Concurrent validity has similarities to predictive validity, however, rather than predicting, it postdicts. The criterion measures and test scores are obtained simultaneously. I would ascertain this validity by correlating the test scores with data already gathered e.g. I would correlate the new test with an existing test.
To test Convergent validity I would see how highly the test correlates with other tests or factors where, in theory, there should be an overlap. To establish Divergent validity I would see how poorly the test correlates with other tests or factors where, in theory, there should be no overlap.
A construct is an intangible quality or trait such as depression, intelligence, love etc. Behaviour implies the existence of certain constructs, so constructs can be seen, to some extent, to predict, influence and/or determine behaviour. Construct validity validates the suitability of test-based suppositions about the construct which is supposedly measured by the test. Construct validity does not provide a quantitative measure of validity because there aren't any mathematical bases to determine construct validity. Furthermore, judgment of construct validity is continually changing as the character of the underlying construct is further documented.
To determine if the test has construct validity I would test for internal consistency i.e. to see if the test items or subtests are homogenous and, as such, measure a single construct. I would do this by correlating each item and subtest with the total score. I would also study the developmental changes to determine if they are consistent with the theory of the construct and would ascertain whether group differences on test scores are theory-consistent. I would correlate the test with other related and unrelated tests and measures to establish convergent and discriminant validation, and finally, I would use factor analysis to identifies the minimum number of determinants (factors) required to account for the inter-correlations among a battery of tests.
RELIABILITY:
Once I have established that the test is valid, I would want to establish how accurate the test is i.e. how dependable is this test; can the measurement device yield the same result over and over again under similar situations? Some degree of inconsistency is always present from one moment to the next and reliability should be viewed as a continuum.
Unsystematic measurement errors (item selection, test administration) and systematic measurement errors are the major sources of measurement errors in psychological testing
Classical theory of measurement accepts that an observed score consists of a true score plus measurement error. Measurement error is random and the mean error of measurement is zero. True scores and error scores are uncorrelated and error scores on different tests are uncorrelated. Therefore the variance of the obtained score is the variance of true scores plus the variance of error scores. So, to establish the reliability of this test in accordance with true score and error score variances, I would evaluate reliability as the relative influence of true and error scores on the obtained test scores, and the
reliability co-efficient as the ratio of true score variance to total variance (i.e. true score variance plus error score variance). The value of the reliability coefficient can vary between 0.0 to 1.0
therefore, reliability coefficient = true score variance : total variance
Item selection refers to the instrument itself i.e. selection and wording of questions. Some questions chosen for a test may be biased in favour of certain populations. I would need to take this into account when evaluating the test reliability.
I will not be able to test for test administration as this refers to the actual administration if the test and takes into account such things as general environmental conditions, examinee fluctuations and the examiner. These have to do with administration and not construction.
I would evaluate they way in which the test will be scored (test scoring). In the test scoring, it is always possible that subjectivity may undermine reliability. Machine-scoring can be used for multiple choice, however, I would check that scoring guidelines have been constructed so as to minimise measurement error.
In identifying any possible systematic measurement error, I would need to determine if the test is continually measuring something other than the trait it is supposed to measure. If this is happening, then
X = obtained score
T = true score
Es = the systematic error due to the anxiety subcomponent
Eu = the collective effect of the unsystematic measurement errors
So: X=T+Es+Eu
This is difficult to test for because it is unknown to the test developer.
To further determine the reliability of the test, I would administer the test twice (or more) to the same test group and obtain a coefficient of reliability between the scores on each testing. If the test is reliable, each persons score will be completely predictable from first test score. I will take into account those things that will raise the second score (such as having seen the test, practice, maturation, schooling etc.). This is called test-retest reliability.
I might also use alternative-forms reliability by administering parallel sets of items of similar difficulty (i.e. an alternative form of the same test) to the same group of test subjects and correlating the scores for the two tests. This would show the reliability of the new test, however, it is more expensive than the test-restest reliability test because of the cost of publishing.
The test does not have to be administered twice to test reliability. Here are three further ways in which I could check the reliability of the new test:
split-half reliability: I would administer the test and then divide it into two equal halves and then correlate the scores on the one half with scores on the other half. If correlation is strong, reliability is high.
coefficient alpha: is the mean of all split-half coefficients, corrected by the Spearman-Brown formula will allow me to determine the degree to which the items of the test measure the same construct.
inter-scorer reliability: if the test is one which involves judgmental scoring such as a projective test, I will have a sample of tests independently scored by two or more examiners and will then correlated the typical degree of agreement between scorers.
If the test which I am evaluating has unstable characteristics such as emotional reactivity, which is highly volatile, or is a speed test with simple item difficulty, or there is a restriction in range, I may choose not to use traditional approaches to estimating reliability because they could be misleading.
TEST CONSTRUCTION
Once reliability and validity have been established, I would look at the measurement scales to be used in this test to see if they best suit the type of test it is. There are a number of scales and formats which can be used:
The Nominal scale is the lowest measurement scale. Categories are arbitrary and do not designate "more" or "less" of anything. Nominal scales can be used for determining the mode, the percentage values or chi square
The Ordinal scale is a measurement scale that allows for ranking. Ordinal scales do not provide information about the relative strength of ranking. They can also be used for determining the mode, percentage, Chi square and also the median and percentile rank.
The Interval scale is a measurement scale that provides information about ranking and the relative strength of ranks. It is based on the assumption of equal-sized units or intervals for the underlying scale and provides a metric for gauging differences between ranking. It can be used for determining the mode, the mean, the standard deviation, the t-test and the F-test and the product moment correlation.
The Ratio scale is rarely used in psychological testing.
It is important that the test does not have a ceiling or floor effect, which would cause either an unrealistic number of exceptionally high or low scores.
I will need to clarify how the test items were tested and ascertain why particular items were selected. Using item analysis I will now revise some items in the test which do not add to the test and which are not purposeful in targeting the specific traits and or knowledge that the test is trying to measure. Once I have completed this reassessment, the test will have been revised and amended with new and omitted items, and will be more accurate. To conduct an item analysis, I could use one of the following procedures:
- Item-difficulty index
- Item-reliability index
- Item-validity index
- Item-characteristic curve
- Item discrimination index
I will then cross validate the test using a new sample of examinees.
THE USE OF STANDARDISED SCORES, INDICATING THE IMPORTANCE OF THE NORMATIVE GROUP
To make a raw score meaningful, it needs to be compared to the norm group. Scores from a representative sample of examinees are vitally important in interpreting raw test scores. Without these representative scores, individual test scores would be rendered non-sensicle and useless.
So a norm group is essential, as it is the group used to establish a standard. The norm group's performance gives meaning to individual raw scores.
A norm group must contain a representative, large random sample of a cross section of population group that the test has been made for. Geographic location; diversity of background; social class; urban vs. rural setting must be proportionately representative in the sample.
It can be very difficult and costly to get a large representative sample, therefore, researchers sometimes decide to use smaller norm groups. In such instances stratified random sampling may be used. To do this, the population is stratified according to very specific variables such as age, sex, race, social class, educational class etc. Even this reduced norm group configuration can be hard to compile and then
good faith sampling will be used.
A standardised score is a derived score based on the standard deviation. The distribution of standardised scores has the same shape as the distribution of raw score (because it maintains the relative magnitudes of distance between successive values) which means that distortion is unlikely (percentile scores can be very distorting, especially at the extremes). With standardised scoring it is possible to show results from two different tests according to a common scale and make direct comparisons.
Finally, I would check that the test also contains a technical manual with information that the evaluator will require. This should include the purpose for which the test was designed as well as information about reliability, validity, item analysis, normative group data and cross validation studies. The standardised procedures for the administration and scoring of the test must also be included together with guidelines for interpretation of the test.
SOURCES CONSULTED
Gregory, R.J. (2000).Psychological testing: history, principles, and applications (3rd ed.). Boston: Allyn and Bacon.