Scales of measurement (levels of measurement)
1. Nominal scales. Nominal scales contain qualitatively different categories to which we attach names rather than numerical meaning. Nominal scales are very common: e.g. sex, marital status. The numbers are merely used to classify and identify persons. In nominal measurement numbers are substituted for names or verbal labels. For example in a football game, each individual is assigned a specific uniform number, and these numbers are used to identify players throughout the game. Each player receives a different number and ratians the number for the duration of the game.
2. Ordinal scales. An ordinal scale contains categories that can be ordered by rank on a continuum. The categories have a rudimentary arithmetic meaning such as more or less of the quatity being measured. An ordinal scale does not imply anything about the arithmetical values other than that they are in order. Many scales in social sciences are ordinal. Ordinal measurement is extremely common in psychology. For instance, in comparing scores on a test of knowledge we might be prepared to state with some confidence that a person who receives a high score is more knowledgeable than someone who receives a low score. 3. Interval scales. When numbers attached to a variable imply not only that 3 is more than 2 and 2 is more than 1 but also that the size of the interval between 3 and 2 is the same as the interval between 2 and 1 , they form an interval scale. The essential quality is that differences are equal. The numbers on an interval scale can be added or subtracted without distortion but cannot be multiplied or divided because the scale does not have a true zero. Temperature scales such as Fahrenheit and Centigrade measurements are interval scales. Such scales are useful because they not only tell us about ordering of measured items but also about the relative value of differences. For instance, if we measure the temperatures in each month of the year in Leeds any scale of measurement we use should tell us that July is hotter than December (even in Leeds) but also that the difference between July and December temperatures is greater, smaller or the same as the difference between April and June temperatures.
4. Ratio scales. Ratio scales have a true zero and as a result the scale values represent multipliable quantities. Physical scales of weight and length are ratio scales, e.g. a 4' board is twice the length of a 2' board. The important property of a ratio scale is that ratios between numbers correspond to ratios between the attributes measured in those persons or objects.
4. Reliability: Introduction
Neither physical measurements nor psychological tests are completely consistent; if some attribute of a person is measured twice, the two scores are likely to differ. For example, if a person's height is measured twice in the same day, a value of 1.55 m may be obtained the first time and a value of 1.56 m the second. A person taking two forms of a general intelligence test may obtain an IQ score of 110 on one test and 114 on another. Thus scores on psychological tests and other measures show some inconsistency. On the other hand most measurements are not completely random. Methods of studying, defining, and estimating the consistency or inconsistency of test scores form the central focus of research and theory dealing with the reliability of test scores.
5. Classical Test Theory
A perfect measure would consistently assign numbers to the attributes of persons according to some well-specified rule. In practice, our measures are never perfectly consistent. Theories of test reliability have been developed to estimate the effects of inconsistency on the accuracy of psychological measurement. The basic starting point for almost all theories of test reliability is the idea that test scores reflect the influence of 2 sorts of factors:
1. Factors that contribute to consistency: stable characteristics of the individual or the attribute one is trying to measure.
2. Factors which contribute to inconsistency: features of the individual or the situation which can affect test scores, but which have nothing to do with the attribute being measured.
An individual's observed score (X) on a test is made up of two components, the true score (T) and an error score (E).
X = T + e
The true score is conceived as the mean score this person would have obtained over a very large number of testing occasions and the error score as representing the sum total of all the effects on any one testing occasion that cause his/her observed score to depart from his/her true score, i.e. sum of components contributing to consistency and inconsistency respectively. True score is here given a particular meaning. True score represents a combination of all the factors that lead to consistency in the measurement of the factor of interest.
Errors in measurement represent discrepancies between scores obtained on tests and the corresponding true scores:
e = X - T.
The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so that errors are minimised. The error in measurement can arise from a variety of sources - from the conditions of the experiment or from the observer - but it is assumed that over a number of repeated measurements these errors will be random and will tend to cancel one another out, so that the mean of all the measurements will be a more accurate measurement than any single measurement. The central assumption of reliability theory that measurement errors are essentially random does not mean that errors arise from random or mysterious processes. On the contrary, for any individual an error in measurement is not a completely random event. However, across a large number of individuals, the causes of measurement error are assumed to be so varied and complex that measurement errors act like random variables. Thus a theory which assumes that measurement errors are essentially random may provide a pretty good description of their effects.
6. Methods of estimating reliability.
Test-retest Reliability
The simplest method of estimating reliability requires the administration of the same test to the same group of people on two different occasions: the reliability is estimated simply by the correlation between the 2 sets of scores. This method is known as the test-retest method. The rationale behind this method is that since each test is administered twice and every test is parallel with itself, differences between scores on the test and on the retest should be due solely to measurement error. And indeed this argument is probably true for physical measurements such as length measured by a ruler. Unfortunately this argument is often inappropriate for psychological measurement since it is often impossible to consider the second administration of a test a parallel measure to the first. Thus it may be inaccurate to treat test-retest correlation as a measure of reliability.
Alternate forms reliability.
The alternate forms method for estimating relaibility is, on the surface, the closest approximation to the method suggested by the parallel tests model. The key to this method is the development of alternate forms of the test that are, to the highest degree possible, equivalent in terms of content, response processes and statistical characteristics. This alternate form should be equally valid in all respects and should ideally have the same statistical characteristics (e.g. M, SD). The alternate forms method for estimating reliability involves administering one form of a test to a group at one point and then the other form of the test at another and correlating the scores on the 2 tests. This method of estimating reliability - the alternate forms reliability - is traditionally regarded as the best method of estimating reliability because it accounts for most of the major sources of error variance and second, because it makes the fewest assumptions.
Split-half Methods
Split-half methods of estimating reliability provide a simple solution to the 2 practical problems which plague the alternate forms method:
- the difficulty in developing alternate forms
- the need for 2 separate test administrations.
The reasoning behind the split-half methods is quite straightforward. The simplest way to create 2 alternate forms of a test is to split the existing test in half, and use the two halves as alternate forms. The split-half method of estimating reliability thus involves administering a test to a group of individuals, splitting the test in half, and correlating the scores on one half with the scores on the other half. The correlation between these 2 split halves is used in estimating the reliability of the test.
Internal consistency.
Internal consistency methods of estimating test reliability appear to be quite different from the methods presented so far, at least initially. Internal consistency methods estimate the reliability of a test based solely upon the number of items in the test (k) and the average intercorrelation among test items (rij). These 2 factors can be combined in the following formula to estimate the reliability of the test:
k (rij)
rxx = 1 + (k -1) rij
where rxx is the reliability, k is the number of items, and rij is the average correlation. Thus the internal consistency method involves administering a test to a group of individuals, computing the correlations among all items, computing the average correlation, and using the above equation to estimate reliability. This form of reliability is in fact closely linked to the other ways of computing reliability we have considered. For instance, coefficient alpha (Cronbach, 1951), which represents the most widely used and most general form of internal consistency estimate, represents the mean reliability coefficient one would obtain from all possible split-halves. Hence, the only real difference between the split-half and internal consistency methods for estimating reliability is the unit of analysis: half the test in split-half and individual items in internal consistency tests.
7. Generalizability of test scores
Reliability theory tends to classify all the factors that may affect test scores into 2 components: true scores and random errors of measurement. This is useful when thinking about physical measurements but is not necessarily the best way to treat psychological measurements. The most serious weakness in applying classicial reliability theory is the concept of error. As we saw just now the factors that determine the amount of measurement error are different for internal consistency methods than for test-retest or alternate forms methods. We typically think of the reliability coefficient as a ratio of true score to true score plus error, but if the makeup of the true scores and error parts of a measure change when we change our estimation procedures, something is seriously wrong.
The theory of generalizability presented by Cronbach et al. (1972) represents an alternate approach to measuring and studying the consistency of test scores. In reliability theory the central question is "How much random error is there in my measures?" In generalizability theory the focus is on our ability to generalize from one set of measures to a set of other plausible measures. The central question in generalizability theory is "What are the conditions under which I can generalize?" or "Under what sorts of conditions would I expect either similar or different results than the ones obtained here?" Generalizability theory attacks this question by systematically studying the many sources of consistency and inconsistency in test scores.
8. References
General references:
Kline, P. (1993). The handbook of psychological testing. London: Routledge.
Anastasi, A. (1988). Psychological testing. New York: Macmillan.
Guilford, J.P. (1971). Psychometric methods. New York: McGraw-Hill.
Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.
Rust, J. & Golombok, S. (1989). Modern psychometrics: the science of psychological assessment. Routledge: London.
Scales of measurement:
Stevens, S.S. (1946). Scales of measurement. Science, 103, 677-680.
Torgerson, W.S. (1958). Theory and methods of scaling. London: Wiley.
Measurement theory:
Guilford, J.P. (1971). Psychometric methods. New York: McGraw-Hill.
Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.
Measurement theory:
Guilford, J.P. (1971). Psychometric methods. New York: McGraw-Hill.
Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.
Reliability and validity:
Kerlinger, F.N. (1986). Foundations of behavioral research. London: Holt.
Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.
Generalizability theory:
Shavelson, R.J., Webb, N.M., & Rowley, G.L. (1989). Generalizability theory. American Psychologist, 44, 922-932.