Introduction to Psychometrics/Reliability.

Introduction to Psychometrics/Reliability

1. Introduction

Psychometrics is about the measurement of psychological constructs such as intelligence, people's attitudes and interests, and people's behaviour based upon the systematic application of a few relatively simple principles. And measurement has proved to be extremely important in science. Measurement is at the heart of the scientific method used in psychology.

2. Definitions

Abstract concepts like social status or intelligence we call constructs, the concrete representations of these constructs we call variables (e.g. income, GCSE results) and the procedures for measuring variables we call operational definitions (e.g. actual questions used and the methods for administering and scoring these tests).

Constructs are the abstractions scientists use in discussing theories (e.g. social power, intelligence, dietary restraint, attitude). Sometimes constructs are referred to as latent variables. Here the term latent indicates the fact that the variable must be inferred from other observable variables. Hence, constructs (or latent variables) can only be inferred from the relationship between sets of stimuli and sets of responses. For example, racist attitudes are inferred from the relationship between stimuli such as individuals of a particular skin colour or religious belief and observable expressions of negative feelings or behaviours such as avoidance or discrimination.

Variables are representations of constructs. They are nearly always partial and cannot fully represent constructs, we use them because they are measurable. Their concreteness suggests steps in order to measure the construct.

Operational definition specifies how to measure a variable in order that we can assign someone a score of high, medium or low. This is the way in which we obtain numbers or categories for the variables, i.e. the empirical observations that will be taken. An operational definition is the sequence of steps or procedures a researcher follows to obtain a measurement of the construct on a variable. The major problem with operational definitions is that they imperfectly measure the concepts of interest. Thus we must be aware that no one measure will perfectly represent a construct and hence it is often wise to use more than a single operational definition and so end up with more than a single measure.

Scales are usually sets of questions (or variables) which assess a particular variable. So, for instance, we might have a scale of intelligence which is based upon responses to 100 multiple-choice knowledge questions. The scale is the total score across all these responses.

Response formats are different from scales. Scales possess certain properties such as level of measurement, reliability and validity. Response formats simply indicate how an individual is to make their response to a question. For example, strongly agree to strongly disagree response format.

Measurement In the most general sense "measurement is the assignment of numerals to objects or events according to rules" (Stevens, 1951) in such a way as to correspond to different degrees of a quality or property of some object or event (Duncan, 1984). In this definition the important word is assigned. This refers to the mapping of one set of objects onto another set of objects. Measurement is all about the rule for assigning members of one set of objects to another set of numerals. Most important in the definition of the rule is the operational definition. The operational definition attempts to assign numbers in such a way that some attributes of the persons being measured are faithfully reflected by some properties of the numbers.

Scales of measurement (levels of measurement)

1. Nominal scales. Nominal scales contain qualitatively different categories to which we attach names rather than numerical meaning. Nominal scales are very common: e.g. sex, marital status. The numbers are merely used to classify and identify persons. In nominal measurement numbers are substituted for names or verbal labels. For example in a football game, each individual is assigned a specific uniform number, and these numbers are used to identify players throughout the game. Each player receives a different number and ratians the number for the duration of the game.

2. Ordinal scales. An ordinal scale ...

This is a preview of the whole essay

Scales of measurement (levels of measurement)

2. Ordinal scales. An ordinal scale contains categories that can be ordered by rank on a continuum. The categories have a rudimentary arithmetic meaning such as more or less of the quatity being measured. An ordinal scale does not imply anything about the arithmetical values other than that they are in order. Many scales in social sciences are ordinal. Ordinal measurement is extremely common in psychology. For instance, in comparing scores on a test of knowledge we might be prepared to state with some confidence that a person who receives a high score is more knowledgeable than someone who receives a low score. 3. Interval scales. When numbers attached to a variable imply not only that 3 is more than 2 and 2 is more than 1 but also that the size of the interval between 3 and 2 is the same as the interval between 2 and 1 , they form an interval scale. The essential quality is that differences are equal. The numbers on an interval scale can be added or subtracted without distortion but cannot be multiplied or divided because the scale does not have a true zero. Temperature scales such as Fahrenheit and Centigrade measurements are interval scales. Such scales are useful because they not only tell us about ordering of measured items but also about the relative value of differences. For instance, if we measure the temperatures in each month of the year in Leeds any scale of measurement we use should tell us that July is hotter than December (even in Leeds) but also that the difference between July and December temperatures is greater, smaller or the same as the difference between April and June temperatures.

4. Ratio scales. Ratio scales have a true zero and as a result the scale values represent multipliable quantities. Physical scales of weight and length are ratio scales, e.g. a 4' board is twice the length of a 2' board. The important property of a ratio scale is that ratios between numbers correspond to ratios between the attributes measured in those persons or objects.

4. Reliability: Introduction

Neither physical measurements nor psychological tests are completely consistent; if some attribute of a person is measured twice, the two scores are likely to differ. For example, if a person's height is measured twice in the same day, a value of 1.55 m may be obtained the first time and a value of 1.56 m the second. A person taking two forms of a general intelligence test may obtain an IQ score of 110 on one test and 114 on another. Thus scores on psychological tests and other measures show some inconsistency. On the other hand most measurements are not completely random. Methods of studying, defining, and estimating the consistency or inconsistency of test scores form the central focus of research and theory dealing with the reliability of test scores.

5. Classical Test Theory

A perfect measure would consistently assign numbers to the attributes of persons according to some well-specified rule. In practice, our measures are never perfectly consistent. Theories of test reliability have been developed to estimate the effects of inconsistency on the accuracy of psychological measurement. The basic starting point for almost all theories of test reliability is the idea that test scores reflect the influence of 2 sorts of factors:

1. Factors that contribute to consistency: stable characteristics of the individual or the attribute one is trying to measure.

2. Factors which contribute to inconsistency: features of the individual or the situation which can affect test scores, but which have nothing to do with the attribute being measured.

An individual's observed score (X) on a test is made up of two components, the true score (T) and an error score (E).

X = T + e

The true score is conceived as the mean score this person would have obtained over a very large number of testing occasions and the error score as representing the sum total of all the effects on any one testing occasion that cause his/her observed score to depart from his/her true score, i.e. sum of components contributing to consistency and inconsistency respectively. True score is here given a particular meaning. True score represents a combination of all the factors that lead to consistency in the measurement of the factor of interest.

Errors in measurement represent discrepancies between scores obtained on tests and the corresponding true scores:

e = X - T.

The goal of reliability theory is to estimate errors in measurement and to suggest ways of improving tests so that errors are minimised. The error in measurement can arise from a variety of sources - from the conditions of the experiment or from the observer - but it is assumed that over a number of repeated measurements these errors will be random and will tend to cancel one another out, so that the mean of all the measurements will be a more accurate measurement than any single measurement. The central assumption of reliability theory that measurement errors are essentially random does not mean that errors arise from random or mysterious processes. On the contrary, for any individual an error in measurement is not a completely random event. However, across a large number of individuals, the causes of measurement error are assumed to be so varied and complex that measurement errors act like random variables. Thus a theory which assumes that measurement errors are essentially random may provide a pretty good description of their effects.

6. Methods of estimating reliability.

Test-retest Reliability

The simplest method of estimating reliability requires the administration of the same test to the same group of people on two different occasions: the reliability is estimated simply by the correlation between the 2 sets of scores. This method is known as the test-retest method. The rationale behind this method is that since each test is administered twice and every test is parallel with itself, differences between scores on the test and on the retest should be due solely to measurement error. And indeed this argument is probably true for physical measurements such as length measured by a ruler. Unfortunately this argument is often inappropriate for psychological measurement since it is often impossible to consider the second administration of a test a parallel measure to the first. Thus it may be inaccurate to treat test-retest correlation as a measure of reliability.

Alternate forms reliability.

The alternate forms method for estimating relaibility is, on the surface, the closest approximation to the method suggested by the parallel tests model. The key to this method is the development of alternate forms of the test that are, to the highest degree possible, equivalent in terms of content, response processes and statistical characteristics. This alternate form should be equally valid in all respects and should ideally have the same statistical characteristics (e.g. M, SD). The alternate forms method for estimating reliability involves administering one form of a test to a group at one point and then the other form of the test at another and correlating the scores on the 2 tests. This method of estimating reliability - the alternate forms reliability - is traditionally regarded as the best method of estimating reliability because it accounts for most of the major sources of error variance and second, because it makes the fewest assumptions.

Split-half Methods

Split-half methods of estimating reliability provide a simple solution to the 2 practical problems which plague the alternate forms method:

- the difficulty in developing alternate forms

- the need for 2 separate test administrations.

The reasoning behind the split-half methods is quite straightforward. The simplest way to create 2 alternate forms of a test is to split the existing test in half, and use the two halves as alternate forms. The split-half method of estimating reliability thus involves administering a test to a group of individuals, splitting the test in half, and correlating the scores on one half with the scores on the other half. The correlation between these 2 split halves is used in estimating the reliability of the test.

Internal consistency.

Internal consistency methods of estimating test reliability appear to be quite different from the methods presented so far, at least initially. Internal consistency methods estimate the reliability of a test based solely upon the number of items in the test (k) and the average intercorrelation among test items (rij). These 2 factors can be combined in the following formula to estimate the reliability of the test:

k (rij)

rxx = 1 + (k -1) rij

where rxx is the reliability, k is the number of items, and rij is the average correlation. Thus the internal consistency method involves administering a test to a group of individuals, computing the correlations among all items, computing the average correlation, and using the above equation to estimate reliability. This form of reliability is in fact closely linked to the other ways of computing reliability we have considered. For instance, coefficient alpha (Cronbach, 1951), which represents the most widely used and most general form of internal consistency estimate, represents the mean reliability coefficient one would obtain from all possible split-halves. Hence, the only real difference between the split-half and internal consistency methods for estimating reliability is the unit of analysis: half the test in split-half and individual items in internal consistency tests.

7. Generalizability of test scores

Reliability theory tends to classify all the factors that may affect test scores into 2 components: true scores and random errors of measurement. This is useful when thinking about physical measurements but is not necessarily the best way to treat psychological measurements. The most serious weakness in applying classicial reliability theory is the concept of error. As we saw just now the factors that determine the amount of measurement error are different for internal consistency methods than for test-retest or alternate forms methods. We typically think of the reliability coefficient as a ratio of true score to true score plus error, but if the makeup of the true scores and error parts of a measure change when we change our estimation procedures, something is seriously wrong.

The theory of generalizability presented by Cronbach et al. (1972) represents an alternate approach to measuring and studying the consistency of test scores. In reliability theory the central question is "How much random error is there in my measures?" In generalizability theory the focus is on our ability to generalize from one set of measures to a set of other plausible measures. The central question in generalizability theory is "What are the conditions under which I can generalize?" or "Under what sorts of conditions would I expect either similar or different results than the ones obtained here?" Generalizability theory attacks this question by systematically studying the many sources of consistency and inconsistency in test scores.

8. References

General references:

Kline, P. (1993). The handbook of psychological testing. London: Routledge.

Anastasi, A. (1988). Psychological testing. New York: Macmillan.

Guilford, J.P. (1971). Psychometric methods. New York: McGraw-Hill.

Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.

Rust, J. & Golombok, S. (1989). Modern psychometrics: the science of psychological assessment. Routledge: London.

Scales of measurement:

Stevens, S.S. (1946). Scales of measurement. Science, 103, 677-680.

Torgerson, W.S. (1958). Theory and methods of scaling. London: Wiley.

Measurement theory:

Guilford, J.P. (1971). Psychometric methods. New York: McGraw-Hill.

Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.

Measurement theory:

Guilford, J.P. (1971). Psychometric methods. New York: McGraw-Hill.

Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.

Reliability and validity:

Kerlinger, F.N. (1986). Foundations of behavioral research. London: Holt.

Nunally, J.C. (1978). Psychometric theory. McGraw-Hill: New York.

Generalizability theory:

Shavelson, R.J., Webb, N.M., & Rowley, G.L. (1989). Generalizability theory. American Psychologist, 44, 922-932.

Introduction to Psychometrics/Reliability.

This is a preview of the whole essay

Document Details

Related Essays

INTRODUCTION TO MULTICULTURALISM- Essay

Introduction to biological, behaviourist and cognitive psychology.

Workshop for the introduction of Tristan Egolf's new novel Skirt and the Fi...

'Are people consistent in how they behave?'