Table 3: Items receiving a greater lower group pass rate
1.2 Item Discriminability
(The results from the calculations are given below and the full results are contained in appendix 2).
Table 4: Item discriminability results
According to the discriminability formula .67 is considered the lowest acceptable level, with any score being closer to 1 the most acceptable and 0 the least. As can be seen in table 4, only 4 questions (2, 6, 24, and 27) meet this criterion.
As observed in the previous section, the results from the item facility index alone do not provide all the information needed for analysing the results. The item discriminability analysis therefore allows contrast to be made between each question – differentiating between each item’s capacity to test grammar knowledge. Questions 3, 10, 13 and 20 all fall significantly below the acceptable level. Question 2 when analysed by this method showed a .69 validity score (despite a relatively high item facility index score (.70)). Similarly, questions 6, 24 and 27 fell outside the recommended facility index of 0.33 - .67, but according to the item discriminability formula were discerning questions. Carrol and Hall (1985) (cited in Fulcher 2002) recommend a re-drafting and re-examination of any questions outside of this range.
1.3 Internal Consistency
The third statistical test to be applied here is for internal consistency. As can be seen by our data (appendix 3), by splitting the scores into two halves we have an estimate of the mean of the two halves: 11.8 and 9.8. Using a Pearson Product-Moment correlation coefficient between the scores yielded a correlation of 0.40 – an estimate of the half-test reliability of our test. Using the Spearman-Brown Prophecy Formula (below) we corrected the reduction in length:
Applying this calculation we can say:
Reliability = 2x .40 = .80 = 0.57
1+ .40 1.40
This means that .43 of the test is unreliable, which may be seen as an unsatisfactory result if we take .70 as an acceptable measure. Taken alongside the facility and item discriminability tests, these results indicate the need for a revamping of certain items.
2. Why are some of the items not operating as hoped?
The analysis which follows show that questions 3, 10, 13 and 20 contain errors which invalidate the test questions as reflected in the internal consistency test. Although they met the acceptability criteria of the item difficulty test, they failed the item discriminability test. The result of these errors caused many higher level students to choose the wrong answer and skewed the average scores in favour of the lower groups as well as affecting the item discriminability test. Second, as we saw in table 1 some of the questions were too easy for the group; with virtually all the students answering correctly.
3. An analysis of the test and answer key from a qualitative perspective.
In this section we will look at the test from a qualitative perspective, particularly regarding three areas: relevance, unambiguousness and face validity (Hughes, 1990).
Our first question then concerns whether the MC correctly tests that which it purports to measure – grammar. From this perspective it is clear that question 26 is not a test of grammar, but a test of lexicon:
26) Writing this report is ________ very consuming.
a) seeming b) appearing c) looking d) proving.
Henning writes that such mistakes ‘lack validity as measures of what they are purported to measure’ (ibid. p.44).
Secondly, we should ensure that all the items are unambiguously stated (Hughes 1990). There are a number of questions in the test which offer more than one grammatically acceptable answer (see questions 10, 13, 20). One example is question 10:
10) As a result of his lectures, she__________ influenced by this new approach to teaching.
a) was b) shall be c) has d) has been
We can see that there are two possible answers which could be inserted. Such ambivalence must be avoided in MC tests (Fulcher, 2002), with distractors having the sole purpose of testing ability and competence as opposed to making subjective decisions regarding what are valid or invalid responses (Annable, 2006).
Similarly, question 13 was open to subjective opinion:
13) His examination results were not as bad as they ______ been.
a) need have b) might have c) could have d) should have
Any of three answers, b, c or d would be grammatically acceptable, yet the answer sheet indicates answer d as the only correct answer.
Thirdly, we can estimate a test by its face validity (Hughes, 1989). If a test has any spelling mistakes students may be critical of the test - especially given that it is an English test. Question 4 which has a spelling mistake (‘yeserday’) is such an example, as was the incorrect answer in question 3 (which the teacher could not present as a feedback example):
3) My results are the same _______ yours.
a) that b) was c) than d) like
The answer key gives B as the correct answer, but this is clearly the wrong answer, as are all the other alternatives, and in terms of face validity it offers an inaccurate sample of English to the students who are being marked on precision. Question 13’s subjective MC choice also makes for a poor multiple question choice.
4. Recommendations
Our analysis of the test and questions sheet highlights the importance of a statistical and qualitative assessment. Owing to ambiguously worded and sometimes wrong questions and answers, questions 3, 10, 13 and 20 favoured the weaker groups. Question 3’s unreliability was due to a spelling mistake in choice b) which should read ‘as’, not ‘was’ and therefore the MC section should be amended to reflect this. Similarly question 13`s MC answers should be changed to provide suitable distractors for the MC options, suggested MC answers could be:
a) has b) having c) might having d) might has
Number 26’s lexical rather than grammar-based question could have been improved by offering an alternative set of MC questions based on tense i.e. prove proving proves proved. According to the discriminability formula, only questions 2, 6, 24, and 27 were suitable; which means that the remaining questions need to be made more difficult if the test is to provide an accurate measurement of the range of individual differences. In particular the results from the item difficulty index indicate that items 1,7,11,15 and 17 need to be eliminated or radically changed. Alternatively, Henning (1987) suggests that they could all be placed at the beginning of the test and be used as ‘warm up’ questions and not scored as part of the test.
Lastly, although the grammar test failed to meet the internal consistency statistical criteria it is worth noting that a test may have scientific validity but not receive validation from the students. For this reason the ‘bottom line’ for Bachman (1990: 288) is whether students find the test useful and whether it allows for effective washback.
Bibliography
Bachman, L (1990). Fundamental Considerations in Language Testing. Oxford University Press.
Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall.
Celce-Murcia, M & Larsen-Freeman, D (1999) The Grammar Book: An ESL/EFL Teacher`s Course. Heinle and Heinle Publishers
Fulcher, G. (2002). Language Testing. University of Surey.
Henning, G. (1987). A Guide to Language Testing: Development Evaluation Research. Newbury House Publishers.
Hughes, A (1990). Testing for Language Teachers. Cambridge Handbooks for Language Teachers. Cambridge University Press.
Kitao, S, (1986) The Internet TESL Journal, Vol. II, No. 6, June 1
Annable(2006)LancasterUniversity http://www.ling.lancs.ac.uk/groups/crile/docs/crile44annable.pd
Appendix 1: Item Difficulty
Appendix 2: Item Discrimination
Highlighted numbers represent a wrong answer: e.g. 1 2 3 4 5
Item
Student Score
TOP GROUP
T1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
27
T2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
27
T3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
26
T4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
26
T5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
26
T6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
26
T7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
26
T8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
25
T9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
25
T10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
25
T11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
25
T12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
25
T13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24
T14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24
T15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24
T16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24
T17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24
T18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24
T19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
24
T20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
T21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
T22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
T23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
T24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
T25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Totals:
Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Incorrect: - - 18 - - 2 2 2 1 18 - 2 17 4 - 5
Correct 30 30 12 30 30 28 28 28 29 12 30 28 13 26 30 25
Item: 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Incorrect: 2 3 2 14 1 4 7 3 1 12 1 1 14 3
Correct: 28 27 28 16 29 26 23 28 29 18 29 29 16 28
Item
Student Score
BOTTOM GROUP
T1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
9
T2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
11
T3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
13
T4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
14
T5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
15
T6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
15
T6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
15
T7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
16
T8 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
16
T9 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
16
T10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
17
T11 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
17
T12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
17
T13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
17
T14 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
17
T15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
17
T16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
18
T17 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
18
T18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
18
T19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
18
T20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
19
T21 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
19
T22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
19
T23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
19
T24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
19
T25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
19
Totals:
Item 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Incorrect: 0 17 13 7 2 19 6 15 10 10 0 15 9 10 3 13
Correct 30 13 17 23 28 11 24 15 20 20 30 15 21 20 27 17
Item: 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Incorrect: 5 15 13 15 10 15 13 18 11 20 16 15 18 17
Correct: 25 15 17 15 20 15 17 12 19 10 14 15 12 13
Using the item discriminability formula,
D = discriminability
Hc = the number of correct responses in the high group
Hc = the number of correct responses in the low group
Internal Consistency: Split half
Source:
See appendix1 for results
Brown (1996) asks `what is acceptable?` It seems that the answer depends on a number of factors including the number of questions, the `fit` of the test and how the test is applied.