Language Testing Practical Task

University Degree Linguistics, Classics and related subjects

Question 1: The three files provided for this task contain information about a grammar test (the test itself, the answer key and a set of data from 100 students who sat the test). Using this information, write a review of the test. The review should contain:

An analysis of the test items (at least item facility & discriminability)

An estimate of the internal consistency of the test

An analysis of the test and answer key from a qualitative perspective

(supported by the information gathered in the item and test analyses)

A short discussion of why some of the items are not operating as hoped for (or why they are operating as expected)

A series of recommendations for the improvement of the test

(1675 words)

CONTENTS

Page

Introduction ……………………………………………………..3

1. Analysis of the test items………………………………………...3

1.1 Item Difficulty……………………………………………………3

1.2 Item Discriminability…………………………………………...5

1.3 Internal Consistency……………………………………………5

2. Why are some of the items not operating as hoped?………....6

3. An analysis of the test and answer key

from a qualitative perspective……………………………………..6

4. Recommendations…………………………………………………8

Bibliography…………………………………………………………..9

Appendix 1: Item Difficulty...................................................................10

Appendix 2: Item Discrimination………………………………….11

Appendix 3: Split Half Method.........................................................................15

Introduction

One of the most common methods of testing grammatical ability is through multiple choice (MC) tests (Kitao, 1996); as such it has become an important element for many second language learners. Using an analysis of the test items we shall employ three statistical tests: item difficulty, item discriminability and a means test, all of which aim to provide a measuring stick of the effectiveness of the test’s ability to discriminate between students. We shall then apply a Pearson’s Product statistical test to see if the test is valid and use the data collected to see if the test is sufficient for measuring grammatical ability. The question and answer papers will then be qualitatively analysed and a short discussion of the main issues will ensue. One of the main elements of this paper is to examine why some items of the test appear to favour the lower subset of testers over higher scoring testers. Following this a discussion for improvements for the test will be given.

1. Analysis of the test items

1.1 Item Difficulty

The item facility index allows us to measure the difficulty of a question for the tester and is useful for ordering tests, with the easier items at the beginning. Questions 1,7,11, 15 and 17 all have a very low item difficulty ranking of 1 or close to 1 which mean that almost 100 per cent of the students answered correctly. If we are to discriminate between each candidate’s ability a much lower score is needed.

Table 1: Chart showing easiest questions from the grammar test

A second function of the facility index is to identify the medium of the scores and decide an appropriate level of difficulty within the test items. According Tuckman (1978) (cited in Fulcher, 2002) anything between 0.33 - .67 is considered an acceptable level, yet as we can see by the results in table 2 (below), only 9 questions meet this criterion:

Table 2: Items meeting the acceptability criteria

However our second statistical test, the item discriminability index (table 4) highlights inconsistencies regarding the higher and lower scoring group as identified in the results below (table 3). These show that a greater proportion of the lower scoring group answered questions 3, 10, and 13 correctly (with question 20 showing similar scores) despite a lower mean score. Henning (1987) contends that such anomalies ‘certainly cause us to have different thoughts about the suitability of such [items]’ (p.51).

Table 3: Items receiving a greater lower group pass rate

1.2 Item Discriminability

(The results from the calculations are given below and the full results are contained in appendix 2).

Table 4: Item discriminability results

According to the discriminability formula .67 is considered the lowest acceptable level, with any score being closer to 1 the most acceptable and 0 the least. As can be seen in table 4, only 4 questions (2, 6, 24, and 27) meet this criterion.

As observed in the previous section, the results from ...