My hypothesis is that the children's book will have a mean word length much shorter than that of the adult book.

Maths statistics Coursework

Introduction

I am currently a lower sixth student at college studying A/S mathematics and I am required to produce a piece of statistical coursework. I will be doing S1 Task D: Authorship from the AQA syllabus.

Aim

I am required to collect data from two populations with a view to estimating population parameters e.g. μ and Ơ. This should involve taking a random sample as well as calculating and comparing confidence intervals. I will investigate whether it is possible to gain information about authorship of a text using statistical measures.

There are two types of data, qualitative and quantitative. Qualitative data is data such as colour of eyes, make of car e.t.c. Quantitative data is numerical data, and can be separated into discrete and continuous data. Discrete data involves counting and is data such as the number of people in a family, marks in an exam, e.t.c. Continuous data involves measuring and is data that takes a range of values such as height or speed of cars. For my investigation I will compare the word lengths (how many letters in a word) within two populations of book types, one being a children's book and the other an adult/more mature readers book. Therefore the data that I collect will be discrete quantitative. When I have collected my data I will examine it and use various statistical methods such as arithmetic means.

Hypothesis

My hypothesis is that the children's book will have a mean word length much shorter than that of the adult book. I expect to find this in my results because the adult book will have a larger and more complex word length, as this will suite the more mature audience. But for the children's book the audience are much younger and are a less mature reader therefore I expect to find the mean word length to be shorter. Although Ithink that there will only be a slight difference in the mean word length, this is because there are only a few very large words i.e. 12 letters long, and these are very rare as they will be out of context. I therefore expect to find most of the word lengths around the average 4 to 6 letters long. Also Ithink that there would be a difference in word length because children are proven to have a short attention span and loose focus very easily. Therefore pictures and colours would be added to the literature that they read opposed to the intense language that would be contained in an adults book.

I also expect to find that for the adult book the variance will be larger. This is because the choice of vocabulary used would be much wider than that of the children's book and therefore the word lengths will be more spread out from the mean.

Plan

The two books I have chosen to compare are; Roald Dahl – James and the giant peach as the children's book, and Stephen Dobyns – The school of dead girls as the adult book. These are my parent populations and the words that Ichoose as my sample will be taken from these populations of words. I intend to take a suitable sample from these.

Data collection:

An important requirement of a sample is that every member of the population has an equal chance of being selected. The sample members must be selected at random i.e. No bias. I will achieve this by the use of random numbers. I will record the amount of pages there are in the first book. I will then count how many lines are on any one page of the book, I will then count how many words are on any one line. With these values I will use my calculator to randomly select a page, a line and word. To do this I will press RAND(x)+1, where x is the number of page/line/word. This is a totally random process of selecting a word, if this was done by human many errors could occur which would effect the accuracy of the results and may make the sample bias, for example I may not psychologically choose the same word more than once but with the use of random numbers this may be the case. I have chosen to select my sample from the population by collecting a simple random sample, this is where every member of the population has an equal chance of being selected, this is a quick and reliable method for the data collection that Irequire. For a population made up of a number of sub-groups, or strata, a stratified sample may be used where each stratum is sampled separately. In the simplest form of a stratified sample, the number chosen from each stratum is proportional to the size of that stratum. I will not be collecting my data using this technique as this will create bias in my investigation and Ido not have subgroups within my population.

Ensuring that the sample members are selected at random is not the only requirement of a good sample. The sample size is also important - not much information can be obtained from a small sample, of three for example and it would be impractical to use a sample size of five hundred. I will use a sample size of fifty words per population therefore 100 in total. I think that this will be an ideal size to base my conclusions on.

When Ihave collected my sample Iwill organise the data into a frequency table this is to ...