The aim of this coursework is to compare the word and sentence length of an adults and a child's book. The results should reflect a higher level of difficulty in the adult's book.

The Normal Distribution

Design

The aim of this coursework is to compare the word and sentence length of an adults and a child’s book. The results should reflect a higher level of difficulty in the adult’s book.

The strategies that I will be using are simple. I am going to take a sample of word and sentence lengths from both books. I will be taking two sets of both these measures for each book. To make it a fair and reliable the samples will be random.

The main objective of the coursework is to demonstrate the difficulty of an adult’s book compared to a child’s book. If the word and sentence length of the adult’s book is longer by a reasonable amount I will judge that the adults is more difficult.

The population that I will be using is two fiction books chosen from a library. One book was selected from the adult’s section and one from the child’s. The adult’s book is called ‘The Regeneration Trilogy’ by ‘Pat Barker’. The child’s book is called ‘The Borrowers Afloat’ by ‘Mary Norton’.

To obtain our sample we decided to do each measure separately. We firstly took the word length from the child’s book followed by the word length of the adults. First of all we randomly selected a page in the child’s book using the ‘Random Number Generator’ on a calculator. Once we had our page we randomly selected a line on the page using the same method. We now had our starting point. We then counted the number of letters in each word, starting with the first word of our selected line until we had 100 words. Each word length was recorded in a frequency table (as shown in appendix). We then repeated this again using another randomly generated page and line. We now moved on to the adult’s book using the exact same method.

Now that we had our word length we could do sentence length. Once again we did the child’s book first. This time we randomly selected a page in the book followed by a randomly selected sentence. This sentence then became our starting point. We now counted the number of words in each sentence until we had sampled 20 sentences. We recorded them in a frequency chart (shown in appendix). We then did the same thing with another randomly selected page and start sentence. We now repeated this whole method again for the adult’s book.

I will be using a number of Statistical Theories. The first is the ‘Central Limit Theorem’. This means that if our population is not normal it can still become approximately normal providing the sample size is large enough. The mean of our sample means will be approximately equal to the mean of the parent population, which in this case is the book. The variance of the distribution of our sample means will be approximately the variance of the parent population divided by our sample size. It is possible to state the ‘Central Limit Theorem’ symbolically. This is shown below:

if X ~ (unknown)(μ,σ²) then Χn ~ N (μ, )

A good sample size to use is n ≥ 30. This means that our sample size of 200 word lengths per population and 40 sentence lengths per population is enough.

I will also be using an estimate of the population mean (x). This will become the sample mean, which is an unbiased estimator. I will also do an estimate of the population variance. The sample variance is biased by itself so we will use the formula:

Sample variance x ( )

I will be using the normal distribution. This will be done using the distribution of sample means. By using this distribution I will be able to find out confidence intervals. I will be using the confidence intervals for 68%, 90%, 95% and 99% depending on which ones are suitable for my results. However I will be using 95% definitely. I will get my confidence intervals from:

X – 1.96 (s.e) < μ < X + 1.96 (s.e)

This would be used to find out the 95% confidence interval. The 1.96 comes from the normal distribution tables. In the tables I looked for 0.9750. This is 97.5%. This may seem strange but the reason I found this and not 95% was to do with the symmetry of the normal curve. If I find 97.5% it leaves 2.5% on either end of the curve. This adds up to 5% so 95% is left over. This is sown more clearly in the diagram on the next page.

...