X ~ (unknown) (μ, σ²) then X ~ N(μ,σ²/n)
Once I have collected the data I will calculate the mean, standard deviation and variance of the sample. When I have figures for these I can estimate the variance and standard deviation of the population. Next I will calculate the standard error which will allow me to calculate confidence intervals for the population. When calculating confidence intervals I will use the tables for the normal function.
Data Collection
Data For Horror Book
Data For Drama Book
The first thing for me to do is to find the Mean, Standard Deviation and Variance of the sample I have taken. As it would be extremely time consuming trying to find the exact mean and variance for 100 results I have set up frequency tables which will allow me to work out the mean and variance more quickly. I have chosen quite small class intervals so that the calculations will be as accurate as possible.
Drama Book
Horror Book
From my frequency tables I have been able to use a number of graphical methods to show the data. I have work out the median of the horror book to be 11 and the drama book to be 12. I have also found the lower quartiles to be at 7 for the horror book and 8 for the drama book whilst the upper quartiles are 19 for the drama book and 20 for the horror book. This tells me that the data for the horror book appears to be more spread out so I would therefore expect it to have a larger variance.
Having found the mean of the samples I can say using the central limit theorem that the population means are the same. So the mean sentence length for horror books is 14.15 words and the mean sentence length for drama books is 14.3 words. However the variance obtained for the sample is not the same as that for the population – it is a biased estimator. This means the mean of its distribution is not equal to the population value it is estimating. To obtain an unbiased estimator for the variance of the population we can use the formula
We can see that this didn’t really have that bigger affect on the variance because the value for n was quite large and so n / n-1 was almost 1.
In order to calculate the accuracy of your value for the sample mean you can calculate the Standard Error (s.e.). This is the standard deviation of the sample means. This is found using the below formula.
We can see that the s.e. for the horror book was 0.937 and for the drama book it was 0.891. This standard error is quite small so I can be quite confident that the actual mean of the populations is equal to that of the sample. A better way of showing how confident I can be in my approximations is to use confidence intervals. Confidence intervals allow you to give a percentage value to how confident you can be that the mean of the population is within certain values.
The central limit theorem says that the sample mean is distributed normally when a large enough sample is taken and that the sample mean is equal to the population mean. This means that we can use the tables for the normal function to find out how confident we can be that the population mean is within a certain range. For example when 68% of the graph is shaded (below) we can use the normal tables to work out that the population mean is within + or – 1 s.e. of the sample mean. So if you took a sample you could be 68% confident the sample mean was within + or – 1 s.e. of the population mean. This can be written as the inequality:
However because we don’t know the value of we must rearrange this to form the inequality:
I am going to use this to calculate 90%, 95% and 99% confidence intervals.
The z value for a 90% confidence interval is 1.645 so I can be 90% sure that the sample mean is within 1.645 s.e of the population mean. This means the calculations are:
For 95% confidence I found that the z value was 1.96 so you can be 95% sure the sample mean is within 1.96s.e. of the population mean.
The z value for 99% was 2.58 s.e
Data Interpretation
From my calculations I have been able to work out the population parameters for the 2 books. Firstly I found that the mean for Alfred Hitchcock’ horror book was 14.15 whilst the mean for the drama was 14.3. I found that the population variance was 87.8 for the horror book and 79.3 for the drama book. The confidence intervals I calculated for the horror book were
And for the drama book they were
Although this supports my prediction that horror books would have less word per sentence I am not actually that confident in this conclusion. This is due to the fact that the confidence intervals for 99% have a large range of 4.84 words for horror books and 3.49 words for the drama book. This means that the actual population mean could be quite different to the sample mean I calculated and so it could be that the population mean for the drama book was actually more than that of the horror book. I also found that that the variance for the horror book was greater than that of the drama book. This is probably because a drama book is likely to keep the same style of writing throughout the book with roughly the same sentence length whereas a horror book is likely to contain parts where there is suspense and the sentences are short and parts where there is description and the sentences are much longer. One of the problems with my findings was that the calculated as means were not whole number. It is impossible to have fractions of a word so if you actually round the means to the nearest whole word they are exactly the same at 14.
There were a number of limitations with this investigation firstly if I couldn’t be that confident that the mean I obtained was that accurate if I wanted to be more accurate I would have to take a lot more samples. For example if I wanted to be 99% sure that the sample mean was within 0.1 of a word of the population mean I would have to take over 58,000 samples (see below) for the horror book and over 52,000 samples for the drama book.
Obviously this is highly impractical but it shows how inaccurate my estimate is due to the fact that I took so few samples. Also I only sampled 1 book from each genre so it is difficult for me to accurately say that all books from these genres will be the same. It is possible that different authors with different writing styles will produce different sentence lengths. For example another horror writer may use longer sentences whilst another drama writer might use shorter sentences.
So if I was to extend this investigation I would firstly take more samples to ensure greater accuracy which would therefore allow greater certainty in any conclusions drawn. Secondly I would compare a number of different horror books against each other to see if their population parameters were similar or if they varied. Another progression could be to sample a number of horror books by the same author to see if they are at all similar in their population parameters.