Differences in standard deviation models the shape of the distribution. Although the distribution remains symmetric, the distribution becomes flatter if we increase the standard deviation. This corresponds to more diversity between the observations
Problems
What should I do if?
- There are no words on the selected page ?
Randomly generate another page number.
- There are no words on the selected line?
Randomly generate another line number.
- The word number “k” doesn’t exist?
Randomly generate another word number.
- The word continues onto the next line?
Count the whole length of the word.
Count both bits of the word but not the hyphen.
- The word is a sound E.g. “ssssh” ?
Count the characters as I would for a normal word.
- The word is in digit form E.g “1357” ?
Count the number of characters, in this case four.
Frequency of Word Length for Emma
Mean = Σfx = 45 = 6.5
Σf 20
Variance = Σfx2 - x2 = 3330 – 6.52 = 5.3
Σf 70
Standard deviation = √ variance
= √ 5.3 = 2.3
I have calculated the sample mean, this is probably not the same as the population mean, but it is an unbiased estimator of the population mean. I have also calculated the standard deviation, this is not an unbiased estimator, to find it we use the formula:
√ n x s2 = σ √ 70 x 2.32 = 2.3
n-1 69
The standard deviation of the distribution of the sample means (standard error) =σ
√n
According to the central limit theorem 95% of the values of a normal distribution lie with 1.96 standard deviations of the mean, this can be written as
x - z σ < μ < x + z σ
√n √n
where x = the sample mean, z = the z value of the confidence interval, σ = the unbiased estimator of the population standard deviation and n = the size of the sample
1.96 μ 1.96
The can be re-arranged as x ± z σ
√n
x - z σ < μ < x + z σ
√n √n
6.5 ± 1.65 x 2.3
√70
6.04 < μ < 6.95
This tells me that I can be 90% confident that the population mean lies between 6.04 and 6.95. I am now going to calculate a 90% and 95% confidence intervals so I can compare the two.
1.65 μ 1.65
6.5 ± 1.96 x 2.3
√70
This tells me that I can be 95% confident that the population mean lies between 5.96 and 7.04
I am now going to calculate a 99% confidence interval because
2.575 μ 2.575
x - z σ < μ < x + z σ
√n √n
6.5 ± 2.575 x 2.3
√70
5.79 < μ < 7.21
This tells me that I can be 99% confident that the population mean lies between 5.79 and 7.21
Frequency of word length - Bridget Jones’ diary
Mean = 5.6
Variance = 229.7
Standard deviation = √ 229.7 = 15.1
Using the same theory as earlier, I am now going to calculate an unbiased estimator of the population standard deviation. Then use it to calculate a confidence interval of 90%
Unbiased estimator of population standard deviation (σ) = 15.2
2.58 < μ < 8.61
This tells me that I can be 90% confident that the population mean lies between 2.58 and 8.61.
I am now going to calculate 95% and 99% confidence intervals because I can then compare if there is any over lap between the confidence intervals at 90%,95% and 99%
2.03 < μ < 6.03
This tells me that I can be 95% confident that the population mean lies between 2.03 and 6.03.
5.04 < μ < 9.17
This tells me that I can be 99% confident that the population mean lies between 5.04 and 9.17
Comparing the two intervals
These diagrams help to illustrate more clearly the confidence intervals.
The first diagram is for the 90% confidence interval of the mean word length and shows clearly that the interval for Bridget Jones’ Diary completely encompasses the interval for Emma, this is not what I would have expected and shows that the true means of the two books are likely to be the same.
The second diagram is for the 95% confidence interval of the mean word length and shows that the interval for Bridget Jones’ Diary only overlaps the interval for Emma a little, this is what I would have expected and shows that the true means of the two books are not likely to be the same.
The third diagram is for the 99% confidence interval of the mean word length and shows that the interval for Bridget Jones’ Diary only overlaps the interval for Emma a little, this is what I would have expected and also shows that the true means of the two books are not likely to be the same.
Conclusion
Having found the sample mean word length for each of the two books and compared the confidence intervals, I have found that there is considerable overlap between the confidence intervals at all three levels of confidence. This tells me that it is likely that the true means for the two books are probably very similar, however the 95 and 99% confidence intervals only overlap a little meaning that there is still a high chance that the means are not the same and that Emma is higher. This is not what I had expected, I had expected to find that there was very little or no overlap and that the mean word length of Emma was longer.
However accurate I believe my results to be, they can never be perfectly accurate as the true mean is unattainable with such a comparatively small sample size. The only way to find the true mean would be to sample every word in the book. I feel that my prediction at the start of the investigation that the means would be very different now seems not to be true, and that the means are probably quite close in value.
I think my results are realistic as if you look at the sample of words there are almost as many short words in Emma as in Bridget Jones’s Diary, there are just more long ones and less in the middle. I don’t think the way I have sampled my data will have had much effect on the sample as it was completely random and should have produce a fair sample.
If I were to do this again I would take a far larger sample, say 200 words, I might also take the sample differently, but I’m not sure how I could do this and still keep it random.
As there is an overlap it would be interesting to investigate it to even lower percentage levels. I could also take a larger sample or more samples, alternatively I could investigate average sentence length as this would probably be a better indication of how hard the book is to read.
Sentence Length
I will take 70 samples from each book as before and evaluate the results in the same way to establish confidence intervals. I will again take a random sample using numbers generated by my calculator.
I will then use this formula :
RAN x p + o
Where p is the difference between the page number of the last page of the main body of the text and the first page of the main body of the text I.e. not including any introductions, acknowledgements or appendices, and o is the number of pages before the main story.
Using the formula;
RAN x p
I can determine the line number by having p as the maximum number of line on a page, I will then use the sentence that contains the first word of that line, I will include any bits on the previous of following lines.
I think I will find that the sentences in Emma will be considerably longer as I am expecting the sentences to be more complex and descriptive.
What should I do if?
- The sentence includes a hyphenated word?
Treat it as one word, this is what I did in the word length section.
- The sentence includes speech?
If the speech is part of the sentence i.e. no full stop or question/ exclamation mark and capital letter then the speech is part of the sentence and will be included.
- The word is a sound E.g. “ssssh” ?
Count the word as I would for a normal word.
- The word is in digit form E.g “1357” ?
Include it as one word.
Frequency of sentence length - Emma
Mean = 30.2
Variance = 1057.7
Standard deviation = 32.5
Unbiased estimator of the population standard deviation = 33.0
23.69 < μ < 36.71
This tells me that I can be 90% confident that the population mean lies between 23.69 and 36.71
I am now going to calculate 95% and 99% confidence intervals because I can then compare if there is any over lap between the confidence intervals at 90%,95% and 99%
22.47 < μ < 37.93
This tells me that I can be 95% confident that the population mean lies between 22.47 and 37.93
20.04 < μ < 40.36
This tells me that I can be 99% confident that the population mean lies between 20.04 and 40.36
Frequency of sentence length – Bridget Jones’ Diary
Mean = 19.0
Variance = 45.82
Standard deviation = 6.77
Unbiased estimator of the population = 6.87
Using the same theory as earlier, I am now going to calculate an unbiased estimator of the population standard deviation. Then use it to calculate a confidence interval of 90%
17.65 < μ < 20.35
This tells me that I can be 90% confident that the population mean lies between 17.65 and 20.35
I am now going to calculate 95% and 99% confidence intervals. This will allow me to compare the intervals between the two books and compare any overlap.
17.40 < μ < 20.60
This tells me that I can be 95% confident that the population mean lies between 17.40 and 20.60
16.91 < μ < 21.01
This tells me that I can be 99% confident that the population mean lies between 16.91 and - 21.01
Comparing the two intervals
These diagrams help to illustrate more clearly the confidence intervals.
The first diagram is for the 90% confidence interval of the mean sentence length and shows clearly that the interval for Bridget Jones’ Diary does not overlap the interval for Emma at all, this is what I expected and shows that the true means of the two books are not the same.
The second diagram is for the 95% confidence interval of the mean sentence length and also shows that the interval for Bridget Jones’ Diary does not overlap the interval for Emma at all, this shows that the true means of the two books are not the same.
The third diagram is for the 99% confidence interval of the mean sentence length and shows that the interval for Bridget Jones’ Diary only overlaps the interval for Emma a little, this is what I would have expected and also shows that the true means of the two books are not likely to be the same.
Final Conclusion
From my work on word length, the means looked to have a high probability of being the same or similar. From my work on sentence length I can see that the means have little of being the same and that the mean for Emma is higher.
As my original task was to compare the writing styles of the two authors this is helpful as it allows me to see that although Jane Austin only used slightly longer words, her sentence are far longer and therefore take more effort to read. Though it is debatable as to whether or not longer words and sentence are actually harder to read, I had to assume this in order to carry out this coursework.
I think that my sampling method was a legitimate random sampling method as it made sure the sample was random, and therefore hopefully not biased. The investigation could have been extended by investigating paragraph or chapter length for these books and samples could have been taken from other books by the same authors to further compare their writing styles. It would probably have been fairer to take a larger sample, this was not practical at the time of the investigation as I was limited by time.
My prediction regarding word length was wrong, because at the 90% confidence interval the Bridget Jones Diary interval overlaps Emma completely, this means it is likely that the means are close, if not the same. On the other hand the 95% and 99% intervals only overlap slightly which shows that the means are probably not the same. However for sentence length, the 90% and 95% confidence intervals do not overlap at all and the 99% one only overlaps slightly. This shows that it is extremely unlikely that the means are the same, though when there is any overlap there is always a chance that the true means could be the same just as there is always a chance that the true mean may fall outside the confidence interval. For 99% intervals, the true mean will fall outside the confidence interval for one in one hundred samples. From this information I would conclude that my prediction was right.
I cannot be absolutely certain that this is the case unless I carry out a hypothesis test. This is because although it seems from my results that Emma has a longer sentence length than Bridget Jones’ Diary, it may be that, despite my random sampling, that I selected unusually short sentences for Bridget Jones’ Diary and unusually long ones for Emma. In order to do the hypothesis test I would make a null hypothesis (H0) that the mean of Emma – the mean of Bridget Jones = 0 and an alternative hypothesis (H1) that the mean of Emma is greater than the mean of Bridget Jones. The alternative hypothesis is always one sided and I have decided that if I were to do this that I would test that Emma is greater than Bridget because that is what my samples have indicated. I would then calculate the probability that the two means lie within the overlap of the confidence intervals and if this was a suitably large probability I could conclude that the true means were the same, if the probability was small I could conclude that the two means weren’t the same and that Emma had a longer mean sentence length.
I would determine large as greater than 0.05 and small as less than or equal to 0.05
Finally I feel that the results I have obtained are quite realistic and that overall the investigation has been a success. This is shown by the fact that all of my results are sensible, even if not what I predicted and all of my aims and objectives have been met.
Emma by Jane Austin-word sampling
Bridget Jones’ Diary by Helen Fielding-word sampling
Emma by Jane Austin-Sentence sampling
Bridget Jones’ diary by Helen Fielding- Sentence sampling