- Level: AS and A Level
- Subject: Maths
- Word count: 3043
Statistics - My aim is to investigate whether it is possible to gain information about authorship of a text by using statistical measures.
Introduction
Statistics Coursework – Authorship
Design
Aim
My aim is to investigate whether it is possible to gain information about authorship of a text by using statistical measures. I will investigate the authorship of an Adult text and a Child text. I will calculate the mean of the distribution for both populations. From this, I will calculate the standard deviation and variance. I will use the unbiased estimator for both populations. I will calculate the standard error and confidence intervals for both populations. I will represent my data using frequency distribution tables. I will put my results into a frequency distribution graph. For the confidence intervals, I will use normal distribution diagrams.
Hypothesis
I predict that there will be more letters per word in Great Expectations by Charles Dickens and fewer in Charlie and the Great Glass Elevator by Roald Dahl. Therefore, the mean in Great Expectations will also be larger. I expect Great Expectations to have a larger standard deviation because of the use of a larger vocabulary.
Population
I will randomly select 50 pages from each book by using the RAND function in Microsoft Excel. Once I have 50 random pages for each book, I will select a random line for each page. I will finally select a random word from each line.
Using the RAND function
Middle
3
327
2
12
THIS
4
474
8
6
MY
2
459
33
9
YOU'VE
5
454
23
10
PUT
3
308
25
1
HAD
3
406
29
11
TONE
4
Raw data for Charlie and the Great Glass Elevator by Roald Dahl.
Page No. | Line No. | Word No. | Word | Letters in word |
77 | 18 | 7 | GRIN | 4 |
150 | 11 | 6 | MORE | 4 |
131 | 9 | 9 | TO | 2 |
143 | 14 | 1 | EXPLOSIONS | 10 |
164 | 12 | 1 | ISN'T | 4 |
140 | 31 | 7 | AGAIN | 5 |
92 | 2 | 1 | RED | 3 |
176 | 26 | 2 | ALL | 3 |
74 | 2 | 6 | EYE | 3 |
41 | 1 | 8 | OFF | 3 |
14 | 30 | 7 | GREEN | 5 |
120 | 2 | 3 | A | 1 |
55 | 25 | 5 | ONE | 3 |
146 | 16 | 9 | FEEDING | 7 |
93 | 19 | 1 | CRIPPLED | 8 |
57 | 8 | 10 | MARS | 4 |
23 | 8 | 2 | ABOUT | 5 |
119 | 9 | 1 | LOOK | 4 |
26 | 29 | 1 | WORTH | 5 |
74 | 22 | 5 | WONKA | 5 |
24 | 7 | 2 | YOU | 3 |
111 | 25 | 3 | YOU | 3 |
138 | 2 | 6 | I | 1 |
70 | 23 | 6 | RAN | 3 |
158 | 27 | 1 | VAPOUR | 6 |
152 | 28 | 3 | PINE | 4 |
165 | 18 | 6 | OLD | 3 |
89 | 5 | 4 | BESIDE | 6 |
111 | 26 | 7 | TO | 2 |
43 | 20 | 6 | MANDARIN | 8 |
23 | 3 | 1 | SERIOUS | 7 |
181 | 12 | 3 | MOMENT | 6 |
117 | 18 | 2 | ABOUT | 5 |
38 | 5 | 6 | SPY | 3 |
170 | 18 | 3 | SAID | 4 |
181 | 13 | 9 | OF | 2 |
78 | 7 | 6 | YOU | 3 |
65 | 21 | 9 | OR | 2 |
75 | 28 | 8 | BUMP | 4 |
50 | 24 | 1 | STRAIGHT | 8 |
14 | 8 | 7 | OUT | 3 |
98 | 1 | 6 | ELEVATOR | 8 |
172 | 10 | 1 | FORTY | 5 |
130 | 19 | 3 | QUIET | 5 |
104 | 9 | 8 | WONKA | 5 |
183 | 2 | 5 | LETTER | 6 |
17 | 1 | 7 | MR | 2 |
183 | 27 | 4 | A | 1 |
107 | 16 | 8 | TO | 2 |
129 | 9 | 1 | PILL | 4 |
Frequency distribution
I
Conclusion
Communication
Limitations
One major limitation was the amount of samples that I collected. If I had collected more samples my data would have increased in accuracy. Because of the time allowed to complete the investigation collecting 50 samples from both the books seemed sensible. If I were to repeat the investigation I would increase the number of samples that I collected because this would increase the accuracy of my experiment.
Extensions
To extend the investigation I could have looked at the number of words per line. I could have looked at the number of words per page. I could have also looked at the number of paragraphs per page.
Improvements
To improve the investigation, I could have collected more results. This would lead to the sample mean being more similar to the population mean.
I could have also collected different types of results. I could have looked at the number of words per page.
Conclusion
In conclusion, my results show that it is possible to gain information about authorship of a text using statistical measures. My results show this because the adult text has a higher average of letters per word and also has more variation of word length. More information could be gained by collecting a larger sample size.
