# Determine whether it is possible to gain information about authorship of a text using statistical measures.

Introduction

Aim: the aim of this investigation is to determine whether it is possible to gain information about authorship of a text using statistical measures. I will be comparing two books, the first aimed at an adult audience and the second aimed at a child which is written using a lower form of English literacy.

I will be comparing the complexities of the two books using a statistical manner by calculating the average number of words per sentence and the average number of letters per word in each book. Using this information I will then calculate confidence intervals which are what I will aim to compare between the two books. I have chosen to investigate these two areas as neither are effected by font size or the number of lines on a page/words per line.

The final outcome of this investigation should give the relevant evidence to distinguish between which of the two books is more complex and uses longer words and sentences.

The two books I have used are;

“Diana, Her True Story” – Andrew Morton (174 pages, max 39 lines per page)

“A Series of Unfortunate Events” – Lemony Snicket (190 pages, max 21 lines per page)

Middle

As shown on the sample mean distribution, the actual sample mean I will calculate can fall anywhere on this normal curve. However I am able to calculate confidence intervals which I can use to say how confident I can be to any given percentage that my individual sample mean is a good estimator of the mean of the sample means within an interval. All this represents the central limit theorem.

In order to obtain the sample data of sample size 50 for the number of words per sentence I:

- Generated 50 random numbers for each book to pick a random page. The adult book has 174 pages and the child book has 190 pages. When generating random numbers on a calculator I used the whole number and ignored the decimals e.g. if 163.691 came up on the calculator I would use page number 163. Also I generated numbers up to 1 page higher e.g. 191 for child book, as this would give the last page an equal chance to occur. If I generated up to 191 then the highest number that could be randomly generated is 190.999 to 3 d.p. and this would therefore be classed as page 190. The number must be between 0 – 175 for the adult book and 0 – 191 for the children’s book.

- Next I randomly generated the number for the sentence I counted on that page, according to the number of sentences e.g. if page has 11 sentences, then I generated numbers up to 12.
- I counted the number of words in that sentence and recorded them in a table that I constructed.
- I would complete the above procedures for both of the books.

Conclusion

- When counting the number of words in a line or sentence, a number e.g. 5 isn’t counted as a word in the sentence, also the characters . , “ ( ) ‘ ? ! : ; + - % etc are not counted as words. This also applies to counting words in a line.

- A word which is split by a hyphen e.g. co-ordination is classed as one word. Also direct speech is counted when counting the words. When counting letters in a words only the letters of the alphabet count. If a word has a – (hyphen) then this is not counted as a letter e.g. co-ordination.

These must be specified so that I treat every piece of data exactly the same and therefore prevent a biased approach.

