• Join over 1.2 million students every month
  • Accelerate your learning by 29%
  • Unlimited access from just £6.99 per month

Determine whether it is possible to gain information about authorship of a text using statistical measures.

Extracts from this document...


Aim: the aim of this investigation is to determine whether it is possible to gain information about authorship of a text using statistical measures. I will be comparing two books, the first aimed at an adult audience and the second aimed at a child which is written using a lower form of English literacy.

I will be comparing the complexities of the two books using a statistical manner by calculating the average number of words per sentence and the average number of letters per word in each book. Using this information I will then calculate confidence intervals which are what I will aim to compare between the two books.  I have chosen to investigate these two areas as neither are effected by font size or the number of lines on a page/words per line.

The final outcome of this investigation should give the relevant evidence to distinguish between which of the two books is more complex and uses longer words and sentences.

The two books I have used are;

“Diana, Her True Story” – Andrew Morton (174 pages, max 39 lines per page)

“A Series of Unfortunate Events” – Lemony Snicket (190 pages, max 21 lines per page)

...read more.


As shown on the sample mean distribution, the actual sample mean I will calculate can fall anywhere on this normal curve.  However I am able to calculate confidence intervals which I can use to say how confident I can be to any given percentage that my individual sample mean is a good estimator of the mean of the sample means within an interval.  All this represents the central limit theorem.

In order to obtain the sample data of sample size 50 for the number of words per sentence I:  

  • Generated 50 random numbers for each book to pick a random page. The adult book has 174 pages and the child book has 190 pages.  When generating random numbers on a calculator I used the whole number and ignored the decimals e.g. if 163.691 came up on the calculator I would use page number 163.  Also I generated numbers up to 1 page higher e.g. 191 for child book, as this would give the last page an equal chance to occur.  If I generated up to 191 then the highest number that could be randomly generated is 190.999 to 3 d.p. and this would therefore be classed as page 190. The number must be between 0 – 175 for the adult book and 0 – 191 for the children’s book.  
  • Next I randomly generated the number for the sentence I counted on that page, according to the number of sentences e.g. if page has 11 sentences, then I generated numbers up to 12.
  • I counted the number of words in that sentence and recorded them in a table that I constructed.
  • I would complete the above procedures for both of the books.
...read more.


If the random sentence on the page is the last and it overlaps two pages, then this sentence will still be counted from the page it starts to the page it finishes.A sentence ends with either . ? ! or .... unless these characters are part of a name/web address etc in the book e.g. Dr Who or www.webaddress.com.The first sentence on a page is counted as the first full sentence on that page and not a mid sentence carrying on from the page before.  Therefore it begins either after a . , ? ! or .... .  The beginning of a new chapter or paragraph may also be the first paragraph.
  • When counting the number of words in a line or sentence, a number e.g. 5 isn’t counted as a word in the sentence, also the characters . , “ ( ) ‘ ? ! : ; + - % etc are not counted as words.  This also applies to counting words in a line.  
  • A word which is split by a hyphen e.g. co-ordination is classed as one word.  Also direct speech is counted when counting the words.  When counting letters in a words only the letters of the alphabet count.  If a word has a – (hyphen) then this is not counted as a letter e.g. co-ordination.

These must be specified so that I treat every piece of data exactly the same and therefore prevent a biased approach.

...read more.

This student written piece of work is one of many that can be found in our GCSE Comparing length of words in newspapers section.

Found what you're looking for?

  • Start learning 29% faster today
  • 150,000+ documents available
  • Just £6.99 a month

Not the one? Search for your essay title...
  • Join over 1.2 million students every month
  • Accelerate your learning by 29%
  • Unlimited access from just £6.99 per month

See related essaysSee related essays

Related GCSE Comparing length of words in newspapers essays

  1. Introduction to English language.

    Nouns can also be objects of prepositions - words like to, in, for, and by - so the above sentence could read: He gave a bone to the dog. The words to the dog are called a prepositional phrase. Some verb forms take nouns as objects: Drinking milk is good for you.

  2. Maths Statistical Coursework

    Hypothesis My second hypothesis, which will focus upon style rather than content, is: A broadsheet newspaper will be more difficult to read than a tabloid paper. Method of data collection For this hypothesis, I have decided to collect data that reveals the amount of syllables per word in articles in

  1. Statistically comparing books

    Sentence Length Page Chapter Amount of words Amount of words over 6 letters This is the table I'm going to use to record my data. It will allow me to see all my data clearly and let me see how spread out my data is.

  2. Aim: having been presented with some data, to come up with a hypothesis and ...

    Section One: Finding the margin of error for the size of the angle In this section of the investigation I hope to explain to you what error margin means with regards to the size of the angle and what the boundaries should be, in which I have to stick to in order to agree with the hypothesis.

  1. This investigation looked to see whether the height on the shore would affect the ...

    * I will not collect Gibula Umbillicalis from rockpools. I think whether the topshells are in or out of a rockpool could affect the size of the topshell and so I want to make sure that it does not change the outcome of my investigation Equipment I have chosen equipment for its accuracy and practicability.

  2. In this investigation I aim to carry out a modification on the experiment on ...

    This shows that we recognise high frequency words such as "the" as whole units rather than by their individual letters, so automatically process them. This powerfully autonomic nature of reading words is also evident in the following research carried out by "Stroop".

  1. My hypothesis is that the children's book will have a mean word length much ...

    the children's book, and Stephen Dobyns - The school of dead girls as the adult book. These are my parent populations and the words that Ichoose as my sample will be taken from these populations of words. I intend to take a suitable sample from these.

  2. I have always found it fascinating how the English language is built up and ...

    another example of my statistical research into newspaper article prose. The sole reason for using the internet is to save an inordinate amount of time copying prose from the real Independent newspaper or the hassle and inaccuracy of OCR (optical character recognition)

  • Over 160,000 pieces
    of student written work
  • Annotated by
    experienced teachers
  • Ideas and feedback to
    improve your own work