• Join over 1.2 million students every month
  • Accelerate your learning by 29%
  • Unlimited access from just £6.99 per month

Determine whether it is possible to gain information about authorship of a text using statistical measures.

Extracts from this document...


Aim: the aim of this investigation is to determine whether it is possible to gain information about authorship of a text using statistical measures. I will be comparing two books, the first aimed at an adult audience and the second aimed at a child which is written using a lower form of English literacy.

I will be comparing the complexities of the two books using a statistical manner by calculating the average number of words per sentence and the average number of letters per word in each book. Using this information I will then calculate confidence intervals which are what I will aim to compare between the two books.  I have chosen to investigate these two areas as neither are effected by font size or the number of lines on a page/words per line.

The final outcome of this investigation should give the relevant evidence to distinguish between which of the two books is more complex and uses longer words and sentences.

The two books I have used are;

“Diana, Her True Story” – Andrew Morton (174 pages, max 39 lines per page)

“A Series of Unfortunate Events” – Lemony Snicket (190 pages, max 21 lines per page)

...read more.


As shown on the sample mean distribution, the actual sample mean I will calculate can fall anywhere on this normal curve.  However I am able to calculate confidence intervals which I can use to say how confident I can be to any given percentage that my individual sample mean is a good estimator of the mean of the sample means within an interval.  All this represents the central limit theorem.

In order to obtain the sample data of sample size 50 for the number of words per sentence I:  

  • Generated 50 random numbers for each book to pick a random page. The adult book has 174 pages and the child book has 190 pages.  When generating random numbers on a calculator I used the whole number and ignored the decimals e.g. if 163.691 came up on the calculator I would use page number 163.  Also I generated numbers up to 1 page higher e.g. 191 for child book, as this would give the last page an equal chance to occur.  If I generated up to 191 then the highest number that could be randomly generated is 190.999 to 3 d.p. and this would therefore be classed as page 190. The number must be between 0 – 175 for the adult book and 0 – 191 for the children’s book.  
  • Next I randomly generated the number for the sentence I counted on that page, according to the number of sentences e.g. if page has 11 sentences, then I generated numbers up to 12.
  • I counted the number of words in that sentence and recorded them in a table that I constructed.
  • I would complete the above procedures for both of the books.
...read more.


If the random sentence on the page is the last and it overlaps two pages, then this sentence will still be counted from the page it starts to the page it finishes.A sentence ends with either . ? ! or .... unless these characters are part of a name/web address etc in the book e.g. Dr Who or www.webaddress.com.The first sentence on a page is counted as the first full sentence on that page and not a mid sentence carrying on from the page before.  Therefore it begins either after a . , ? ! or .... .  The beginning of a new chapter or paragraph may also be the first paragraph.
  • When counting the number of words in a line or sentence, a number e.g. 5 isn’t counted as a word in the sentence, also the characters . , “ ( ) ‘ ? ! : ; + - % etc are not counted as words.  This also applies to counting words in a line.  
  • A word which is split by a hyphen e.g. co-ordination is classed as one word.  Also direct speech is counted when counting the words.  When counting letters in a words only the letters of the alphabet count.  If a word has a – (hyphen) then this is not counted as a letter e.g. co-ordination.

These must be specified so that I treat every piece of data exactly the same and therefore prevent a biased approach.

...read more.

This student written piece of work is one of many that can be found in our GCSE Comparing length of words in newspapers section.

Found what you're looking for?

  • Start learning 29% faster today
  • 150,000+ documents available
  • Just £6.99 a month

Not the one? Search for your essay title...
  • Join over 1.2 million students every month
  • Accelerate your learning by 29%
  • Unlimited access from just £6.99 per month

See related essaysSee related essays

Related GCSE Comparing length of words in newspapers essays

  1. This investigation looked to see whether the height on the shore would affect the ...

    I decided to carry out the investigation at the two most different points. I chose height '4' because it was the furthest height up the shore where I had actually found gibbula umbilicalis. I also chose height '10' because it was the lowest height on the shore and I had found gibbula umbilicalis at this height.

  2. I have always found it fascinating how the English language is built up and ...

    In the hope of using a representative sample: headings, dates, names of reporters and listed points will be omitted from the data analysis. This would have effected the sample by producing an unfair bias to longer words and short sentences.

  1. Introduction to English language.

    All of these kinds of words together are called parts of speech. They can just as well be called parts of writing because they apply to written as well as to spoken language. o Nouns and Articles Nouns can be particular or general: the house, a house.

  2. Maths Statistical Coursework

    However, I am reasonably happy that my conclusions are reliable as the gap between the tabloids and the broadsheet paper are so wide, that it is highly unlikely that there is a huge fault with my results, other than that there are ways to make it more reliable.

  1. Statistically comparing books

    I will use all three averages for word length; I have chosen these because the use of the Median in a Box Plot with the two books alongside each other will help me to see if there are differences or similarities.

  2. Aim: having been presented with some data, to come up with a hypothesis and ...

    6.0 40 57 6.2 39 60 8.6 36 66 8.2 35 69 0.7 35 71 6.9 44 75 9.3 40 76 6.8 39 77 5.4 39 82 4.4 29 86 3.2 50 Actual result: Length: 7.9 cm Angle: 34 degrees Glancing upon my sample I have noticed two anomalies, in

  1. Data Handling Project

    The Herald Tribune and the Daily Mail have the same median value, of 5 letters per word. However if you refer back to the guidelines I made about what to judge a newspaper's word length on, you will notice that the second criteria is about the consistency of data, in

  2. In this investigation I aim to carry out a modification on the experiment on ...

    She presented participants with a piece of English prose and asked them to read it and circle all the t's in the passage. Participants frequently missed out the t's in common words such as "the" and more easily identified the t's in more uncommon words.

  • Over 160,000 pieces
    of student written work
  • Annotated by
    experienced teachers
  • Ideas and feedback to
    improve your own work