• Join over 1.2 million students every month
  • Accelerate your learning by 29%
  • Unlimited access from just £6.99 per month

Determine whether it is possible to gain information about authorship of a text using statistical measures.

Extracts from this document...


Aim: the aim of this investigation is to determine whether it is possible to gain information about authorship of a text using statistical measures. I will be comparing two books, the first aimed at an adult audience and the second aimed at a child which is written using a lower form of English literacy.

I will be comparing the complexities of the two books using a statistical manner by calculating the average number of words per sentence and the average number of letters per word in each book. Using this information I will then calculate confidence intervals which are what I will aim to compare between the two books.  I have chosen to investigate these two areas as neither are effected by font size or the number of lines on a page/words per line.

The final outcome of this investigation should give the relevant evidence to distinguish between which of the two books is more complex and uses longer words and sentences.

The two books I have used are;

“Diana, Her True Story” – Andrew Morton (174 pages, max 39 lines per page)

“A Series of Unfortunate Events” – Lemony Snicket (190 pages, max 21 lines per page)

...read more.


As shown on the sample mean distribution, the actual sample mean I will calculate can fall anywhere on this normal curve.  However I am able to calculate confidence intervals which I can use to say how confident I can be to any given percentage that my individual sample mean is a good estimator of the mean of the sample means within an interval.  All this represents the central limit theorem.

In order to obtain the sample data of sample size 50 for the number of words per sentence I:  

  • Generated 50 random numbers for each book to pick a random page. The adult book has 174 pages and the child book has 190 pages.  When generating random numbers on a calculator I used the whole number and ignored the decimals e.g. if 163.691 came up on the calculator I would use page number 163.  Also I generated numbers up to 1 page higher e.g. 191 for child book, as this would give the last page an equal chance to occur.  If I generated up to 191 then the highest number that could be randomly generated is 190.999 to 3 d.p. and this would therefore be classed as page 190. The number must be between 0 – 175 for the adult book and 0 – 191 for the children’s book.  
  • Next I randomly generated the number for the sentence I counted on that page, according to the number of sentences e.g. if page has 11 sentences, then I generated numbers up to 12.
  • I counted the number of words in that sentence and recorded them in a table that I constructed.
  • I would complete the above procedures for both of the books.
...read more.


If the random sentence on the page is the last and it overlaps two pages, then this sentence will still be counted from the page it starts to the page it finishes.A sentence ends with either . ? ! or .... unless these characters are part of a name/web address etc in the book e.g. Dr Who or www.webaddress.com.The first sentence on a page is counted as the first full sentence on that page and not a mid sentence carrying on from the page before.  Therefore it begins either after a . , ? ! or .... .  The beginning of a new chapter or paragraph may also be the first paragraph.
  • When counting the number of words in a line or sentence, a number e.g. 5 isn’t counted as a word in the sentence, also the characters . , “ ( ) ‘ ? ! : ; + - % etc are not counted as words.  This also applies to counting words in a line.  
  • A word which is split by a hyphen e.g. co-ordination is classed as one word.  Also direct speech is counted when counting the words.  When counting letters in a words only the letters of the alphabet count.  If a word has a – (hyphen) then this is not counted as a letter e.g. co-ordination.

These must be specified so that I treat every piece of data exactly the same and therefore prevent a biased approach.

...read more.

This student written piece of work is one of many that can be found in our GCSE Comparing length of words in newspapers section.

Found what you're looking for?

  • Start learning 29% faster today
  • 150,000+ documents available
  • Just £6.99 a month

Not the one? Search for your essay title...
  • Join over 1.2 million students every month
  • Accelerate your learning by 29%
  • Unlimited access from just £6.99 per month

See related essaysSee related essays

Related GCSE Comparing length of words in newspapers essays

  1. Maths Statistical Coursework

    Hypothesis My second hypothesis, which will focus upon style rather than content, is: A broadsheet newspaper will be more difficult to read than a tabloid paper. Method of data collection For this hypothesis, I have decided to collect data that reveals the amount of syllables per word in articles in

  2. Statistically comparing books

    Sentence Length Page Chapter Amount of words Amount of words over 6 letters This is the table I'm going to use to record my data. It will allow me to see all my data clearly and let me see how spread out my data is.

  1. The aim of the research is to find out whether or not interference does ...

    This means that these results are due to the manipulation of the independent variable rather than being due to chance. The aim of the study is to find out whether or not interference does occur when participants try to identify the colour ink that colour words are written in.

  2. Aim: having been presented with some data, to come up with a hypothesis and ...

    Throughout this investigation I will try to find out if my hypothesis is correct and use statistical methods in order to prove this. Below is my sample of 30 year 11 students taken from a year group of 86. Reference Length Angle 3 10.1 40 6 9.0 40 7 7.5

  1. My hypothesis is that the children's book will have a mean word length much ...

    occur which would effect the accuracy of the results and may make the sample bias, for example I may not psychologically choose the same word more than once but with the use of random numbers this may be the case.

  2. The aim of this coursework is to compare the word and sentence length of ...

    It is possible to state the 'Central Limit Theorem' symbolically. This is shown below: if X ~ (unknown)(?,?�) then ?n ~ N (?, ) A good sample size to use is n ? 30. This means that our sample size of 200 word lengths per population and 40 sentence lengths per population is enough.

  1. This investigation looked to see whether the height on the shore would affect the ...

    I will use my preliminary experiment to decide these heights. My dependent variable will be the length of the topshell, and the size of it's aperture. I will put these into a ratio of Aperture:Length. Due to the shape of the topshell there is no definite length of shell, but I will measure from the same correlating points (see appendix)

  2. Tabloid Newspaper - The Sun statistical analysis.

    2 50 50 Mean = ?fx = 246 = 4.92 ?f 50 = 29.24 - (4.92)� X Frequency, f F x X X F x X � 1 1 1 1 1 2 6 12 4 24 3 7 21 9 63 4 9 36 16 144 5 11 55

  • Over 160,000 pieces
    of student written work
  • Annotated by
    experienced teachers
  • Ideas and feedback to
    improve your own work