Statistics - My aim is to investigate whether it is possible to gain information about authorship of a text by using statistical measures.

Authors Avatar

Kuljit Bahra        AS Maths Coursework        10/03/2003

Statistics Coursework – Authorship

Design

Aim

        My aim is to investigate whether it is possible to gain information about authorship of a text by using statistical measures. I will investigate the authorship of an Adult text and a Child text. I will calculate the mean of the distribution for both populations. From this, I will calculate the standard deviation and variance. I will use the unbiased estimator for both populations. I will calculate the standard error and confidence intervals for both populations. I will represent my data using frequency distribution tables. I will put my results into a frequency distribution graph. For the confidence intervals, I will use normal distribution diagrams.

Hypothesis

        I predict that there will be more letters per word in Great Expectations by Charles Dickens and fewer in Charlie and the Great Glass Elevator by Roald Dahl. Therefore, the mean in Great Expectations will also be larger. I expect Great Expectations to have a larger standard deviation because of the use of a larger vocabulary.

Population

        I will randomly select 50 pages from each book by using the RAND function in Microsoft Excel. Once I have 50 random pages for each book, I will select a random line for each page. I will finally select a random word from each line.

Using the RAND function

        I got my random numbers by using the following process.

        e.g.                248 × RAND        (248 = number of pages in book)

                        36 × RAND                (36 = number of lines on page)

                        13 × RAND                (13 = number of words on line)

        I will count the number of lines on each page and times this with the RAND function to make the random number correct each time. I will also use this same process with which word to select on each line.

Sampling

        Sampling is the selection of individual members from a population. The advantage of taking a sample is that it is cheaper, quicker and the results are easier to analyse than the results of a census. However, the disadvantage is the results may include natural variation or bias and so may not be representative of the whole population and it may not be accurate.

        There are rules that must be followed when choosing a sample.

  • The sample size must be large enough so that the results are more accurate. A very small sample may not represent the rest of the population. So I must make sure that any sample I take is large enough to be representative of the population as a whole. So in order to get more accurate results and for the data I collect to be representative of the whole population, I am going to take 50 samples in total for both the books.

  • The sample should be taken at random.  If a random sample is not taken, then my results may be biased. If I choose which page and which line, I wanted to count the number of words then I will end up with data, which is unrepresentative. So in order to get a set of data, which is representative, I used the RAND function in Microsoft Excel to get the random page number, line number and word number.
Join now!

Method

        For this investigation, I am finding out whether it possible to gain information about authorship of a text. I will be using and adult text and a child text. The adult text that I will be using is Great Expectations by Charles Dickens. This book consists of 484 pages. The child text I will be using is Charlie and the Great Glass Elevator by Roald Dahl. This book consists of 190 minus eight pages at the beginning of the text. I will select 50 random pages from each book. I will then select a random line ...

This is a preview of the whole essay