The aim of this investigation is to gain statistical information to show authorship of a text.

Authors Avatar
AS Mathematics: (AQA) Statistics Coursework

DESIGN

Introduction:

The aim of this investigation is to gain statistical information to show authorship of a text. For this investigation, I will use two pieces of text in order to investigate authorship. In order for the investigation to be valid, the two pieces of text I need to use should have a different theme attached to them. By theme, I mean they need to be different in a broad way i.e. different genre, different age readers.

I had a number of different texts to compare but I decided to use one adult text and one child text as this will give me a more obvious variation and expectation.

For this investigation I will be calculating the mean of the distribution for both populations. I will then be able to calculate the standard deviation and variance, and I will be using the unbiased estimator for both populations. I will calculate the standard error and confidence intervals for both populations.

My data will be represented using frequency distribution tables and these can show the trends of a frequency distribution graph. The normal distribution diagrams will also be used for the confidence intervals representation.

Population:

In a statistical enquiry, you often need information about a particular group. This group is known as the POPULATION and it could be small, large or infinite. The population for my investigation is the all the words of each separate book.

Sampling:

Sampling is the selection of individual members of a population. The advantage of taking a sample is that it is cheaper, quicker and the results are easier to analyse and the appropriate for this type of investigation. Unfortunately, it does have some disadvantages that are difficult to avoid as the results may include natural variation or bias, and so may not be representative of the whole population and thus the results are meaningless.

The following are important factors to take in consideration when choosing a sample for this type of investigation:

==> The sample size must be large enough so that the results are more accurate. A very small sample may not represent the rest of the population. So I must make sure that any sample I take is large enough to be representative of the population as a whole. So in order to get more accurate results and for the data I collect to be representative of the whole population, I am going to take 50 samples in total for both the books.

==> The sample should be taken at random. If a random sample is not taken, then my results may be biased. If I choose which page and which line, I wanted to count the number of words then I will end up with data, which is unrepresentative. So in order to get a set of data, which is representative; I used the RAND function on my Casio calculator to get the random page number, line number and word number.

Sample size and method:

For this investigation, I have decided to arbitrarily select 50 pages from each text. The main reason the pages are random is to avoid any biased results that would give inaccurate results and therefore give a bad interpretation of the population parameters.

Fifty pages is also enough pages to get an overview of the means, standard deviation etc. A larger sample is creates more accurate results than a smaller sample and 50 pages is sufficient enough to determine accurate representations of specific parameters and it is also a large enough sample for a theory I am going to use later in the coursework.

Method:

For this investigation, I am finding out whether it possible to gain information about authorship of a text. I will be using and adult text and a child text. The adult text that I will be using is 'Watchers' by Dean R Koontz which is a book that consists of 507 pages. The child text I will be using is 'Million Dollar Egg' by Roderick Hunt. This book consists of 45 minus three pages at the beginning of the text. I will select 50 random pages from each book. I will then select a random line and word on each of these pages.

How I will select the pages:

I will select 50 random pages from each book by using the RAND function on my calculator. Once I have 50 random pages for each book, I will select a random line for each random page. From the random line, I will then finally select a random word from each line.

All calculators have a random key, which can be used to generate random numbers. On the majority of scientific calculators they are decimals give correct to 3 decimal places. By multiplying by a fixed value, you can generate values up to a fixed maximum.

e.g. If there are 50 pages in a book, and a random page is required, if RANDOM 50 is typed into a calculator, it will generate numbers that can be rounded 0 to 49, so to choose an accurate random page of 50 pages, I would have to type (RANDOM * 50) +1.

After typing (RANDOM*50) + 1, I will need to consider the numbers before the decimal point, as I will need to round of the number to give an integer as, book pages are consecutive whole numbers, not decimals:

e.g. 304.9 will be 305, 76.3 will be 76 (to 1 d.p)

Using the RAND Function:

(X, Y and Z would be constants corresponding to the page, line and word number total)

e.g. X × RAND function (X is number of pages in each book)

Y × RAND function (Y is number of lines on X)

Z × RAND function (Z is the number of words on Y)

The number of pages for each book will obviously be the same. However the number of lines varies as each chapter ends, and the number of words on each line varies, as larger words use up more space causing fewer words per line and vice versa. For this reason I will obviously count the number of lines on each page and times this with the RAND function to make the random number accurate each time. This process will also be used for the word selection procedure.
Join now!


Encountering difficulties

When selecting the random words, by chance I could encounter a word with hyphens or any other grammarcal expressions. To avoid any an unfair investigation I have chosen to ignore these types of words or expressions and instead another random word should be selected. i.e. Mr. is not considered as a word. I will also ignore conjunctions. Before I collect my data I must also consider one other thing to do with the two texts that could affect the reliability of my results. The main one was to do with punctuation. I would only be ...

This is a preview of the whole essay