Compare a modern romantic comedy with a very old romantic comedy - Compare word lengths mathematically.

Authors Avatar

Statistics Coursework

Introduction

For my investigation I intend to compare a modern romantic comedy with a very old romantic comedy. I have chosen Bridget Jones’s Diary by Helen Fielding and Emma by Jane Austin. Bridget Jones’s Diary was first published in 1996 whereas Emma was first published in 1816. As the books were written a hundred and eighty years apart, I think there will be a difference in the writing styles, and I think Emma will be harder to read, although I cannot compare mathematically writing style and difficulty, I can compare word length assuming that longer words are more difficult to read.

Hypothesis

Old books are harder to read than modern books of the same genre. I hope to find that the confidence intervals I calculate will not overlap or not overlap much, this will show that the means are not or are unlikely to be the same.

Method

 In order to compare the word lengths, I going to take a sample of eighty words from each book, then compare the means, variances and standard deviations in order to decide whether or not Emma is a more difficult book to read than Bridget Jones’s Diary.

To ensure that the sample is completely random I am using the random button on my calculator to pick a random page number and line number for the sample of sentences, and a random page, line number and word number for the sample of random words.

To determine my numbers I will use the formula;

                                RAN x p + o

Where p is the difference between the page number of the last page of the main body of the text and the first page of the main body of the text I.e. not including any introductions, acknowledgements or appendices, and o is the number of pages before the main story.

Using the formula;

                                RAN x p

I can determine the line number by having p as the maximum number of line on a page, and I can determine word number by p being the maximum number of words on a line.

I will do this for both books, then put all the numbers into a table and find their corresponding words. Then draw up a frequency table to make it easier to interpret the data. From this I will calculate the sample mean, sample standard deviation, and unbiased estimator of the population standard deviation. I will then use these to calculate 90%, 95% and 99% confidence intervals, which will allow me to compare the means.

Theory

Central Limit Theorem

The central limit theorem tells us that if we were to take all possible samples and find all the sample means, then the distribution of these means would be normal. If I then found the mean of the sample means, it would be equal to the population mean.

        

Even if I only have one sample and I don not know whether the population is a normal distribution or not, I can still use the central limit theorem, providing

  • The sample is random
  • The sample size is large ( ≥ 30)

This can be written as

X∼| in any way| (μ,σ2)                X∼N (μ,σ2 /n)

It is ideal because I don not know the distribution of the data, as the sample I have taken is random and n = 70 then I can use the central limit theorem.

Confidence Intervals

A confidence interval gives an estimated range of values, which is likely to include an unknown population parameter, in this case the mean, the estimated range being calculated from a given set of sample data.

If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage ( confidence level) of the intervals will include the unknown population parameter . Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9%, confidence intervals for the unknown parameter.

The confidence level is the probability value (1-α) associated with a confidence interval. It is often expressed as a percentage. For example, say  α = 0.05 = 5%, then confidence level = (1-0.05) = 0.95, that is, a 95% confidence level

The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter (see precision). A very wide interval may indicate that more data should be collected before anything very definite can be said about the parameter.

Normal Distribution

The normal curve was developed in 1733 by DeMoivre as an approximation to the binomial distribution. His paper was not discovered until 1924 by Karl Pearson. Laplace used the normal curve in 1783 to describe the distribution of errors. Subsequently, Gauss used the normal curve to analyze astronomical data in 1809. The normal curve is often called the Gaussian distribution. The term bell-shaped curve is often used in everyday usage.

The normal distribution is an approximation to the distribution of values of a characteristic. The distribution is useful as a model for the length of certain animals, the distribution of IQ scores, and so on. The exact shape of the normal distribution depends on the mean and the standard deviation of the distribution. The standard deviation is a measure of spread and indicates the amount of departure of the values from the mean.

Join now!

Differences in standard deviation models the shape of the distribution. Although the distribution remains symmetric, the distribution becomes flatter if we increase the standard deviation. This corresponds to more diversity between the observations

Problems

What should I do if?

  • There are no words on the selected page ?

Randomly generate another page number.

  • There are no words on the selected line?

Randomly generate another line number.

  • The word number “k” doesn’t exist?

Randomly generate another word number.

  • The word continues onto the next line?

Count the whole length of the word.

  • The word is ...

This is a preview of the whole essay