Statistically comparing books

Rebecca Nielson

Statistics Coursework

I am going to statistically compare two different books. The books I have decided to compare are, The Order of the Phoenix - Written by J. K. Rowling and Nicholas Nickleby - Written by Charles Dickens. I will be taking a sample of words and sentences from each book to try and find a similarity or difference between the two.

The Order of the Phoenix

This is the fifth book in the Harry Potter series, which tells the tale of Harry’s fifth year back at Hogwarts School of Witchcraft and Wizardry. The series so far has already made author J. K. Rowling into a multi-millionaire.

Nicholas Nickleby

This is the third book written by Charles Dickens and tells the story of Mr Nickleby who dies penniless leaving his wife, daughter and son to fend for themselves, Nicholas the son, soon finds his own way out of his family's desperate situation by perseverance and good fortune.

I have chosen these books as they are both stories about a boy’s childhood however they were written in different centuries. From my data I will hopefully be able to see whether there is a difference in the way these two books are written.

There are different parts of a novel that I can use to statistically compare a book; I am only going to use word length, sentence length and the number of syllables per word.

This should be enough information to see if there is a comparison between the two books.

Hypotheses

I think that Nicholas Nickleby will have a higher mean word length than The Order of the Phoenix, because it is aimed at an adult audience so there will be longer words which are usually harder to read.

I also think that there will be a good positive correlation between mean sentence length and reading age from both of the books. This is because normally longer sentences contain longer words, which are probably harder to read, so the reading age would get higher as sentence length increases.

I think The Order of the Phoenix will have a shorter standard deviation in terms of sentence length than Nicholas Nickleby. This is because Nicholas Nickleby is a book aimed at adults; therefore it will have more difficult sentences which will cause some variation in length.

I think the mean syllables per word will be very similar in both books, even though they are aimed at different audiences you will still have one and two syllable words. I think if there is a difference it will only be small, but I think that Nicholas Nickleby would have a slightly higher mean for syllables per word.

I’m going to use stratified sampling, which is similar to random sampling except the words are spread out more over the chapters, so its offers a wider view of the book. Both of my books have paragraphs, so I will use them as my strata. This method isn’t as fast as random sampling but is relatively quick, and as it offers a wider view of the book I think it is more reliable.

I will also use blocks of text and find out the average letters per word, word per sentence and the reading age of that block. I will use readability statistics on Microsoft Word to find out this information. It will save time as the computer works it all out instantly. Even though this will show the flow of writing, it isn’t a spread of the book so it could be bias, as there will only be four or five blocks of the book.

Data Collection

Before I can begin to compare books and test my hypotheses I need to collect data samples from each of my books. I have looked on the internet to see if the texts from my books were available but they weren’t, so I will be collecting the data from both books myself. This may take some time but this way I will know what goes in my data.

I have to choose an appropriate sample size to represent the whole book. A sample size of around 20 would be insufficient and wouldn’t represent the whole book, and a sample size of around 200 would be more representative of the book than a sample size of 75, it would take too long. I have decided to use a sample size of 75 for both words and sentences. I believe that this size will represent the book and is a reasonable size.

In Order of the Phoenix there are 36 chapters of similar length, therefore I will take two samples (at random) from each chapter and only one sample from the smallest chapter. In Nicholas Nickleby there are 64 chapters, therefore I will take one sample (at random) from each chapter and in the 11 longest chapters will take two samples. This method will make sure my data fits the sample size and is a spread in both books.

So that my data isn’t biased I will use my Sharp calculator to select random numbers, which will represent page number. I will use the RAND function (Pressing ‘2nd’ then ‘7’) I will then multiply it by the amount of pages in the chapter and add the first page number in the chapter. This will then give me a page in that chapter at random.

Words

I could choose any word on the page but I’ve decided that I’m going to choose the 25th word on each page. This allows the word to be in the actual flow of writing, not right at the beginning or end, but near the middle. If I used the 1st word then it has more chance of being the start of sentence.

Sentences

I am going to choose the 3rd sentence every time during the sampling. Choosing the 3rd sentence means that it isn’t at the start or near the end, it is during the flow of the writing, also it will save me lots of time, as I could use the calculator again to randomly select a sentence, but this would be to time consuming.

In my data collection I may come across hyphenated words, direct speech, words with an apostrophe or names of places and people. I have decided that if I come across any of these I will include them in my data as they are part of the book and show the authors writing style.

If I come across a hyphenated word I will class it as one word and disregard the hyphen.

E.g. First-class would be used as one word with 10 letters

If I come across a word with an apostrophe I will class the apostrophe as a letter.

E.g. shouldn’t would be used with 9 letters

If I come across a page with a picture and no text, or a chapter page (Only says name of chapter) I will just reselect a new page from the random numbers on the calculator. As I don’t think that the chapter page is relevant to what I’m doing therefore the text will not be useful.

If a page is selected where there are no sentences on that page or the same page is selected twice I will simply use the calculator to find a new page in that chapter that either hasn’t already been chosen or that has a sentence on it.

If a sentence on one page continues onto the next page I simply will continue counting onto the next page until I reach a full stop. As im looking at full sentences not parts of sentences.

I will be using the three main averages; mean, mode and median.

...