If a page is selected where there are no sentences on that page or the same page is selected twice I will simply use the calculator to find a new page in that chapter that either hasn’t already been chosen or that has a sentence on it.
If a sentence on one page continues onto the next page I simply will continue counting onto the next page until I reach a full stop. As im looking at full sentences not parts of sentences.
I will be using the three main averages; mean, mode and median.
The mean is worked out by adding all the values together and then dividing the total by how many values there are. The mean can be very useful and it uses all of the data, but can be distorted by anomalies.
The mode is the value that occurs most often. This is easily found but it isn’t that useful. Anomalies won’t distort the mode but there isn’t always a mode in the data.
The median is the middle value of a list of values after they have been put into order, from smallest to largest. It can be affected by anomalies but only slightly, and isn’t always a value from the data. Also the median isn’t as useful as the mean but it can be used.
I will be able to compare the mean, mode and median from each book to see if there are any differences or comparisons. I will be mainly concentrating on the mean as it more useful to my work.
I’m going to use standard deviation to look at the spread of my data. I’m going to use the inter-quartile range as well; this will help with my comparison as will be able to how varied each book is.
Word Length
This is the table I’m going to use to record my data. It will allow me to see all my data clearly and let me see how spread out my data is. I have decided to add an extra column for syllables per word, so that it will save time and I will be able to see the data in one chart, instead of two. Also this extra column will let me use one sample of data instead of two; this will save time but won’t affect my data.
I will be using the RAND method on my calculator to choose page number; this avoids bias and is very quick to do with a large quantity. Also this gives a fair and even spread of the whole book, which the data will be nearer to the total population of words. I will use all the averages but the mean will probably be most significant.
I will use the Readability Statistics on Microsoft Word, as it is possible for it to work out the averages for me. It will find out the reading age and I will be able to see if there is any correlation between the reading age and sentence length. I will then be able to put the results of both books on the same scatter graph and compare them alongside each other.
I will use all three averages for word length; I have chosen these because the use of the Median in a Box Plot with the two books alongside each other will help me to see if there are differences or similarities. It will also help me find the upper and lower quartiles so then I will be able to find the Inter-quartile range.
I will find the Standard deviation too as I will be able to see the spread of the data. I will need to draw a Cumulative Frequency graph and also draw a Box Plot. I will also draw a Histogram for the data with unequal groups, which will permit me to compare the distribution easily.
Syllables
I will use the word length table as it already covers amount of syllables. This will help me as I will be comparing them and it is good to have them next to each other.
I am going to use the mean and standard deviation only because they are the only ones I need as I will only use this as a variable.
Sentence Length
This is the table I’m going to use to record my data. It will allow me to see all my data clearly and let me see how spread out my data is. I will be using the same method as I used for word length, this avoids bias and is very quick to do with a large quantity. I will use the Readability Statistics, and use it in the same way as I did for word length but compare reading age with sentence length, and do a scatter graph for sentence length aswell.
I will use all three averages for sentence length; I have chosen these because the use of the Median in a Box Plot with the two books alongside each other will help me to see if there are differences or similarities. It will also help me find the upper and lower quartiles so then I will be able to find the Inter-quartile range.
I will find the Standard deviation too as I will be able to see the spread of the data. I will need to draw a Cumulative Frequency graph and also draw a Box Plot. I will also draw a Histogram for the data with unequal groups, which will allow me to compare the distribution easily.
Analysis (calculations, tables and graphs)
Word Length
These are the results for the word length in both books. This includes the mean, median, mode and range values. I will be able to find the LQ (lower quartile) And UQ (upper quartile) and IQR (inter-quartile range)
I will be able to work out the standard deviation, using the standard deviation formula which is:
Values
LQ = 19th
UQ = 57th
Median = 38th
Nicholas Nickleby
Order of the Phoenix
The frequency polygon to show the amount of distribution and spread of the two books. (Graph 1)
Syllables
These are the results for the amount of syllables in both books. This includes the mean and standard deviation as there the only averages that can be used for a mathematically purpose. I will not draw any graphs as this data will only be used in a scatter graph later to see if there is any correlation between the word lengths and amount of syllables and then compare the two books.
Nicholas Nickleby
Order of the Phoenix
Sentence Length
These are the results for the word length in both books. I have unequal sized groups of data. This will enable me to draw a Histogram. I will draw a box plot so I will also be able to see the Upper and Lower Quartiles and therefore see the Inter-quartile range. The table I will use will be different to the other ones as it will have a mid point value so I can work out the mean. As I am going to draw a cumulative frequency graph in the table there will also be a column with cumulative frequency.
Nicholas Nickleby
Order of the Phoenix
The cumulative frequency graphs what will allow me to find the quartiles and the inter-quartile range. (Graph 2)
From the Cumulative frequency graph I have found for Nicholas Nickleby the Median was 16, the LQ was 5.5, and the UQ was 34 so the IQR was 28.5. For Order of the Phoenix the Median was 10, the LQ was also 5.5, and the UQ was 21 so the IQR was 15.5. I will now put these results into a Box plot. (Graph 3)
This is my table to find out the heights of the bars I am going to use for the Histogram.
Nicholas Nickleby (Graph 4)
Order of the Phoenix (Graph 5)
Interpretation
Word Length
In both books the Frequency Polygon shows a similar pattern in the way they are spread although Nicholas Nickleby’s Mode had 4 letters and there was a frequency of 17 compared with the Order of the Phoenix which had a Mode of 6 letters per word and that had a frequency of 13.
The Frequency Polygon showed that Nicholas Nickleby rise very quickly from 1 to 4 letters per word reaching its highest point with a frequency of 17. It then went down to 11 for 5 letters and stayed there for 6 letters also. Between 6 and 8 it decreased by 6, but then when it got to 9 letters it enlarged by 1. The frequency then went down and up 1 at the 11 letters mark then it went down 2 after 10 letters.
The Frequency Polygon for Order of the Phoenix are very similar but had a rise, which wasn’t as steep it but was more moderate. The peak of this was at 6 letters and had a frequency of 13. It then reversed and went back down, until it reached 10 letters, which had a frequency of 2. Then similar to Nicholas Nickleby it rose by 1 at the 11 letters mark, but went down by 3 until it got to 13.
The Quartiles show a similarity between the 2 books. Nicholas Nickleby and Order of the Phoenix seem to have the same LQ and the UQ and IQR, which are very similar as between them both they are only 1 away from each other. This has shown me even more so than the Frequency Polygon that the word lengths in both 2 books are very alike.
In both books the Means were very close. The mean for Nicholas Nickleby’s was 6.13 and Order of the Phoenix was 6.31. This showed Order of the Phoenix had more letters per word but just about. This was not the real Means of the complete book as it was only a section, so it was not the definite averages.
The standard deviations for both books are quite similar too. The standard deviation for Nicholas Nickleby’s was 2.535, where as Order of the Phoenix’s standard deviation was 2.366. I said in my hypothesis that Nicholas Nickleby had more similarity in its word length and it was correct in this section. The standard deviation was greater therefore meaning more variation showing this section proves my hypothesis.
I said in my Hypotheses, I think that both books would have a similar mean word length and that Nicholas Nickleby would have a slightly greater mean word length. I was correct in saying that they would be similar, but I wrong, it was Order of the Phoenix, which had the greater mean so I was wrong in that part. This data is so near together if I took another section it could change the data and make it totally different. A sample is not the complete book so the average will always change between each different sample.
Sentence Length
In both books the Estimated Means for sentence lengths were totally different. Nicholas Nickleby’s was 30.55 and it was a lot greater than Order of the Phoenix the estimated mean was 19.17. This shows a vast difference with Order of the Phoenix only being 55.8% of Nicholas Nickleby, in the sentence lengths, which were used in both books.
The Box Plot shows a vast difference as well, the LQ is the same the Median and UQ for Nicholas Nickleby is much greater than Order of the Phoenix. It shows that the IQR as Nicholas Nickleby is just less than double of the Order of the Phoenix. The data for Order of the Phoenix is slanted to the LQ so is negatively slanted. For Nicholas Nickleby the data is also negatively skewed but not as much as the Order of the Phoenix.
The vast difference in IQR shows also in the standard deviation. It is 26.7777 for Nicholas Nickleby a vast difference between Order of the Phoenix which was 13.6001. It has more than doubled and shows plainly that Nicholas Nickleby uses a lot more assortment in sentence length than Order of the Phoenix.
I said that in my Hypotheses I believed Nicholas Nickleby’s mean sentence length would be much greater than Order of the Phoenix. My hypothesis was correct and it is plainly much greater.
Also I said I believe that the Standard deviation would be much greater for Nicholas Nickleby than Order of the Phoenix. I was also right, there is much more assortment in sentence length in Nicholas Nickleby with the standard deviation being over double Order of the Phoenix.
Syllables
In each book the means for syllables per word were weirdly exactly the same and so was the Standard deviation, which surprised me. This proves my hypotheses totally correct about being very similar and in fact they were the same.
Conclusion
The two books are very different, yet very similar to the other in various ways. The word length and syllable for each word are very alike in the two books I had chosen, but the sentence length is totally diverse from each other. I feel this is to do with the different era in which they were written in. Nicholas Nickleby had used a lot of really extended sentences whereas Order of the Phoenix has smaller sentences which were broken down from the longer sentences used in Nicholas Nickleby. I believe that they may have shortened the sentences in Order of the Phoenix, because it is designed for a younger audience, it will sell better as children will want to buy the goods after the book is released.
The most major results with the biggest difference were in the sentence length. Nicholas Nickleby’s variation means of using the standard deviation was totally different and more than double that of Order of the Phoenix.
Although these findings are good they only estimate on the whole book. The sample could be totally different to the real averages of the whole book. If I did another sample it may give totally different results the only true real averages are the whole book.
I could try to progress my variety by using a different method or a better sampling size. I may perhaps use other books the author had written to spot if there is any pattern in the way they write. When by means of using another book the author had written will also enable me to see if there is a more evident pattern the way authors wrote their books in their different eras.
I establish the confidence for sentence length for both books by using the formula
the + or – depends on whether your finding the upper or lower band. I worked this out as the Means, which only from samples and without doing a census I wouldn’t be able to find out the Mean of the population. By working it out would give me a two numbers which I can be 95% sure that the population mean lies between them. For Nicholas Nickleby the intervals were 24.49 and 36.6. For Order of the Phoenix the intervals were 16.63 and 22.78.
I will now draw up a simple graph to demonstrate the confidence intervals. (Graph 6)
When looking at the graph I can see that the confidence intervals do not overlap. I can be 95% confident that the factual value of the mean sentence length for the two books are entirely different, and as predicted Nicholas Nickleby has a larger mean sentence length. This is a very strong piece of proof based on the distribution of the mean sentence length and it shows that the Hypotheses saying that Nicholas Nickleby had a greater sentence length, which even is more, correct.
I could also try and I am going too next, use the computers to work out the Readability Statistics, contrasting reading ages, use Spearman’s Rank and correlation coefficient between the reading age, mean sentence length and also use Microsoft Excel. I will be drawing scatter graphs for syllables per word and word length. I will also draw scatter graphs for sentence length and number of words above 6 letters.
Other Investigations Using ICT
Readability Statistics
I will use Microsoft Word to calculate some averages using the Readability Statistics in its program. I will use the Flesch-Kincaid Grade Level (Reading age of U.S school years, I will covert it to years old by adding 6 to get the actual age), the characters per word and words per sentence to compare blocks of text from each book against each other but also to compare them against the original sample.
I searched on the internet for the texts from both books; unfortunately I couldn’t find any available resources. This meant I had to find 5 blocks of data (at least 100 words) from each book, I used the RAND function on my calculator to select a page at random, and then I selected a block of text from that page.
I typed all the 5 blocks from Nicholas Nickleby into Microsoft Word. The Readability Statistics showed:
This would mean the reading age for Nicholas Nickleby would be:
I typed all the 5 blocks from Order of the Phoenix into Microsoft Word. The Readability Statistics showed:
This would mean the reading age for Order of the Phoenix would be:
These Readability Statistics show how different this sample is too the first sample I took from the two books.
In the second sample (blocks of text) the mean word length for Nicholas Nickleby was 4.4 and in the first sample (words on there own) the mean word length was 6.13. That is a big difference in word length. Also in Nicholas Nickleby the sentence length estimated Mean for the first sample was 30.55 and in the second sample it was 21.6, which also showed a big difference. I think this shows the first sample was completely wrong and was way higher than the actual amounts.
In the second sample the mean words per sentence in Order of the Phoenix was 15 and for the first sample the mean words per sentence was 19.17. This shows a difference of around 4, this is still quite a large difference but not as much of a difference that was found in Nicholas Nickleby. In the first sample the mean word length was 6.31 but in the second sample the mean word length was 4.1. This a big difference and shows how the first sample differs a lot from the second sample.
The Readability Statistics themselves show that Nicholas Nickleby’s words per sentence, word length and Reading age are all higher than in Order of the Phoenix.
I think the Reading age is higher in Nicholas Nickleby because the word length and sentence length are higher. This shows the longer words used and the longer sentences used make it harder to read, therefore the reading age increases.
I will use Spearman’s Rank correlation coefficient between Reading Age and Mean sentence length. For this I am going to use 5 sets of data from Readability Statistics. So each of my blocks will be used separately.
I will now put them into rank, with the biggest ranked 1 and the smallest marked 5.
Nicholas Nickleby
Order of the Phoenix
I can see already without doing the calculation of
(with n being the amount of data, p as Spearman rank correlation and d as the difference between the rankings of one item) that both books have correlation of 1 so its a perfect positive correlation using Spearman’s Rank correlation coefficient.
This means my original hypotheses was completely correct with there being a very strong correlation but this correlation is better than strong it is perfect.
In this investigation I have found out many things about the word length, sentence length, syllables and words over 6 letters in a sentence. I found that word length and syllables were around the same for both books but words over 6 letters in a sentence and sentence length in Nicholas Nickleby were much more than Order of the Phoenix. I think it would be interesting if there is similarities like this in any of the other books that have been written by the same authors and that would prove to me totally that the way they wrote was due to the times in which they were written.
Scatter Graphs
The first scatter graph I will do is sentence length and number of words above 6 letters in that sentence. These are the two variables for the graph.
I am going to use Microsoft Excel to produce the graph for me. I will also let Microsoft Excel put a trend line in for me and produce correlation coefficient (R^2) and I will my calculator to find R.
The hypothesis I have used for this graph is that there will be a very strong positive correlation for both books. This means as 1 of the variables increases so does the other. I think this because the longer the sentences are usually there is more longer words as there are more words in the sentence
Nicholas Nickleby
This graph shows a positive correlation, it is quite a good positive but I thought it would have been a little stronger. This graph shows that the longer the sentence the more words over six letters it has.
Order of the Phoenix
This graph has a positive correlation. It’s a good positive correlation. It shows that a longer sentence has more words over six letters than a short one. The longer the sentence the more amount of words over six letters.
Both books are very similar and have strong positive correlations. Even though Nicholas Nickleby has a stronger positive correlation (line of best fit is more at an angle) they aren’t very close to each other, unlike in Order of the phoenix.
Nicholas Nickleby’s correlation is almost perfect and Order of the Phoenix is also almost perfect but no as close.
My hypotheses were correct as I said they both would have strong positive correlations. I think the reason I believed this was correct as well.
Word Length for Nicholas Nickleby
Word Length for Order of the Phoenix
Sentence Length for Nicholas Nickleby
Sentence length for Order of the Phoenix