X – 1.96 (s.e) < μ < X + 1.96 (s.e)
This would be used to find out the 95% confidence interval. The 1.96 comes from the normal distribution tables. In the tables I looked for 0.9750. This is 97.5%. This may seem strange but the reason I found this and not 95% was to do with the symmetry of the normal curve. If I find 97.5% it leaves 2.5% on either end of the curve. This adds up to 5% so 95% is left over. This is sown more clearly in the diagram on the next page.
The standard error (s.e) is the standard deviation of the distribution of the sample means. It is found using the formula:
s.e. =
I was required to make a number of assumptions before I started to collect any data. The main one was to do with punctuation. I would only be counting letters. This meant things like full stops, apostrophes and even numbers left in numerical form where excluded. Abbreviations such as ‘Dr’ were counted as they were seen; i.e. Dr equals 2 letters and not 6. I also started on the first full word of each randomly generated line.
Analysis
The Raw data that was used is in my appendix. Included in the appendix are all of the pages that we took our samples from. Also in the appendix are all of the frequency charts that were used at the time to record our data.
I am now going to organise my raw data in the tables below:
The child’s book (Word length)
The Child’s book (Sentence length)
The Adults Book (Word Length)
The Adults book (Sentence length)
There was also one 39, which I will show below, as it would be a waste of space if I went up to 39 in my table:
I have used an adequate amount of data. For each population I have sampled 200 word lengths and 40 sentence lengths. As I have said earlier a good sample size to use is n ≥ 30. I have exceeded this by far for my word length and I am also fine on my sentence length.
I am now going to start calculating my data. As I briefly stated before I am going to find out the confidence intervals for the two populations. I will firstly try 95% for each. I will then be able to see if I am 95% confident that the population mean of the adults book is larger then the population mean for the child’s book (or vice versa). If the intervals of the two populations overlap I will not be able to say that I am 95% confident the population mean of one is higher than the population mean of the other.
E.g. If I was 95% confident that population mean of the child’s book was in the interval below (left), and 95% confident that the population mean of the adults book was in the interval below (right), I could not be 95% confident that the adults population mean is higher than the child's:
Children’s book (3.4 , 4.5) Adults Book (4.1 , 5.5)
Although looking at this it looks like the population mean for the adults book is higher there is an overlap. The population mean could be anywhere in the interval. This could mean that the population mean for the child’s book is 4.4 and the population mean for the adults book is 4.2. This means that the population mean for the child’s book is larger. If an overlap was to occur this is when I would use one of the other standard errors. In this case I would try to find out what interval the population mean would be in if I were 90% confident. The intervals may now look like:
Children’s book (3.7 , 4.2) Adults Book (4.4 , 5.2)
The intervals do not overlap so I can be 90% sure that the population mean for the adults book is higher than the population mean for the child’s book.
I may also find that when I try a 95% standard error the intervals do not overlap. If this were to happen I would try 99%. If the intervals still didn’t overlap I could be 99% sure that the population mean for the adults book was higher than the population mean for the child’s book.
To be 68% confident of μ, x ± 1 s.e.
To be 90% confident of μ, x ± 1.645 s.e.
To be 95% confident of μ, x ± 1.96 s.e.
To be 99% confident of μ, x ± 2.58 s.e.
Calculations
Comparing the word length of an adult and a child’s book with a 95% confidence interval.
Child’s Book
n = 200
x = 4.33
σ = 1.890
σ² = s² ( )
= 1.890 ( )
=1.894
s.e =
=
= 0.1340
The 95% confidence interval for the child’s book is:
X – 1.96 (s.e) < μ < X + 1.96 (s.e)
4.33 – 1.96 (0.1340) < μ < 4.33 + 1.96 (0.1340)
4.067 < μ < 4.593
I am 95% confident that the population mean for the word length of the child’s book is in the interval:
(4.07 , 4.59)
Adults Book
n = 200
x = 4.45
σ = 2.366
σ² = s² ( )
= 2.366 ( )
=2.372
s.e =
=
= 0.1677
The 95% confidence interval for the adult’s book is:
X – 1.96 (s.e) < μ < X + 1.96 (s.e)
4.45 – 1.96 (0.1677) < μ < 4.45 + 1.96 (0.1677)
4.121 < μ < 4.779
I am 95% confident that the population mean for the word length of the adult’s book is in the interval:
(4.12 , 4.78)
There is an overlap so I am going to try the 90% confidence intervals for each population. To save time I am now going to use my graphical calculator.
Child’s Book
The 90% confidence interval for the child’s book is:
X – 1.645 (s.e) < μ < X + 1.645 (s.e)
4.33 – 1.645 (0.1340) < μ < 4.33 + 1.645 (0.1340)
4.110 < μ < 4.550
I am 90% confident that the population mean for the word length of the child’s book is in the interval:
(4.11 , 4.55)
Adults Book
The 90% confidence interval for the adult’s book is:
X – 1.645 (s.e) < μ < X + 1.645 (s.e)
4.45 – 1.645 (0.1677) < μ < 4.45 + 1.645 (0.1677)
4.174 < μ < 4.726
I am 90% confident that the population mean for the word length of the adult’s book is in the interval:
(4.17 , 4.73)
There is still an overlap so I am going to try the 68% confidence interval.
Child’s Book
The 68% confidence interval for the child’s book is:
X – 1 (s.e) < μ < X + 1 (s.e)
4.33 – 1 (0.1340) < μ < 4.33 + 1 (0.1340)
4.196 < μ < 4.464
I am 68% confident that the population mean for the word length of the child’s book is in the interval:
(4.20 , 4.46)
Adults Book
The 68% confidence interval for the adult’s book is:
X – 1 (s.e) < μ < X + 1 (s.e)
4.45 – 1 (0.1677) < μ < 4.45 + 1 (0.1677)
4.283 < μ < 4.617
I am 68% confident that the population mean for the word length of the adult’s book is in the interval:
(4.28 , 4.62)
The 68% intervals of both populations for word length overlap. I have decided that I can not judge which book has a higher level of difficulty from these results.
I’m now going to try sentence length to see if any new conclusions can be made.
Comparing the sentence length of adults and a child’s book with a 95% confidence interval.
Child’s Book
n = 40
x = 12.25
σ = 6.278
σ² = s² ( )
= 6.278 ( )
= 6.348
s.e =
=
= 1.004
The 95% confidence interval for the child’s book is:
X – 1.96 (s.e) < μ < X + 1.96 (s.e)
12.25 – 1.96 (1.004) < μ < 12.25 + 1.96 (1.004)
10.28 < μ < 14.22
I am 95% confident that the population mean for the sentence length of the child’ book is in the interval:
(10.28 , 14.22)
Adults Book
n = 40
x = 10.85
σ = 8.356
σ² = s² ( )
= 8.356 ( )
=8.463
s.e =
=
= 1.338
The 95% confidence interval for the adult’s book is:
X – 1.96 (s.e) < μ < X + 1.96 (s.e)
10.85 – 1.96 (1.338) < μ < 10.85 + 1.96 (1.338)
8.228 < μ < 13.47
I am 95% confident that the population mean for the word length of the adult’s book is in the interval:
(8.23 , 13.47)
It now appears that my results are not going to agree with my prediction. The sentence length for the child’s book appears to be larger than the sentence length for the adult’s book. I am still going to try the 90% confidence interval.
Child’s book
The 90% confidence interval for the child’s book is:
X – 1.645 (s.e) < μ < X + 1.645 (s.e)
12.25 – 1.645 (1.004) < μ < 12.25 + 1.645 (1.004)
10.59 < μ < 13.90
I am 90% confident that the population mean for the sentence length of the child’ book is in the interval:
(10.60 , 13.90)
Adults book
The 90% confidence interval for the adult’s book is:
X – 1.645 (s.e) < μ < X + 1.645 (s.e)
10.85 – 1.645 (1.338) < μ < 10.85 + 1.645 (1.338)
8.649 < μ < 13.05
I am 90% confident that the population mean for the word length of the adult’s book is in the interval:
(8.65 , 13.05)
The 2 populations still overlap. I am finally going to try 68%.
Child’s book
The 68% confidence interval for the child’s book is:
X – 1 (s.e) < μ < X + 1 (s.e)
12.25 – 1 (1.004) < μ < 12.25 + 1 (1.004)
11.25 < μ < 13.25
I am 68% confident that the population mean for the sentence length of the child’ book is in the interval:
(11.25 , 13.25)
Adults book
The 68% confidence interval for the adult’s book is:
X – 1 (s.e) < μ < X + 1 (s.e)
10.85 – 1 (1.338) < μ < 10.85 + 1 (1.338)
9.512 < μ < 12.19
I am 68% confident that the population mean for the word length of the adult’s book is in the interval:
(9.51 , 12.19)
The results are inconclusive because there is an overlap. There is no point in trying any lower confidence intervals.
Interpretation and Validation
I did not get the results that I wanted. All of the confidence intervals that I tried overlapped. This meant that I could not say the mean for one population is larger than the mean of the other. I expected the adult’s book to be longer and I was disappointed that my results showed the 2 books to have similar word and sentence length. The fact that I tried the 68% confidence interval and the two populations overlapped showed me that word and sentence length could not be used as possible measures for the difficulty of a book
There are many reasons why I may not have got the results that I was looking for. The sample that I used was possibly too small. If I used larger samples I may have found different results. I could also have improved my assumptions. If I was to exclude certain words such as ‘I’, ‘to’ and ‘a’ I may have had a more realistic set of results.
From my results I have concluded that the length of words and sentences does not determine the difficulty level of a book. One thing that I did notice was a larger range for all of the adult’s data. For word length the adult’s book had a lowest point of 1 and a highest of 17. For the child’s book the lowest point was 1 and the highest 11. For sentence length the adult’s book had a range of 1 to 39 while the range of the child’s book was only 3 to 32. From this I can see that the adults book does contain larger words and sentences than the child’s book. It’s just that on the pages I selected there was many small word and sentence length.
In the end I tried 95%, 90% and 68% confidence intervals for every set of results. All of these proved inconclusive. The strange thing was that for word length it appeared more likely that the population mean for adults was higher but for sentence length it was the other way around. I believe that if I had tried a bigger sample the results would have been very different. It firstly would have given a wider range of words and sentences, which would eliminate the possibility of selecting an easy or hard page. It would also have made the standard error smaller. This would then have made the intervals become a lot closer. The reason for this is the formula, which is used to calculate the standard error.
s.e. =
As you can see the standard error equals the population variance divided by the square root of n. If the sample size were larger n would be larger which would therefore make the standard error smaller.
In my aim I set out to show that the adults book was more difficult than the child’s book. I have not done this because all of my intervals overlap. I could have kept on lowering the confidence interval until there was no overlap but I didn’t see the point. At 95% if the intervals overlapped there would be a clear difference between the two population means. If the intervals are still overlapping at 68% then the results are very similar. I stopped at this point because I believed that I could not prove that the adult’s book was more difficult.
My sampling methods were the possible reason why I did not get the results that I wanted. I firstly could have used more pages to sample. I could have selected a random page, line and eventually a word on that line. I then could have repeated this 100 times for each book. For the sentence length I could have selected a page at random followed by a sentence from that page. This could have been done 40 times for each book. This then would have eliminated the chance of choosing an easy or hard page in the book.
If I was to do this coursework again I could change the populations. I could possibly choose a book which has an adults version and a child’s version. I feel this would be fairer because it is a possibility that I chose an advanced child’s book or an easy adult’s book. Instead of collecting two sets of data for each population I could have chose two adults and two child’s books. I then could have taken 100 words from each book and combined the child’s and the adults together. Another possible way would be to choose two fact books. It would be interesting to see if this had any difference on the final results.