The sample mean is an unbiased estimator of the population mean. Confidence Intervals relate to an expression of the degree of confidence in your estimate in a more precise way than simple stating the standard error of the mean and size of the population. This is done by using an interval estimate.
I will be calculating confidence intervals of 95% and 90% for my data.
With a sample, the formula for the standard deviation is the same as that for the parent population except for the fact that is replaced with S and µ is replaced with :
S2 = (X –)2
n
The reason why X is an unbiased estimator is because it is equal to the mean of the parent population, which is µ.
However, it is not the same case with a sample variance and population variance. The sample variance () is a biased estimator of the population variance, but it is apparent that a different rule can be put into the place of this in order to get an unbiased estimator:
Sn2_n_ < which is also known as Sn-12
n - 1
General Formulae
= fx
n
S()={fx2 - (2)}
n
S2= fx2 - (2)
n
Sn-12 = n x Sn2
n-1
Sn-1 = Sn-12
Unbiased s.e. = Sn-12
n
The Normal Distribution
When distributing a sample normally, the standard deviation of is called the Standard Error (s.e). This can be calculated using the following formula:
s.e. =
n
After looking at data tables for the normal distribution function it is possible to obtain a value of Z for which an area equal to that of the probability you would like to be confident of lies between +Z and –Z. For example, looking at the following diagram, it is clear that when modelled normally, the Z value for a 90% confidence interval in any population lies somewhere between + 1 s.e. and + 2 s.e. Therefore we can calculate that (z)=0.95.
From the data tables it is clear that Z for a 90% confidence interval is equal to 1.645 s.e. We therefore can calculate that the 90% confidence interval for µ is ( - 1.645 s.e., + 1.645 s.e.).
We can also calculate that the area for a 95% confidence interval lies between +1.96 s.e. and – 1.96 s.e.
Method of Data Collection
Certain precautions were taken to ensure that data collected was consistent. When working out the sentence sizes for the two books, I picked 10 Pages randomly by generating a random number in my calculator. The method of generating the random number was as follows: (Rnd x 50 +1). The integer that appeared in the answer was taken as the random number. I will then generate a further 5 random numbers corresponding to the line number on the page after having taken into account the number of actual lines there are. This gives me a total sample size of 50 from each book. When retrieving the data, once the line number has been generated for each sample number, I will record the size of the first sentence that begins on or after the line number I have generated. For example, if line number 15 is generated, but no sentence begins on that line, I will look through line 16 in order to find whether a sentence begins on that line. In cases where I have had to turn onto the next page for the next sentence, the data is still counted as from the original page number that was generated randomly.
In terms of what is counted as one word or two, a hyphenated word is consistently counted as two words, and the letter ‘a’ is consistently counted as a word.
The above table shows tabulated data for sentence sizes in the adult text.
The table below shows data for sentence sizes from the child’s text.
Interpretation of Adult Text data
n =50
=fx = 564 = 11.28
n 50
S() = fx2 - (2) = 8732 - 127.24 = 6.89
n 50
S2 = 47.40
Sn-12 = n x Sn2 = 50 x 47.40 = 48.37
n-1 49
Sn-1 = Sn-12 = 6.95
Unbiased s.e. = Sn-12 = 6.95 = 0.98
n 7.07
90% confidence interval = ( - s.e. x 1.645), ( + s.e. x 1.645)
= (11.28 - (0.98 x 1.645)), (11.28 + (0.98 x 1.645)) = (9.67), (12.89)
95% confidence interval = ( - s.e. x 1.96), ( + s.e. x 1.96)
= (11.28 - (0.98 x 1.96)), (11.28 + (0.98 x 1.96)) = (9.36), (13.20) < The true mean of sentence sizes in the adult’s text lies between these values.
Interpretation of Child Text Data
n =50 = 15.92 S() = 9.19 S2 = 84.46 Sn-12 = 86.19
Sn-1 = 9.28 Unbiased s.e. = 9.28 = 1.31
7.07
90% confidence interval = (15.92 - (1.31 x 1.645)), (15.92 + (1.31 x 1.645)) = (13.77), (18.07)
95% confidence interval = (15.92 - (1.31 x 1.96)), (15.92 + (1.31 x 1.96))
= (13.35), (18.49) < The true mean of sentence sizes in the child’s text lies between these values.
COMPARING 90% CONFIDENCE INTERVALS
From the interpretation of my results, it is apparent that my prediction was fairly accurate, and I do feel they are very realistic because of this. In terms of the variation, the results seem to show less variation (S2 = 47.40) in the adult text than in the child’s text (S2 = 84.46). This shows us that the adult text is more consistent in general when it comes to sentence sizes. In terms of where the actual true mean lies, my prediction is again correct, because the graph and analysis shows that the mean sentence length for the Child’s text is 15.92 words whereas the mean sentence length for the adult text is 11.28 words. The confidence interval diagrams above show how confident I am in my calculated mean. In each case I am 90% or 95% confident that the true mean lies between the intervals highlighted on the diagrams. It is apparent that the size of the confidence interval decreases as my % confidence decreases. Overall, I would say that the data shows evidence of a difference in sentence sizes between the two books and it generally back-ups the argument put forward in my hypothesis.
In order to make my results more accurate, I would need to review my method of data collection. I feel that this may have caused a slight misrepresentation as at times, I was counting the same sentence twice because I chose random line numbers each time, which meant that for example if number 16 and 17 were chosen but the first sentence after 16 began on line 17, I would have to count the same sentence twice. Other than that I may have made general human error in counting up words in a sentence, which may have limited the validity of my results. If more time was available for me to investigate this further, I think I would probably compare word sizes as well as sentence sizes. I could also compare around 5 different sets of adult and children books. Doing this would allow me to look at a larger sample as well as the variation between different authors and styles of books. I could decide to control a variable such as genre. For example, to compare an adult horror book with a child’s horror book in relation to an adult and child’s thriller. I would also consider increasing the sample size from 50 to 100 as the Central Limit Theorem states that an approximation gets closer and more accurate as the sample size gets bigger.