In cluster sampling, the population is divided into smaller groups called clusters. One or more clusters are chosen at random and this makes up the sample. This form of sampling is also very cheap but can be easily biased if the clusters are not all the same. I shall not be using this form of sampling as it would be very difficult to ensure that the clusters are all the same and so it would be likely that the results are biased.
I shall use stratified sampling as it is easier to minimize bias when using this method. There shall be no problem with expenses as I have already collected my data from secondary sources, but it shall be slightly time consuming as I shall have to separate my data into separate strata. I have decided to separate my data so that the number of individuals in the sample of a certain age is representative of the amount in the population.
There are 6 different age groups in my sample, 11 year olds, 12 year olds, 13 year olds, 14 year olds, 15 year olds and 16 year olds. I decided to take a sample of roughly 5% as this would provide a good representation of my population. When I calculated 5% of 1182, I got 59, so I decided to use 60 pupils as this would be easier to handle.
Now I shall separate the population into separate strata and use the random number generator on my calculator to decide which individuals shall be present in my sample.
For the 1st age group the calculator randomly selected the 9th individual that was DJ Ahmed. I did the same for all the others and these are the individuals that were selected:
I shall include the full sample in the Appendix.
Results:
The first thing I planned to do was to find mean KS2 levels and plot that in a pie chart for all the data, the data for male and the data for females. The mean is a measure of central tendency, which is meant to give a good idea of a generalized figure for all the data. It is easily distorted by extreme values or 'outliers'. I did this by adding the three KS2 levels together and dividing the answer by the number of levels for each student. For example, Ahmed DJ achieved a 4, 4, 4 in his KS2 exams. If I add the three 4s together, I get 12. If I then divide 12 by 3, as there were three results, I get a mean of four, so the mean KS2level for Ahmed DJ is 4.
Before I made the pie charts, I arranged the mean KS2 levels in order of size. This is called a distribution. It allowed me to notice that no student got levels where there was more than a difference of 1 between each level e.g. no student achieved a 5, 4, 3 or a 6, 5, 4. There is a difference of more than 1 between 5 and 3 or 6 and 4.
This is the pie chart, which shows the mean KS2 levels for all the students. Pie charts useful for looking at a category’s proportion compared to the whole. A category variable is one which contains discrete data, that is it contains data which does not overlap e.g. if you ask someone how old they are in years, they would say either 15 or 16 not 15.5532 or something like that.
This is the equivalent pie chart for just males. It shows that no male student achieved a KS2 level under 3 and over 50% got at least two 4s.
This is the equivalent pie chart for females. It shows that no female student achieved a level above 5; it also shows that over 50% achieved at least two 4s.
In these aspects, both males and females are equal but it is noticeable that males got a larger amount of 4s and above, which would suggest that they do better in their KS2 exams.
Now I intend to look at the numbers of students achieving certain levels. One way of doing this would be comparative pie charts, which extend on pie charts because they show how many attained the level whilst also showing the proportions compared to the total. Another way of doing this would be by simply drawing a stem and leaf diagram. A stem and leaf diagram allows me to compare the numbers of levels attained, the range of the levels and the median for both genders. I shall construct a stem and leaf diagram for both genders and the IQ.
I shall now calculate the range, which is a measure of spread. It is calculated by subtracting the lowest number from the highest number.
Range for Males: 121-90= 31 Range for Females: 132-90= 42
I also wish to find a measure of central tendency for the data. The median is calculated by putting the data in order of magnitude and then taking the middle number. It is useful when there are extreme values or ‘outliers’ as it does not take these into account. I shall use this as my central tendency as it is clearly noticeable that there are extreme values in this sample.
Median for Males: (27+1)/2=14 so the median shall be the 14th value. 100
Median for Females: (33+1)/2=17 so the median shall be the 17th value. 102
This would suggest that the IQ of females tends to be more dispersed but the IQ for a male tends to be lower than that of a female.
In order to further look at the dispersion of these two samples, I shall examine their Inter-Quartile ranges (IQ-range). The IQ-range measures the spread of the middle half of the data and is not affected by extreme values. To do this, I must find the median of the lower half (lower quartile) and the median of the upper half (upper quartile). To find the IQ-range, you minus the lower quartile from the upper quartile.
In order to further compare the 2 genders, I shall construct a step polygon. A step polygon shows the running total of the frequencies on the vertical axis and is used for discrete data. This would allow me to compare Inter Quartile ranges as it is easy to estimate this from a cumulative frequency diagram. I shall have to create a running total of how many people had a certain IQ level.
IQ-range for males: (13+1)/2 = 7 Lower Quartile = 99 Upper Quartile = 108
108 – 99 = 9
IQ-range for females: (17+1)/2 = 9 Lower Quartile = 100 Upper Quartile = 105
105 – 100 = 5
This would suggest that females have certain extreme values but are less dispersed than the males as the range gave an opposite picture. Because the range gave an opposite picture, we can tell that it had been affected by extreme values.
In order to confirm this, I shall need to investigate if there really are 'outliers' present in the data for females. I could do this by drawing a box and whisker diagram and comparing certain values with the nearest quartile. An outlier is a very small or very large value in a set of data. They are often ignored as they can distort the data. A value is classified as an outlier if the distance from the nearest quartile is greater than 1.5 times the IQ-range. A box and whisker diagram would be useful as it would show the skewness of a distribution and allows you to easily compare distributions. Data can be positively skewed or negatively skewed. When a set of data is skewed, the median tends to be closer to one of the quartiles. When the median is closer to the lower quartile, the data is positively skewed, and when the median is closer to the upper quartile, the data is negatively skewed. Symmetrical distributions occur when the median is directly in the middle of the IQ-range and the range.
For a male to be an outlier it must 1.5 × 9 away from the nearest quartile. This would mean it would have to be either above 121.5 or below 85.5. This means there is only one outlier in the data for the males and that is the male who had an IQ of 121. For a female to be an outlier, it must be 1.5 × 5 away from the nearest quartile. So any value above 112.5 or below 92.5 is an outlier. I have marked all outliers with an x.
It is clear from these diagrams that females are strongly concentrated close to their median of 102. It also becomes clear that the males have a stronger positive skewness which means that more of the male pupils will have a higher IQ than the median when compared to females. This has now confused any conclusions which I was beginning to draw up as I was beginning to interpret that females did tend to have an IQ higher than males but this may suggest something slightly different. To investigate this further, I shall have to look at the middle 20% to confirm whether the majority of males in the IQ-range are lower than those of the boys. To do this, I must find the range between the 4th decile and the 6th decile. A decile is 10 percentiles. To find a decile, you find 10% of the total number in the group you are working on. This will give you the required value and you just read off.
The 4th decile is the equivalent of the 40th percentile so I find 40% of 27 and 33. I do the same with 60%.
Males: 10.8-16.2
Females: 13.2-19.8
Males: The 10th and 11th values are100. The 16th and 17th values are 103 and 105. I have to go 0.2 between these two values. To do this, I work out the difference of 2. I then multiply 2 by 0.2 which comes to 0.4. I than add this to 3, so the other value is 103.4. This means that the middle 20% of IQ for males ranges between 110 and 103.4
Females: The 13th and 14th values are 101. The 19th and 20th values are 103. This means that the middle 20% of IQ for females is ranges between 101 and 103.
This difference is extremely minor and can not be used to prove anything so I must come to the conclusion that gender may not have any effect on IQ but I cannot be sure.
The next stage of my investigation was to decide if KS2 results and IQ are correlated. I can do this by constructing a scatter diagram of average KS2 results against IQ.
Here is my scatter graph for IQ against mean KS2 levels for the whole sample:
This would suggest to me that there is some weak positive correlation. To further my investigation into this, I shall compare the same for both genders.
Here is the scatter diagram for the mean KS2 levels against IQ for males:
And for females:
From this, I can tell that the Males have a stronger relationship between their IQ and their KS2 results than females. But to examine the strength of the correlation and to further prove that IQ and KS2 level are related, I have to use a correlation coefficient. One type of correlation coefficient is Spearman's rank correlation coefficient. Spearman's would require me to rank 60 different students from their IQ and mean KS2 level and method would be unsuitable as it is too time consuming. The product-moment correlation coefficient allows you to determine how strong the relationship is between two variables but is only a measure of linear correlation. It takes into account every value and measures the correlation between all of them. When using this, I do not have to rank the data.
When product-moment is equal to 0, it means that there is no relationship between the two variables. When product- moment is equal to 1, there is perfect positive correlation and when product-moment is equal to –1 than there is perfect negative correlation.
The formula is:
To calculate the product-moment correlation coefficient, I need to draw up a table with the required values.
n = 60
= 103.3667
= 4.2
∑(xy)/n = 26268.33 / 60 = 437.8056
∑x2/n = 644660 / 60 = 10744.33
∑y2/n = 1080 / 60 = 18
∑(xy)/n - = 437.8056 – 434.14 = 3.665556
∑x2/n - 2 = 10744.33 – 10684.67 = 59.66556
∑y2/n - 2 = 18 – 17.64 = 0.36
∑x2/n - 2 × ∑y2/n - 2 = 59.66556 × 0.36 = 21.4796
√ 21.4796 = 4.634609
3.665556 ÷ 4.634609 = 0.790909
This would show that IQ and KS2 levels are correlated as 0.790 (790 recurring) is classed as high correlation. Although this is the case there is still no evidence that suggests which of the two variables is the cause and which is the effect.
A trendline, also known as line of best fit, is a line which is takes into account all values on the graph and draws a type of mean line, one which is representative of all the points. As the trendline is a straight line it would take the form y = mx + c. The equation of the trendline could be worked out by first looking at the gradient and then the point of intercept.
To work out the gradient of a line you take 2 points on the line and look at their coordinates. (y2 - y1) ÷ (x2 - x1) = m To find c you can either extend the line and find the point of intercept on the y-axis or put in one point from the graph along with the gradient.
The formula for this trendline is y = 0.0614x - 2.1503. This means that to find an estimate for someone's IQ from their mean KS2 level, you add on 2.1503 and then divide by 0.0614.
I shall now compare IQ to age so I can look at the relationship between the year someone took their Key Stage 2 exams and their results:
This graph shows little correlation and therefore could not be used to prove that IQ is related to the year you took your KS2 exam.
I shall now look at the relationship between these variables and the normal curve. The normal curve dictates the expected distribution from any population concerning any particular factor. It is a bell shaped curve and is symmetrical. The mean mode and median are all equal and all lie on the axis of symmetry. The normal curve looks something like this:
68% of all the values of any normal distribution lie within 1 standard deviation of the mean, 95% of all the values of any normal distribution lie within 2 standard deviations of the mean and 99% of all the values of any normal distribution lie within 2.5 standard deviations of the mean. Here the standard deviation is being used as a measurement and this shall become clearer later. The standard deviation is a type of average amount of difference from the mean for every particular value in a set of data. It is used as a measure of distribution as the higher the standard deviation, the further each value is from the mean so the larger the spread of data.
I shall now draw the distribution of IQ for my sample; this would allow me to make some estimates on the chances of another student being within 1 standard deviation of the mean and those within two standard deviations of the mean etc.
As you may have noticed this looks nothing like the normal distribution so I will now group the data in order to emphasise any patterns and we should begin to see some similarities to the normal curve. Grouping data loses the value of individual values but allows the data to be handled with more ease and it also emphasises patterns within a distribution.
Although this is beginning to look a lot more like the normal distribution, it is not close enough to compare it using the same proportions so I can not make assumptions based on the normal curve which relate to this.
I shall now look at the distribution of mean KS2 levels:
This graph shows some similarities to the normal curve so I shall group the data in order to emphasise any patterns within the distribution.
This shows extreme similarities to the normal curve and so it would be safe to use the same proportions as estimates. For example, it should be safe to say that roughly 68% of the students would lie between 1 standard deviation of the mean.
Mean ± 1 SD = 68%
This would enable me to estimate the confidence that I can have in my sample. I can do this because a sample-mean can be used as an estimate for the population mean. I shall now be abbreviating these to s-mean and p-mean. Using the s-mean to calculate the p-mean is called statistical inference.
To compare the sample to the population, I must first calculate the mean of the mean KS2 levels and the standard deviation of the mean KS2 levels. I have already explained how to calculate the mean but to calculate the standard deviation you use this formula:
= 4.2
I shall include the calculations for how I calculated the standard deviation in the Appendix.
The standard deviation is 0.36
The mean is 4.2.
So I can now say, based on the proportions of the normal curve, that:
68% = 4.2 ± 0.36
68% of the pupils lie between 3.84 and 4.56
Only 26 of the 60 pupils lie between these two figures and to work out the percentage, I divide the number I obtained by the total number in the sample and then multiplied by 100. This came to 43% which is well below 68%. This may be due to the fact that the variables have so little flexibility (one can either have a mean KS2 level of 3.6 or 4).
But this was expected as the data I am using is discrete data. Discrete data is data which can only take certain individual values and in this case your mean KS2 levels could only be 2, 2.3, 2.6, 3, 3.3, 3.6, 4, 4.3, 4.6, 5, 5.3, 5.6. This data introduced inaccuracies into the normal curve but may still be able to be used to estimate a distribution of sample means. On the other hand, if I had continuous data, the estimates which I would make using the normal curve would be much more accurate as continuous data is data that can have any number or value within a certain range.
I shall now look at an estimate for the distribution of sample means in order to estimate the confidence boundaries for my sample. These confidence boundaries will show me how confident I can be that my sample provides an accurate representation of the population. To do this, I must first work out the standard error. This is an estimate for the standard deviation of sample-means and will allow me to utilise the normal curve to estimate the whereabouts of the population-mean.
So 0.36 is my sample-SD and there were 60 individuals in my sample. This means my SE is equal to 0.0464758. So there is a 68% chance that the p-mean is ±0.0464758 away from my s-mean. So I can now say that there is approximately 95% chance of my s-mean being 2SE away from the p-mean. So my sample has approximately 95% chance of the p-mean being equal to 4.2± 2(0.0464758). This means that there is a 95% chance of the p-mean being between 4.2929516 and 4.1070484. But as we saw before, these values are largely inaccurate due to the data being discrete but this estimate would still suggest that the p-mean has a 95% chance of being equal to 4 or 4.3.
Conclusion:
I can now make conclusions based on my investigation. My conclusions are:
- Gender has either little or no effect on ones Intelligence Quotient. I found this out from my investigation into gender and IQ.
- IQ is strongly related to Key Stage 2 levels and a person with a high IQ is likely to have high KS2 levels. I found this out through my investigation of IQ and mean KS2 levels.
- The year you took your KS2 exams has little or no impact on the levels you achieve. I found this out through my investigation of age and mean KS2 levels
- I can now also conclude that these conclusions would apply to all the 11-16 year olds who attend Mayfield High School as the sample I took was randomly selected to be representative of the entire school.
- There is a 95% chance that the mean KS2 level for the whole of Mayfield High School shall be someone around 4.
Limitations:
My investigation was limited for several reasons which include:
- My sample may not have been representative of the whole school as my strata were taken by age groups. This meant that it is possible that there were 60 students in the school with learning difficulties and I randomly selected all 60 of these. Such a problem would have introduced bias into my sample but it was unavoidable as the data which I obtained on my population did not include such details.
- My conclusions could not be used to represent any other population other than the 11-16 year olds who attended Mayfield High at the time as it may not fairly represent the different social classes, ethnic groups and several other different groups which may have introduced other factors into the investigation.
- There is always a chance that some of the results obtained in this investigation may be inaccurate or incorrect due to mistakes in my sample. However, there is a greater chance of them being correct and reliable as statistical theory has been used and explained and bias has been minimised where possible through the use of stratified sampling and taking a large sample.
- Some of my data was discrete and this limited the accuracy of some of my estimates.
- I could not use comparative figures as I had no knowledge of how the IQ was measured and I therefore could not compare this with a different measure of IQ.
- Statistical analysis is inevitably subject to a certain degree of errors, these errors can never be eliminated but can only be reduced.
Bibliography:
- quote from psychologists
Stanley Thornes Advanced Level Statistics
Stanley Thornes GCSE Level Statistics
Statistics Without Tears- Derek Rowntree
Appendix
Sample:
Standard Deviation: