Other sampling methods that I could have used include simple random sampling and stratified random sampling. For simple random sampling, I could have given every record an assigned number, and then used a random number generator to select them, ignoring any repeated numbers. For stratified random sampling, I could have used a field such as gender or Year to split the overall data into groups, and selected the same amount of records from each group.
Sampling the Data
As stated above I am going to use the systematic method of sampling, and I will choose every 2nd record working from the top of my spreadsheet downwards (from all the data for the wrist circumference and thumb circumference, but only from Year 9 boys and girls for the height values and only years 7 and 11 for the foot size readings). Below are the tables containing the required data I need for my hypotheses.
Removing Outliers
I will now remove any outlier values that could unfairly change my results as I stated I would above. I will start with the Year 9 male and female heights. I firstly rearranged the table on my spreadsheet so that the heights were in ascending order. I have 50 pieces of data for each of the male and female tables, so to find the median I would need to find out what the 25th value was. The median for the male set of data is 165, and for the female set of data it is 163.
I now will find out the upper and lower quartile values and from that the inter-quartile range. To find the upper quartile value, I need to find value that is ¾ through the entire 50 values (which is halfway between the 12th and 13th values), and for the lower quartile value I need to find the value that is ¼ through the entire 50 values (which is halfway between the 37th and 38th values).
For the male set of data, the 12th piece of data is 157, and the 13th is 159, therefore the lower quartile is (157+159)/2 = 158. The 37th value is 173, and the 38th value is 175, therefore the upper quartile is (173+175)/2 = 174. The inter-quartile range is the upper quartile value minus the lower quartile value, so it is 174-158 = 16.
For the female set of data, the 12th piece of data is 158, and the 13th is 159, therefore the lower quartile is (158+159)/2 = 158.5. The 37th value is 165, and the 38th value is 166, therefore the upper quartile is (165+166)/2 = 165.5. The inter-quartile range is the upper quartile value minus the lower quartile value, so it is 165.5-158.5 = 7.
From these values, I will draw a box-plot chart for each of the male and female data sets. I am using box-plot charts as they are the easiest and clearest way to find a skewness, and by drawing the whiskers you can spot outliers easily. Below you can see the box-plot charts that I drew on my computer:
Male height box-plot:
Female height box-plot:
As you can see from the box-plots, both sets of data have quite strong negative skewness as they have very long whiskers on their left hand side.
The way to calculate the outliers is to say that values which are higher than 1.5*inter-quartile range above the upper quartile value or below 1.5*inter-quartile range under the lower quartile value are outliers and will be taken out.
Inter-quartile range for males = 16
16*1.5 = 24
Therefore any values below 134 (158-24) and above 198 (174+24) will be taken out. This has been done in the tables above, and the outliers have been highlighted.
Inter-quartile range for females = 7
7*1.5 = 10.5
Therefore any values below 148 (158.5-10.5) and above 176 (165.5+10.5) will be taken out. This has been done in the tables above, and the outliers have been highlighted.
I used this same method of finding outliers for year 7 foot size, year 11 foot size, wrist circumference and thumb circumference. Below are the results I got, and the outliers I found have been highlighted in the table above.
Year 7 foot size-
Median: 23
Lower Quartile value: 21.75
Upper Quartile value: 25
Inter-Quartile Range: 3.25
1.5*IQR: 4.875
Year 11 foot size-
Median: 25.5
Lower Quartile value: 23
Upper Quartile value: 27.5
Inter-Quartile Range: 4.5
1.5*IQR: 6.75
Wrist Circumference-
Median: 160
Lower Quartile value: 147.5
Upper Quartile value: 170
Inter-Quartile Range: 22.5
1.5*IQR: 33.75
(No outliers)
Thumb Circumference-
Median: 60
Lower Quartile value: 40
Upper Quartile value: 67.5
Inter-Quartile Range: 27.5
1.5*IQR: 33.75
(No outliers)
Data Analysis and Interpretation
Hypothesis 1:
I firstly produced the new table for year 9 male and female heights, which excludes the outliers and is in ascending order.
I then grouped the data into groups of varying sizes, and I used 8 groups as I feel that more would be too many but less would make the histogram too simplistic. This grouping enabled me to produce the following table:
I then used the equation Frequency Density = Frequency / Class Width to work out the frequency density to allow me to draw frequency density histograms.
Year 9 Males:
Year 9 Females:
I used my computer to draw the following graphs:
Year 9 Male height histogram and probability distribution curve:
Year 9 Female height histogram and probability distribution curve:
I am also going to find out the mean of both the raw and grouped data for the two sets of data. To find out the mean of the raw data, you add all of the height values together and divide by the amount there are.
Males’ Raw data mean – 7724/47 = 164cm (to 3 sig.fig.)
Females’ Raw data mean – 7311/45 = 162cm (to 3 sig.fig)
To work out the mean of the grouped data, the midpoint of each group must be calculated and multiplied by the frequency for that group. All these values are added together and the result is divided by the total frequency.
Males’ Grouped data mean –
( (140*7)+(152.5*1)+(157.5*8)+(162.5*9)+(167.5*7)+(172.5*7)+(177.5*6)+(185*2))
= 7670cm
7670/47 = 163cm (to 3 sig. fig)
Females’ Grouped data mean –
( (140*0)+(152.5*5)+(157.5*14)+(162.5*16)+(167.5*6)+(172.5*4)+(177.5*0)+(185*0))
= 7262.5cm
7262.5/45 = 161cm (to 3 sig. fig)
The histograms and probability distribution curves show that the males have a larger range of heights than the girls, and have both smaller and higher values than the girls. The girls’ heights are more concentrated around 170cm, and the boys are mainly spread 150-180cm. The means that I calculated both prove my hypothesis that boys would be, on average, taller than girls correct, however the difference in both cases is only 2cm, which is less than what I expected.
Hypothesis 2:
I firstly produced the new table for year 7 and 11 foot sizes, which excludes the outliers and is in ascending order.
I then grouped the data into groups of varying sizes to allow me to draw comparative pie charts.
To draw comparative pie charts, I must make sure that the areas of the two pie charts are in proportion to the total pieces of data in my sample. The radius for my Year 7 pie chart will be 1.5cm, which is an area of 2.25π (using the formula π r²), and the ratio between the amount of Year 7 and Year 11 values that I have is 45:44, or 1:0.977.
Therefore as 1*2.25π=2.25π then 0.977*2.25π = 2.19825π so using the formula A=π r² I can work out that the radius for the Year 11 pie chart is the square root of 2.19825, which is 1.48cm (to 3 sig.fig.). I used my computer to construct these pie charts:
Year 7 Foot Size Pie Chart Year 11 Foot Size Pie Chart
I then worked out the mean of both the raw and grouped data for Year 7 and Year 11 foot size. To work out the means for the raw and grouped data I used the same methods as in hypothesis 1.
Year 7 raw data mean: 1042.9/45 = 23.2cm (to 3 sig.fig.)
Year 11 raw data mean: 1124.5/44 = 25.6cm (to 3 sig. fig.)
Year 7 grouped data mean:
((21*14)+(22.5*11)+(23.5*8)+(24.5*6)+(26.5*6))/45 = 23.0cm (to 3 sig. fig.)
Year 11 grouped data mean:
((21*3)+(22.5*8)+(23.5*3)+(24.5*6)+(26.5*24))/44 = 24.9cm (to 3 sig. fig.)
The pie charts clearly show that in Year 11 there are more people with feet that are between 25 and 28cm long than in Year 7, and less people with feet that are between 20 and 22cm long than in Year 7. Therefore people in Year 11 have larger feet than people in Year 7, and this is numerically proven by the means that I calculated in both the grouped and raw data cases. Therefore, my hypothesis stating that on average, children in year 7 will have smaller feet than children in year 11 has been proven correct.
Hypothesis 3
As in the other two hypotheses, I started off by producing a table of the wrist circumference and the thumb circumference, without outliers.
I then used my computer to draw a scatter graph of the data.
I then calculated the averages (mean, median, and mode) and range of each set of data.
Wrist Circumference –
Mean: (using the same method for raw data as in previous hypotheses)
7860/50 = 157.2cm
Median: (using the same method as previously)
25th value = 160cm
Mode: (the most frequently occurring value)
=150cm
Range: (highest value – the lowest value)
200-120 = 80cm
Thumb Circumference –
Mean: 2739/50 = 54.78cm
Median: 25th value = 60cm
Mode: = 70cm
Range = 80 – 20 = 60cm
In my opinion, the median is the best of these averages, as the mean (although the most ‘accurate’) can be easily distorted by extreme values, and the mode cannot really be used to help me analyse the data any further.
I then calculated the spread of the data by working out the inter-quartile range and standard deviation of the data. The inter-quartile range shows the range between the value one quarter of the way through the data and the value three-quarters of the way through the data. (I have actually already calculated the inter-quartile range from when I was removing outliers from the data, and as I found no outliers in either the Wrist Circumference or the Thumb Circumference data so they can be used again now.)
Inter-quartile range for Wrist Circumference = 22.5cm
Inter-quartile range for Thumb Circumference = 27.5cm
To work out the standard deviation of each set of data you use the formula √ (1/n∑x²-x ²), which means all the data individually squared then added together, the mean squared then subtracted from the total and this remaining number divided by the total number of pieces of data. This gives the variance, which can be square rooted to find the standard deviation.
I then used this method to work out that the standard deviation of the Wrist Circumference, which = 21.0038cm,
and Thumb Circumference which = 18.9698cm
I then attempted to find out if there was a correlation between the two sets of data using Spearman’s Rank Correlation Coefficient. I used the formula rs = 1-((6∑d²)/(n(n²-1))) to work this out, and the result that I produced was rs = 0.5768.
Using a critical values table for Spearman’s Rank Correlation Coefficient (shown below), I saw that there was no correlation between my data, as 0.5768 is not greater than 0.7647 or less than -0.7647.
Should there have been a correlation, I would have drawn a line of regression on my scatter diagram, however as there is no correlation I will not.
Therefore, my hypothesis has been shown incorrect, as there is no correlation between the Wrist Circumference and the Thumb Circumference, not even a weak one.
Conclusion
In summary, the first two of my hypotheses have been proven correct as shown by the graphs and calculations done in the Hypothesis 1 and Hypothesis 2 sections, however my third hypothesis was proven incorrect as shown by the graphs and calculations in the Hypothesis 3 section. In conclusion, I feel that I have successfully investigated the various statistics from the school census and have come to the conclusions shown in my investigations above.
Limitations and Extensions
The main limitation that I faced during this investigation was time, and I feel that should I have had more time I could have investigated my hypotheses further, and come up with other hypotheses to test regarding the school census. Some extensions that I could attempt in the future include redoing the testing of my hypotheses using much larger samples of data (200 or more) to build up a more accurate picture of the data, attempting different types of graphs to compare and show the data (such as cumulative frequency graphs), and using similar methods that I have portrayed in my investigation above to test other variables from the census (such as height of belly button).