I will analyse the distribution of 100 metre times by grouping the data into different class intervals, calculating the frequency density for each group, making a grouped data table then producing a histogram with autograph. I can use this to see if the data looks symmetrical, meaning the distribution would be likely to be normal. I will then test by comparing mean, median and mode of the grouped data, normal distribution would mean these values were the same. Finally I can work out the standard deviation and look at the percentage of data within 1, 2 and 3 standard deviations of the mean.
Data Collection
For each hypothesis I will need to decide upon how much, and which data to use in graphs and diagrams. As there is a very large amount of data, it is necessary to only use parts of it for some hypotheses.
For the first hypothesis, using the scatter diagram, I will use a data sample of 100 items, this will be stratified by age from years 7 to 10, as there are no height or weight values for students in year 11, so BMI cannot be calculated. I will take the number of students with height, weight and bleep test values for each year to use for the sample.
This table shows the number of pupils with all three values for each year.
In order to collect the data for each sample, i.e. the 26 from the 75 year 7 pupils, I used a simple random sampling method. I gave each pupil of the 75 a number from 1 to 75. I then used a random number generator to generate 26 numbers between 1 and 75, choosing the corresponding pupils to use.
I repeated this for each of years 7, 8, 9 and 10, leaving me with my sample of 100.
As I was sampling the data, if I had a random number repeated I just generated another one.
My data sample for the first hypothesis is therefore random and should be free from bias. As it was made using stratified sampling it should give a fair representation of the whole school’s data.
For the second hypothesis I shall use all of the year nine data for shot putt as they have the most students with a recorded distance for when they were in years 7, 8 and 9. This gives me a sample of 82 students.
To carry out the third hypothesis I shall use all of the students with 100 metre times from all years, this will give me the biggest chance of being able to draw a valid conclusion from the histograms. This sample includes: 96 students from year 7; 96 students from year 8; 96 students from year 9; 95 students from year 10; and 96 students from year 11, a total of 479 100 metre times.
Hypothesis 1
I entered the data sampled into autograph as raw data, BMI as the x value and Bleep test score as the y value. I also found the double mean point, splitting the graph into quadrants.
The data was clearly very uncorrelated, so it would be pointless to use Spearman’s correlation coefficient. Instead I made a box plot and a dot plot of BMI first, in order to study outliers.
This suggests that there may be outliers towards the higher end of BMI. To calculate if there are outliers I used:
Any value > Q3 + 1.5 x IQR or < Q1 – 1.5 x IQR is an outlier.
In the data Lower Quartile = 17.51
Median =19.05
Upper Quartile =20.83
Therefore the inter quartile range is 3.32. 1.5 X 3.32 = 4.98
17.51 – 4.98 = 12.53 therefore there are no lower outliers
20.83 + 4.98 = 25.81 so the data highlighted are upper outliers.
12: 1.6 2.69
15: 0.11 0.15 0.35 0.41 0.42 0.73 0.82 1 1.2 1.23 1.44 1.44 1.46 1.63 1.65 1.87 1.97 2.04 2.07 2.09 2.1 2.31 2.51 2.52 2.63 2.67 2.75
18: 0.03 0.05 0.05 0.14 0.31 0.34 0.36 0.43 0.49 0.52 0.61 0.66 0.72 0.73 0.78 0.8 1 1.03 1.03 1.05 1.05 1.05 1.13 1.13 1.15 1.29 1.38 1.49 1.56 1.59 1.61 1.68 1.75 1.84 1.91 1.94 2.08 2.14 2.16 2.28 2.37 2.37 2.44 2.57 2.7 2.83 2.9 2.96
21: 0.13 0.46 0.79 0.88 1.15 1.16 1.16 1.53 1.59 1.86 2.04 2.15 2.23 2.42
24: 0.03 0.62 1.04 1.83 1.88 2.09
27: 0.47 0.78
I made another dot plot and box and whisker diagram repeated this, using Bleep test scores.
As there are clearly no great outliers I did not need to calculate the boundaries.
I then created another scatter diagram with the outliers removed.
There was still very little correlation so my hypothesis was disproved as it appears Bleep test is not related to BMI.
Hypothesis 2
For the second hypothesis I entered the data from my appendix and created three box and whisker diagrams.
The blue diagram, of data from year 9, appears to have outliers that should be removed.
I used the same method as in hypothesis 1.
Any value > Q3 + 1.5 x IQR or < Q1 – 1.5 x IQR is an outlier.
In the data Lower Quartile = 5.5
Median = 6.15
Upper Quartile = 6.925
Therefore the inter quartile range is 1.425.
1.5 X 1.425 = 2.1375
5.5 – 1.425 = 4.075 the data highlighted blue are lower outliers
6.925 + 1.425 = 8.35 so the data highlighted red are upper outliers.
3: 0.8
4: 0.2 0.4 0.7 0.7 0.8 0.8 0.95
5: 0 0 0 0.2 0.3 0.3 0.3 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.7 0.7 0.7 0.75 0.8 0.9
6: 0 0 0 0 0.1 0.1 0.1 0.1 0.2 0.4 0.4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 0.75 0.8 0.8 0.9 0.9
7: 0 0 0.2 0.3 0.3 0.4 0.5 0.5 0.5 0.5
8: 0 0 0 0 0.2 0.3 0.3 0.5 0.7
9: 0.5
I discarded these data then looked at the yellow year 8 diagram.
In the data Lower Quartile = 5.25
Median = 6
Upper Quartile = 6.5625
Therefore the inter quartile range is 1.3125.
1.5 X 1.3125 = 1.96875
5.25 – 1.96875 = 3.28125 therefore there are no lower outliers
6.5625 + 1.96875 = 8.53125 and also no upper outliers
3: 0.8
4: 0 0 0.5 0.5 0.5 0.6 0.75 0.75 0.8
5: 0 0 0 0 0 0 0 0 0.2 0.25 0.25 0.25 0.3 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.6
6: 0 0 0 0 0 0 0 0 0 0 0 0.1 0.3 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.75 0.75 0.8
7: 0 0 0 0 0 0 0 0.1 0.2 0.3 0.5 0.5 0.75
8: 0 0.3 0.4 0.5
Finally I looked at the orange year 7 diagram as before.
In the data Lower Quartile = 4
Median = 5
Upper Quartile = 6
Therefore the inter quartile range is 2
1.5 X 2 = 3
4 – 2 = 2 so there are no lower outliers
6 + 2 = 8 the data highlighted red is an upper outlier.
2: 0.75
3: 0 0 0 0 0 0.25 0.5 0.5 0.5 0.5 0.5 0.5 0.5
4: 0 0 0 0 0 0 0 0 0 0 0 0.1 0.2 0.25 0.5 0.5 0.5 0.5 0.7 0.75 0.75
5: 0 0 0 0 0 0 0 0 0 0 0 0.2 0.25 0.25 0.5 0.5 0.5 0.5 0.6 0.75 0.8
6: 0 0 0 0 0 0 0 0 0 0.1 0.1 0.1 0.2 0.2 0.5 0.5 0.75
7: 0 0 0 0 0 0 0.3 0.5
8: 0.3
Having then discarded all of the outlying data I created the box and whisker diagrams again.
From the orange year 7 box there is a definite increase in the median. The inter quartile range is smaller, so data is more closely grouped and also grouped towards a longer throw. The high points and low points are also both greater in year 8 than year 7. Neither data is particularly skewed.
The difference between years 8 and 9 is not this conclusive. The quartiles are all slightly higher, but the high point drops. This however is due to the difference in weight of the shot thrown. It increases between year 8 and 9, but not year 7 and 8.
The box and whisker diagrams support my hypothesis fully, to conclude, taking into account the change in weight of shot putts.
Hypothesis 3
To create a histogram I made tables of grouped data with unequal intervals, looking at each year separately, so I can also observe how the distribution of data changes with age.
Firstly I ordered the data for 100 metre times of year 7 and found that the median is 18.25. I therefore chose my group intervals, having them closer at the median. I then entered the frequency of data in each group.
I then repeated this for each year, finding the median, then making groups based upon this and finding frequency.
Year 8
Median = 17.6
Year 9
Median = 16.42
Year 10
Median = 15.7
Year 11
Median = 15.2
I then created a histogram for each year group, I made it of frequency density using autograph (frequency density is equal to frequency divided by class width).
Year 7
Year 8
Year 9
Year 10
Year 11
None of the histograms are symmetrical, therefore my hypothesis is incorrect, as it means that the data for 100 metre times does not follow a normal distribution.
In years 7, 8 and 9, there are classes (bars) which have much lower frequency densities than if the data was normally distributed, and in year nine there are several times to the very far right of the graph. The data does become more evenly distributed in the older years however.
This increase in normality of distribution with age could be due to the fact that, in years 7 and 8 for example, children are still growing rapidly and at different rates, however by year 11, growth of the children is starting to slow down and gaps are narrowed as they even out in ability.
The skewness of the data, almost normally distributed then with a spread to the right, of slower times can also be explained logically. The reason for this is that a year group is far more likely to have pupils who are far slower, for example due to injury or disease, than pupils who are much quicker than the rest of their year.
Another way that I am able to consider distribution with is the comparison of the mean, median and mode (in this case modal class).
For year 7: Mean = 18.6771
Median = 18.25
Modal class = 17 – 17.5
For year 8: Mean = 17.9818
Median = 17.5833
Modal class = 18 – 18.5
For year 9: Mean = 17.0625
Median = 16.4375
Modal class = 16 – 16.5
For year 10: Mean = 16.1447
Median = 15.7679
Modal class = 15.5 – 16
For year 11: Mean = 15.4479
Median = 15.1538
Modal class = 14.5 – 15 & 15.5 – 16
This again supports the finding that the data is becoming more normally distributed as the age of the year group increases, as it is clear that the mean, median and modal class are closer together as the year group increases.
Finally I can consider distribution of data by looking at the percentage of data within standard deviations of the mean value.
The formula for standard deviation is
however autograph calculates this for me.
For the data to be normally distributed: Approximately 68% of data lies within one standard deviation of the mean
i.e. 68% lies within μ ± σ
Similarly 95% lies within μ ± 2σ
And 99% lies within μ ± 3σ
For year 7: Standard deviation = 2.1952
Evaluation
Hypothesis 2 was the only hypothesis to be proven correct, however I was able to analyse why hypothesis 3 was incorrect, and also look at links between the distribution and age too.
Hypothesis 1 was incorrect, however this was the least likely to be proven right as BMI is a simple indication of something that is often too complicated to be shown in such a categorical way.
Overall the project therefore had mixed results, however I was able to draw conclusions from all three hypotheses which is a strong positive. I tried to make hypothesis that were not definite as there would be no point in stating obvious points to then prove them correct, so it is understandable that the whole project did not go completely smoothly.
To better the investigation I would use a wider variety of results if possible – there are obvious limitations with the data I used for this project. It is only from one school, and only boys as well. There are also not very many pupils who have complete records – there are very many pieces of data missing. I could use a national database for example with much more data so as to reduce the risk of anomalous graphs and to make the project more reliable and valid, including results for both genders.