Hypothesis 1
I plotted three scatter diagrams: one for year 7, one for year 9 and one for year 11. In all three diagrams, there seems to be a negative correlation as when the 100m time increases, the distance jumped tends to decrease. This means that the faster somebody runs (the lower the 100m time), the further s/he will jump (the higher the long jump distance).
To check for any outliers in my data, I used Microsoft Office Excel’s QUARTILE and MEDIAN functions to quickly find the median and the quartiles. I used and to define whether data is an outlier or not. Any value which is larger than or smaller than is an outlier.
From the table above, there are a few entries in years 9 and 11 which are outliers, which I will discard from my sample. Four entries were discarded in total:
I plotted the double mean point and the quadrants using the Autograph software for each graph to see if there was any relationship between the 100m times and the long jump distances. By doing this, I will find out if there is any correlation between the two sets of data.
As the majority of the points are in the top left and bottom right quadrants, it is likely that all three graphs have negative correlation. To calculate the strength of the correlation, I used Spearman’s rank correlation coefficient, which is a number between -1 and 1 which shows the strength of the correlation: 1 being perfect positive correlation and -1 being perfect negative correlation. Spearman’s rank correlation coefficient () is calculated as:
Below is the Spearman’s rank correlation coefficient calculated for Year 7:
As the correlation coefficient for all years is between -0.5 and -1, all of the three graphs have negative correlation. Years 7 and 9 have weak negative correlation as the coefficient is closer to -0.5 than -1 whereas year 11 has strong negative correlation as its coefficient is closer to -1 than -0.5. This means that when a person in years 7 or 9 runs faster, it is somewhat likely that s/he will jump further, although this is not always the case. A person in year 11 who runs faster has a high probability of jumping further. On all three graphs, I drew a line of best fit.
The line of best fit is given by. The “c” shows that if x (or m) were 0, y would equal c. So if a year 7’s 100m time was 0, theoretically his/her long jump distance would be 5.108m. However, this would not be true practically as it is impossible to score 0 seconds in the 100m. The “m” shows that if x increased or decreased by 1, y would increase or decrease by m. So if a year 11’s long jump distance was 3m, his/her 100m time would be 7.5788s (-0.8982 + 8.477) whereas if another year 11’s long jump distance was 4m, his/her 100m time would be 7.2794s (-1.1976 (-0.8982 + -0.2994) + 8.477). Although the equation of the line of best fit allows estimation to a number of decimal places, in reality, the actual time will not be that specific value. The actual time will be roughly around that value, but due to the number of factors involved (height, weight, technique, fitness etc.), it is impossible to predict an exact time for a person.
In conclusion, as the speed of a person increase (i.e. the time is lower), the longer the distance jumped. This supports the original hypothesis. However, making predictions of a person’s 100m time based on their long jump distance (or vice-versa) will only be reasonably accurate within a certain range and if that person attends RGS.
Hypothesis 2
I drew a stem and leaf diagram to check the quartiles, the median and the outliers for the 100m times. This is to ensure that the values calculated by Microsoft Office Excel were correct. These values were correct apart from Q1 and Q3, which were different by about 0.025. However, this did not affect which values were outliers.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Year 7:
15:| 40, 60,
|=================================
16:| 10, 20, 30, 40, 50, 90,
|=================================
17:| 00, 10, 10, 10, 20, 40, 80, 80,
|=================================
18:| 10, 30, 40, 40, 50, 60, 70,
|=================================
19:| 20, 20, 60, 70, 90,
|=================================
20:| 20, 70, 80, 90,
|=================================
21:| 00, 00, 40, 70,
|=================================
22:| 30, 73,
|=================================
23:|
|=================================
24:| 00, 30,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Year 9:
14:| 16, 20, 60, 64, 70,
|=====================================================
15:| 00, 30, 40, 48, 50, 90,
|=====================================================
16:| 00, 10, 10, 40, 42, 53, 76, 80, 89,
|=====================================================
17:| 00, 05, 10, 26, 40, 40, 42, 48, 67, 67, 80, 90, 90,
|=====================================================
18:| 20, 40, 50, 50, 60,
|=====================================================
19:|
|=====================================================
20:| 70,
|=====================================================
21:| 70,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Year 11:
13:| 20, 40, 50, 80, 80, 80, 90, 90, 90,
|=========================================
14:| 10, 40, 40, 50, 50, 70, 80, 80, 90, 90,
|=========================================
15:| 00, 50, 60, 70, 80, 80, 80,
|=========================================
16:| 00, 10, 40, 50, 50, 80,
|=========================================
17:| 10, 60,
|=========================================
18:| 30, 30, 80,
|=========================================
19:| 20,
|=========================================
20:| 70,
|=========================================
21:|
|=========================================
22:| 80,
I then drew box plots for the 100m times of years 7, 9 and 11. It is visually evident that there is a positive skew in years 7 and 11 and a negative skew in year 9. The box plots also shift towards the left as the year increases, therefore suggesting that the times improve over time. The inter-quartile range and range for year 9 is smaller than that of Year 7. This shows that more people are achieving times that are numerically close to each other than that of year 7. The inter-quartile range in year 11 is more than that in year 9, but less than year 7. A reason for this could be the two outliers which would affect the quartiles and the median.
In general, the box plots, including the quartiles, median, highest and lowest non-outlier values shift towards the left as the year increases. The inter-quartile range and range might also decrease as the year increases. This shows that 100m times generally improve as someone grows older.
Hypothesis 3
From the box plots, I can see that the data for year 7 is positively skewed as the median is closer to the lower quartile than the upper quartile. The year 9 data is negatively skewed as the median is closer to the lower quartile than the upper quartile. The year 11 data is slightly positively skewed, albeit not as much as the year 7 data. The year 11 data is closest out of the three years to having a symmetrical box plot, so I will use the year 11 100m time data to find out whether it follows a normal distribution.
To check that the year 11 data is the least skewed, I will use the following formula to calculate how much the 100m time data for years 7, 9 and 11 is skewed by: . The further away from 0 the value is, the more skewed the data is.
I used the Autograph graphing software to draw my histogram. I inserted the year 11 100m time data and manually edited the class intervals. I kept re-editing the class intervals by trial and error until the histogram looked reasonably symmetrical. To achieve this, I needed to have smaller class intervals near the median (15.25) and larger class intervals for data further away from the median. This would cause the class intervals further away from the median to appear shorter and wider, and so increase its similarity to a bell-shaped curve, which is a feature of a normal distribution. At first, I made all the class intervals the same to see which groups needed to be wider.
The first histogram had a class width of 2 whereas the second histogram had a class width of 1. The second histogram shows that most of the data is between 13 and 14 (the mode is 13.8). As there is a positive skew, the mode and median (14.95) are smaller than the mean (15.42). Because of this, it can be suggested that the data does not follow a normal distribution. To see if this is the case, I worked out the standard deviation for this set of data using the below formula.
Although only 65.8% of the data lies within 1 standard deviation, 97.4% of data and all data lie within 2 and 3 standard deviations respectively. Because enough data lies within 2 and 3 standard deviations, the data just about follows a normal distribution.
The final histogram is as shown below (larger version can be found in Appendix):
I used class widths that decreased in size the closer they were to the mean. It is evident that there is a positive skew as the highest bar is the bar to the left of the bar containing the mean. The mean and median are also in different bars. Although the histogram is not symmetrical, the tallest bar is in the centre with bars each side of it decreasing in size which can suggest a normal distribution is possible with this data.
Conclusion and evaluation
In conclusion, there is a negative correlation between the 100m times and the long jump distances. This shows that as long jump distances increase, the 100m times are lower. The box plots of the 100m times, although skewed, shift towards a smaller time as the year increases, showing that the times improve over time as someone grows older. The 100m times for year 11 just about follow a normal distribution, despite having a positive skew and only 65.8% of data within ±1 standard deviation.
This data has a number of limitations. For example, the results are only valid for one school (RGS) and for one gender (boys). Values may differ between different schools, areas and genders. An improvement could be to include results from all schools in the area, or all schools in the country if possible and by including results from girls as well.
The data was also taken from a secondary source. This could affect the results as the person(s) collecting the data might have made errors. To eliminate sources of error as far as possible, primary data could have been used or I could have been physically present when the results were recorded to ensure there were no errors.
Only three years (years 7, 9 and 11) were used in my sample. It is possible that there are errors in one of these years, which could affect the results. Including all years in my sample might highlight these errors, which might improve the accuracy of the graphs, box plots and histograms.
Overall, I think that this investigation is valid only for male students at RGS, but can be further improved in a number of ways.
Appendix
Data sample:
Page of