Maths Handling Data Coursework: Mayfield High School
Jonathan Nisner
Maths Handling Data Coursework: Mayfield High School
For this handling statistical data coursework I will be investigating the heights and weights of students of years 7 to 11 in Mayfield High School. I will look for a trend in the heights and weights of the students to see if the taller they are, the more they weigh. My hypothesis is that there is no correlation between height and weight, and my Alternative hypothesis is that there is strong, positive correlation between them. I will then investigate the heights of boys in years 7 and compare them to the girls, and then do the same in year 11. I will then be able to compare these two sets of results. These hypotheses.
I am carrying this investigation out because from my hypothesis I want to know whether students in the older years should be separated from the younger students. In order to carry out this investigation, I will need to collect the heights and weights of all the students in Mayfield High School between and including years 7 to 11. Instead of collecting the data, I can find the information on an exam board website. This data is reliable because it is provided by the exam board and is based on real students, however, it may be unreliable because it is secondary data, not primary since I am not physically collecting it myself, and the students may not have measured or weighed themselves on the day, and had guessed the measurements instead. Height and weight is continuous data so only some graphs and calculations can be done.
I have decided to select a sample size of 100 students because 1183 students are too many. I won't be able to compare the data well since I will have such a wide range of results. I also won't be able to fit all the information on one graph or one box-plot making the comparison even more difficult. I will be able to cope with 100 students' heights and weights better.
In order to get a sample of 100 students, I will take a stratified sample of each year group, and then of each gender within the year groups. I will use stratified sampling because using this I can get a fair sample that is proportional and to the ratio of the original number of students within each year group out of the whole school.
Once I have my stratified sample of each year group, I will use random sampling to choose which of the students I will take my data from. This is a fair way of sampling due to the fact that there will be no biased decision as to who is picked for their data.
The following data is what I will use in order to take a stratified sample of 100 students:
Year Group
Number of Boys
Number of Girls
Total
7
51
31
282
8
45
25
270
9
18
43
261
0
06
94
200
1
84
86
70
Total
604
579
183
To take a stratified sample of the year groups I will do the following calculations:
Year Group
Calculation
Total (answer)
7
(282/1183) x 100
= 24
8
(270/1183) x 100
= 23
9
(261/1183) x 100
= 22
0
(200/1183) x 100
=17
1
(170/1183) x 100
= 14
Total
=100
To stratify the year groups into gender I will do the following calculations:
Year Group
Gender
Calculation
Total (answer)
7
Boys
(151/282) x 24
= 13
Girls
(131/282) x 24
=11
8
Boys
(145/270) x 23
=12
Girls
(125/270) x 23
=11
9
Boys
(118/261) x 22
=10
Girls
(143/261) x 22
=12
0
Boys
(106/200) x 17
=9
Girls
(94/200) x 17
=8
1
Boys
(84/170) x 14
=7
Girls
(86/170) x 14
=7
I have rounded the answer to the nearest whole number since the number of people is discrete data because a fraction of a person is impossible.
Now that I know how many students I should take data from I need to choose which students to use. In order to make this as fair as possible, I will choose random students. I will do this by separating the students into year groups and into gender. Then I will go down the list and recall the height and weight of the number student that comes up on the calculator as I do these calculations:
For year 7 boys: RAN# x 151 and press = 13 times.
For year 7 girls: RAN# x 131 and press = 11 times.
For year 8 boys: RAN# x 145 and press = 12 times.
For year 8 girls: RAN# x 125 and press = 11 times.
For year 9 boys: RAN# x 118 and press = 10 times.
For year 9 girls: RAN# x 143 and press = 12 times.
For year 10 boys: RAN# x 106 and press = 9 times.
For year 10 girls: RAN# x 94 and press = 8 times.
For year 11 boys: RAN# x 84 and press = 7 times.
For year 11 girls: RAN# x 86 and press = 7 times.
To be certain that I had the correct number of students, I added the answers to the above calculations up to check that I had 100 students.
I will collect the information from the chosen students and I will make a scatter diagram on the computer showing the correlation between the heights and weights of the sample of 100 students from Mayfield High School. I will use a scatter diagram because it shows the correlation between two variables well and can be seen at first site. I will add the line of best fit and its equation. The data is on the next page.
The scatter diagram shows fairly strong positive correlation and is proved by the line of best fit. It's equation is y=0.5078x + 136.04. This shows that the gradient of the line of best fit is positive and that with every kilogram the height increases by 0.5078cm. I will now use the equation of the line of best fit to figure out the height of a student whose weight is 65kg.
Y=mx+c
Y=0.5078x + 136.04
Y=0.5078(65) + 136.04
Y=169.047
Y=169cm
My scatter diagram shows two obvious anomalies, which could be due to ...
This is a preview of the whole essay
The scatter diagram shows fairly strong positive correlation and is proved by the line of best fit. It's equation is y=0.5078x + 136.04. This shows that the gradient of the line of best fit is positive and that with every kilogram the height increases by 0.5078cm. I will now use the equation of the line of best fit to figure out the height of a student whose weight is 65kg.
Y=mx+c
Y=0.5078x + 136.04
Y=0.5078(65) + 136.04
Y=169.047
Y=169cm
My scatter diagram shows two obvious anomalies, which could be due to the students not writing the correct information about themselves.
I will now do Spearman's coefficient of rank correlation to get a more accurate comparison between the heights and weights due to a more accurate correlation. My Alternative hypothesis is that the heights and weights of students in Mayfield High School are positively correlated. My Null hypothesis is that there is no correlation between the data. To do Spearman's rank, I will use the following formula: ? =1- 6?d2
n(n2 -1)
? = Spearman's coefficient of rank correlation (rho)
? = sum of (sigma)
d = difference between two rankings of one item of data
n = number of items of data
Firstly, I will rank the heights with 1 being the smallest height and 100 being the tallest. Then I will rank the weights with 1 being the lowest weight and 100 being the highest. Then I will find the difference between the two rankings of each set of data and square it. I will then add all the d2 values together and use the formula to find Spearman's coefficient of rank correlation.
The value of ? will always be between -1 and +1.
-1 0 +1
strong negative weak negative no weak positive strong
correlation correlation correlation correlation positive
correlation
? = 1 - 6?d2
n(n2 -1)
? = 1 - 6 x 91740
100(1002-1)
? = 0.44950495
The calculated value is 0.44950495. The value of ? is right between 0 and +1. This means that the heights and weights of the sample of 100 students from Mayfield High School are positively correlated but not strongly and not weakly. This shows that the taller a person is the heavier they are likely to be as I stated in my hypothesis, however there can be exceptions. I will now find the critical value of rho on a table with critical values of rho at a 1% level. This means that that the result will be 99% accurate.
n = 100, and in the table, a sample size of 100 gives 0.2327 as the critical value of rho. Since 0.44950495 is higher than 0.2327 (the critical value of rho), I can discard the null hypothesis and conclude that the heights and weights of students at Mayfield high school follow a positive trend.
My second hypothesis is that the girls will be taller than the boys in year 7.
For my second hypothesis I will investigate the heights of the boys and girls in year 7. In order to do this I will take a sample of 60 year 7 students from Mayfield High School because 282 students means too much data to handle and I won't be able to fit them all on a graph or box plot. I will be able to compare the heights of 60 students better than of 282. I will take my sample by using stratified sampling of the boys and girls in the year as well. By using stratified sampling I will get a fair sample of students since the sample is in proportion with the rest of the school and the boys and girls will be proportional to each other. This is how I will take a stratified sample:
Year 7:
Boys: 151 x 60 = 32.12765957
282
= 32 boys
Girls: 131 x 60 = 27.87234043
282
= 28 girls
Now I will use random sampling in order to a sample of 28 girls and 32 boys out of the whole of year 7. This is how I will do it:
Boys: RAN# x 151 and press = 32 times
Girls: RAN# x 131 and press = 28 times
I will find the boys and girls in the list of year 7 students on the edexcel website and will take their heights. This is continuous data. I will put the heights in order from smallest to biggest and will then put them into two cumulative frequency tables. From the tables I will be able to draw two cumulative frequency curves on one graph. From this I will be able to find the lower quartiles, upper quartiles and medians for each of the curves for boys and girls. They will show me whether the boys or girls are generally taller in the year.
This is my cumulative frequency table for the heights of boys in year 7 in Mayfield High School:
Class Width
Height x (cm)
Frequency
Cumulative
Frequency
20? x <130
0
0
30? x <140
40 ??x <150
6
7
50 ??x <160
5
22
60 ??x <170
9
31
70 ??x <180
32
80 ??x <190
0
32
This is my cumulative frequency table for the heights of girls in year 7 in Mayfield High School:
Class Width
Height x (cm)
Frequency
Cumulative Frequency
20 ??x <130
30??x <140
0
40??x <150
4
5
50 ??x <160
0
5
60 ??x <170
0
25
70? x <180
2
27
80 ??x <190
28
I will draw two box plots, one for the boy's heights and one for the girl's heights. I will draw these because it is a good way of comparing the median, lower quartile and upper quartile of the boys' and girls' heights in year 7 of Mayfield High School since they show the differences clearly. I will then find the interquartile ranges. By finding the IQR of the boys and girls, I will be able to compare each range of heights.
From the box plots that I drew, I found that the girls' median height; 160cm was larger than the boys' median height; 156.5cm. This tells me that on average most of the girls are taller than the boys in year 7. Also the girls' lower and upper quartile ranges were higher than the boys showing that on the whole, the girls in year 7 are taller than the boys. The interquartile range of the boys' heights is 10.5 whereas the girls' is 12.5. This tells me that most of the boys have similar heights; they are not so spread out, whereas the girls' heights are slightly more spread out meaning that their heights aren't as similar to each other as the boys' heights are. Also, from the box plots I can see that the boys' heights are very slightly positively skewed since the median is a little closer to the lower quartile than to the upper quartile but the girls' heights are a little negatively skewed; the median is a little closer to the upper quartile. The box plots are almost symmetrical meaning that there is almost an equal distribution of students on either side of the median. This is another difference between the girls' and boys' heights in year 7.
I will now look to see if there are any outliers in my data. If there are any they may have distorted my data. To do this I must multiply the interquartile ranges each by 1.5. For the boys this is: 10.5 x 1.5 = 15.75. For the girls this is: 12.5 x 1.5 = 18.75. Any boys that are 15.75cm below the lower quartile or above the upper quartile are outliers. Any girls that are 18.75cm below or above the lower and upper quartiles are outliers. Here are the heights of the boys and girls in year 7:
Year Group
Gender
Height (m)
Year Group
Gender
Height (m)
7
Female
.25
7
Male
.36
7
Female
.41
7
Male
.4
7
Female
.43
7
Male
.41
7
Female
.45
7
Male
.42
7
Female
.48
7
Male
.46
7
Female
.51
7
Male
.47
7
Female
.52
7
Male
.48
7
Female
.52
7
Male
.48
7
Female
.52
7
Male
.5
7
Female
.53
7
Male
.51
7
Female
.55
7
Male
.51
7
Female
.56
7
Male
.52
7
Female
.56
7
Male
.52
7
Female
.58
7
Male
.53
7
Female
.59
7
Male
.54
7
Female
.6
7
Male
.54
7
Female
.6
7
Male
.54
7
Female
.6
7
Male
.54
7
Female
.6
7
Male
.55
7
Female
.61
7
Male
.57
7
Female
.61
7
Male
.58
7
Female
.62
7
Male
.58
7
Female
.62
7
Male
.59
7
Female
.63
7
Male
.61
7
Female
.65
7
Male
.61
7
Female
.7
7
Male
.61
7
Female
.75
7
Male
.61
7
Female
.8
7
Male
.62
7
Male
.62
7
Male
.65
7
Male
.68
7
Male
.73
Male: 151.5 - 15.75 = 135.75cm
Male: 162 + 15.75 = 177.75cm
Any boy that is 135.75cm or less, or 177.75cm or more is an outlier.
Female: 153.5 - 18.75 = 134.75cm
Female: 166 + 18.75 = 184.75cm
Any girl that is 134.75cm or less, or 184.75cm or more is an outlier.
There are no outliers for the boys, but there is one girl that is 125cm so she is the only outlier in year 7. This may have occurred because she may not have known what her height really was, or she really is that small. This may have distorted my data but since there is only one outlier, probably by not much.
In conclusion, my graphs and calculations have proved my hypothesis correct. The girls in year 7 are taller then the boys. This may be because the girls have already started their growth spurts but for the boys this would happen in a few years time.
My third hypothesis is that boys in year 11 will be taller than the girls in year 11.
For my third hypothesis, I will investigate the heights of girls and boys in Year 11. In order to do this I will take a sample size of 60 year 11 students from Mayfield High School because it is the same sample size as I took for my second hypothesis when comparing the heights of boys and girls in Year 7. This way I will be able to compare the heights of the boys in year 7 with the boys in year 11, and the heights of the girls in year 7 with the girls in year 11. This will show the changes that can happen in height over the years.
I will take my sample by using stratified sampling of the boys and girls in the year group. Stratified sampling is a fair way of taking a sample because the number of boys and girls are in proportion to the total number of students in the year. This is how I will take a stratified sample of 60 students of year 11:
Boys: 84 x 60 = 29.64705882
170
= 30 boys
Girls: 86 x 60 = 30.35294118
170
= 30 girls
I will now use random sampling to choose the 60 students whose data I will use. I will do this by doing the following calculations:
Boys: RAN# x 84 and press = 30 times
Girls: RAN# x 86 and press = 30 times
I will find the students I have chosen in the list of year 11 students on the edexcel website and take their heights. The list is on the next page. I will put the boys' heights into a cumulative frequency table and the girls' heights into another. From the tables I will be able to draw a cumulative frequency curve for the boys and a separate one for the girls. From these I will be able to find and compare the lower quartiles, medians and upper quartiles for each of the curves. This will tell me whether the boys are generally taller than girls or the other way round.
This is my cumulative frequency table for the heights of boys in year 11in Mayfield High School:
Class Width
Height x (cm)
Frequency
Cumulative Frequency
50??x <160
60??x <170
5
6
70??x <180
5
21
80? x <190
7
28
90??x <200
29
200??x <210
30
This is my cumulative frequency table for the heights of girls in year 11 in Mayfield High School:
Class Width
Height x (cm)
Frequency
Cumulative Frequency
00??x <110
10??x <120
0
20? x <130
0
30? x <140
2
3
40??x <150
0
3
50??x <160
9
2
60??x <170
3
25
70??x <180
5
30
After drawing my cumulative frequency curves for the boys' and girls' heights, I will draw a box and whisker diagram for the boys' heights and then for the girls' heights. I will draw these because they are good way of comparing the lower quartile, upper quartile and median of the boys and girls heights since they show the differences clearly. From them I will be able to find the interquartile ranges of the boys and the girls. This will show me the difference between the spread of heights of the girls and the boys.
From the box plots that I drew, I can see that the boys' median height; 158.5cm is higher than the girls' median height; 153cm. Also, the boys' upper and lower quartiles are higher then the girls; 153 and 172.5 compared to 146 and 159 of the girls. This tells me that overall the boys in year 11 are taller than the girls. Also, the boys' interquartile range; 19.5 is higher than the girls' interquartile range; 13. This tells me that the boys' heights are very spread out whereas the girls' heights are close to each other. The box plot that I drew for the boys' heights is positively skewed, but the box plot for the girls' heights is slightly negatively skewed. The positive skew tells me that there are more students closer to the upper quartile than to the lower quartile. The slightly negative skew means that there are a few more students nearer the lower quartile than the upper quartile but since it is so slight it is almost an equal distribution on both sides of the median. This is another difference between the heights of the boys and girls in year 11 at Mayfield High School.
I will now try and find any outliers that may have distorted my data. To do this I need to multiply the interquartile range by 1.5. For the boys this is: 19.5 x 1.5 = 29.25 and for the girls this is:
3 x 1.5 = 19.5. If the boys are 29.5cm below their lower quartile, or above their upper quartile they are outliers. If the girls are 19.5cm above or below their upper and lower quartiles they are outliers. Here are the heights of the boys and girls in year 11:
Year Group
Gender
Height (m)
Year Group
Gender
Height (m)
1
Female
.03
1
Male
.5
1
Female
.33
1
Male
.61
1
Female
.37
1
Male
.62
1
Female
.50
1
Male
.62
1
Female
.52
1
Male
.62
1
Female
.52
1
Male
.65
1
Female
.53
1
Male
.65
1
Female
.54
1
Male
.65
1
Female
.55
1
Male
.65
1
Female
.55
1
Male
.67
1
Female
.56
1
Male
.67
1
Female
.57
1
Male
.67
1
Female
.61
1
Male
.67
1
Female
.62
1
Male
.68
1
Female
.62
1
Male
.68
1
Female
.62
1
Male
.68
1
Female
.63
1
Male
.7
1
Female
.63
1
Male
.72
1
Female
.63
1
Male
.72
1
Female
.63
1
Male
.73
1
Female
.65
1
Male
.75
1
Female
.65
1
Male
.8
1
Female
.68
1
Male
.81
1
Female
.68
1
Male
.84
1
Female
.69
1
Male
.85
1
Female
.70
1
Male
.85
1
Female
.72
1
Male
.86
1
Female
.72
1
Male
.88
1
Female
.73
1
Male
.97
1
Female
.74
1
Male
2.06
Male: 153 - 29.25 = 123.75cm
Male: 172.5 + 29.25 = 201.75cm
Any boy that is 123.75cm or less, or 201.75cm or more is an outlier.
Female: 146 - 19.5 = 126.5cm
Female: 159 + 19.5 = 178.5cm
Any girl that is 126.5cm or less, or 178.5cm or more is an outlier.
The shortest male is 150cm so there is no outlier there but there is one boy who is 206cm. He is the only outlier for the boys.
The shortest girl is 103cm, which is the only outlier since the tallest girl is 174cm, not an outlier.
So overall there is one outlier for the boys and one outlier for the girls. These may have occurred because the students may have not known their height when giving the information so may have guessed inaccurately. They may have distorted my data slightly.
In conclusion, my graphs and calculations have proved my hypothesis correct. The boys are taller than the girls in year 11. This may be because the girls have already stopped growing, or their growth spurts have slowed down whereas the boys are now in the middle of their growth spurts and are therefore taller than the girls.
I will now take this investigation further by comparing the heights of boys in year 7 with the heights of boys in year 11, and the heights of girls in year 7 with the heights of girls in year 11. I will look to see what changes can occur in height over a period of 4 years.
I will do this by comparing the medians, lower quartiles and upper quartiles and the interquartile ranges. I expect that the boys and girls of year 11 will be taller than the boys and girls in year 7.
Year Group
Gender
Median
Lower Quartile
Upper Quartile
Interquartile Range
7
Girls
60cm
53.5cm
66cm
2.5cm
1
Girls
53cm
46cm
59cm
3cm
Year Group
Gender
Median
Lower Quartile
Upper Quartile
Interquartile Range
7
Boys
56.5cm
51.5cm
62cm
0.5cm
1
Boys
58.5cm
53cm
72.5cm
9.5cm
From these tables I can see that on average the girls in year 7 are taller than the girls in year 11, since the median, lower quartile and upper quartiles are all bigger than the year 11s. This contradicts what I had expected since I thought that the girls should have grown more by the time they reached year 11. Obviously this was not the case. This unexpected occurrence could have happened because the students may have written down incorrect information since it is highly unlikely that girls in year 7 would on average be taller than the girls and boys in year 11. The boys in year 7 are on average shorter than the boys in year 11, which is expected. I can see this since the year 11 boys' median, upper quartile, and lower quartile are all larger than the year 7 boys'. From these tables I can also see that the spread of girls' heights in year 7 and 11 are very similar. This means that the range of heights isn't that wide as is the case with the year 7 boys' interquartile range whereas the year 11 boys' heights are very spread out and have a wide range.
I have proved my three hypothesise correct. They were:
. The taller you are, the heavier you are.
2. Girls in year 7 are taller than the boys in year 7.
3. Boys in year 11 are taller than the girls in year 11.
I accept my three hypothesise since my data supports them but when I furthered my second and third hypothesise I found part of it to be incorrect.
There are certain limitations put to my investigation. One of these is that my sample size may have been too small. I could have made this bigger but it would have been time consuming and too much data to put on appropriate graphs. I wouldn't have been able to compare the information sufficiently.
The information provided may be inaccurate or incorrect since the pupils at Mayfield High School may not have known their heights and weights accurately and may have given incorrect guesses. They also provided a lot of inappropriate data such as their names which is inappropriate since isn't worth investigating.
The data could be biased because it was only collected from one school and not from other areas in the country. The data may be biased because there may have been a lot of snack shops or fast food restaurants near the school, which may have affected the students' weights.
I could have investigated further by using more than one school's data and compared them and I could have had more hypothesise. This would have given more precise results due to a larger range and a larger sample.
I could have drawn more graphs and made more calculations to get more accurate results. For example I could have calculated the standard deviation for my three hypothesise to get a more accurate spread of data than the interquartile range.