First I needed to find the gradient of the line of best fit and the y intercept. This was so I could then substitute them into the equation.
Gradient = Difference in y axis (weight)/Difference in x axis (height)
Gradient = 10/12
Gradient =0.83
Y intercept = 24
(Because my graph did not pass through the y axis I had to extend my graph and line of best fit)
Substituted into the equation: y=0.76x + 24
I then substituted a value for height into the equation from a pupil who lay directly on the line of best fit (to ensure the pupil was representative) to check if the equation worked.
X = 1.62
Y=0.83x1.62+24
Y=25.3446
Weight = 25.3446
However this value is totally inaccurate as the approximate weight of someone of height 1.62 should be 52kg, not 25g.This meant that my equation was incorrect. Because the x axis on my graph only began at 1.30m (not at 0 as it could have done) I thought that this could be having an effect on my equation. As a result of this I then tried adding 1.30 (where my x axis started) to my final result for height as this seemed like a logical suggestion:
Y= 52
52=0.83x+24
52=0.83x +24
0.83x=28
Therefore x =28/0.83
x= 33.73+130
x=1.63.7 (which is accurate considering the actual height should be 1.62m)
So the equation for finding height in metres when only weight is given is:
X=(y-24/0.83) +1.30.
Similarly, the equation can easily be rearranged to find a value for weight if only height is given based upon the y axis beginning at 30kg:
Y= (0.83x+24) +30
After analysing the scatter graphs, we can see that there is a consistent positive gradient in the relationship between heights and weights (albeit varying) in the pupils from the sample. Now I will proceed to narrow the focus of my investigation onto the spread of the data between heights and weights in conjunction with the sexes. To do this I will use histograms to represent the data collected on height, I will use a stem and leaf diagram to analyse the data on weights. I will use a stem and leaf diagram to show the weights because it will show the shape of a distribution. In addition I will use a histogram because they are a good way of representing continuous data they will represent the frequency of the results in a different way. In addition
Stem and leaf diagram
From the stem and leaf diagram I could find the mode, modal group, median, range, lower quartile, upper quartile and interquartile range (see table below) by simply looking at the diagram and finding the results needed.
The stem and leaf diagram of weight in both the boys and girls revealed that whilst the boys results were relatively normal with the modal group occurring relatively close to the middle of the range of heights, the results for the girls were very different. The modal group within the girls was skewed at the lowest end of the scale, in the 40-49kg group. This difference suggested to me that further investigation was needed.
I will now investigate the heights of the pupils at Mayfield high. To do this I will use histograms, cumulative frequency graphs and box and whisker plots.
Histograms
I have chosen to use a histogram to represent the heights of the pupils. This is because histograms are used to represent continuous data that is numerical and grouped, much like the heights of our pupils. I will draw two histograms to illustrate the heights of boys and girls.
The first set of histograms I will draw will use the length of the bar to represent the frequency so I must ensure that all the class intervals are equal in width (5cm).
From these two histograms a rough idea of the dispersion of the data can be gained, the standard deviation is actually the distance either side of the mean that encompasses about 65% of the total area of the bars.
The histogram for the heights of the girls showed a ‘tighter’ distribution of results where most results are within a narrow range either side of the mean. The value for standard deviation will therefore be quite small.
The histogram for the heights of the boys shows a higher dispersion than that of the girls. It has a fairly large spread of results away from the mean which suggests that the standard deviation will work out to be larger than that of the girls.
To represent the data of the heights in a different way I will now draw a set of histograms with different class widths. To do this I will have to calculate the frequency density. It will be the area of each bar that will determine the frequency.
Cumulative frequency graph
Using a cumulative frequency curve will show me how the overall result was obtained, how spread out the data values are
I will plot the cumulative frequency curves for both the boy’s heights and the girl’s heights onto one graph to make a comparison clearer. The cumulative frequency is obtained by adding together the frequencies to give a ‘running total’. First I will put the data into a grouped frequency table:
The cumulative frequency curves in the graph show that most girls' heights tend to be lower than the boy's heights; however this relationship changes at the height of approximately 168cm where the girls become taller than the boys.
The girls heights (represented by red line on graph) show a tighter distribution around the median which is reflected by a small interquartile range. This represents the fact that the girl’s results were more consistent than the boy’s results. In comparison, the boys heights (represented by a green line on the graph) show a more widely spread set of data and therefore have a larger interquartile range
I can use the cumulative frequency curves to work out how many percent of the boys in the school is over 1.6m and below 1.8m and how many percent of girls are in this range.
The cumulative frequency chart tells me that there are 25 girls between 1.60m and 1.80m and there are 31 boys between 1.60m and 1.80m.
So this means that 25/50= 1/2 or 50% of the girls in the school will be between 1.60m and 1.80m and 31/50 or 62% of the boys in the school will be between 1.60m and 1.80m. This means that if I selected a boy at random from the school then there is a 0.62 chance of him being between 1.60m and 1.80m. Also because the percentage of girls in these ranges is lower than the boys it suggests that boys are in general taller than the girls.
Box and Whisker Plot:
I have used the cumulative frequency curves to draw a box and whisker plot by extending the lines taken from the curve showing the lower quartile, median and upper quartile. However these values will only provide estimates because the graph, as hand draw may be inaccurate. It is much easier to represent the values for the lower quartile (25th percentile), Median (50th percentile) and Upper quartile (75th percentile) on a box and whisker diagram. I used a box and whisker diagram as it is easily apparent how the data is placed, and whether it is skewed and how the interquartile range is placed etc. It represents the central 50% of the data compared to the overall spread of data (the whiskers extend to the lowest and highest values.)
The median height for boys in the school is 1.62m compared to the median height of 1.63m for girls. This shows that, in general, the girls were generally taller than the boys. The width of the box represents the interquartile range, a measure of variability of the data. For the boys, the interquartile range was 14cm (1.69m – 1.55m) and for the girls it was 11cm (1.68m – 1.57m), this shows that there is more variability in the males data, this is because there is a wider box.
The box plot for the boy’s heights has roughly symmetrical distribution and the median is in the centre of the box. However the whiskers are of unequal length, with the right hand whisker being considerably longer than the left hand whisker. This shows that the distribution of values for the boy’s heights is roughly symmetrical but that due to some boys being extremely tall, a large range is shown. In comparison, the median for the girl’s height is towards the upper quartile, indicating that the values are negatively skewed, and again, the whisker to the right is considerably longer than the whisker to the left showing that the girls too have some extremely tall people within their sample.
I will now use grouped standard deviation to show how representative my sample is.
Throughout the investigation so far, it has emerged that in general, the boys from the sample show a stronger correlation than the girls (see scatter graphs) and that it is the boys who pull the girls’ ‘diluting’ results together giving the whole school scatter graph a positive correlation. Furthermore The results from the grouped standard deviation show that whilst the boys in the sample show a relatively representative result, the girls do not which suggests that the girls results require further investigation to find why their sample is not representative. It is because of this consistent reliability that I have chosen to now dismiss the results for the boys in the sample as it complies satisfyingly with expectations. If unlimited time was available, I would closely analyse both the boys and girls so accurate comparisons between the two could be made. However due to time constraints, I am forced to choose only one sex to examine in detail and because the results found so far for the girls have suggested abnormality, I have chosen to narrow the investigation onto the girls in the sample. Similarly, due to time constraints, I can only analyse fully one factor and I have chosen this to be height as this showed a more interesting factor.
Taken as a whole, the girls results are reliable and do show a positive correlation (as can be seen by the relative scatter graph), albeit not a strong as that of the boys, however, when split into their individual year groups the girls generally show a poor correlation. This result is slightly strange and could suggest that girls in one particular have a very strong positive correlation in order to counter balance the pattern of poor correlation within the girls. This irregularity will provide me with an interesting further investigation.
The majority of girls have heights in the range 1.60 – 1.65m, this quite close to the mean of 1.64m; however this value is at the higher end of the class. What this tells me is that the highest values may be having a disproportionate effect. In order to see precisely what effect these higher values are having I will carry out a test for individual standard deviation on this discrete data.
I will split the girls into year groups and take only the results of years 7 and 11, giving me two different groups in order to make a comparison (I would have split the girls into year groups to allow for greater accuracy if I would have had sufficient time). I choose to use years 7 and 11 as these should provide me with the most extreme comparison.
Standard deviation for Year 7 girl’s heights
Standard deviation for Year 11 girl’s heights
From this standard deviation I can see that neither the data for the heights of year 7 girls nor the data for the heights of year 11 girls was representative. Both showed the identical pattern of meeting (or almost meeting) the criteria of fitting into one application of standard deviation. However, because the majority of heights lay below the mean in each case, after two applications of standard deviation 100% of the whole sample fitted.
Conclusion:
As a result of my calculations and graphs, I can say that in general, the results were what I expected and do support the statement I made when describing the task.
My scatter graphs were useful as they illustrated the general pattern of the results, linking the information of both height and weight together. From these I could then see trends and based upon these trends, choose where to take the investigation next.
The stem and leaf diagram showed me the frequency of the weights and also allowed me to see the shape of the dispersion of weights.
My histograms then gave me an idea of the general dispersion of the heights of both the boys and girls.
The cumulative frequency graph combined with the box and whisker plots gave me another way of representing the data on heights and allowed me to clearly compare the boy’s results with those from the girls.
It was clear from the grouped standard deviation I carried out that the data from the boys was far more representative than the girls. When I focused the investigation onto the girls, and used individual standard deviation, it was revealed that this was due to an imbalance in data – that most girls’ heights were well below the mean.
Because of this lack of representative sample from the girls, the conclusions I draw from the investigation are likely to be unreliable; however I think I can still safely say I have proved my hypothesis correct in that all calculations/graphs have shown that height increases in relation to weight. In addition I have discovered differences between the heights and weights in boys and girls from the sample, such as girls have proved to be taller than boys up until key stage four when boys ‘take over’ and are the taller sex.
If I had had more time, I would have found a new, statistical sample for the girl’s results as this sample for the girls proved not to be very representative (shown by standard deviation). This way I could be sure that the conclusions drawn from my investigation were accurate.
If I could repeat my investigation, I would use computer software to draw my graphs and thus improve the accuracy of my graphs (particularly cumulative frequency). I would also use a larger sample as although this would be time consuming, it would increase the level of accuracy in my findings. In addition, I would take a sample that represented exactly the number of boys in the school against the number of girls. Although I included 50 of each sex in my sample (as this would give me easier numbers to work with), the total number of girls differs from the total number of boys and in theory this should be represented in the sample if the sample is really representative.
Despite this, general accuracy was gained with my calculations throughout, I used the full numbers (e.g. 45.28467102 instead of rounding it to 45.3) to get a more precise result, instead of rounding numbers to make them easier to work with. This can be seen in the work on standard deviation.
Furthermore to ensure my equations were correct I checked them with representative pupil’s statistics.
Further investigations I could do on the topic of data handling could involve an investigation into body mass index, or to investigate height in relation to exact age as opposed to year group.