I will work out what “r” is between height variable (x) and weight variable (y) for the year 11 boys and girls, which will give me a measure of linear correlation. After I will interpret the two r values obtained and analyse to give reasons for the strength of the correlation and why they may differ. The linear equation that best describes the relationship between X and Y can be found by . The means by which I shall do this are as follows.
I shall use the following equations further in my coursework in aid of finding correlation values.
Sxx =
Syy =
Sxy =
Linear Regression And The Least Squares Regression Line
Regression is the process by which you can determine the function satisfied by points on a scatter diagram. The function will give you points that will pass through the mean, (,). As it is linear regression it tells you that the function is a straight line therefore the function (f) would be:
f(x) = a + bx where a = - b
Least squares regression line a form of linear regression is the line that assumes the position as close as possible to all the points on a graph. It is the line with least square error (minimal sum of the deviations squared – Σd2). Some distances are positive and some are negative however when they are all squared they are all made positive and give the lowest possible Σd2.it can also be shown that Σd2 is minimised when:
To calculate the deviance (residual) of each point from the regression line assuming that the data points are (x1, y1), (x2, y2) etc… then d (deviance) would fit the formula
di = yi – (a + bx)
Hypotheses
- I think that both boys and girls will have moderate to strong positive correlation.
- I think that girls will have a weaker correlation than boys however the correlation will be within or around 0.15 difference of one another.
- I think that the boys will have an overall lower residual range.
- I think that the boys regression line will have a larger gradient than that of the girls due to boys build muscle through and females gain fat through puberty (and muscle weighs more) I am presuming that the shorter students are in early puberty and haven’t developed as much as the taller students.
- I think that height and weight will be very dependant on each other.
Product Moment Correlation Coefficient - Girls
I used the totals of x,y,x2,y2 and xy to calculate r using the following equations:
Sxx =
Syy =
Sxy =
I entered the values into the equations:
Sxx = 93.12 - = 0.19
Syy = 94954 - = 2433.54
Sxy = 2933.86 - = 7.41
r = = 0.34
Product Moment Correlation Coefficient – Boys
I used the same equations to do this however used the data obtained from the Boys table.
Sxx = 105.11 - = 0.64
Syy = 127984 - = 7208.69
Sxy = 3594.81 - = 42.63
r = = 0.63
*all answers on this page given to 2 decimal places
Line of Least Square Regression – Girls
Equations used for this are:
b =
a = - b =
y = a + bx
I then entered the values:
b = = 39.00
a = = -12.23
therefore… y = -12.23 +39x (**shown on graph)
Line of Least Square Regression – Boys
I used the same equations as for the girls however used the data obtained from the Boys table.
b = = 66.61
a = = -56.34
therefore… y = -56.34 + 66.61x (**shown on graph)
*all answers on this page given to 2 decimal places
Analysis, Improvements and Further Work
Correlation
For year 11 boys I found that the correlation between height and weight was r = 0.63, a relatively high positive correlation showing that weight is directly effected by height. For year 11 girls however the correlation was much lower being r = 0.34. There are many possible reasons for this some being the range in height of boys to girls, boys height ranges from 1.5 up to 2.0 and for the girls 1.4 up to 1.73 a 17cm difference between the two background variables in range therefore they are much less spread out and if one person is extremely overweight or extremely skinny it effects the correlation a lot more dramatically in comparison to that of a male candidate. Also the correlation could be like this due to dieting as females of this age are exposed to glamour models and ultra thin celebrity bodies however males generally are lazy at this age and most don’t do anything that would change their weight like jogging. I expected the boys to have a stronger correlation however not this much stronger and this may come down to the random selection process.
Regression – Girls
The equation gained through the process of least squares regression line in the form y = a + bx for girls was:
y = -12.23 + 39x
This sows that 2 year 11 girls with a height difference of 1 meter should theoretically have a 39kg difference in weight, indicated by the gradient (slope). This seems slightly high as the average weight of a girl in year 11 is 51kg and as you look into the regression with more depth you see that it is only reliable within the set of data it is for. For example as you can see from the line equation when a girl has a height of 0m according to the regression line her weight should be -12.23kg which is ridiculous, this is called an extrapolation which occurs when the function is used to predict values outside the given range (highest and lowest x-value) and can be highly inaccurate, this is why you should only use the values of x within the given range this is called interpolation.
To check how accurate my line of best fit is I chose 3 girls at random and put their height value into the line equation in place of x to give me a theoretical value for y (weight), which I will then compare to their actual weight:
- height 1.73m weight 58kg
- height 1.60m weight 45kg
- height 1.52m weight 48kg
To calculate the deviance (residual) of each point from the regression line assuming that the data points are (x1, y1), (x2, y2) etc… then d (residual) would fit the formula:
I did not use this so that I can show the stages of the workings.
di = yi – (a + bx)
- y = -12.23 + (39x1.73) = 55.24kg
To work out the residual you take the actual weight from the theoretical weight.
residual = 55.24 – 58 = -2.76 a percentage error of 5.00%
- y = -12.23 + (39x1.60) = 50.17kg
residual = 50.17 – 45 = 5.17 a percentage error of 10.30%
- y = -12.23 + (39x1.52) = 47.05kg
residual = 47.05 – 48 = -0.95 a percentage error of 2.02%
This clearly shows that my least squares regression line is very accurate with all 3 random girls falling within 10% of the line and 2 even within 5%. This could be one method of determining which results are anomalous and are going to be influential points when plotting the regression line however it is simpler and less time consuming to refer to the graph and visually choose anomalies.
Regression - boys
The equation gained through the process of least squares regression line in the form y = a + bx for boys was:
y = - 56.34 + 66.61x
This shows that according to the least squares regression line 2 boys in year 11 with a height difference of 1 meter should theoretically have a weight difference of 66.61 kg, indicated by the gradient (b). This figure seems to be ridiculously high considering I haven’t grown over the past year and only put on 3 kg from being 67kg I am now 70 kg and my height is 1.78 according to this line someone who is in year 11 and only 78cm tall should theoretically weigh under 1 kg. As with the girls regression line if you look into the data in more depth then you realise that the line is only useful for figures within the given range (highest and lowest x-value) calculating for data outside of this range is called extrapolation and an example of this would be on the male graph when height = 0m their weight according to the least squares regression should be –56.34 which is instantly incorrect as soon as you see it is a minus figure, this is the reason we should only interpolate (use data within range) when using the least squares regression.
To check how accurate my line of best fit is I chose 3 boys at random and put their height value into the line equation in place of x to give me a theoretical value for y (weight), which I will then compare to their actual weight:
- height = 1.91m weight = 82kg
- height = 1.80m weight = 63kg
- height = 1.68m weight = 56kg
- y = -56.34 + (66.61x1.91) = 70.89kg
residual = 70.89 – 82 = -11.11 a percentage error of 15.67%
- y = -56.34 + (66.61x1.80) = 63.56kg
residual = 63.56 – 63 = 0.56 a percentage error of 0.88%
- y = -56.34 + (66.61x1.68) = 55.56kg
residual = 55.56 – 56 = -0.44 a percentage error of 0.79%
I felt this was an unfair representative of the boys even though it was randomly selected therefore I am going to do a further 2 to get a better representative.
- height = 1.55m weight = 54kg
- height = 1.77m weight = 57kg
- y = -56.34 + (66.61x1.55) = 46.91kg
residual = 46.91 – 54 = -7.09 a percentage error of 15.11%
2. y = -56.34 + (66.61x1.77) = 61.56kg
residual = 61.56 – 57 = 4.56 a percentage error of 7.41%
The reason I felt it was unfair is due to the fact that the boys had a stronger correlation than the girls and therefore should theoretically have lower percentage error from the least squares process. The reason the boys is more varied than the girls I believe is down to the random selection however I feel that I have still proved that both the boys and girls least squares regression lines are accurate for their respective data. The fact that the boys and girls both not only have positive correlations which are at minimum moderate but they also both have steep gradients for their regression lines and low residuals which indicates that height and weight are most definitely dependant on one another i.e. the taller the year 11 student (male or female) the more they are going to weigh within the boundaries set.
Hypotheses - Reviewed
- I think that both boys and girls will have moderate to strong positive correlation.
I was slightly off with this hypothesis the boys had a relatively strong correlation however the girls had a low/moderately low correlation however they were both positive.
- I think that girls will have a weaker correlation than boys however the correlation will be within or around 0.15 difference of one another.
The girls did have a weaker correlation however it was larger than 0.15 which could be put down to the sample not being greatly respective of the entire year group.
- I think that the boys will have an overall lower residual range.
By looking at the graphs and comparing them this is correct however when working out residuals through equations the boys shown a very large range (from 0.79% error up to 15.67%)
- I think that the boys regression line will have a larger gradient than that of the girls due to boys build muscle through and females gain fat through puberty (and muscle weighs more) I am presuming that the shorter students are in early puberty and haven’t developed as much as the taller students.
I was correct in thinking that the boys would have a steeper gradient but that doesn’t necessarily mean my reason is correct it could be many different reasons that I have overlooked such as sport they do etc.
- I think that height and weight will be very dependant on each other.
I was correct in thinking that height and weight would be dependant on each other.
Modifications That Would Make The Investigation More Reliable
- Look at different year groups and compare results
- it could have been more accurate if I took a larger sample (the larger the more accurate).
- Gather secondary information from the internet looking at national data for height and weight rather than localised
- Look at different schools year 11 pupils for a wider range of sources to produce a more reliable result
- Use more precise measuring instruments which could measure to 3 or possibly 4 decimal figures
- I could have used a stratified sample which would take into account that there are more boys than girls in year 11
- Take many small samples and collect the averages together this will give you more accurate readings and will also allow you to negate anomalies
- Look into the subjects background and see how environmental differences such as wealth has a trend in the data
All of the above would make improvements to the accuracy of my results however I feel that the easiest to do with the most impact on reliability would be to increase the sample i.e. use the whole year group. Also with equal ease you could take a stratified sample which would reverse the error which would be caused by the difference in amount of students boys : girls.