# I will be testing the following hypothesis in my pilot study: The taller the student the heavier the student will be

Extracts from this document...

Introduction

Statistics Coursework: Mayfield High School

In this coursework I will be looking at data collected from students at a fabricated school Mayfield High School. I was given information about things for each pupil such as age, IQ, weight, height of the students the school. From the data I was given I had to come up with a line of enquiry to explore. I will be using statistical presentation methods like graphs and other various calculations to test the hypotheses. I am going to be looking at the height and weight of students at the school to see if the increase in height reflects the weight of a person and vice versa.

The data I am going to be looking at is secondary data. The advantages of this are that it saves time (if correct) in collecting the data. Furthermore, secondary data allows me access to data that I could not have otherwise got. However a disadvantage is that it may not be accurate could have parts of the data missing. The data could be biased. This means that there is not an equal chance for each student to be chosen at random. Furthermore, I also will not have any knowledge of the fact that it may be biased, as the person collecting the data may have changed it to their benefit, and also because I did not collect the data as it was done for me. The data may also be out-of-date.

I could have used primary data. This is because, unlike secondary data, it would give me unbiased data and it would be directly from a population. Also, it can give a better realistic view to the researcher about the topic under consideration. Furthermore, I would know how it was obtained, therefore ensuring that the data is accurate.

Middle

F

10

56

1.71

877

F

10

54

1.61

1114

M

11

52

1.62

1039

M

11

50

1.70

1037

M

11

54

1.61

1111

M

11

73

1.85

1120

M

11

67

1.78

1182

M

11

54

1.7

1006

M

11

72

1.55

1059

F

11

39

1.74

1093

F

11

42

1.71

1056

F

11

44

1.52

1061

F

11

48

1.60

1033

F

11

54

1.65

1076

F

11

48

1.73

I will now create a scatter graph, which is a graph of plotted points that shows the relationship between two sets of data. In my pilot study, each dot represents one person's weight versus their height, showing the data and also work out the correlation coefficient.This is because it will give me the correlation between the height and weight. The correlation indicates the strength and direction of a linear relationship between two random variables, in this case height and weight. Furthermore, I expect the correlation to be positive and give me confidence in investigating my hypothesis further. Spearmen’s Rank Correlation Coefficient (SRCC) is a more accurate method to compare correlation. This is because it gives me one number for each sample and therefore I can indicate and compare between the samples or each year. It uses the mean of each set of data and looks at the distance away from the mean of each point. The formula, which is also known as the Correlation Coefficient or ‘r’ is

(Where and are the means of the x and y values respectively)

The value of ‘r’ determines correlation. It is always between –1 and 1.

The scatter graph should show that there is some correlation between the height and weight of the students. Furthermore, the software Autograph worked out the correlation coefficient for me and also the equation of the line of best fit. The graphs below will give you an idea of what the types of correlation look like.

-1 = Perfect Negative Correlation1 1 = Perfect Positive Correlation -0.8 = Good Negative Correlation 0.8 = Good Positive Correlation

-0.5 = Some Negative Correlation 0.5 = Some Positive Correlation

0 = No Correlation

The data box above would give me an idea of the correlation, whether it being negative or positive or even no correlation at all. With my graph or calculations I may have some difficulty in working them out. This is because as mentioned previously, the data I am gathering data from and analysing is secondary data. Moreover, my graphs and calculations may not be accurate because of this and could cause my hypothesis in being false. There may also be outliers. An outlier is a value that "lies outside" (is much smaller or larger than) most of the other values in a set of data. It is 1.5 times bigger or lower the Interquartile range. Therefore, I am going to carry out a test to try and identify these outliers and decide what measures to carry out (replace them or leave them as they are). I am also going to show the graphs before and after the calculations for the anomalies to show the graphs with the outliers and also without the outliers. This is because it shows how the outcome of my results would differ between the two graphs (before and after).

There are two methods for calculating the outliers. One is using the Standard Deviation and the other is calculating the Interquartile Range. I therefore will be using the Interquartile Range method. This is because I find this the easier method of the two and also prefer this method. Furthermore, there was an outlier in the sample of data. This student was student 531 and had a weight of 110kg and was 1.7 meters tall. I handled this by choosing another student/number at random and replacing it with the outlier. The number that was generated was 456 where the weight was 42kg and a height of 1.62m.

Having worked it out, the correlation between the height and weight of the students is 0.375909. The graph also gives me an indication of this correlation.

This shows that there is a relationship between the two and that there is a line of enquiry to investigate further. The relationship is that the height of a student reflects its weight and therefore this supports my hypothesis as it is proving to be correct. However, this does not support my hypothesis fully as even though it is supporting my hypothesis, some students may not fit the database i.e. weigh more than usual of are taller than usual.

MAIN STUDY

Research on Height and Weight

From studying research, I found that adolescence is a time of great change in males, both physically, and mentally. Changes in a male’s body are greater at this time than any other time in a male’s life. Puberty usually occurs most often between the ages of 10 and 15, or occasionally earlier or a little later. I also found out that for a female, adolescence is also the time when a girl will see the greatest amount of growth in height and weight. Also, I found out that puberty for a girl occurs prior than for a boy, usually from 11 to 14 years of age. When going through puberty putting on weight and growing taller occurs at different times.

For my first hypothesis I think that the Mayfield High School spreadsheet will support my data because it has the relevant data the will prove my hypothesis. Also, the scatter graph shown above shows that there is some correlation between the height and weight therefore I can investigate further full knowing that I could achieve a positive result. Overall, my pilot study did not prove to be fully correct due to the affects of puberty and/or other variables.

I will now refine my hypothesis to including how I think age and gender will affect results. I am doing this because my pilot study shows that there is a line of enquiry to investigate further and therefore I will investigate further to gather more information and try and prove my hypothesis fully correct.

For my next hypothesis I have decided to investigate between the relationship between the height and weight of the pupils and the difference between these in different year groups. Therefore from my research:

- I think that females in year 7 will be likely to be taller and weigh more than the males. A number of the girls may start to go through puberty at this time. Therefore, I think that the spread of data for the girls will be greater than the spread for the boys.
- I also believe that almost every girl in year 8 will be taller and weigh more than the boys in year 7 because puberty occurs earlier for a girl than for a boy. Also, like in year 7 I think that the spread of data will be greater for girls, compared to boys, but because the boys will have started to go through puberty, the spread of the boys’ data will increase.
- I believe that in year 9 the height and weight will start be about the same. The boys may be taller and heavier due to puberty occurring. I also think that the year 9 boys and girls spread will be greater. Furthermore, the spread of the data for both of the genders may start to equal out, but I think that the boys spread may start to increase as the majority of the boys will have started to experience puberty whereas the females spread would also increase as puberty is still occurring but coming to its final stages.
- For year 10, I believe that the boys will tend to be slightly taller and heavier than the girls. There will be a smaller spread as most of the students would have been through puberty. The spread of the data will start to even out, however, due to research I have found that at the end of puberty boys’ body growth tends to increase a lot more than girls.
- Finally, I also believe that year 11 boy’s will weigh and be approximately the same height as the year 11 girl’s because the boys. The spread of the boys will be higher than the girls because even though girls experienced puberty earlier than girls, the effect that puberty has on boys is larger. Therefore, I believe that boys will weigh more and be taller than the females.

However, the height and weight are not directly proportional to each other. This is because you cannot control how much you grow but you can control how much you weight. This could be through eating disorders, genetic makeup and activity level. The table below is a two way table due to the fact there are two variables shown at the same time and helps view results and data conclusively.

Year Group | Number of Boys | Number of Girls | Total |

7 | 151 | 131 | 282 |

8 | 145 | 125 | 270 |

9 | 118 | 143 | 261 |

10 | 106 | 94 | 200 |

11 | 84 | 86 | 170 |

TOTAL | 604 | 579 | 1183 |

I will use stratified sampling to investigate my First Hypothesis. This is because it took into thought all our needs of the sampling of the data; and this methods was easily accessible and can be easily manipulated and carried and only asked for a simple understanding of the subject.

The variables for the sample are gender and age so I had to do separate samples for boys and girls and vary the amount of samples taken from each year to keep the sample unbiased and insufficient.

I will be analysing the data by taking a sample as it will be time consuming and difficult to analyse the whole population, 1183. I will be sampling each of the 10 groups (males and females in each year) in the school separately to make comparisons across year groups and gender. To do this I will need a larger sample. Therefore, 60 students (30 boys and 30 girls) from each group should be enough to perform statistical calculations on, which would give me a total population of 300 students (150 boys and 150 girls). Like my pilot study this is a stratified sample. I have chosen this method as it takes a proportional number from each group in the population so that each group is fairly represented. Furthermore, I have chosen to sample 300 students as I think that this will be enough to represent the whole population fairly.

This is ideal for carrying out the statistical calculations and graphs necessary on more than one section of the whole population.As my data has already been sorted into alphabetical order (on the Mayfield High School spreadsheet), as shown previously in the first couple of pages, I simply need to return to this and collect 60, as this will give my sample a total population of 300, random numbers from each group in between the highest male position and the lowest female position. E.g. in Year 7 the highest male position shown in my table above is 279 and the lowest is 133. Therefore, I need 30 numbers in between these two integers.

I also need to consider the fact that the data I am going to be analysing for this hypothesis is secondary data. Therefore, this could affect the outcome of the results and could result in my hypothesis being false. I may also notice anomalies in my results. For example, someone is 3m tall but weighs 5kg.

The reason as to why I did not discuss anomalies and outliers in my pilot study was because of the fact that I wanted to see whether or not I would be necessary or not to discuss them in the main study.

An outlier is any value which is 1.5 (or more) times the inter-quartile range below the lower quartile or 1.5 times (or more) times the inter-quartile range above the upper-quartile.

There are two methods for calculating the outliers. One is using the Standard Deviation method and the other is calculating the Inter-quartile Range. I therefore will be using the Inter-quartile Range method. I will be removing the outliers because they may cause my hypothesis to be false and eliminating them could make my hypothesis true. When using the Inter-quartile Range method I will be aware of any outliers because it will be more than 1.5 times the Inter-quartile range above the Upper Quartile (UQ) and/or below the Lower Quartile (LQ).

I am also going to carry out a test to try and identify anomalies and decide what measures to carry out (replace them or leave them as they are because it could make a difference in the outcome of results). An anomaly is a value that is an impossible value. Meaning that it is a value that, if compared to the rest of the values/sample, is outstanding.

I am also going to show the graphs before and after the calculations for the anomalies to show the graphs with the anomalies and also without the anomalies. This is because it shows how the outcome of my results would differ between the two graphs (before and after).

Investigation

For the sampling I am going to use the website randomintegers.org (mentioned before) to randomly select the numbers of the pupils. I will do this by inserting how many integers I need (in this case 30 boys and 30 girls) and insert between which values the numbers have to be. I am then going to carry out the following for each group:

- Scatter graphs to show whether or not the two sets of data are related with each other. This should show me that the two sets of data are related and that my hypothesis is correct.
- Correlation coefficient to measure the correlation between the two sets of data and also the strength of linear association between two variables (height and weight). Supporting it will be the scatter graph and the correlation should show that there is quite a strong relationship between the height and weight.
- Line of best fit, on scatter graphs. To show the model of association between the two variables. So that the plotted points on a scatter diagram are evenly scattered on either side of the line.
- Standard deviation for heights (or weights) for each group.

- Standard deviation is the measuring of variations around the mean value. Some values will be below the mean, some above and sometimes will be equal to the mean. So, some of the differences between the individual measurements will be positive, some negative, some zero.

Positive | More than the mean |

Negative | Less than the mean |

Zero | Equal to the mean |

- Minimum, lower quartile, median, upper quartile, inter-quartile range and maximum for weights (or heights).

- Inter-quartile Range – This is also a measure of spread but looks at the spread of the middle 50% of the data around the median. It is found by subtracting the lower quartile from the upper quartile (calculating UQ-LQ).
- Box and Whisker Plots: As well as an average, such as the mean, I need a measure of spread of the data about the average if I am going to explain it in more detail. From this I can find the range and inter-quartile range and produce these diagrams. A box and whisker diagram can be drawn to represent important features of the data e.g. to show the maximum and minimum values, the median and upper and lower quartiles. I expect the figures to increase as the year and gender progresses therefore this will show that my hypothesis is correct. Below is a simple example of what a box and whisker plot should look like:

Conclusion

Evaluation

My results are not very reliable in making findings. This is because I do not know enough about Mayfield High School to make assertions about the teenagers in general. However, it would be better to track a year group thoroughly instead of using different pupils from each year group.

Furthermore, I cannot use my findings for the whole school as my samples of the pupils were not big enough. Furthermore, the findings could be used to make assumptions for my school but I would prefer not to because it is a completely different school and the hypothesis may prove to be completely false.

I think that my sample was representative considering the fact that there were 1183 pupils at Mayfield High School. Furthermore, as the data was secondary data, this limited me into using the data given and therefore limited my result as, if there may have been as error or miscalculation it was down to some of the data either being incorrect or missing. If I was to do this coursework again I would make sure from the beginning that I would not face any problems by checking if data was missing or inaccurate. This may be difficult as the database is secondary data and therefore it will be hard to search for and emit these problems. However, as it will be my second time for the project, I will be familiar with what to do and therefore the time that it took for me to complete the project will decrease. Therefore, I can use more time in searching and correcting these problems.

This student written piece of work is one of many that can be found in our GCSE Height and Weight of Pupils and other Mayfield High School investigations section.

## Found what you're looking for?

- Start learning 29% faster today
- 150,000+ documents available
- Just £6.99 a month