Statistics GCSE Coursework. Height and weight of pupils. The sampling method I am going to use is stratified sampling. This method is appropriate as the data is split into strata already (year group/gender)

Authors Avatar

Pilot Study

A pilot study is used test the viability of research. It is a study done prior to the main test enabling researchers to improve the design of the main study making the outcome of the test more reliable. To select the data I wanted for my pilot study, I had to take a sample of the data, the reason for doing this is to cut down time, as I would have to sift through all of the data to delete anomalies and if you take a sample it should be representative of the population, as the set of data is therefore made smaller the result I obtain should be more reliable/accurate. However there are some blanks in the data and some data provided is clearly inaccurate (i.e. a year 11, male pupil who is 1.69m tall and

Weighs 5kg) this data is physically impossible and will make my results less reliable (if used), and the conclusion I come to will not be valid. To overcome this problem I will select the next reliable result in the datasheet.

        The sampling method I am going to use is stratified sampling. This method is appropriate as the data is split into strata already (year group/gender) I chose this sampling method as it has many benefits; When the data is divided into strata I can make deductions that might not have been present if I took a simple random sample, for my hypothesis I can see if your height or weight is dependent on your gender and draw conclusions from that. Stratified sampling also increases the accuracy of the estimation made by my data as each stratum is recognised as an individual population and is therefore represented proportionally and fairly in the overall data. This means that each year group and gender is represented fairly. There is no guarantee that the sample will be representative as a large number of abnormal observations could be made however this is rarely the case, if the result is not representative it should be discarded.

        To take a stratified sample from the 1183 students I used this formula:

Combined with the data in the table below to work out how many people should be in each stratum:

        I then took a sample of the data in the table based on the number of

People in each strata and collected the data using Microsoft Excel. I made sure the data I selected from the spreadsheet was random by using the RAND() function in Excel to generate a random number I then sorted the data by the random number (then by year group and gender) to make sure the results were random. The height and weight of each student was attached to the data (and random number) so the sort I preformed didn’t mix up the sort column only. This method is simple random sampling as each member of the population has a equal chance of being selected as the numbers assigned to them are random meaning the numbers I obtain will hopefully be representative of each strata meaning each strata will be representative of the data as a whole. As mentioned previously, to reduce the chance of the data being unrepresentative of the population (resulting in a large number of untypical observations to be made) any data that is missing pr physically impossible will not be included in my pilot study so the outcome should be valid.

        

Another method I could have used for this was systematic sampling. This sampling method has a random start point and then continues with the selection of every nth term after that point. However this sampling method is likely to be unrepresentative to the population as a hidden pattern in the data might interact unintentionally with the outcome meaning that the result I obtain cannot be trusted. For that reason I chose to use simple random sampling to collect my data

For the data I collected through the stratified sampling I was able to produce a scatter graph to see if there is any correlation between the two variables:

To produce my graph I plotted the independent variable (height) against the dependent variable (weight) as based on my hypothesis the taller you are the more you weigh. As you can tell from my graph there is a clear correlation between height and weight, this correlation is positive meaning that as height increases weight increases also (proving my hypothesis).

I have plotted a line of best fit on my graph which allows me to make approximate predictions on the weight of a person based on their height. For example from my line of best fit I can predict that a person who is 1.6 meters tall should weigh 51.2kg (1dp).

        I was able to produce an equation for my line of best fit as follows:

y= -5.6572+35.5356x

In simple terms:

Weight (kg) = -5.6572 + (35.5356 x height in metres)

        The equation can be interpreted as follows: the gradient 35.5356 indicates that for every metre gained in height 35.5356kg is gained in weight so for ever 0.1m 3.55kg (2dp) is gained in weight.

        

        The correlation of the data tells us how strong the linear association is between two variables allowing me to see if there are grounds for further study.

        For the final part of my pilot study I decided to find the correlation coefficient of my data as it would easily tell me if there are grounds for further investigation because the answer we get indicates how strong a correlation there is (if any). The answer we obtain should be within the range of -1 and 1, the correlation we obtain can be described using this rough guide:

-1    = Perfect Negative Correlation                1 = Perfect Positive Correlation

-0.8 = Good Negative Correlation                0.8 = Good Positive Correlation

-0.5 = Some Negative Correlation                0.5 = Some Positive Correlation

0 = No Correlation

The equation for spearman’s rank correlation coefficient is as follows: (here r is the correlation)

To obtain a result I first ordered the “x” observations giving the largest value rank 1 second largest 2 etc., however as some of the data holds the same value I had to assign them the same rank so I had to add the rank values together and divide by how many ranks there are, I then did the same for the “y” observations. I created a column entitled d for difference and another d2 which is the square of the differences which I need for the equation as you can see (above). In this equation n is the number of data items I have which is 100 as I took a sample of 100 pupils (as previously explained) and n2 is the total number of data items squared. The table I produced looked like this:

Join now!

I inputted the results I got from the above table into the equation to get:

The answer I got is 0.4 (to 1dp) this proves there is some positive correlation (when used in conjunction with the table on page 4) so I do have some grounds for further investigation. The answer also proves I have a positive linear correlation between the two variables. Based on the value of r2 (0.42108310812 =0.177310984) therefore I can say that with a 17% chance that from any point on the line of best fit an increase in height will ...

This is a preview of the whole essay