Stratified Sample
The entire secondary data supplied to me comprises 1183 pupils and 31941 datum points. This data is too large for me to handle in this investigation. I have, therefore, decided to analyse a sample of the secondary data of the pupils from Mayfield High School.
I have decided that an appropriate sample is 50 pupils as it is a more manageable set of data, large enough to be a fair representation, but small enough to handle. I considered that, since both my hypotheses may be affected by age, it was appropriate to use a stratified sample so that each gender in each year group was fairly represented.
Using a stratified sample is very important in the case of both of my investigations because weight and IQ are likely to vary over time. Each year group should, therefore, be fairly represented as, between them, they represent a period of five years. Proper representation of all the year groups may counteract anomalies to the hypothesis such as year 11 pupils who used to watch a lot of TV (and have therefore gained a lot of weight) but, because they are in an examination year, may now only watch a few hours.
Because a stratified sample was important, I did not consider that a pilot survey would be appropriate. A pilot survey of, for example, 10 pupils, would not be likely to give a proper representation of each gender in each year group. Whatever the results of the pilot survey, I would still have had to carry out the full investigations into each hypothesis using the bigger sample of 50 pupils.
For each hypothesis, I have analysed the same stratified sample which I produced by the method given below.
Producing the Stratified Sample
In order that my results may not be biased to one particular gender, I ensured that I took a representative number of pupils from each gender in each year group. This is particularly important in the case of the investigation into the weight of pupils, because it might be said that girls usually weigh less than boys: if that is true, then it is appropriate to have both genders equally represented. The numbers of pupils taken from each gender in each year group were calculated according to the table below.
Girls
Boys
I created, from my secondary data, separate charts for each gender and year group and numbered each pupil within that group according to their alphabetical order. To ensure that there is no bias, the actual pupils selected were selected by a random number generator which was a function available in Excel. The random number generated was always less than or equal to 1. I, therefore, multipled each number generated with the number of pupils in each gender and year group to ensure that each pupil was capable of being selected. If the same number in the same group was generated twice, which did happen once, I selected another random number using the random number button on my calculator and, again, multiplying it by the number of pupils in that gender and year group.
In this way I produced my stratified sample of 50 pupils which I have set out at Appendix 1. The number column in appendix is the number of the pupil in that year and gender group, assigned through the random number generation process.
Analysis of Stratified Sample
Because both of my hypotheses predict a connection between the two sets of data in each hypothesis, I decided that, for each hypothesis, I should present the data on a scatter graph.
I used excel to create both scatter-graphs, and in each scatter graph I inserted a line of best fit. At Appendix 2, I show the scatter graph created for investigating my first Hypothesis (TV vs Weight), and at Appendix 3, I show the scatter graph created for investigating my second hypothesis (TV vs IQ).
In both my scatter graphs, one pupil (the same in each graph) would appear to be an anomaly in that he is recorded as watching 79 hours of TV per week which is excessive by comparison with the other pupils. However, there is not sufficient information to suggest that this is dirty data. I have, therefore, kept this pupil in my sample and used his data for my calculations. I have not, therefore, treated him as an anomaly because although it is excessive it is possible. I have simply treated him as an outlier.
The scatter graphs in both investigations (see below) showed possible but weak correlations. To see further if it could be said that there was a correlation in each case I used Spearman’s coefficient of rank correlation. At Appendix 4, I set out the table from which I calculated Spearman’s Rank using the data in the first hypothesis (TV vs Weight). At appendix 5, I show the table from which I calculated Spearman’s Rank using the data in the second hypothesis (TV vs IQ).
RESULTS AND ANALYSIS
The First Hypothesis – comparing time spent watching TV with weight
From the scatter graph at appendix 2 we see that there is, contrary to my hypothesis, a possible weak negative correlation. The correlation is negative because, as the line of best fit shows by sloping downwards (negative gradient), it would appear to show that people who watch more TV weigh less. The correlation is weak because the points on the graph are well spread out from the line of best fit.
However, the extremely slight gradient of the line of best fit tends to suggest that there is very little correlation at all between these two sets of data; i.e. the gradient of –0.26 is hardly significant to suggest a correlation. I, therefore, decided to check whether there was a correlation using Spearman’s coefficient of rank correlation.
At appendix 4, I show how I have calculated the Spearman’s Rank value comparing time spent watching TV with Weight. The Spearman’s rank calculation of –0.14369 suggests also only a weak negative correlation.
The Second Hypothesis – comparing time spent watching TV with IQ
From the scatter graph at appendix 3 we see that there is, in support of my hypothesis, a weak positive correlation. The correlation is positive because, as the line of best fit shows by sloping upwards (positive gradient), it suggests that people who watch more TV have higher IQ’s. The correlation is weak because the points on the graph are spread out from the line of best fit.
However, the extremely slight gradient of the line of best fit tends to suggest that there is very little correlation at all between these two sets of data; i.e. the gradient of 0.15 is hardly significant to suggest a correlation. I, therefore, decided to check whether there was a correlation using Spearman’s coefficient of rank correlation.
At appendix 5, I show how I have calculated the Spearman’s Rank value comparing time spent watching TV with IQ. The Spearman’s rank calculation of 0.05348 shows an extremely weak positive correlation. Although I predicted a positive correlation, the weakness of the Spearman’s Rank value suggests that there is almost no correlation at all between the two sets of data.
CONCLUSION
The results suggest a weak negative correlation in the data comparing time spent watching TV with weight i.e. that those pupils who spend more time watching TV weigh less. Although weak, the correlation is significantly stronger than the positive correlation found in the data comparing time spent watching TV with IQ. I conclude that there is no real correlation to speak of between IQ and time spent watching TV, and only a weak correlation (negative) between Weight and time spent watching TV.
It must be remembered that these results have been drawn from a sample only. More conclusive results are likely to be obtained from a larger collection of data than the 50 in my sample.
I am surprised by my finding of a negative correlation in the TV v Weight comparison. Possible explanations for such a correlation, if it is genuine (ignoring factors such as dirty data and limited sample), may include the fact that there is a lot of sport on television and it is possible that the more athletic pupils watch it.
However, my investigations into both hypotheses were limited and always unlikely to have proved a link between the two pieces of data in each case, and even less likely, a direct causal relation. For instance, I did not take into account external factors which could have affected or improved my results.
External factors which I could have taken into account to improve both my investigations include factors like age, height, gender, method of travel to school and programmes watched. Most of these factors were available to me in data form on the secondary data supplied to me. I would have to have done further primary investigation (through e.g. questionnaires) into a breakdown of types of programme watched, if I wanted to discover further possible explanations for any correlations found. A more conclusive investigation is likely to result if this additional material is used and taken into account, which may support a direct causal relationship.