I chose two stratified samples because stratified samples allows the gender to be represented fairly, taking proportion into account, so if there are double male then female my sample will have double females. So I shall do one stratified sample for Year 7 and one stratified sample for Year 11’s. My sample size shall be 50 people. I chose 50 because it is big enough to fairly represent the population but not too big that it is too time consuming to draw diagrams with.
To do a stratified sample I will split my population into two, one side for males and one side for females and numerically label them in ascending order. I will count how many pieces of data are in each gender. I will then find the fraction of boys to girls so 1/3 are boys and 2/3 are girls(the ratio would be 1:2). I will then times my gender fraction times by the number of data in the gender population.(if my answer is a decimal I shall round it to a whole number) for example 60 (overall gender number)*1/3= 20. This will give me my gender sample size. After this I will use a random number generator to generate a random sequence from 1 to the number of data in one of my gender.. From this I will select the gender sample size starting from the top(so in my example I would select the first 20 numbers in the random sequence). With this sequence I will select the data in the overall gender gender corresponding with my random sequence(in my example the first 20). I shall then do the same for the other gender in the same year and merge these two samples together to form a stratified sample of one year group. Then I shall do all previous steps for my other year.
I will do the outlier test because I do not want my data to be spread out severely due to 1 or 2 extreme values that will change my average and spread of data. I shall do the outlier test on my sample. To do the outlier test I will take away the upper quartile from the lower quartile to find the interquartile range (IQR). Then I will add 1.5*IQR to the Upper quartile to get my upper bound. To calculate my lower bound I will do the Lower Quartile take away from 1.5*IQR. Any number that is beyond these bounds will be deemed as an outlier and on my box plot I will mark the outliers with an X
From these two samples I will draw 2 box plots. Box plots are good because they allow me to compare the consistency of a group of data (IQR), the average of a group of data (median), and the spread of a group of data (skew).
I want each box plots to be on the same scale so I am able to directing compare the two box plots. Also I want one year group to have a considerably higher median than the other year at least greater than 5 seconds furthermore it should be more consistent, so has a smaller IQR. Upon looking at the box plots I will see if there is a difference and if one year group is clearly better than the other. and if I will reject my hypothesis or nor. If I accept hypothesis 1 then I will investigate my second hypothesis and third hypothesis- The better you are at estimating 30 seconds the better you are at estimating 60 seconds Using year 11’s due to them being better at estimating periods of time so they will have a better correlation in the two estimates I believe. If my box plots show that year 7’s are better at estimating 30/60 seconds then I shall use year 7's for hypothesis 2 and 3.
HYPOTHESIS 2
The better you are at estimating 30 seconds the better you are at estimating 60 seconds
I am interested in investigating this hypothesis because I believe that people who estimate two periods of time different from each other i.e. 30 and 60 seconds will have correlation so if one is bad at estimating 30 seconds then they there estimate for 60 seconds will also be bad. I believe this as estimating time periods has a technique like counting 60 times by saying a repeated number of words like saying elephants 60 times for 60 seconds. If this results in underestimation for 60 second this will then result in underestimation for 30 as the problem is in the technique.
I shall be using the same stratified sample that I collected above, as mentioned in my acceptance criteria. I shall use the year group that are better at estimating 60 seconds as they have a better grasp of time and I believe they would suite my hypothesis. Also again I want to represent gender in proportion because I don’t want gender to have an effect on the result of data but by being general in my hypothesis, that anyone who is good at estimating 30 seconds is good at estimating 60 seconds. this is why I chose to use my previous stratified sample. Due to me using my previous data I have already taken precautions of getting rid of bias, understanding the nature of my data, how to take the sample and the outlier test.
The diagram that I shall draw will be a scatter graph. A scatter shows bivariate data(two pieces of data that are from one source and one piece of data can not be removed from it). A scatter graphs allows me to visually see if a set of data is positively, negatively or not correlated. I shall draw a digram for one year group(the year group I accept from H1) and will have 30 second on one axis and 60 seconds on the other. To draw a scatter graph you have to draw a graph with a suitable and consistent scale and then plot each piece of bivariate data on my graph. If the scatter graph shows clearly no correlation then I shall reject my hypotheses. If I see some correlation or am in doubt then I will want to get a better way of measuring the correlation rather than having a mere visual look but would like a numeric value. To do this I shall use Spearman's Rank Correlation Coefficient(SRCC).
Spearman’s rank is good as it gives me a numeric value of the correlation and allows me to see if the correlation is positive or negative. My value will between -1 and +1. -1, means that the data is strongly negatively correlated. 0 means that the data has no correlation and +1 means the data is strongly positive. To do Spearman’s rank you have to do the difference of the two bivariate piece of data(e.g. the difference of the 30 second estimate from the 60 second estimate, from the same person,). You do this for each piece of data to find the difference for each piece of data. Some numbers may be negative so to get rid of these we square each difference. Then we find the sum of all the difference squared. We then times the sum by 6 to get a “number.” From this number we divide it by n(n^2-1) n represents the sample size. So you do the sample size squared take away 1 then times by the sample size. you do the bracket first using BIDMAS. So you do the “number” divided by n(n^2-1). After this you get a decimal. From this decimal you do 1- the decimal to get an answer. The answer will be between -1 and +1. I will accept my hypothesis is my SRCC is above 0.6 or below -0.6 because 0.5 shows a medium correlation and I need a clear number above this to show there is strong correlation.
Hypothesis 3
Year 11 data follows the normal distribution.
For hypothesis 3 I want to see if year 11's (or year 7, if I reject H1 then I'll use year 7's) data is normally distributed. I am interested in investigating this hypothesis as a normal distribution looks at how much data is within 1,2 and 3 standard deviations(I shall explain later). The normal distribution models how data is spread out and is found in nature(e.g. the size of people's index finger is normally distributed) Due to the normal distribution being a model I shall a quota sample as the gender nationally is roughly equal and also gives each gender an equal chance of being represented. Most data in a normal distribution is toward the mean(central tendency) and you have a few pieces of data that is far above it and a few pieces of data that is far below it. The normal distribution due to this is in the shape of a bell curve as shown in the picture.
To create a quota sample I shall use my population from year 11 that I collected from Winston Churchill. The reason I won't collect a new population is because this would be a waste of time when I have already collected a large amount of data that is representing my population. My sample size for my quota sample is 60. The reason my sample size is bigger than previous sample sizes is that I am using the normal distribution to represent the nation so a bigger sample size would make it more accurately represented. As mentioned before my population has no bias data in it. From my year 11 population I shall split my population into two one for Female Year 11's and one Male Year 11's and then I will number them all numerically e.g. 1,2,3 and this will be my gender population number. I shall calculate the male sample first so I shall use a random number generator to generate a random sequence from 1 to my maximum gender population number from the male population. From this I shall take the first 30 pieces of data. The reason it is 30 is that my overall sample size is 60 so for quota I would do 60 divided by the number of genders(2) and this will be my sample size for each. I shall use the 30 random sequence to identify all pieces of estimation data that corresponds with the number. I shall do the same for the other gender and then shall merge these two together to create a quota sample for Year 11's. From my sample I shall do an outlier test as I shown previously. I will do this because I don’t want some extreme values skewing my data so when I draw the normal distribution the curve is skewed one side only due to one piece of data.
To draw a normal distribution from my sample I need to find the standard deviation for my data. Standard deviation shows how much my data is far from the mean. For example the mean of 29,30,31 is 30 but 20,30,40 also has a mean of 30 but my first set of numbers are closer to the mean so have a smaller standard deviation. The first step is to find the mean of the data. You do this by adding all the pieces of data in the sample and divide it by the sample size. Then you have to get each individual piece of data and take it away from the mean then square the answer, this is called the difference squared. You repeat this step for each individual piece of data. After this you add all the differences squared to get the sum of the difference squared Ʃ(x-x)^2. Then we find the sample size so 60 in my quota sample and take 1 away from it to get 59. Afterwards we do Ʃ(x-x)^2 divided by the sample size take away 1(59) to get a value. Finally we square root this value to get the standard deviation(SD)
We then use the standard deviation to help see if our data follows the normal distribution. The normal distribution has certain percentages of the entire data within 3 standard deviations. So from one SD above and below the mean 68% of the data should be there if our data follows the normal distribution. From 2*SD above and below the mean 95% of my data should be there and from 3*SD above and below my mean 99.8% of my data should be there. So I need to calculate the percentages of my data within 1,2 and 3 SD. For the percentage of my data within 1SD I shall do the mean plus the SD to get my upper bound. Then do my mean subtract my SD for my lower bound. From my bounds I shall count how many pieces of my data lie in these two bounds then divide this number by the sample size(60) and then times by 100 to get my percentage for how much data lies within 1SD. I shall do this for 2SD and 3SD but instead of doing the mean plus the SD or the mean minus the SD I shall do it plus 2 or 3*SD or minus 2 or 3*SD and find the percentage within 2SD and 3 SD as explained above.
My acceptance criteria to accept this hypothesis is that for the percentage within 1 SD it has to be between 65%-71% as this value is close to 68%. For the percentage in 2 SD I will allow it to be between 93%-96% and for 3SD I will allow between 99% -100%. The reason I chose these percentages is because the normal distribution doesn’t exist exactly in reality but is similar and because I used a sample not my whole population so I have less data to work the percentages with so might be a few percent off.
H1 Year 11 are better at estimating 60 seconds than year 7 s H2 The better you are at estimating 30 seconds the better you are at estimating 60 seconds H3 Year 11 data follows the normal distribution.