Stratified sampling and Hypotheses - Taller people tend to be heavier - Males are taller than females - Males height is more variable than females
INTRODUCTION 2
Stratified Sampling Method 2
Sample 5
PLAN 8
HYPOTHESIS 1 9
Scatter graphs 9
HYPOTHESIS 2 11
Mean 12
Median 13
Grouped Frequency Tables 13
Histograms 14
Cumulative Frequency 16
Stem and Leaf Diagram 17
Relative Frequency 18
HYPOTHESIS 3 19
Range 19
Sample Standard Deviation 19
CONCLUSION 22
INTRODUCTION
For this investigation, I was given a collection of data with 328 records. These records were of students with certain details about themselves. These details included:
- Mathematics Set
- Height (cm)
- Weight (kg)
- Shoe size
- Hand span size
I realized that this database containing 328 records of students was too large to work with. Therefore I decided to create a sample of this database using the stratified sampling method.
Stratified Sampling Method
Firstly, I counted the total number of male and female students in each math's set. I used a tally chart to help me with this:
Math's Set
Male
Female
Red
Mauve
Blue
Green
I then counted each tally to receive the following numbers:
Red:
58 male, 65 female
Mauve:
31 male, 40 female
Blue:
53 male, 63 female
Green:
2 male, 5 female
Once I finished counting the number male and female students from each math's set, I added up all my results:
56
65
32
40
53
63
12
+ 5
328
The reason I added all my results together is because if all the results added together did not equal 328, I would have not counted properly and missed students out because the total number of students is 328. I then calculated the percentage of the male and female students of each category out of the total 328 students:
Percentage of Red Males
56
328
To the nearest whole number = 17%
Percentage of Red Females
65
328
To the nearest whole number = 20%
Percentage of Mauve Males
32
328
To the nearest whole number = 10%
Percentage of Mauve Females
40
328
To the nearest whole number = 12%
Percentage of Blue Males
53
328
To the nearest whole number = 16%
Percentage of Blue Females
63
328
To the nearest whole number = 20%
Percentage of Green Males
12
328
To the nearest whole number = 4%
Percentage of Green Females
5
328
To the nearest whole number = 2%
Now that I have calculated the percentages for the number of students in each math's set, I can use these percentages to choose my sample. I will use the figure that I obtained from each percentage calculation to choose the number of male and female students for each category. Otherwise known as each strata which basically means layer, in this case our chosen strata are the different categories of math's set e.g. red, blue, mauve etc. For example, the percentage I received for the males in the red math's set was 17%. Therefore at random, I will pick 17 males of the red math's set and include them in my sample from the database of 328 students. I will use the same approach to select the rest of the students for my sample. Therefore in total I will have 17 males from the red math's set, 20 females from the red math's set, 10 males from the mauve math's set, 12 females from the mauve math's set, 16 males from the blue math's set, 19 females from the blue math's set, 4 males from the green math's set and 2 females from the green math's set. The reason why I used the percentages that I calculated as guidance to the amount of students I should extract from each category out of the entire database is because, percentages are out of a 100. Therefore with this approach, with all my chosen students added together, I will have 100 different students which will make up my complete sample.
I chose the stratified sampling method because it was most appropriate to use seeing as the database has several different strata. By using this method, the proportion of each strata in the sample will be more accurately matched to the proportion of each strata from the database. This would provide me with a more accurate sample to represent the entire database:
Sample
No.
Math's Set
Male/Female
Height (cm)
Weight (kg)
Shoe Size
Hand Span 1d.p
035
Red
M
70
69
0
25
037
Red
M
69
69
7
22
049
Red
M
80
65
0.5
23
030
Red
M
70
60
9
21.5
031
Red
M
79
73
0
22.5
032
Red
M
72
70
8
23
034
Red
M
72
59
0.5
20.5
035
Red
M
70
69
0
23
036
Red
M
60
45
5
20
037
Red
M
69
69
7
22
040
Red
M
72
58
9
23.6
043
Red
M
85
90
9.5
24
046
Red
M
77.5
49
1
2.5
047
Red
M
73.75
50
2
23
048
Red
M
74
62
9
20
049
Red
M
80
65
0.5
23
073
Red
M
65
63
0
24
008
Red
F
87
70
6.5
20
024
Red
F
70
46
7
9
025
Red
F
78.5
60
6.5
20.5
026
Red
F
66
46
5
21.6
027
Red
F
63
63
6
22
028
Red
F
77
60
8
22
029
Red
F
60
53
5.5
9
033
Red
F
63
47
6.5
20
038
Red
F
69
58
6
22
039
Red
F
80
70
8
22.3
041
Red
F
63
43
5
20
042
Red
F
60
43
4
8
044
Red
F
67
62
6
9.5
074
Red
F
66
50
5.5
9
075
Red
F
75
61
9
9.5
079
Red
F
58
51.5
4
20
080
Red
F
65
53
5
7.4
081
Red
F
65
54
6
9
082
Red
F
63
63
7
8.6
311
Red
F
67
75
7
8.5
000
Mauve
M
76
62
8
9.9
001
Mauve
M
74
57
0
21.8
002
Mauve
M
78
55
8.5
8
003
Mauve
M
72
52
9
24
004
Mauve
M
75
60
9
23.5
018
Mauve
M
84
60
1
21
019
Mauve
M
60
56
7
21
050
Mauve
M
75
63
1
21
...
This is a preview of the whole essay
001
Mauve
M
74
57
0
21.8
002
Mauve
M
78
55
8.5
8
003
Mauve
M
72
52
9
24
004
Mauve
M
75
60
9
23.5
018
Mauve
M
84
60
1
21
019
Mauve
M
60
56
7
21
050
Mauve
M
75
63
1
21
053
Mauve
M
72.5
54
8.5
22
55
Mauve
M
79.1
75
2
23
006
Mauve
F
62
65
5.5
8.2
007
Mauve
F
60
70
6
8.7
010
Mauve
F
60
53
5.5
9.5
011
Mauve
F
65
59
9
21.7
012
Mauve
F
62.5
44.5
7
5
013
Mauve
F
55
55
5
20.5
014
Mauve
F
55
45
5.5
21
015
Mauve
F
59
59
5.5
6.1
016
Mauve
F
60
62
5
7.5
017
Mauve
F
73
70
6.5
8.5
020
Mauve
F
70
55
5
8
021
Mauve
F
62.5
55
5
20
04
Blue
M
78
54.4
0.5
21.2
07
Blue
M
81
74
0.5
24.5
11
Blue
M
75
66
0
23
13
Blue
M
50
77
0.5
21.5
18
Blue
M
78
70
1
24
25
Blue
M
65
70
0
20.5
26
Blue
M
85
57
0
21
27
Blue
M
80
70
1
22
30
Blue
M
50
57
7
24
59
Blue
M
90
82
0
22.2
60
Blue
M
74
60
1
21
61
Blue
M
70
57
0
21
64
Blue
M
80
79.5
2
22
65
Blue
M
70
60.4
0
9
66
Blue
M
73
70
1
21
67
Blue
M
77
57.2
0.5
23
72
Blue
F
75
65
6.5
22
73
Blue
F
73
60
7
20
75
Blue
F
65
44.5
3
8
213
Blue
F
63
54
5.5
9
214
Blue
F
63
84
5
7.5
217
Blue
F
68
63.2
7.5
20
56
Blue
F
62
55
5
8
57
Blue
F
60
55
5
8
58
Blue
F
74
70
7
20
62
Blue
F
65
47
5
21
63
Blue
F
62
44
5
9
099
Blue
F
65
51.2
5
20
00
Blue
F
68
57.6
3.5
9
01
Blue
F
55
44.8
6
9
02
Blue
F
47
57.6
5
9.9
03
Blue
F
65
64
6
21
05
Blue
F
55
41.6
4
8.8
06
Blue
F
60
52
6
20.4
08
Blue
F
50
45
5
20
77
Green
M
70
39
7.7
21
86
Green
M
66
35
9.5
8.5
80
Green
M
62
40
8
22
81
Green
M
72
35
0
21
78
Green
F
60
34
5
8.5
83
Green
F
58
37
5
8.5
PLAN
In this investigation, I have chosen 3 different hypotheses which I will attempt to prove. A hypothesis is a theory or assumption of something. The 3 hypotheses I made for this investigation are as follows:
Hypotheses
- Taller people tend to be heavier
- Males are taller than females
- Male's height is more variable than females
I will use a series of mathematical calculations and diagrams to retrieve enough information relating to these 3 hypotheses such as:
- Scatter graphs
- Grouped frequency tables
- Cumulative frequency graphs
- Relative frequency
- Histograms
- Lower quartile
- Median
- Upper quartile
- Inter quartile Range
- Box and whisker diagrams
- Stem and leaf diagram
- Model group
- Mean
- Median
- Range
- Sample standard deviation
By doing this, I will be able to analyze the information so that I am able to reach a substantial conclusion for each hypothesis. I am relying on my calculations and diagrams to support each of my hypotheses.
HYPOTHESIS 1
- Taller people tend to be heavier.
Scatter graphs
Both of these scatter graphs show a linear relationship between height and weight.
As you can see from the male scatter graph, the line of best fit shows that there is a positive correlation between the height and weight of the male students because the gradient of the line slopes upwards. The strength of the correlation is not very strong but reasonably moderate. Generally the graph does show that the taller the male students are, the heavier they are. Although there are some male students which are shorter than the taller students however they are heavier than the taller students. This is why the correlation between the height and weight is not very strong, it is moderate.
The female scatter graph also shows a positive correlation between the height and weight of female students with the gradient of the line of best fit sloping upwards. The line of best fit of the female scatter graph also has a slightly steeper gradient than the gradient of the line of best fit on the male scatter graph. This shows a greater difference between height and weight of females, height being the larger value. The strength of the correlation is slightly more stronger than the strength of the male correlation. This is because the points of the scatter graph are slightly more closer to the line of best fit than the points on the male scatter graph. However the line of best on the female scatter has a lower y-intercept value.
By drawing the line of best fit on both graphs, we can predict a person's height or weight. For example, if we suggest that a person weighs 40kg, we can look at the line of best fit on both graphs too see at which coordinate the line of best fit comes into contact with the value of x at 40. As you can see, for a male, the estimated height would be 168cm. The estimated height for a female with a weight of 40kg would be, 160cm. However because there is not a strong linear relationship on either graph, these estimated values may not always be perfectly accurate.
Having analyzed both graphs, they both generally show that the taller the students are, the heavier they are because of the positive gradient of the line of best fit on both graphs. The female graph supports my first hypothesis better because of its steeper positive gradient which shows a greater difference between height and weight. With both graphs supporting my initial hypothesis, it has now been proven that "taller people tend heavier."
HYPOTHESIS 2
- Males are taller than females
Male Height (cm)
Female Height (cm)
70
87
169
70
80
78.5
70
66
79
63
72
77
72
60
70
63
60
69
69
80
72
63
85
60
77.5
67
73.75
66
74
75
80
58
65
65
76
65
74
63
78
67
72
62
75
60
84
60
60
65
75
62.5
72.5
55
79.1
55
78
59
81
60
75
73
50
70
78
62.5
65
75
85
73
80
65
50
63
90
63
74
68
70
62
80
60
70
74
73
65
77
62
70
65
66
68
62
55
72
47
65
55
60
50
60
58
Mean
Mean Height of Males
To the nearest whole number = 173cm
Mean Height of Females
To the nearest whole number = 165cm
From the calculations above, I gathered all the height of the males from the sample and added them together. I then divided this number by the number of numbers to receive a mean value for the height of the males. I use the same process to work out the mean value for the height the females from the sample. As you can see, the mean figure that I obtained for the male height is 173cm to the nearest whole number. The mean figure that I obtained from the female height is 165cm to the nearest whole number. The mean height value of the males is greater than the mean height value of the females by 8cm. These calculations show that on average, males are 8cm taller than females from the sample.
Median
Median - Males
50, 150, 160, 160, 162, 165, 165, 166, 169, 169, 170, 170, 170, 170, 170, 170, 172, 172, 172, 172, 172, 172.5, 173, 173.75, 174, 174, 174, 175, 175, 175, 176, 177, 177.5, 178, 178, 178, 179, 179.1, 180, 180, 180, 180, 181, 184, 185, 185, 190
Median = 173.75cm
Median - Females
47, 150, 155, 155, 155, 155, 158, 158, 159, 160, 160, 160, 160, 160, 160, 160, 160, 162, 162, 162, 162.5, 162.5, 163, 163, 163, 163, 163, 163, 165, 165, 165, 165, 165, 165, 165, 166, 166, 167, 167, 168, 168, 169, 170, 170, 173, 173, 174, 175, 175, 177, 178.5, 180, 187
Median = 163cm
As you can see from these calculations, I ordered all the different heights of the males from smallest to largest. I then searched for the middle value from the ordered list to find the median of the height of males. The number that I targeted to be the middle value for the male heights was 173.75cm. I used the same approach to find the median value of the female heights and the number I received was 163cm. The median of the female heights is lower than the median of the male heights by 10.25cm. This shows that the males are generally taller from this sample.
Grouped Frequency Tables
Males
Height (cm)
41 - 150
51 - 160
61 - 170
71 - 180
81 - 190
Total
Frequency
2
2
2
26
5
47
Mid-Interval Value
45.5
55.5
65.5
75.5
85.5
Frequency x Mid-Interval Value
291
311
986
4563
927.5
8078.5
Females
Height (cm)
41 - 150
51 - 160
61 - 170
71 - 180
81 - 190
Total
Frequency
2
5
27
8
53
Mid-Interval Value
45.5
55.5
65.5
75.5
85.5
Frequency x Mid-Interval Value
291
2332.2
4468.5
404
85.5
8681.2
From the calculations above, I have calculated the estimated mean and model group using grouped frequency tables for both male and female heights. Firstly, I arranged number of the heights of the males into different groups and added all the frequencies together. I then multiplied the frequency of each group by the mid-interval value and I also added these separate values together. I then divided the overall total by the frequency total to receive an estimated mean. I followed the same procedure for the female height to receive the estimated mean and model group.
As you can see from the end results, the estimated mean value of male heights is 179.52cm. The estimated mean of the female heights is 163.80cm. The mean of the male height is once again larger than the mean of the female heights. Showing that males on average, are taller than females. The model group of the males is also higher than the model group of females. This shows that there are a larger number of taller males than females. The most frequent male height group is taller than the most frequent female height group. This means that more males are taller than females from the sample.
Histograms
Male
Height (cm)
40 - 160
60 - 170
70 - 190
Frequency
4
2
31
Female
Height (cm)
40 - 160
60 - 170
70 - 190
Frequency
7
27
9
As you can see from these two histograms, the male histogram shows a larger amount of tall students than the female histogram. Within the region 170cm - 190cm on the male histogram, the bar covers a very large area of the histogram in comparison to the female histogram. Which does not cover much area. There is very little male students which have the height between 140cm - 160cm in comparison to the female students. The bar within this region of the female histogram covers a much larger area. This shows that there are a larger amount of male tall students than tall female.
Cumulative Frequency
Male
Height (cm)
41 - 150
51 - 160
61 - 170
71 - 180
81 - 190
Frequency
2
2
2
26
5
Cumulative Frequency
2
4
6
42
47
Female
Height (cm)
41 - 150
51 - 160
61 - 170
71 - 180
81 - 190
Frequency
2
5
27
8
Cumulative Frequency
7
32
59
67
68
Having worked out the cumulative frequency for each group of heights for both male and female, I plotted the results on two different cumulative graphs. I then drew 2 separate box and whisker diagrams which correspond to both graphs. As you can see, the male cumulative frequency curve shows a moderately tight distribution around the median. This shows that the heights of the males are moderately spread out. The interquartile range of the male box and whisker diagram is also quite small a value of 9. This makes the box rather short. Again, this shows that the male heights are not widely spread.
On the other hand, the box diagram on the box and whisker diagram on the female cumulative frequency graph is longer. The female box and whisker diagram has an interquartile range of 16. This shows that the female heights are more widely spread out than male heights. The cumulative frequency curve is also a bit more loose around the median which also shows that the female height is a little more spread out than males. However in relation to who are taller, the box and whisker diagram of the females is much more to the left. By looking at the scale below the box and whisker diagram we can see that the further the box is to the left, the more the box will range from lower heights. If you compare both box and whisker diagrams, although the male box is shorter, you can see that the males' is further to the right. This shows that the males interquartile range consist of higher heights than females.
Stem and Leaf Diagram - Male and Female (cm)
The diagram above represents a stem and leaf diagram of both male and female heights. The figures in centre column are the first 2 digits of the male and female heights. The digits on either side of this column arranged in rows show the 3rd digit of male and female heights. The left side represents the males and the right hand side represents the females. It is very simple to understand this diagram. For example, 5|18|0, this represents a male with a height of 185cm and a female with a height of 180cm.
The top number in the centre column - 14, is only there to serve the female heights because there are no digits to the left of the 14 in the male sector. There is no male height in the sample which is that low. This also applies to the figure 19 in the centre column however the other way around. There is a digit next to this number to the left but not to the right. This shows that the lowest height from the sample is a female height and the highest height is male height.
As you can see, by drawing this stem and leaf diagram, we can easily notice trends from this visual representation of male and female heights. We can see that there are two very long rows of digits on either side of the stem and leaf diagram. The male longest male row is slightly below the longest female row. As you may have noticed, the further down the stem and leaf diagram you go, the larger the number becomes. With the longest male row of digits being below the longest row of female digits shows something very significant. This shows that the majority of males are taller than the majority of females.
Relative Frequency
By using the stem and leaf diagram, we can easily see the number of male and female students for each height. We use this feature to assist me in calculating the relative frequency. To calculate the relative frequency, we must divide the correct number of males or females that fall into each category bellow by the sample size which is 100. This will provide us with a figure representing the relative frequency:
Height (cm)
Relative Frequency
Male
Female
< 140
0
< 150
0.02
0.09
< 160
0.10
0.42
< 170
0.39
0.51
< 180
0.47
0.53
< 190
0.48
0.53
As you can see from the table above, there is an obvious general trend. The trend is that all of the relative frequencies for the male height groups are below the relative frequencies of the female height groups. Like the previous forms of evidence, this form of evidence supports the hypothesis that males are taller than females. For example, the relative frequency for the height group < 170cm, the male relative frequency is 0.39. Likewise, the female relative frequency for this height groups is 0.51. As you may have noticed, there is a large difference between the two figures. The difference between the two figures is 0.12. This means that there are fewer males that are below this height than females. This trend is consistent throughout the table above. Therefore, the males are generally more males are taller than females.
Having used various different forms of calculations and diagrams, all of them have indicated something. All of my calculations and diagrams have presented me with information in different forms, all of this indicated something similar that generally, females are shorter than males. Therefore, my second hypothesis has now been proven based on the series of calculations made and diagrams used in this section.
"Males are taller than females"
HYPOTHESIS 3
- Males height is more variable than females.
Range
Male
Lowest male height = 150cm
Highest male height = 190cm
Range:
90 - 150 = 40
40cm
Female
Lowest female height = 147cm
Highest female height = 187cm
Range:
87 - 147 = 40
40cm
To calculate the range for males, I acknowledged the lowest value and highest value of the male heights. Knowing this would enable me to calculate the range. To calculate the distance from the lowest value to the highest value of the male heights, all I was required to do was subtract 150cm from 190cm which would provide me with the range of male heights. Therefore, the height of males from my sample has a range of 40cm. As you can see from the above calculations, I used the same approach to calculate the range of the female heights. Astonishingly, the female heights have a range 40cm. This value is exactly the same as the range of the male heights. This shows that both male and female heights vary across the same sized scope. However, this is to vague to really tell us which is more variable.
Sample Standard Deviation
Males
Standard Deviation = (x - x)2
n - 1
The formula above is the formula that is used to calculate the standard deviation of a sample. By calculating the standard deviation we can measure the dispersion of both male and female heights. This will provide us with information to decide which is more variable - male or female.
The letter "x" represents each height. The symbol "x", represents the mean of the heights and the letter "n" represents the number of numbers. Firstly, we must calculate the mean value of the male heights:
The next step is to subtract each male height by the mean and square each value. Once this is complete, we must add all the squared values together. I have done this and tabulated the results:
X
x - x (4d.p)
(x - x)2 (4d.p)
50
-22.9755
527.8751
50
-22.9755
527.8751
60
-12.9755
68.3644
60
-12.9755
68.3644
62
-10.9755
20.4623
65
-7.97553
63.60911
65
-7.97553
63.60911
66
-6.97553
48.65805
69
-3.97553
5.80485
69
-3.97553
5.80485
70
-2.97553
8.85379
70
-2.97553
8.85379
70
-2.97553
8.85379
70
-2.97553
8.85379
70
-2.97553
8.85379
70
-2.97553
8.85379
72
-0.97553
0.951663
72
-0.97553
0.951663
72
-0.97553
0.951663
72
-0.97553
0.951663
72
-0.97553
0.951663
72.5
-0.47553
0.226131
73
0.024468
0.000599
73.75
0.774468
0.599801
74
.024468
.049535
74
.024468
.049535
74
.024468
.049535
75
2.024468
4.098471
75
2.024468
4.098471
75
2.024468
4.098471
76
3.024468
9.147407
77
4.024468
6.19634
77.5
4.524468
20.47081
78
5.024468
25.24528
78
5.024468
25.24528
78
5.024468
25.24528
79
6.024468
36.29422
79.1
6.124468
37.50911
80
7.024468
49.34315
80
7.024468
49.34315
80
7.024468
49.34315
80
7.024468
49.34315
81
8.024468
64.39209
84
1.02447
21.5389
85
2.02447
44.5878
85
2.02447
44.5878
90
7.02447
289.8325
Having obtained the total, I must now divide this value by the number of numbers minus 1 which is 46. Once I receive this number, I must find the square root of it:
Standard Deviation = 2952.2444
46
Standard Deviation = 64.179
Standard Deviation = 8.011 (3d.p)
After making various different calculations, I now have a standard deviation for the height of males. Using the same approach, I calculated the standard deviation for the height of females:
Females
Standard Deviation = 7.432 (3d.p)
Now that I have two separate standard deviations for the height of males and females, we can compare the variability. As you can see, the standard deviation of the male heights is slightly greater than the standard deviation of the female heights. This shows that the males have a more variable spread of heights from the sample than the females. The female heights vary slightly less in comparison to the male heights because the standard deviation of the female heights is lower than the standard deviation of the males. Even though the range of the both male and female height was the same, the heights of males varied more than the female in the same scope of 40cm. Therefore, I must finally come to substantial conclusion. Based on these calculations, my third and final hypothesis has now been proven.
"Males height is more variable than females"
CONCLUSION
Carrying out extensive calculations and with the assistance of very useful diagrams, I proved all three of my hypotheses. However, the hypotheses were only proven using a small sample. The rest of the characteristics of the database of student could be slightly different. This could defy all three of my hypotheses. If was given more time, I would try using a larger sample. This would provide me with more accurate results and the conclusions that I would make would be more valid in relation to the whole database, not just the sample. I would also try different sampling methods. I would then be able to support my hypotheses even further by making calculations relating to all 3 hypotheses using a number of different samples.