Frequency table
This is a grouped frequency table for my IQ. It shows me the number of times each value occurred in my data. From this I can find out that the largest number of people got the IQ between 100- 109. I also included cumulative frequency into the table. Cumulative frequency is running total of the frequency at the end of each class interval.
Cumulative frequency will come useful when I have to do the interquartile range.
The frequency tables can also be displayed graphically. One of the ways to display frequency data is by using frequency polygons. Frequency polygons display the data in a way it is easier to understand. The area of the polygon is proportional to the frequency of the data. This is the frequency polygon for the IQ of my data.
It shows that the value with the highest frequency is IQ of 100-109.
Frequency table for SAT results
I now have a frequency table and frequency polygon showing my data. Furthermore, I need frequency table and polygon for my SAT results. However there is no need of having grouped data. This is my frequency table.
Frequency table for SAT results
From this frequency table I can see that value that occurred most was Level 4. This tells me that out of 100 pupils from my sample; almost half of them got level 4 as their sat results. It also tells me that not many people got results below level 4 at their SAT results. This data can also be put into a frequency polygon for more interesting presentation of data.
Frequency polygon of the SAT Results
As you can see my frequency polygon represents my frequency table. It shows that larger part of my sample got the level 4 and above in there SATs.
Mean
The mean of data is the one form of data average. The mean of data is equal to the sum of all observed values divided by the total number of values. It can be put into this formula.
Mean = Sum of values
Or it can be written using ∑ X = ∑x
n
Where X = mean of the values
∑x = Sum of all the values
n = Number of values
I have decided to do the mean for the IQ and the English SAT results both. I will try to see what an average IQ is to the average SAT result.
Mean is a very useful average to use when you want to get a ‘typical’ value of a set of data. However, you must be careful when you use the mean, since if you have any extremes in your set of data, it can heavily influence the mean. Nevertheless, since my data doesn’t have any extremes, and is quite closely grouped around my data, I think mean is appropriate.
Since I have 100 samples to analyse, me doing calculation with a calculator was impractical, even more so doing it with pen and paper. Instead I did my mean calculations using Excel.
Mean of student IQ
Mean = Sum of the IQ/number of students
Mean = 10362/100
Mean = 103. 62
Mean of English SAT results
Mean = Sum of SAT results/number of SAT results
Mean = 419/100
Mean = 4.19
The mean of English SAT results is 4.19
The mean shows that for Mayfield high school the average intelligence of pupils IQ is 102. It also shows that these schools pupils got an average of 4.1096 in their SAT results. This is only relative to the samples taken from the population of the Mayfield High School. It shows me a general picture of the Schools intelligence and their achievements.
The Median
Median is actually just a fancy name for a middle of the data. The median is the middle value of the sample. You get a median by arranging your data in a row or a column by the magnitude, smallest first or vice versa. Value that is in the middle of the column will therefore be intermediate in size and shall give general size of the data.
IQ median
To help with finding the median of the data, I did a stem-and-leaf diagram of the IQ of my sample data. This orders my data in there order of size and helps me find the median. This is my stem and leaf diagram
The stem and leaf diagram of IQ
Since there is and even number of my samples, there isn’t any middle value. Instead, I have to take the two middle numbers. Then, by convention the median is then taken as the average of the two numbers.
From my stem and leaf diagram I know that the two middle values of my sample data are:
101 and 101
To get the median, I have to find the average mean of the two values. So, the median is:
(101+101)/2 = 101
The median of the IQ is 101
SAT Results median
Now I have to find the median of the English SAT results. I cannot put my SAT results into a stem leaf diagram because there is no need for it. Also I cannot use my frequency table to find the median. However, my frequency table can help me to find the median. My sample data for IQ has only few different values. From frequency table I know how many of each value there are in my sample. I have decided to just put all of the values in order, smallest first. From the data thus it wont be hard to find the median.
1,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
Since there are 100 values in my sample, there is no definite median in my data. Instead I took two middle values and found the mean of those values. This gave me median of the results.
The middle two values of the data are 4 and for. So, the median of the set of data is:
(4+4)/2 = 4
The median of the SAT Results is Level 4.
The mode
The next thing I can find out about my data is the mode. The mode is the single value that occurs most in the data. I can use my frequency tables for this.
IQ mode
From my frequency table I have data of how many times the particular values occur. From the frequency table and frequency polygon previously done I know that the values that occur most often are 100-109. However, since the frequency table is grouped it doesn’t give me a specific value that happens most frequently. Instead I used the stem-and-leaf diagram previously done. From there I gathered that the one value that occurs most often is 100 IQ. From that data I know that:
Mode of students IQ is 100
SAT English results mode
I am unable to use the stem and leaf diagram for my SAT results. However, since my frequency table for SAT results isn’t grouped, I am able to use that to find the mode for my SAT results using the frequency table and the frequency polygon. From the frequency table, I can see that the value that occurs most often is Level 4. Therefore
Mode of SAT results is Level 4
Range
The range of a set of numerical data is the difference between the highest and the lowest values. Range is a simplest possible measure of spread. The range cannot be used with grouped data. Range is also strongly influenced by the extremes of the values. To get the range all you have to do is takeaway the smallest value of the data from the largest value from the data.
IQ range
Since range cannot be found out using grouped data, I cannot use my frequency table to find out the highest and the lowest values. Instead, I will use my stem and leaf diagram. From there I know that the highest value for IQ is 124, while the lowest value is 76.
So, to find out the range I have to:
124 – 76 = 68
The range of IQ is 68
SAT results range
The frequency table for my SAT results isn’t grouped. I am able to use it to find out the highest and lowest value. Also, previously I had my SAT results data lined up smallest to largest value. I am able to use both to find out the range. From that data I know that the highest SAT level is 5, while the lowest SAT level is 1.
To find out the range I have to:
5 – 1 = 4
The range of SAT results is Level 4
Range is not very useful because it can be heavily influenced by the extremes of the data. For example if I have one very high IQ and one very low IQ, the range taken from those two values will be misleading.
Standard deviation
Range is a very basic measure of spread. Spread is measure of dispersion I the data. The standard deviation is square root of variance. The formula for standard deviation is:
∑x²-(∑x) ²
n
Standard deviation of IQ
I have decided to do standard deviation of my data to find out how my data is dispersed. However, before I can find out the standard deviation, I must find out the sum of squares of my values and the total sum of my values. Since I am dealing with such a large amount of values I think that doing it with calculator is impractical and time consuming. Instead I decided to do my calculations in Excel. This is the result I got.
Now I am able to do my calculation.
∑1072652- 10362 ²
∑1072652 – 107371044
1057276
100
10572.76
Standard deviation = 102.83
Standard Deviation of SAT results
The same process as before is performed.
1787- 419²
100
1787 - 175561
100
1787-1755.61
31.39
Standard Deviation = 5.60
To further investigate my data, I decided to perform summary statistic on the Years 10 and 11. this will give me further knowledge of the population of the school and what are there averages. This will be my subgroup.
Subgroup
Frequency tables of year 10 and 11
I have decided to do everything for the year 11 and 12 as I did for the general data of my sample. First I have to do the frequency tables.
Frequency table of IQ
I can not do the frequency table on the IQ due to the high range of possible values. Instead I do the group frequency of the IQ. Everything is done the same as for the main data. The only difference is that the data is much smaller.
This table shows the frequency of the particular value in my data. As you can see there are several differences form my general sample data. For one thing, there I no student in years 10 and 11 that have IQ higher than 120. Also there is no very distinctive majority for any particular data. These differences can be shown by a frequency polygon.
As you can see, the difference from the order frequency polygon is that it has broader width. The frequency of the IQ values is more disturbed between them.
Frequency table of SAT results
Since there aren’t any large ranges of values, I am able to use normal frequency table for SAT results data.
My frequency table, like that for the general data, shows that predominant level is level 4. However, as with the Y10 and 11 tables, the frequency of the values is more disturbed.
This is the frequency polygon of the SAT results
The frequency polygon shows the data in the table and illustrates the difference of distribution from the original sample frequency polygon.
Mean of year 10 and 11
I need to do mean for IQ and SAT results. I used the same method as previously in obtaining the mean. I will try to see if it will match the means of the general sample data.
Mean of IQ
For my mean of IQ I decided to do the calculation of sum of values using Excel. It is faster and more convenient than doing it by hand.
Mean = Sum of values/number of values
Mean = 3166/31
Mean = 102.12
Mean is smaller than with the main data, but this could be because I used smaller number of data so there is less variety.
Mean of SAT levels
Again I decided to do the calculation on excel because is more convenient and practical.
Mean = sum of values/number of values
Mean = 125/31
Mean = 4.03
As with the mean of IQ, the mean of SAT result is slightly less than the general sample data. Again this might be to the smaller number of values.
Median of Y10 and 11
Now I have to do the median of my data. I will use the same data as before. to help me find the median I did the stem- and – leaf diagram of the IQ.
Stem and leaf diagram
Key 10 3 = 103
Now using the stem and leaf diagram I can find out the median of my IQ
The median of the IQ
The median of the IQ is the middle number form a set a data. Using the stem-and-leaf diagram I am able to see what the middle value is. I have 31 values, so the middle one is the 16th value. The median of this set of data is:
Median = 100
As you can see the median is again smaller than for the general sample data.
Median for the SAT results
I am unable to do the stem-and-leaf diagram with my SAT levels because the value has only one unit. Instead I lined up all the values from largest to smallest. This will show me the middle value, and it will also be useful when I have to do my range.
2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5
Since there are 32 values, the 16th will be the middle value. The median of SAT results is:
Median = 4
Here the median is exactly the same as fro the main sample data.
Mode of Y10 and 11
I will calculate mode by finding the frequencies of the each value.
Mode of IQ
Mode is obtained by finding out a single value that occurs most often in the data. I already have the frequency table for both IQ and SATs. However, since the frequency table for IQ is grouped, I cannot use it to find the single value that occurs most often. Instead I use my stem-and-leaf diagram. Form there I can see that the value that occurs most often is 100.
Mode = 100
This mode is slightly smaller than the mode of the entire sample. This might be because the lack of variety.
Mode of SAT results
While I am unable to use the frequency table for my IQ, my SAT results frequency table isn’t grouped. From there and the frequency polygon I can see that the value which occurs most often is Level 4. So, the mode is:
Mode = Level 4
The mode is exactly the same as from the entire sample.
The range of Y10 and 11
Range is obtained by taking away the smallest number from the largest number. I can use my stem and leaf diagram, and lined up values to help me with finding the range. This is how I did it.
The IQ range
From the stem and leaf diagram, I know that the largest value is 119 and the smallest value is 76. So to calculate the range I:
Range = 119 – 76
Range = 43
This range is somewhat more than slightly smaller than the entire sample range. I think this is because in the whole sample data there are two extremes that increase the range. That is a problem with ranges; they can be very affected by extreme data.
Range of SAT results
To find out the smallest and the largest value of the range I looked at the values lined up in order of their size which I did when I was finding out the median from that I know that the largest value is 5 while the smallest value is 3.therefore the range is:
Range = 5 – 2
Range = 3
As you can see the range is slightly smaller than the range of the main sample data. This might be because again the range has been affected by the extreme data, or in this case the lack of extreme data.
I decided that the standard deviation is unnecessary to do for my subgroups. I don’t believe it couldn’t tell me anything important that the calculations already done haven’t told me. From comparing the averages of the entire sample data and the chosen data from the sample data (year 10 and 11), I gather that while most averages aren’t exactly the same, they do not vary too much. I think there are differences because in my main sample there are more values to be observed so there will be more variation. Those variations and higher values caused the data averages to be higher than for only a part of the sample.
However, most of the data is very similar to each other. This is because the average found in my sample represents the whole school, so theoretically they should represent most of the students there. Of course there will be some cases where averages will not match a students results, but that is only normal.
From this summary statistics, I feel confident to say that the averages gotten from my samples adequately represent the entire schools achievements in IQ and English results. Now, I need to analyse my data. I will try to find out what the relationship is between IQ and SAT results in the school that has these particular averages.
Analysing my data
First I sorted out my data by the IQ. This will make my analysis of the data much easier.
Then I am going to draw a scatter graph to see if there is any relationship between the independent and dependant variables of this investigation. In this investigation I choose both variables, but one always depends on the other. My independent variable will be IQ, while SAT English results is the dependant variable. In theory, SAT results should always depend on the intelligence of the person.
In this graph the, the independent variable will be plotted horizontally, while the dependant variable will be plotted vertically. This is because the one will always depend on the other.
This is my scatter graph. From it I see that there is a relationship between the two variables. The red dots are my samples, and their SAT results in relationship with there IQ. This show what grade a pupil got if he had a certain IQ. However, I need a measure of the relationship between two variables. This will give me the line of best fit to use. Line of best fit is also called the regression line.
The Regression line
Regression lines can be used as a way of visually depicting the relationship between the independent (x) and dependent (y) variables in the graph. Once drawn regression line can be used to estimate the value of x which would correspond to the given value of y. This means that regression finds a line that predicts Y from the values of X.
For my enquiry I will do the linear regression of my results. Linear regression works by minimizing the sum of the square of the vertical distances of the points from the regression line, hence is known as the "least squares" method. The calculation effectively minimizes the sizes of squares drawn between the data points and the regression line. A regression line is actually a running series of means of the expected value of Y for each value of X.
First, we need to consider the standard equation for a straight line. The line you would generally see in Pure Mathematics is written as y = mx +c. But in Statistics the equation for a equation for a linear (straight) line is y = a +bx
Since I am trying to find out a linear regression line I use:
Y=a +bx
Where
a = constant known as intercept
b = constant known as gradient
-
. Since the regression line is a series of means of the expected value of Y for each value of X, it would be sensible for the line to pass trough a entire mean of x and y, which would pass trough a centre of the scatter diagram. We call the point of the means for the x values and y values ( x‾,y‾). For the line to pass trough this point, certain conditions have to be met: a and b have to satisfy:
y‾ = a + b x‾ (1,1)
So that:
a = y‾-b x‾= 1 (∑y1-b∑x1)
n
- To fix the line we need to find a value for the gradient as well. This method is often more convenient. The least square estimate is the best one to use for this purpose:
b =Sxy (1,2)
Sxx
Where quantities Sxy and Sxx (and for completeness, the quantity Syy) are given by:
Sxy =∑x 1y1 -∑x1 ∑y1 (1,3)
n
Sxx =∑x²1- (∑x1) ² (1,4)
n
Syy=∑y²1 – (∑y1) ² (1,5)
n
n = number of samples you are using to find the regression line.
Using the values from above for least-squares estimates a and b, the resulting line is described as the estimated regression line of y on x, and b is called the estimated regression coefficient.
Example1 :
We have a data on a constant (approximately) speed of cars and the fuel consumption. The denoting y is mpg and x is denoting mph. We have:
n =16, ∑x1 = 680, ∑x²1 = 29 400, ∑y1= 700.7, ∑y²1 = 30 828. 05
∑x 1y1= 29 518.5
First we calculate the Sxy and Sxx
Sxy = 29 518.5 – ( 680 * 700.7)= -261.25
16
Sxx = 29 400 - 680² = 500.00
16
Hence, the least-squares estimates b and a are given by:
b = Sxy = -261.25 = -0.5225
Sxx 500.00
a= y‾- b x‾ = 1 [700.7-(-0.5525*680)] =66.0
16
The estimated regression line is: y= 66.0 – 0.5225x.
If I want to predict average mpg for a car travelling at the steady 42 mph I use the formula:
66.0 – 0.5225 * 42 = 44.005: approximately 44 mpg.
For my data, I could do this equation for five x values and then plot the values I get on to the scatter graph. With this I could make a regression line. I could also predict what average SAT grade I would get for any IQ value.
This calculation I did on an Excel, using the DataAnalysis programme. It gave me the:
- Formula of the regression line:
Y=1.6195 – 0.056x (1,6)
With this I can predict all of the SAT results for any given IQ.
This is a table of the results-the IQ values, the predicted SAT results and the residual.
The r2 of my graph is 0.3752 which means that while regression line didn’t help that much, it did to improve my ability to predict. With this I can predict the average grade the student will get if he has a certain IQ. This, admittedly isn’t to precise because many other factors are involved in getting grades, such as amount of time spent studying. Nevertheless, there is a clear positive relationship between my two variables, which signifies that the IQ does have and effect on the SAT results. To get the measure of that relationship, I have to find out the correlation coefficient between my two variables
Correlation coefficient
Correlation quantifies how closely two variables are connected. Correlation measures the degree of relationship between two variables. There are many correlations that can be used with different set of data. I will use the parametric Pearson, or "r value", correlation. This correlation calculations are based on the assumption that both X and Y values are sampled from populations that follow a normal (Gaussian) distribution. It is most appropriate when both X and Y variable are random variable, which in my case there are. What correlation (r) actually gives me is
- A number between -1.00 and 1.00. This signifies the strength of the correlation with -1.00 being the strongest negative correlation, and 1.00 being the strongest positive correlation. 0 is when there is no correlation between two variables. For example, 0.9956 would mean there is strong positive correlation, whereas -0.945 would be a strong negative correlation.
We can write:
Correlation = sample covariance
(sample variance of x)*(sample variance of y)
This is put in mathematical form in Pearsons formula
To solve this equation you need to do a table showing summary statistics. This will help when you need to do the formula. Since all of the data will already be in the table, only thing you need to do is put it in the correct place.
With this you can calculate
Since the lower formula is only the formula of standard deviation, there is no need for me to repeat the formula. I already have the results for those calculations.
Standard deviation of x = 102.83
Standard deviation of y = 5.60
With this knowledge I know with which I need to divide the covariance. On Excel, there was already a result for correlation. I got it when I calculated my regression line. There is no need to repeat the formula. I wished, however to show how to do the formula and what you need to find out to do the formula correctly.
This are my results of correlation coefficient:
r = 0.612513327
Correlation coefficient of my data is
r= 0.612513327
Conclusion
In my hypothesis I stated that the higher the IQ of the person, the higher SAT English results that person will get. I performed an analysis of the data I obtained and I have found that hypothesis to be true. To see if there is relationship between two variables, I did a scatter graph. From the scatter graph I could see that there is obvious positive correlation between the two variables. To find the line of best fit I calculated a regression line. This again showed that there is a relationship between the two variables. But to get the measure of the relationship, I had to calculate the correlation coefficient. Correlation coefficient measures the degree of the relationship between two variables, where – 1 is a strongest negative correlation, +1 is a strongest positive correlation and the 0 is where there is no correlation. I got the result that the correlation of my two variables is r= 0.612513327.
This result for my correlation coefficient means that there is a positive correlation between the variables, while it is not that strong. IQ does affect the SAT English results, and the higher the IQ, the higher the chance of getting good SAT results. However, the reason why the correlation isn’t any stronger is because there are many other factors which affect SAT results. While having IQ increases your chances, it isn’t the only thing that affects if you get good grade. Factors such as amount of time spent studying or watching TV might also have an impact on what the grade a person gets. The hypothesis is true but there are many other things beside IQ that causes a high SAT grade.
I did the summary statistic to see what the average is of the population of that particular school. The results I got such as correlation coefficient and regression line are only relative to the particular population that gets this kind of averages. It might be that this school isn’t an average school in a country and that you will get a different results in other schools. While these results can be applied to this school, it is possible that they cannot be applied in different schools.
Evaluation
I performed a statistical enquiry using mathematical methods to improve or disapprove my hypothesis. I have completed my investigation successfully, proving my hypothesis. I used summary statistics in describing a population. I have calculated regression line and I have found out the correlation coefficient. This investigation was done thoroughly and accurately as I could possibly make it. I believe it was a success.
However, I believe there are certain ways I could improve the investigation. I know that IQ affects SAT results, but now I could try to see what other factors also affect the results. I could perform an investigation to see if amount of time spent studying or watching TV has any effect on the SAT results.
Also, I could see if this data could be applied to more schools than one. I could get a sample of schools from the entire country and see if there summary statistics match mine. I could find out if the Mayfield High School is an average school of the country.