To detect outliers I had to use a specific formula. If I had just deleted my data by hand there is chance that I might miss some data and therefore make my data inaccurate.
For example
To find out any outliers that are under 29 I have used this formula:
=IF(J3<29,”outlier”)
This enables the word “outlier” to appear next to a figure if it is under 29.
To find out any outliers that are over 69 I have used another formula this is:
=IF(J10>69,”outlier”)
This enables the word “outlier” to appear next to a figure if it is over 69.
Sampling
We use sampling to help make sure that are data is consistent. The larger the sample size, the more information you will have to compare and therefore the more representative the statistics will be of the whole data.
Random Sampling
This is when any data has an equal chance of being selected for sampling. Some ways of doing random sampling include:
- Numbering each item and using a computer to choose numbers
- Using a random number generator (taking in to account that you have numbered each item)
- Putting numbers in a hat and then selecting however many samples are needed (similar to bingo)
Although just writing down any random numbers your self would technically be random sampling it is not reliable enough as a number generator for instance as that is completely random.
Stratified Sampling
When doing stratified sampling the first thing you have to do is to divide the data in to different categories that some of the data have in common e.g. age, social class, hair colour, religion, gender etc. Once you have divided into groups you take a random sample. To make it equal you must take a sample size that is the same through out each category depending on how large it is.
Systematic sampling
This is done by choosing data to sample with in a certain interval. Although it is similar to random sampling it is known to be more efficient. It is more useful for large amounts of data for example if you wanted to get a sample of 40 students out of 200. And you took every 10 students; to pick the first student you would random sample between 1 and 10. So if 6 is picked, the 6th, 12th, 18th, 24th, 30th etc. number would be picked.
Cluster Sampling
Putting the data in ‘clusters’ or groups are simply done by first choosing an amount of data that should be in each cluster. Unless there is good reason, you should always make sure that each cluster has the same amount of data. Once all data is separated clusters are randomly chosen and every item in the cluster is analysed.
Quota sampling
Although this sampling has the issue of it being bias it is very popular when doing market research. This is because it separates in to different groups depending on the characteristic they possess e.g. height, weight, age. Once it has separated in to groups for each characteristic, depending on what the sampling is needed for, a quota or amount of each section is sampled.
Convenience sampling
This sample is amount getting the quickest and easiest sample available for example if the sample was to pick out 10 cars for MOT in a car garage. It could simply be done by choosing the first 10 cars you see. This type of sampling can lead to the data being unrepresentative and bias but on the other hand it is quick, easy and convenient.
Graphs and calculations
To prove my hypotheses correct I am going to do certain graphs and calculations.
Each will need a different approach I will show my methods for each hypothesis.
Hypothesis 1
Year 7s will not be as tall as the Year 11s
To compare the differences in height between the year 7s and the year 11’s I am going to use a box and whisker plot. This will enable me to compare the average heights of each year but still look at any anomalies that are present.
Analysis
Using a box and whisker plot has been beneficiary in proving my second hypothesis correct. As it clearly shows that Year 11s are taller than Year 7s, as the year 7s heights are plotted on the blue graph and the Year 11s on the yellow graph. I initially choose this hypothesis because I knew it would help me differentiate between both pieces of data.
Hypothesis 2
Females will have a higher BMI (Body Mass Index) than males
To prove this hypothesis I am going to use two different types of graphs so that I can clearly justify it being correct. The different types of graph I am going to use are histogram and box and whisker plot. As I have shown previously I am going to use this equation:
Analysis
In the graphs above I showed the BMI of males and females I have done this by presenting a box and whisker plot which enables me to look and compare the mean and medians of each of the sets of data. I have also used a histogram on which I have displayed normal distribution (shown as the blue curved line) and standard deviation (shown as the purple indents in the data line). This shows the proportion of the distribution.
Hypothesis 3
The taller you are the more you will weigh this will result in a strong positive linear correlation
To make sure I get concise and accurate answer when investigating if the taller you are the more you will weigh will result in a strong positive linear correlation I am going to use a scatter graph to prove it. I have decided to split my data, due to the vast amounts of it, into KS3 and KS4.
KS3
KS4
Analysis
The graphs above show the differences between the linear correlation of KS3 and KS4. Under the graphs I have shown number of points, mean of x data and y data, the standard deviation of x and y data, the correlation coefficient, Spearmans ranking coefficient, the y on x line of best fit equation and the x on y line of best fit. All of these statistics help me prove my hypothesis that shows that the taller you are the more you will weigh. I can easily decipher this by looking at Spearmans correlation coefficient as you can see both graphs detect a positive linear correlation because the coefficient is over 0. But both of the graphs indicate a weak correlation coefficient because the coefficient rank is under 0.5.
Conclusion
For my data I did not decide to use sampling because I thought that it would be easier and I knew that the software I was using was able to cope the mass amounts of data provided to me.
For my first hypothesis I wrote “Boys will generally weigh more than girls” this was proved because both my graphs showed positive correlations but the graph that showed the boys result had a higher correlation coefficient than the girls. The problems I encountered during making the graphs was that I had too many outliers that I did not successfully get rid of therefore making the graph data looked all cramped. Not getting rid of the outliers meant that the graph would have to broaden the y and x axis to get all the data in thus making the data looked cramped. To fix these problems I had to go over my graph data again and get rid of my outliers properly.
My second hypothesis was “Year 7s will not be as tall as the Year 11s” where I had used a box and whisker plot graph. Here I had used Autograph software which enabled me to use two box and whisker graphs on one page. This was proven because the whiskers on the box plot showed that my data was correct and that the Year 7s were not on average as tall as all the Year 11s with a few exceptions. Problems that had occurred that there was too much data that made the graph to big so I had to narrow it down. With outliers being a big factor in this.
My third hypothesis was “Females will have a higher BMI (Body Mass Index) than males” for this I had used a wide range of data interpretation showing the data in box and whisker diagrams and histograms. The box and whisker diagrams enabled me to show the actual differences in the averages in the data and the histograms let me show the normal distribution and the standard deviation. I showed the histograms in frequency density which helped round out all the data. Problems I had occurred were using the equation to actually find out the BMI. I used a formula on excel which enabled me to easily and quickly find out the BMI this was “=J5/(I5*I5)”.
My last hypothesis “The taller you are the more you will weigh this will result in a strong positive linear correlation “ this was proved by using a scatter graph on the autograph software. This helped me show all the different statistics that I have analysed in my analysis.
I had achieved most of my aims but nothing in statistics is ever proved it just makes them seem more likely.