The next objective was to create a random sample from the population, with a sample size of 50. Firstly, the data was numbered from 1-106, in the order that it was collected. A random number generator from a calculator was then used to produce numbers that were less than 1, and were to 3 decimal places. All numbers above 0.106 were rejected, as were duplicates if they occurred. As the random numbers were generated, the sample data was selected based on the generated number and the number next to the population (as listed), until 50 sets were collected. This method of sampling meant the sample taken was representative of the overall population.
The following table shows the results of the sample collected:
Modelling Procedures:
Now the data could be compared to see if there was correlation. The first step was to draw a scatter diagram, with the X-axis as engine size and Y-axis as the insurance group.
The followed graph was plotted on a computer running Autograph 3 and seems to suggest a correlation between the data.
Note: Some data was exactly the same, and has been plotted at the same points as another, giving the impression there are less than 50 plotted points.
The graph below shows the same data, however it is seen that an elliptical profile exists, showing a somewhat positive correlation. An elliptical profile shows how strong the correlation is, in that the narrower the ellipse, the stronger the correlation.
As seen above, the ellipse is relatively narrow, suggesting a strong correlation exists. This means the data (consisting of two variables) is bivariate normal.
One more check was carried out to see if a correlation exists, and that was the quadrant test. This is a pretty simple test that determines correlation depending on the number of points within the quadrants. If there are more points in regions 1 and 3 there is usually a positive correlation, whereas if more points lay in regions 2 and 4 there is usually a negative correlation.
It is seen that there are more points in regions 1 and 3; further suggesting there is positive correlation.
Analysis:
To examine just how true the correlation is, the ‘Pearsons Product Moment Correlation Coefficient (PPMCC)’ test was carried out. This test finds us a value of ‘r’, which defines the strength of correlation. If r = -1, there is a perfect negative correlation, and if r = +1, there is a perfect positive correlation.
Before the correlation coefficient (‘r’) can be found, the mean values for the data needs to be found, along with values for the sum of XY, X² and Y². Autograph was used to calculate these values (based on ‘Engine Size’ as X, and ‘Insurance Group’ as Y). The following shows the formula used to calculate the mean:
Once the mean had been found, the PPMCC equation could be used as follows:
Now that a correlation coefficient had been found for the sample data, it needed to be tested against the entire population to see if it was significant. Therefore a hypothesis test was used to see if the sample correlation was representative of the entire population. The data was tested at a 5% significance level. The two hypotheses used were as follows:
Null Hypothesis: H0: ρ = 0 (there is no correlation)
Alternative Hypothesis: H1: ρ > 0 (there is some positive correlation)
Whereby ‘ρ’ is the parent population correlation coefficient.
If the value calculated from the PPMCC test, ‘r’ (or ‘rho’), is less than that of the critical value (in this case, the tabulated correlation coefficient) then the null hypothesis would be accepted. This would mean there was no correlation (based on the 5% significance test).
If however, ‘rho’ is greater than the critical value then the alternative hypothesis would be accepted, meaning there was significant correlation (and in this case, as ρ > 0, a positive correlation).
The critical value (from the tabulated correlation coefficient) was 0.2353.
The value from the PPMCC test was 0.8895.
0.8895 > 0.2353
Therefore, the null hypothesis was rejected, and the alternative hypothesis accepted. Hence, this further proves that there is a positive correlation.
Interpretation:
It was seen from the calculations obtained by the ‘Pearsons Product Moment Correlation Coefficient’ that a strong positive correlation exists between the engine size of a car and the insurance group to which it is placed. This was confirmed by carrying out a hypothesis test, at a 5% significance level, on the sample dataset (a sub-set of the population), which supported this positive correlation.
Referring back to the original aim (and introduction), this investigation was to see if there is positive correlation between insurance category of cars and their engine size. From carrying out various correlation techniques and the hypothesis test, there was adequate evidence to suggest that from the sample dataset collected (from the overall population), the larger the engine size, the higher the insurance group of the car. Therefore, insurance companies are not categorising ‘student cars’, and charging more for a car simply because it fits the student profile (e.g. cheap, sightly aesthetics, small etc).
Accuracy & Refinements:
Firstly, the sample size (50 datasets) was selected using random numbers generated by a calculator. Whilst this method does produce random numbers, the numbers are formed as part of an equation, and so may not prove completely random. A much better approach would have been to use a systematic sample, which would have been obtained from the parent population (once the data was ordered by a variable, e.g. insurance group) by counting through the sampling frame, i.e. every 2nd or 4th dataset was selected.
Secondly, if a larger sample had been collected, the accuracy of the correlation would be increased. There would be more points to plot and therefore the correlation would be much more representative of the entire population (e.g. a sample of 500 cars out of 50,000 in Essex), even if there were more cases of outliers to the correlation.
Thirdly, it was felt that having data that was ‘secondary’ gave rise to bias and error of data collection. If data had been ‘primary’, that is collected by the researchers themselves, the data may have been more accurate. With regards to this investigation, it is possible that because the company were selling cars, there may have been some bias as towards which cars they buy and sell. Cars that were of a poor standard would not have been purchased for secondary sale.