Bivariate Data Exploration

Maths Coursework Tim Durden

STATISTICS 2:

Aim:

The aim of this investigation is to see if there is a correlation between the engine size of a car and the insurance group that it resides in.

Introduction:

In our present day there is an ever-increasing public demand for value-for-money products and services, especially in cars, shopping and clothing markets. For students, this is even more important as everything they buy (unless they are particularly affluent) can easily amount to debt (through extensive student loans). For students in particular, cars are very often an essential means of transport, and so, like most things, it is important for a student to get the best deal for their car.

However, insurance companies and car dealers are very much aware of the student situation and have classified certain cars as ‘student cars’, and to clarify this, include cars from Peugeot (106, 306), Renault (Clio), Citroen (Saxo), and Vauxhall (Nova) to name but a few.

Now it seems that these cars all have relatively low engine sizes, commonly ranging from 900-1800cc, and are all placed in relatively low insurance groups (and therefore have lower insurance costs), but this may not be the case for all cars, especially those with larger engine sizes.

This investigation will examine data from a range of cars, varying in both engine size and insurance group, and if a positive correlation is found between insurance group and engine size, then the concept of ‘student cars’ will not be such a worrying factor when a student goes to buy his first car, however if there is no correlation then it is entirely possible that insurance companies are charging too much for cars in the ‘student car’ category.

Data Collection:

To start the investigation, data needed to be collected before any conclusions could be made. A local used car showroom was approached, and data from all cars on their forecourt (and in the showroom) was collected. This data was taken directly from the records that the company keep for each car they attempt to sell, which meant the data was ‘secondary’. The data collected was that of car engine size, and insurance group. The population consisted of all cars on the company’s current sales list (regardless of age, mileage, fuel type etc). The population size was 106 (the maximum numbers of cars the company could fit on their land).

The next objective was to create a random sample from the population, with a sample size of 50. Firstly, the data was numbered from 1-106, in the order that it was collected. A random number generator from a calculator was then used to produce numbers that were less than 1, and were to 3 decimal places. All numbers above 0.106 were rejected, as were duplicates if they occurred. As the random numbers were generated, the sample data was selected based on the generated number and the number next to the population (as listed), until 50 sets were collected. This ...

This is a preview of the whole essay

The following table shows the results of the sample collected:

Modelling Procedures:

Now the data could be compared to see if there was correlation. The first step was to draw a scatter diagram, with the X-axis as engine size and Y-axis as the insurance group.

The followed graph was plotted on a computer running Autograph 3 and seems to suggest a correlation between the data.

Note: Some data was exactly the same, and has been plotted at the same points as another, giving the impression there are less than 50 plotted points.

The graph below shows the same data, however it is seen that an elliptical profile exists, showing a somewhat positive correlation. An elliptical profile shows how strong the correlation is, in that the narrower the ellipse, the stronger the correlation.

As seen above, the ellipse is relatively narrow, suggesting a strong correlation exists. This means the data (consisting of two variables) is bivariate normal.

One more check was carried out to see if a correlation exists, and that was the quadrant test. This is a pretty simple test that determines correlation depending on the number of points within the quadrants. If there are more points in regions 1 and 3 there is usually a positive correlation, whereas if more points lay in regions 2 and 4 there is usually a negative correlation.

It is seen that there are more points in regions 1 and 3; further suggesting there is positive correlation.

Analysis:

To examine just how true the correlation is, the ‘Pearsons Product Moment Correlation Coefficient (PPMCC)’ test was carried out. This test finds us a value of ‘r’, which defines the strength of correlation. If r = -1, there is a perfect negative correlation, and if r = +1, there is a perfect positive correlation.

Before the correlation coefficient (‘r’) can be found, the mean values for the data needs to be found, along with values for the sum of XY, X² and Y². Autograph was used to calculate these values (based on ‘Engine Size’ as X, and ‘Insurance Group’ as Y). The following shows the formula used to calculate the mean:

Once the mean had been found, the PPMCC equation could be used as follows:

Now that a correlation coefficient had been found for the sample data, it needed to be tested against the entire population to see if it was significant. Therefore a hypothesis test was used to see if the sample correlation was representative of the entire population. The data was tested at a 5% significance level. The two hypotheses used were as follows:

Null Hypothesis: H0: ρ = 0 (there is no correlation)

Alternative Hypothesis: H1: ρ > 0 (there is some positive correlation)

Whereby ‘ρ’ is the parent population correlation coefficient.

If the value calculated from the PPMCC test, ‘r’ (or ‘rho’), is less than that of the critical value (in this case, the tabulated correlation coefficient) then the null hypothesis would be accepted. This would mean there was no correlation (based on the 5% significance test).

If however, ‘rho’ is greater than the critical value then the alternative hypothesis would be accepted, meaning there was significant correlation (and in this case, as ρ > 0, a positive correlation).

The critical value (from the tabulated correlation coefficient) was 0.2353.

The value from the PPMCC test was 0.8895.

0.8895 > 0.2353

Therefore, the null hypothesis was rejected, and the alternative hypothesis accepted. Hence, this further proves that there is a positive correlation.

Interpretation:

It was seen from the calculations obtained by the ‘Pearsons Product Moment Correlation Coefficient’ that a strong positive correlation exists between the engine size of a car and the insurance group to which it is placed. This was confirmed by carrying out a hypothesis test, at a 5% significance level, on the sample dataset (a sub-set of the population), which supported this positive correlation.

Referring back to the original aim (and introduction), this investigation was to see if there is positive correlation between insurance category of cars and their engine size. From carrying out various correlation techniques and the hypothesis test, there was adequate evidence to suggest that from the sample dataset collected (from the overall population), the larger the engine size, the higher the insurance group of the car. Therefore, insurance companies are not categorising ‘student cars’, and charging more for a car simply because it fits the student profile (e.g. cheap, sightly aesthetics, small etc).

Accuracy & Refinements:

Firstly, the sample size (50 datasets) was selected using random numbers generated by a calculator. Whilst this method does produce random numbers, the numbers are formed as part of an equation, and so may not prove completely random. A much better approach would have been to use a systematic sample, which would have been obtained from the parent population (once the data was ordered by a variable, e.g. insurance group) by counting through the sampling frame, i.e. every 2nd or 4th dataset was selected.

Secondly, if a larger sample had been collected, the accuracy of the correlation would be increased. There would be more points to plot and therefore the correlation would be much more representative of the entire population (e.g. a sample of 500 cars out of 50,000 in Essex), even if there were more cases of outliers to the correlation.

Thirdly, it was felt that having data that was ‘secondary’ gave rise to bias and error of data collection. If data had been ‘primary’, that is collected by the researchers themselves, the data may have been more accurate. With regards to this investigation, it is possible that because the company were selling cars, there may have been some bias as towards which cars they buy and sell. Cars that were of a poor standard would not have been purchased for secondary sale.

Page of

Bivariate Data Exploration

This is a preview of the whole essay

Document Details

Related Essays

Statistics Coursework - Bivariate Data.

Bivariate Data - The aim of this coursework is to discover whether there is...

data handling

Anthropometric Data