I will draw up scatter graphs, histograms and cumulative frequency curves (for cost compared with the whole population’s age followed by individual Makes)
to try and distinguish any correlation’s (patterns) in the cost distribution.
Fortunately, the median is not affected by extreme values so the box plots will be fine, as will any other cumulative frequency curves.
Most variables would affect the cost, but by how much remains to be seen.
I am now satisfied that results from my sample will represent the population as a whole.
1st Hypothesis
I have said that age is the most important factor and so I will try to explore it further.
What I think will happen is-
As age increases, the cost will decrease.
This is because there is a very strong correlation between these factors. I will look at one model for one example because if I look at a group of similar cars, I can be confident that any observations I make are because of the cost and not anything else. For this model, I draw a scatter graph of cost against age.
For this scatter graph, I put Age on the X axis and Cost on the Y axis as Cost depends on Age. This makes Cost the dependent variable. I also included a line of best fit and an R2 measure of correlation where R=1 is a perfect correlation.
On examining this graph, there’s a very strong negative correlation between cost and age. Based on these findings, I can easily make my first hypothesis:
There is a fairly strong correlation between Cost and Age.
2nd Hypothesis
I am now going to investigate the variable that I believe would be next important in determining the second hand cost. This is Mileage. I think that like the age variable, there may be a sign of negative correlation between Mileage and Cost. Like before, I will make scatter graphs based on each make of my sample, except this time it will be for Mileage and Cost.
I believe that this will show clearly a strong negative correlation. Again, to allow me to see this more clearly (incase of any missing/anomalous values), I will attach a line of best fit and a calculation for R2 (this being the measure of correlation where R=1 is a perfect correlation). Based on this analysis, I am going to make my second hypothesis:
There is a fairly strong negative correlation between Mileage Cost.
3rd Hypothesis
From my analysis of Cost against Age, I believe the second hand price must be affected by other variables and I think that the Engine Size should be considered. Whilst the price of a second hand car is mostly determined by its Age, this can be distorted by its Mileage and its Engine Size. I have already determined that Mileage affects Cost. But I am now going to see if the cars that go against the general pattern are affected by Engine Size. I will group the categories into three intervals of Engine Size. Also, I shall include these categories on new copies of Cost against Age scatter graphs.
Engine Size
I am satisfied on the basis of these observations that my 3rd hypothesis is correct. That is to say:
The Cost is influenced by Age, Mileage and Engine Size.
Testing The Hypothesis.
Random Sampling
I am going to take a sample of the data and so I need to select a sampling technique. There are a number of possibilities to use.
Random Sampling-I could generate a set of random numbers between 1 and 198, and use these to select my sample but this wouldn’t represent the population very well because any variables would be skewed (biased).
Systematic Sampling
If I ordered my data set by Make and Model, and then chose 4 cars (e.g.-every 4th one), I think this would be an improvement on a random sample as every Make and Model would be represented in a fairly balanced way, but I still think that the may be a better way to achieve my sample.
Stratified Sampling
If I decided from the beginning that the number of cars for each Make of the sample should be proportional to the number of cars in each Make for the total population. I think that this would achieve the best balance possible for the Make column. In addition, if each Mode was fairly proportionally represented, then it would make me think my sample would be a good representation of the total population. Therefore, this is what I will do.
Number Of Each Make In Population
I will choose a sample of 60 because it is small enough to be manageable but large enough to represent population fairly fully, but also to be broken up into sub groups. As the result is 15, I will sample 15 from each group.
However, in order to get a good balance of Models in each make, I will order my data by Make and Model, and use systematic sampling to select 15 from each Make. As there are 8 places I could start for each Model, I shall choose a random number between 1 and 8 to start so that the cars at the beginning and end of each Make are equally likely to be chosen. In other words, I am not biasing my sample towards the cars at the start of the lists.
Plan
In General
- I will draw all scatter graphs for age against cost/mileage. Then I will examine the correlation (if any) for one make and one model.
- I will take a sample to make the data more manageable. I need a good representation of the total population, which is not biased in favour of any particular make or model.
- I will compare histograms for cost and age etc. I will also draw a cumulative frequency curve for cost to find the median, quartiles and the inter-quartile range.
- I will draw box-plots for cost against age for each make to see if there is any truthfulness in the investigation’s observations, such as any cars being too or new on average.
- I will use scatter diagrams to look at the cost against age and mileage against age for each make. I will use trend lines to provide estimates for missing data.
-
I will then consider mileage from the same sample as for age. I will amend the scatter graphs to see if there’s any pattern and show other features such as the equation and the R2 value.
In Moderate Detail
I am now going to investigate the relationship between cost and make in my sample. I will begin by looking at distribution of cost against make using box plots. From the age box plots, I can see that on average, the Ford cars are slightly older than the others on average, but the oldest cars are Peugeots.
The Vauxhall cars are the most spread out because they have the largest inter-quartile range, while the Ford cars have the smallest inter-quartile range. From the cost box plots, I can see that on average, Ford and Vauxhall have the joint lowest prices with a low median of £3000. However, the most expensive car is a seven year old Mondeo with low mileage. The Ford cars are the most spread out with a large inter quartile range.
From the scatter graphs, I can see there is a fairly strong negative correlation between cost and age, meaning there is a definite trend. (Although, as more differences are introduced, the correlation grows weaker- the R2 value.) That is to say, for one model, the correlation is strongest and for one make it is still quite good. But for the whole sample, the correlation is low. As the correlation for makes is quite good, I am going to use the trend line to calculate and estimate the missing values in my sample.
I will round correct to the nearest £50 because most of the other costs are rounded this way. Also, it would not be appropriate to give a higher degree of accuracy as the correlation is not strong enough to allow for this.
These estimates are only moderately reliable as the correlation between cost and age is only moderately strong. Based on this analysis of my sample (which was a reasonably good representation), I believe I have evidence to support the first hypothesis.
There are clearly other factors influencing the cost, so I will now move on to mileage. Firstly, I will draw box plots for mileage. From these I can see that on average, Ford cars have the lowest median. This seems strange as they were the oldest on average.
Vauxhalls are the most spread out which fits in with that age being spread out. I will now draw a scatter graph for the sample and for each make. I believe that on the whole, these broadly confirm my hypothesis. The whole sample shows a weak/moderate negative correlation.
It seems to me that I was correct. My initial prediction that age is the most important variable, but mileage is also important was true. The next thing I am going to do is indicate on the cost against age scatter graphs for each make, the mileage for point and the engine size to see if this gives me any insight into the influence of mileage or engine size.
Having looked at these, I don’t see a very clear pattern. Although, on average, cars with a lower mileage have higher prices, but as always, there are exceptions.