I will exclude any cars for which there are relevant values missing or corrupt (i.e. if there are missing values in a column which I am not examining I will not exclude the car). I will also exclude a car if its original price is over £25000 as I will then regard it as a luxury car.
Hypothesis 2
The greater the age of a car the lower the percentage of original price will be.
I think this because a car will wear out throughout the years and will therefore become less valuable, as people will perceive it as being a less worthy buy. Its appearance will also get worse, which may affect the amount a buyer is prepared to offer for it.
I have chosen to test this factor because it was one that many people chose as an important one in the survey I did.
To test this hypothesis I will first create a new column in the data set which is the percentage of the original price that the second hand price is. If as I expect for Hypothesis 1, there is a link between original and second hand price I will be able to use this information to help me analyse the effect of age. To calculate this column I will divide the second hand price by the original price and convert it to a percentage.
I will exclude any cars for which there are relevant values missing or corrupt (i.e. if there are missing values in a column which I am not examining I will not exclude the car). I will also exclude a car if its original price is over £25000 as I will then regard it as a luxury car.
Then I will sort the table by age of car so that cars of the same age are all grouped together. I will then create a box plots for one, two and three year old cars to see whether the data is skewed.
If it is not I will calculate the mean % of original price for each data set and the standard deviation and compare the values. I would expect the mean to decrease as the age increases.
If the data is skewed then I will make box plots for each age group. To do this I will need to calculate the lower quartile, median and upper quartile. I would expect the general position of the box plot to move lower down the scale, i.e. towards 0 and away from 100.
Hypothesis 3
Colour will have no effect on the price of a car.
I think this because although people would have preferred colours, which they might choose if buying a new car, when they buy a second hand car they will not have this option, as there is only one car the seller will have of that type. There is a chance that somebody might pay slightly more if they liked the colour, but as many people have different favourite colours this will probably have no effect.
One possible anomaly will be for green cars which some people regard as unlucky. They may therefore have a lower % of original price.
To test this hypothesis I will sort the data according to colour and exclude any cars which have very unusual colours. I will also exclude any cars for which there are relevant values missing or corrupt (i.e. if there are missing values in a column which I am not examining I will not exclude the car). I will also exclude a car if its original price is over £25000 as I will then regard it as a luxury car.
I will then create a box plot for the most common colour to see whether the data is skewed.
If it is not I will calculate the mean % of original price for each data set and the standard deviation and compare the values. I would expect the mean be fairly constant. I will plot the means onto a bar chart.
If the data is skewed then I will make box plots for each age group. To do this I will need to calculate the lower quartile, median and upper quartile. I would expect the median not to move much in relation to the lower quartile.
Collect, Process and Represent Data
Hypothesis 1
The cars I have excluded are 13, 54, 56, 71, 72, 73 and 95 as they all have original prices of over £25000. I have also excluded number 29 because it does not have an original price. This is probably because the car is so old (15 years old – the oldest car) that the value for original price could not be found out. I have also not shown numbers 69, 74 and 79 on the graph as they do not fit my scale. I will examine them individually after the graph to see if they fit the pattern.
As can be seen from the graph there is a positive correlation, albeit quite a weak one and with several anomalies, mostly in cars which are particularly expensive. This could be because when people buy a nicer model of car they will want to buy it new rather than second hand. The three cars above the £20000 mark included on this graph are all Rovers. The peculiarity of these results could also be due to something about people who buy Rovers.
Due to the weak correlation I am unable to draw an accurate line of best fit.
The three cars which I excluded are 69 (Original Price £19530, Second Hand Price £14999), 74 (Original Price £14950, Second Hand Price £13500) and 77 (Original Price £17915, Second Hand Price £11750). Of these cars 69 and 77 seem to fit the pattern and show depreciation in line with what would be expected. 74 however, shows a surprisingly low depreciation. This is an anomaly, although the car is only one year old and has a low mileage on it, which could explain why it has not depreciated much.
Hypothesis 2
Age 1
Age 2
Age 3
From these three box plots it can be seen that the data for age is skewed. Therefore I will use box plots.
From this chart is obvious that age does have the effect which I predicted in my hypotheses.
As the correlation between age and percentage of original price seems strong, I am going to try and draw a line which I can use to calculate and expected percentage of original price for each year of age. If the line is curved I will map it using straight lines in order to work this out.
The gradient of the first line is (y2 – y1) ÷ (x2 – x1) = (56 – 68) ÷ (1.667 – 0.667)
= -12 ÷ 1
= -12
Its y-intercept is 75.5
Therefore its equation is y = -12x + 75.5
The gradient of the first line is (y4 – y3) ÷ (x4 – x3) = (24 – 48) ÷ (6.667 – 2.667)
= -24 ÷ 4
= -6
Its y-intercept is 64
Therefore its equation is y = -6x + 64
The gradient of the first line is (y6 – y5) ÷ (x6 – x5) = (16 – 21) ÷ (10.667 – 7.667)
= -5 ÷ 3
= -1.667
Its y- intercept is 34
Therefore its equation is y = -1.667x + 34
These equations can be used as follows to find the expected percentage of original price a second hand car can expect to get due to its age. Let a represent age.
If I want to negate the effect of age on car I can remove the depreciation it causes from my data. To do this I will divide the actual percentage of original price of the car by the expected percentage of original price of the car. This will then give me a value which shows how the car has lost value in relation to what I would expect. It also means I can analyse other factors without it being distorted by age.
Therefore to find the % of original price of a car discounting the effect of age I use this table. Let a represent age and o represent the actual percentage of original price.
I can use this formula to work out percentage of Original Price Discounting Age for all of the cars.