Source:
Method
The data I had at first was only a sample of the used cars population in the country, taken from recent adverts and reputable guides to the motor trade. For the pilot study, I deleted the unnecessary factors, and left the one’s which I was to work upon. For my sample I used a =INT(RAND()*204)+2 formula.
I clicked on the cell again and moved the cursor to the bottom right of the cell until it changed to a black cross. I dragged down until I reached the bottom of the data.
Here is the data with the random sample numbers on the side.
Selected data:
As you can see I have selected a random sample from my database, I chose the first 30 numbers that my random sample formula provided me and notified and replaced and errors which it had chosen.
Here is the original random sample of numbers and their data that I had (In ascending order):
Random Numbers
Random Data
As you can see I have notified data which the random sample had repeated. Moreover I have also highlighted a very significant error which may affect my results. This field had an empty cell, with no details of the price used. This is the problem with secondary data, some mistakes may occur with the data and they need to be ignored. Here are the details of car No. 15.
As mistakes can not be included in the sample, because it will cause an error and an anomaly in my graph, I decided to get rid of the car numbers that had errors. I decided to replace them through choosing random numbers myself, by using the calculator.
The scientific calculator has a random number generation capability, Ran# which can be used to generate the random numbers. The command generates a random number larger than zero and less than one. Eventually, the random numbers produced are spread evenly over the whole interval from zero to one.
The following calculator command will be used to generate the random numbers I need for my sample:
- Enter 204Ran# to generate a random number between 0 and 204
After the Ran# command is executed with key, a new random number will be produced with every press of the key.
The numbers chosen were:
30.992 → 31
18.88 → 19
116.048 → 116
Now I consider my collected and sampled data as being mainly reliable and not biased, due to the sampling method I chose and the errors which I became aware of and removed.
Final random samples of numbers and data:
Analysing and Calculating
To analyse my data I will use a “XY scatter graph” as I will easily be able to analyse them and find out what type of correlation there is and whether it is a strong, weak or even no correlation. A line of best fit will allow further analysis to make the study more efficient.
“A line of best fit should only be drawn on a scatter graph if the correlation coefficient is > 0.6 or -0.6 <”
Correlation Coefficient: A more accurate method to compare correlation is through this method. It uses the mean of each set of data and looks at the distance away from the mean of each point. The formula, which is known as the Product Moment Correlation or r is:
(Where and are the means of x and y values respectively)
The value of r determines the correlation; moreover the value is always between -1 and 1.
Here is a table which will guide me in not resulting to a bias conclusion:
-1 = Perfect Negative Correlation 1 = Perfect Positive Correlation
-0.8 = Good Negative Correlation 0.8 = Good Positive Correlation
-0.5 = Some Negative Correlation 0.5 = Some Positive Correlation
0 = No Correlation
The standard deviation of the data will also need to be calculated. The standard deviation specifies how spread out data is; so the higher the standard deviation the more spread out the data is. The standard deviation, s, of a set of data is given by the formula:
s = Σ(X-x) 2
n
(Where x is the mean)
Below there is a scatter graph showing the results that I got with the data I have chosen above.
The line of best fit shows a weak negative correlation between the data which shows that as the mileage of a car decreases the value of the car increases. This is shown in this section of the graph. (I have drawn an arrow to show the slope and where it occurs).
This proves that my theory/assumption was slightly correct; since there are anomalous results in my graph. (Here are examples which I have circled)
Anomalous
This may be due to the fact; I did not separate the makes of the car, I used all of them in the sample, allowing every make to get an equal chance. However if the factor of make was to be left aside, I believe that my prediction would be entirely correct. As it is a weak correlation one would disagree that the mileage of a car does not affect the value of a car a lot, and has to have other factors included to prove it right. As a result I will develop this result in order to investigate whether I need to control more variables.
I had to work out the correlation coefficient of the data so that I could find out whether my correlation was negative or positive and weak or strong. To successfully work out the correlation coefficient I had to organise my information into a table;
Where ‘Y’ is the mean of ‘y’ and ‘X’ is the mean of ‘x’
I used the following formula to work out the correlation coefficient;
r = -498037950
1022834768 x 21813883000
The correlation coefficient (product moment correlation) of the data I had chosen is
-0.105 (to 3 d.p) According to the determination of a correlation, (scale between -1 and 1), it implies that it is mostly negative correlation. This confirms my hypothesis to be correct as the negative and weak correlation shows that the higher the mileage the lower the car value. However there are a few outliers, which are probably there because some of the cars may have been beyond the price range of others, and maybe prestigious.
The standard deviation of the user car price is;
s = Σ(X-x) 2
n
s = 1022834768 ÷ 30
s = 5839.048918
s = 5839.05 (2.d.p)
I also worked out the standard deviation for the mileage; here is what I did to get the standard deviation;
s = Σ(X-x) 2
n
s = 21813883000 ÷ 30
s = 26965.33763
s = 26965.34 (2.d.p)
Since my data seemed to be of a weak correlation, due to the conclusion given by the standard deviation, I have decided to expand my variable from mileage and price used, to age and price used. I assume that the higher the age of the car, the lower the price.
I have adjusted my data and have added the ages of all the chosen cars. Here is the data that I am going to use:
Below there is a scatter graph showing the results that I got with the data I have shown above.
The line of best fit shows a medium negative correlation between the data which shows that as the age of a car increases the value of the car decreases. This is shown in this section of the graph.
This proves that my theory/assumption was correct; moreover the anomalies may have affected the results in my graph.
I have worked out the correlation coefficient;
r = -0.555694
r2 = -0.3087958216
= -30.87958216%
The correlation coefficient (product moment correlation) of the data for age and price used is -0.555. So according to the scale, this indicates my correlation as being a medium and weak correlation.
I have expanded the working out of the correlation by using Spearman’s rank. Using this method, I am going to rank the price used and the age in an ascending order to its specific ranking. From here I will take the rankings away from each other and square them. Subsequent to this, I will add all of the squared numbers and then multiply by 6, to give me the value of the numerator. To work out the denominator I am going to square the sample size and then take 1 away. Following this I will multiply by 30, since that is my sample size, which will give me the value of the denominator. I will then divide the numerator by the denominator and then take the answer away from 1. This will give me my final value of the correlation.
Spearman’s Rank Formula
42783
1 –
26970
1 – 1.586318
= 0.630391
Spearman’s rank formula has allowed me to validate my coefficient correlation, since it reflects the same answer. As the age increases the price of the car depreciates. This scatter graph has successfully allowed me to develop my hypothesis, so for my main study, I will have to work out the average price depreciation between a Mercedes and Ford cars.
Main Study
Hypothesis
The make of a car is one of the most important factors when considering the depreciation a used car. As I have previously stated, standard cars are to a great extent affected by other factors, though premium or prestigious cars are practically not affected by some factors. So therefore the higher rank a used car is the higher the price. Moreover the higher the mileage of a prestigious car, the higher the price is, but only compared to a less prestigious car. I assume that the difference between models of cars will not take a major role in the depreciation of a used car. In addition I feel that the average depreciation of a car will be at a less rate with prestigious cars.
Aim
What influences the price of a second hand car? I will investigate the following statement: Do Mercedes depreciate faster than Fords? I will carry out the investigation by seeing which car (out of Mercedes and fords) depreciates the fastest, by comparing numerous factors that affect deprecation of cars. I am to collate data of Mercedes and Fords as evidence and statistics to show the rate of depreciation of a number of cars per yeah. I intend to investigate the factors that affect the price of a second hand car. During this investigation I am to focus on the age and the average percentage depreciation of the car, as I believe age is the main factor to car depreciation.
I will use cluster sampling, in order to obtain my sample from the two car makes. Here is a table showing how many cars, I will be using;
For the Mercedes and Ford comparison;
Data - Ford
Data - Mercedes
Analysing and Calculating
Below is a XY scatter graph that allows me to view the correlation between price and mileages of Ford cars, so that I can later, compare this with the price and mileage of Mercedes on a line graph.
The correlation co-efficient of -0.294743, according to the correlation scale, shows very little negative correlation. This means that the higher the price, the lower the mileage; however my chart shows anomalous results, which I will have to remove for further investigation.
As you can see there are anomalous results in my graph, this is because models play apart in the depreciation of the price of a car.
The correlation co-efficient of -0.27397, according to the correlation scale, again shows very little negative correlation for the Mercedes cars. This means that the higher the price, the lower the mileage; however my chart shows a collection of anomalous results together, which I will have to remove for further investigation.
The graph is very spread out, due to the factor of models of the car. As you can see I have pointed out a cluster of anomalous results that have been produced due to this factor.
I have removed the anomalous data from my sample, to judge whether data was anomalous or not, I had to examine my data closely and delete any data that seemed be too high or too low for the average data.
I had to make sure that the number of cars I used for both makes were equal, so that I do not get a biased result at the end. Eventually I ended up deleting about 50% of the original data for each car, leaving me with the following:
Here is the data that I will continue to explore with and provide spearman’s rank with:
Ford Data
Mercedes Data
I decided to work out spearman’s rank before I produced my scatter graphs. I will rank the age and the mileage of both sets of data. I did this so that I could find out whether my data was suitable enough to be developed to scatter graphs. I could then find the depreciation of a group of cars which I will group according to their ranking or if necessary the make of the car.
Spearman’s Rank Formula
Mercedes – Spearman’s Rank
Total Sum = 3943.75
Method – Spearman’s
6 x 3943.75 = 23622.5
19 squared = 361
19 x 361 = 6859
23622.5
1 –
6859
1 – 3.449847
Rank Correlation = -2.44985
A value of -2.44985 means there is slightly great strong negative correlation. Spearman’s rank formula will now allow me to produce successful scatter graphs, that will turn out to become a more reliable and accurate outcome.
Ford – Spearman’s Rank
Total Sum = 2528.75
Method – Spearman’s
6 x 2528.75 = 15172.5
19 squared = 361
19 x 361 = 6859
15172.5
1 –
6859
1 – 2.2120572
Rank Correlation = -1.2120572
A value of -1.2120572 means there is near perfect strong negative correlation. This was lower than that of Mercedes, the reason to this maybe be because of the prestigious-ness of the Mercedes make. This can also be proven by looking at the highest ranking on mileage and age for both Mercedes and Ford. For the Mercedes make the highest mileage was 32 and for the age it was 34. Where as for Ford, the highest mileage ranking was 26, and for the age it was 31.5. Clearly, this illustrates a vast difference between these two makes.
Below are the scatter graphs which I generated from the sets of data, showing the relationship between the price when old and the mileage with both, Mercedes and Ford.
It is evident from the above graph that the data I had collected and had then changed, due to the anomalous results, is now proving my hypothesis even more accurate, with less or no outliers, making my data now more reliable. This is because the line of best fit passes through most of the data and there are more or less equal points on both sides of the line. Moreover a correlation co-efficient of -0.169919 shows a little negative correlation.
I had found out that the results of Mercedes cars do not change much due to the car being prestigious. As you can see from the graph above there are more cars which have higher mileage, which still cost more than the higher mileage cars of Ford or even the lesser mileage ones. I have shown a closer comparison of this below.
Represents the points being compared.
Ford Mercedes
As you can see I have selected a Ford car which has a lesser mileage than the Mercedes car. The mileage of the Mercedes car is nearly double the mileage of the Ford. As you can markedly see that the price of the Mercedes car is still much higher than the Ford. Here is another example, but this time I selected two points where the prices of the cars are near the same.
Fords
Mercedes
I have selected prices at about 10000, and clearly you can see that the mileage of the Mercedes is nearly 4 times as more as Fords, and the price is still slightly more expensive than ford.
From this we can see how much the make affects the price of a car. Here is the correlation co-efficient -0.293468051, this shows that there is little negative correlation, however I can conclude to say that my data is now slightly more reliable since there are no anomalies and my correlation is slightly higher from that which I got before.
Now I will work out the standard deviation, upper and lower quartile, and inter-quartile range. From this I will be able to find out how spread my data is. The higher the stand deviation the more spread out the data is. I will draw box plots to represent this data.
Mercedes
Price
Mean old price: 17523.68421
STD DEV old price: 5605.580514
Lower quartile price: Mean – STD DEV
= 17523.68421 – 5605.580514 = 11918.1
Upper quartile price: Mean + STD DEV
= 17523.68421 + 5605.580514 = 23129.26
IQR: Upper quartile – lower price
= 23129.26 – 11918.1 = 11211.16
Min Val = 10075
Max Val = 30000
Mileage
Mean mileage: 39342
STD DEV mileage: 28598
LQ Mileage: 10744.29
UQ Mileage: 67939.92
IQR: 57195.62
Min Val = 11000
Max Val = 90000
The above results show that the mileage is more spread out than the price, since the standard deviation of the mileage is way higher than that of the price. I believe that the reason for this is, since Mercedes cars are less likely to depreciate in price, even if the mileage is quite high, and then the prices will more or less stay the same or differ slightly. As a result, we have a standard deviation of 5605.580514 for price, and 28598 for mileage, this goes to prove my hypothesis, correct since I assumed that even if the mileage is high for Mercedes the prices will still be fairly the same and there is less depreciation. When I work out the standard deviation for Fords, I think that the standard deviation of the price and mileage would reflect each other equally.
Ford
Price
Mean old price: 5310.789474
STD DEV old price: 2278.659001
Lower quartile price: 3032.1305
Upper quartile price: 7589.4485
IQR: 4557.318
Min Val = 2100
Max Val = 9715
Mileage
Mean mileage: 11157.89
STD DEV mileage: 2823.771
LQ Mileage: 8334.1235
UQ Mileage: 13981.666
IQR: 5647.5425
Min Val = 10000
Max Val = 22000
A previous hypothesis that I had just made has now been proven correctly: ‘When I work out the standard deviation for Fords, I think that the standard deviation of the price and mileage would reflect each other equally.’ This can clearly be seen, since the standard deviation of the price is 2278.659001 and of the mileage it is 2823.771, and as you can see, it is relevantly equal. To further prove my hypothesis, I took away the standard deviation of the price from the mileage for Ford, and then did the same with Mercedes and compared them.
Ford: 2823.771 – 2278.659001 = 545.11227
Mercedes: 28598 - 5605.580514 = 22992.23
From this it is now clear that the data of Mercedes is more spread out, since the cars do not depreciate much, and the data of Ford is less spread, according to my answer, therefore this proves that Ford cars are less likely to be sold at higher prices, with a high mileage.
On the next sheet I have included box plots, to represent the data above and also to show the contrast of the two makes.
New Price/Old Price Factor Comparison
The graph above indicates that there are a higher majority of higher second hand priced Mercedes cars rather than Ford cars. The graph also shows that Ford cars can clearly not be similar to Mercedes cars, with any factor that you plan to investigate, I can tell this because of the difference between the first point of the Mercedes Car and the Ford. Also the difference between the last point of the Mercedes car and the Ford too.
First two points:.
The first two points shown are very far apart, and this carries on increasing invariably as you move higher up the graph. As you reach to the end of the graph there is a huge difference between the two points, which clearly specifies that the price of a second hand car of Mercedes will always exceed that of the Ford, even if other factors were to be involved.
The reason the two makes can not be similar in any way is because of one primary/main factor along with many others, which have a slight effect on the depreciation. The main factor being:
- The Mercedes make is considered to be a very prestigious make than Ford.
This is followed with the other secondary factors e.g.:
- How many doors the car has
- The service history of the cars
- The style of the car (e.g. saloon, convertible etc)
- The number of seats
- How many owners
This graph indicates that the prices of new Ford cars differ hugely from Mercedes. However the difference is less than the other line graph to when you compare the new prices. This proves that Ford cars depreciate faster than Mercedes cars, since before their difference was much higher than it is now. On my graph there is a section where tow points look as if they meet; and there is also a point on the graph where one of the Mercedes cars, exceed the Ford greatly. This may be because of the model of the Mercedes car. This now indicates the importance of two factors: make and model, when comparing and contrasting two sets of data or more.
Price depreciation
The percentage price will be calculated using this formula:
Price when new
Percentage Price = x 100
Price when old
From the previous results I had got, I have now come upon a new factor to investigate, which is the model of car. I will separate the data into their car models, and then find the price depreciation using the formula above. However I may not be able to do this to all the models, since I will need equal amount of cars for each model. I am considering on picking 3 cars for each model, and if a specific model does not contain 3 cars, then I will eliminate it from my investigation. This is because, if I included it, the data will not be reliable or accurate and I will have a biased outcome.
Mercedes
Depreciation = 174.1595697
Depreciation = 201.4127082
Ford
Depreciation = 282.15233
Depreciation = 294.93126
As you can see, there isn’t a major difference between the two Mercedes car I compared, The same applies to ford. We can stop the investigation of models from here and conclude to say that the model of a car, is not a variable that is strong enough to point out differences or similarities between two sets of data. Therefore model is a weak factor to continue investigating.
Since model wasn’t a good factor, I am now left with make, which I have already proved is a very strong factor that affects the depreciation, but now I am going to try and find the average depreciation of the 19 Mercedes cars and the 19 Fords. I will then compare them to see if there is a very big difference.
Here is the formula I used to find this out:
=(SUM(D2:D20)/SUM(E2:E20))*100
‘D’ being a column (new price) of data, and ‘E’ being the other column (old price).
Mercedes
Ford
From the end result which I had received above, I can without a doubt say that the most contributing factor in my investigation was make, because the depreciation of Mercedes was 227.9219102 where as Ford was 290.71503, there is a great difference between them, which proves my hypothesis to be correct.
Conclusion
In my hypothesis I felt that the age and mileage will be a contributing factor to the price, and I proved this in my results. As the mileage increased so did the age. Moreover as I developed, more hypotheses were introduced, due to the factors that I had to investigate. The factor that was the most effective to the price was ‘make’ according to my outcomes.
I received a good correlation between the age and the mileage using the correlation coefficient and the graph that suggests their relationship. From my graphs it was conclusive that as the age went up so did the mileage to a certain extent I also plotted graphs for the mileage and price, for which it was clear that as the price decrease so does the mileage. Furthermore I feel that the engine size doesn’t affect the depreciation more than the price and age. From the depreciation I observed that the rate of depreciation is less with a prestigious car. In conclusion I felt that the most contributing factors was ‘make’ because it played a huge part in my data and showed a close relationship throughout.
Evaluation
To evaluate, I believe that the data I used was only a portion of the population that exists. Moreover it was also secondary data, which had negative and positive aspects about it. I think that if I had more time I would have researched for primary data to supplement my secondary data. This will make my results even more reliable then they are. Also increasing the amount of cars for my main study to about 100-150 cars to make my results more accurate. Also including other makes of cars, such as Renault, Jaguar, etc. I would have to make sure that I had equal numbers of each make for my investigation so that it does not become biased.
The thing I found easy was collecting the data, since it was already provided from Edexcel. I found it difficult to attempt in collecting primary data, since it is very time consuming. Another difficulty aspect of the study was sorting out the data, and choosing a sample, without having biased results, this lead on to analysing the results, which was quite hard, since the formulas/methods was complicated to use for some data.
Next time, I believe that I should consider the factor of ‘colour’ because:
Colour: Colours differ in Middle Eastern countries. If the colour of a car is ‘white’ it is known to be quite costly. However in European countries white is known to be rather cheap. To sum up I believe rich colour coated cars are more expensive.
This would be a great thing to investigate, since it is interesting and I would like to see the results that I obtain. Overall I feel that I have proven my hypotheses to be correct, and I am not surprised about with result I obtained.