These Secondary data, gathered by unknown sources, will help me in my investigation to prove my hypotheses. It is impossible to tell when these were taken, however, so it may be very unreliable and yield false results. I shall now explain, and give different methods of sampling that I will use in my investigation.
Sampling: The process of converting data into digital data by taking a series of samples or readings at equal intervals. This has its advantages: if I have a lot of data and want to find a general trend, then I can use sampling to find a general pattern between all data by condensing it down to a smaller number of data, or “sample”, which is supposedly representative of the original large batch of data. This makes it easier to work with.
However, a disadvantage is that the sample may be obscured, and not fit the original pattern, so will give inaccurate results, not being representative of the original data.
There are many forms of sampling, though 3 basic ways will be shown here; I will then apply them later on in the coursework to my data to prove my hypotheses.
Random, Stratified and Systematic sampling are the three that I will define here:
Random: all data have equal chances of being chosen: there is no system in choosing them.
Stratified: Each datum is put into a group, and the proportional number of each group to the whole original data quota will be selected in terms of percentage. I will not be choosing this, as there are too many variables in the spreadsheet to format this kind of sampling successfully.
Systematic: Taking data in an ordered way; every third or fourth value is an example of this. However, if the data has already been ordered, this may not work, and yield inaccurate results.
Since there are not “groups” as such in my used cars database, I cannot use stratified sampling in my investigation, and the data provided to me may already be ordered in some way. Thus, I have chosen random sampling in my investigation. I have reordered the data into a random mix in Excel. I then deleted half of it, as each datum has equal probability of being filtered into the second, as opposed to the second half.
This is randomly ordered, to get a general trend in data, so my results will not be biased. However, I have 4 pieces of missing data: I will need to fix this using Standard Deviation. To solve this problem I will remove this missing data. To find out if there are any outliers I should find out the standard deviation to find the upper and lower bounds. The upper quartile is 75% of the maximum value, and lower quartiles are 25% of it. The formulae to work the missing values out in terms of standard deviation are as follows:
Upper Bound = Mean + 2x Standard Deviation
Lower Bound = Mean – 2x Standard Deviation
There are data outside the upper bound in the column concerning the Porsche It is approximately £6,000 higher than the upper bound; it is an outlier. However, the effect is not drastic and will not obscure my results to an inaccurate curve. When I identify any huge outliers, I will remove these, though this will not have much of an impact. Because of the 4 missing data, I will need to delete the rows for this make, as a lack of one value will obscure an average. One example is the Lexus: with no mileage, it is impractical for me to include it in my investigation, because it cannot work for my 3rd hypothesis. If I remove all the other cars which lack data, I have a remaining sample of 47.
I have constructed a table to show the range of data and to see how the data correlates I comparison to each other.
From this table it appears that the ranges are very large, however this does not mean that my graphs will necessarily have weak correlations. This is determined by how many bits of data are outside 2 standard deviations of the mean, and to prove whether my hypotheses are true.
The standard deviation is a measure of how spreads out my data are. Several steps of standard deviation entail firstly, computing the mean for my set of data, followed by subtracting the mean from each value in the set. Squaring each individual deviation is the next step, followed by adding all this new data together. I should then divide by one less than the sample size, followed by taking the square root of this number.
I will try to demonstrate this using the Porche example, for number of owners:
= 77
77/46 = 1.68
Sum of squared deviations = 17.4
17.4/ 46 – 1 = 17.4/45 = 0.38666… = x
Square root of x= 0.621
This is the standard deviation: the upper bound of the mean + 2x the standard deviation is the upper bound, and the mean – 2x the standard deviation is the lower bound.
I am now going to produce a graph to try and spot a trend, to prove my hypothesis. I am going to make my graph for my 1st hypotheses. This is percentage depreciation vs. Mileage. I expect to see a positive correlation with the percentage depreciation increasing the more miles the car has done. If this is what the graph shows then my 1st hypotheses will be correct. As my data is quite grouped I expected my points to be close to the line of best fit.
Analysis of Data
I shall now produce a graph to see if I can find any relevant trends. These will mostly entail scatter graphs for each of my hypotheses, to see if the data correlate with the line of best fit. The following graph is percentage depreciation vs. Mileage. If my theory is correct, I will see a positive correlation with the percentage depreciation increasing, being directly proportional to the mileage the car has done. If this is right, my first hypothesis will be correct, and I will proceed in analyzing the other graphs for my hypotheses. My data is quite grouped, so I expect to see some form of positive correlation.
My hypothesis is correct: my graph has a positive correlation. The gradient, being 0.0006 ( 6 x 10^-4 ), means that for every additional mile the percentage depreciation goes up by 0.0006%. In accordance to my data, the line crosses the y axis at 35.664%, meaning that as soon as a car is bought, and no mileage has been added, the price depreciates by approximately 36%. I think this is the case because as soon as a car is bought, the ownership status is converted to “second hand”, and the value greatly goes down. A good correlation is displayed here: R2=0.5424 and any R2 greater than 0.3 is a strong correlation. In theory, a perfect correlation would be 1 , but the likelihood of this happening with any set of data is astronomical. The R2,for this graph, describes how much of the variation in results can be explained by Mileage. Therefore, 54% of this data’s correlation can be explained my mileage - a major influential factor in depreciation of a car’s price.
A second graph will now be plotted to see if my second hypothesis is correct. If this is the case, there should be a positive correlation: this should show that the car’s value depreciates as years are added. It should depreciate as soon as it is bought.
According to the data, my hypothetical reasoning was correct. Like before, my graph has a strong positive correlation between data. Much like the previous graph, this shows that the older a car is, the more the percentage depreciation. The gradient is 6.2048: from this calculation, for each extra year, the car’s value depreciates by 6.2048%. The line crosses the y axis at 29.716. The car’s value will then depreciate 30% when first bought, much like in my first hypothesis, which now cannot be refuted.
My final graph will now be plotted to spot a general trend. I hypothesized that the number of previous owners would increase the mileage a car had done. Much like the last two graphs, this should appear as positive correlation.
The last hypothesis is proved to be correct in this graph, though it is not as strong as I would have liked: it has a weak correlation of R2 = 0.2645. As the gradient is 16,257, every additional owner adds approximately 16250 miles to the car. The line of best fit intercepts y axis through at 19310. This does not make sense: if a car has not been driven, this anomalous result of 19310 miles is invalid. Weak correlation is also down to other factors: if clients have bought a car which is brand new, they will try to make the car perform to its best standard, and do as many miles as possible. This leads to the fact that, when a car is bought first hand, owners will do as many miles as they see fit. Some people may buy a new car before increasing the mileage a lot. This leads to the fact that all mileage done will be passed onto the next owner At some points, an owner may have bought a car to support their financial lifestyle (e.g. low income). If their lifestyle is abruptly changed, they may consider selling their car for a less/ more expensive model. The car will be passed on during this to people in similar financial lifestyles that the owner had originally. When the vehicle is passed on, essential parts will become to be more reliable. For this reason, people may not drive the car to keep in line with the repairs, or keep up with demands of MOT etc. They then perhaps may sell it on again. The line on the graph will start inclining steeply (exponential decay) and then begin to increase more slowly.
Conclusion:
The graphs are now finished, and I will summarize each individual aspect of investigation, and what I would do with better data, or perhaps use different hypotheses. All of my theories were correct, albeit the 3rd hypothesis had much weaker correlation than the 1st or 2nd, thus it has not been proven entirely. As the first 2 hypotheses had good correlation, these cannot be refuted, and are therefore proven, according to the data that I have been given. If I were to have more time to make my investigation more reliable, I would probably use 60%, rather than 50% of the data to get a bigger sample, and therefore more reliable results. My data was also not a spread: while I randomly ordered them, my filtering of half the data may still be flawed. The total population would also be larger, to get more variety.
Some new questions would probably be investigated if I had the chance to do it again. For example, the third scatter graph of the no. Previous Owners vs. Mileage did not give me strong results: it had a weak correlation and a bad trend. In place of this, I would test a new theory of ‘the older a car, the greater the mileage it will have gained.’ This would be an improvement on it: instead of comparing the number of owners to the mileage the age would allow me to see how the mileage built up, in relation to its age. Some flaws in the original graph have been spotted: for example, some people may have owned a car in a very short timeframe, and sold the car briefly after buying it. With age included, I can see how much the car traveled in relation to time, rather than the number of people who drove it. I believe this hypothesis would give me a strong correlation: it would provide me with more reliable results.
Perhaps if I had more time, I would test this multiple regression to see how different influential factors affect each other, rather than depreciation.