I have been given the task of finding what affects the price of a used car, using a spreadsheet given to me displaying data on a hundred cars with data on about each car.

Maths Statistic Coursework

I have been given the task of finding what affects the price of a used car, using a spreadsheet given to me displaying data on a hundred cars with data on about each car. The data on the cars were: (See Spreadsheet 1)

Make Model Price When New

Used Price Age Colour

Engine Size Fuel Type MPG

Mileage Service Owners

Length of MOT Tax (Months left) Insurance Group

Doors (Amount) Style Central Locking

Seats Gearbox Air Conditioning

Airbags

Immediately from looking at those categories I omitted colour, fuel, service, doors, style, central locking, seats, gearbox, air conditioning and airbags. I omitted this data because it is of a low range of contains words, these would be hard to show on graphs and would show me little evidence of what affects a used car price.

E.g. Colour: Cannot produce a scatter graph as it uses words.

Seats: Has a range of 2-5 and would produce poor scatter graphs and would be hard to find a direct relationship on.

Then from the remaining categories I picked age, insurance group, MPG, mileage and of course used price, as this is what I was investigating. It then dawned one me that I could use the depreciation price, the price when I took the used price away from the new, this perhaps could be a more accurate look at the data as some cars depreciate quicker than others. Looking further into that work I decided against it as it would take longer and time was of the essence, but this was perhaps an extension that could be added on at the end.

Reasons Why

Age: Has a large range and would be interesting to see what sort of relationship there is
Insurance Group: Again a wide range.
MPG: Grouped data could be used on cumulative frequency graph and has quite a large range.
Mileage: Huge range and a definite effecter of used price but would be interesting to exactly how much.

Sample

I was given 100 cars but to investigate this would be very time consuming so I would have to bring that number down. In the end I chose to do a 40 car sample as it is a round number, lower than 100 but still big enough to display a fair representation of the data supplied.

Sampling Method

Now I’ve decided how big I need my sample, I know have to decide how I will sample. There are two main methods random or stratified, eventually I want to try both but for now I will use a random sample. To do this I will use the random number function on my calculator.

I press the random number button and a 3 decimal place number is displayed, I then picked the first 2 numbers and used this as my sampling method. If a number was repeated I ignored it and chose again.

EG.

Random ...

This is a preview of the whole essay

Sampling Method

EG.

Random produced number 0.311 so I chose car number 31

Random produced number 0.981 so I chose car number 91

Using this sampling method I chose my first group of cars. They ended up being numbers.

1 2 4 5 7 8 15 16 17 18 21 22 24 26 27 31 32 35 37 38 44 51 53 63 65 67 68 70 71 73 76 77 83 86 91 95 96 97 98 98

From these car numbers I made a table with all the data on the cars above that’s I needed such as used price, MPG and mileage. (See Spreadsheet 2)

From this data I complied for scatter graphs on:

Age against used price
MPG against used price
Mileage against used price
Insurance group against used price

I used scatter graphs as they will display relationships between the data, which is why used price is in everyone. A scatter graph will also give me the ability to put a line of best fit in giving me the ability to predict future data.

Predictions

For age I believe there will be a very strong negative correlation as the older the car gets the lower the price.
For MPG I believe there will be a weak positive correlation as the higher the MPG the higher the price but I believe it doesn’t affect it that much.
For mileage I believe there will be a very strong negative correlation as the mileage increases the price will decrease.
For insurance group I believe there will be a weak negative correlation as the higher the insurance group the price will decrease but not by much.

As you can see from my predictions I believe that mileage will affect used price the most while insurance group will affect it the least from the ones I chose.

See scatter graphs 1, 2, 3 and 4.

Conclusions of Random Sampling.

As you can see some of my predictions were right while others weren’t.

Age was a big effecter of price and had quite a strong negative correlation as I predicted.
MPG again had a very strong negative correlation showing it did affect price a lot, which I predicted wrongly.
Mileage had quite a strong negative correlation but not very strong as I said. It shows mileage affects price but only to a degree by the shape of the graph it appears a curved line of best fit would suite it better but I shall leave that to that.
Insurance group did have a positive correlation and quite a strong one at that, showing as the insurance group went up so did used price.

Observations

As you can see on all of the graphs there are pieces of data that are way of the lines of best fit and away from the rest of the data. I purposely kept this data in as it gives me a valid reason to do another sampling method. This data can be called anomalies as they differ from the rest of the data. I could cut this data out to make the sample fairer but then it wouldn’t be a true random sample.

With these observations made I can say a few things of what affects used car prices but now I shall move on and use a stratified sample and see if the data is more reliable.

Stratified

A stratified sample is one where all the data has been put into an order and then a then picked out. For my stratified sample I have ordered them by mileage and then grouped the mileage and picked 40% from each group. This ensures I get 40 cars again so I can evenly compare the random and stratified samples.

The mileage groups were. 0-5000

5000-10,000

10,000-20,000

20,000-40,000

40,000-70,000

70,000-110,000

With these sorted I took 40% at random from each group and ended up with this. I ensured it was random by drawing numbers out of a hat respective to the numbers of the car, I then noted that number and placed in back in so each time the chance of drawing a single card was equal and didn’t change. If I drew the same one twice I simply ignored that, placed it back in and redrew. (See Spreadsheet 3)

If actually counted there are 41 cars. As 40 and 41 are very close, rather than tamper with any results which could make them biased I simply left them.

From this data I then compiled scatter graphs on them just as before.

Predictions

Age, I believe that there will be a strong negative correlation as there was before but as this is supposedly a more reliable sample it should be more evident.
MPG, I believe there will be a strong negative correlation as there was before but should be more evident due to sample being more reliable.
Mileage should have a strong negative correlation due to reasons above.
Insurance group should have a strong positive correlation due to reasons mentioned above.

See graphs 5,6,7 and 8.

Conclusions on Stratified Sampling.

As you can see some very strange results came up.

Age showed the very strong negative correlation as I said there would be.
MPG showed a strong negative correlation as well as I said.
Mileage proved very weird. The data was in two groups basically one showing high mileage and low price while the other low mileage and low price. From this I can deduce that the mileage is a limiting factor of used price.
Insurance group showed no correlation with data all over the place, show perhaps my random sample was a mishap and in fact insurance has no relationship or very little with used price.

Observations

Correlations were generally a lot tighter showing that stratified sampling alleviates anomalous data but can provide strange results, such as mileage for example. This result however may not be wrong but in fact right and the random results were wrong. To find out this I shall become more specific and look at another way of representing data.

Histograms

After some thought a great way of comparing two sets of data and in a visual manner would be a histogram.

To make a histogram I would have to group the mileages this however was easy as I shall take the groups I did for my stratifying of the data.

The mileage groups were. 0-5000

5000-10,000

10,000-20,000

20,000-40,000

40,000-70,000

70,000-110,000

I then made a tally chart with the groups and both random and stratified data.

Random

Stratified

Then to construct a histogram I would have to work out the frequency density to go on the horizontal axis, this is worked out by.

Frequency Density = Frequency

Group Width

So I ended up with this.

Random

Stratified

Predictions

I predict that the random histogram will have a much more erratic distribution of car mileage while the stratified distribution will be more of bell shape displaying the majority in the mid range with low or no extreme values displayed.

I then proceeded to draw the graphs.

See Graphs 9, 10 and 11

Results

As seen on the two histograms there are some slight differences. The spread of the random sample is a little more erratic and uneven than that of the more bell shaped graph the stratified data shows. From this you could deduce that the stratified sample is a more reliable source of data than a random sample.
From individual graphs you can see that the majority of the cars are around the 20,000 to 60,000 miles range in both the random and stratified samples. Standard deviation could perhaps tell me which sample is more accurate so that could be an extension to the work done.
I mentioned a bell shape graph before. By this I mean one, which slowly goes up to a peak then reduces down, with the majority of the data displayed in the middle and only some or no data displayed in the highest and lowest areas.

However from the histograms I did not find any reasoning behind the weird shaped and correlated stratified scatter graph. Further investigation into this could prove interesting.

Overall Conclusion

From all the work carried out above you can clearly see that many different things affect used car prices and some more than others. You could say that the different categories are limiting factors and a culmination of these results in the depreciation of a cars price.

As a further investigation I would look into the strange scatter graph produced by my stratified mileage sample. Perhaps using standard deviation or other data representation methods I could find out why it is so peculiar. I could also look at how one category affects another such as engine size and mileage or engine size and MPG and find a relationship between those. There are many more aspects that I could of considered but however from the work I’ve done there are things that are certainly clear.