Data Collection and sampling.
I will be collecting secondary data, data that has already been collected and I will get this from the internet, the reason for collecting secondary data is that there is not enough time to collect to collect primary data and also because secondary data from is reliable as well as up to date. I used several websites one being , this website gave the postcodes for specific counties, so I could just type in the postcode into which also has a specific search feature which I put into a ‘three bedroom detached house’ this narrows down all the search to ‘three bedroom detached houses’ within that area. As well as using Wikipedia to find which counties are in the North/South of England. The advantages of ‘right move’ were that the specific search meant only the necessary prices came up.
Systematic sampling
I used a systematic sample to determine which counties I would use to find the postcodes. A systematic sample works like this:
Say if the population size was 200 (this is an example) and you needed a sample size of 50 you would divide 200/50=4. So therefore you would start with a random number and use every fourth number in that sample. In my case I numbered the counties in the north and south of England, there were approx 23 in the south, so I did the following equation: 30(the sample size I needed) / 23(number of counties approx) = 1.2 (approx). 30/23=1.2. This meant I chose 1 postcode from each of my registered counties and from every 5th county I took an extra postcode. From the north I did the same equation but I used 20(approx) counties, so the equation this time was: 30(sample size needed) / 20(number of counties approx) = 1.5. 30/20=1.5. This meant I took 1 postcode from every county and an extra postcode from every second county. To find the house prices I would use, I just used a random sample; I generated the numbers I would use for the random sample using the random number generator on a calculator. I used the systematic sample for choosing the counties’ as it is easy, quick and fair way of sampling, I used this method instead of stratified sampling as this is a simpler sampling method and it is less time consuming. I used a random sample to get the house prices as this is the fairest sampling method as each house price within the search had a fair chance of being represented in my data set.
Histograms
I used a histogram because histograms are a good way to see the measure of spread as well as the skewness of the data. They give a good visual aid to see the spread of the data.
House prices for the North
House prices for the south.
As you can see from my above histograms; my North histogram appears to be quite a strong positive skew, while my south histogram appears to be a weak positive skew. The range of my south data is larger than the range of my north data. South: 799,950-285,000=514,950. North: 550,000-87,000=463,000. This supports my hypothesis as the south has a higher range of data. Also the South’s modal class is 300,000 – 450,000, as opposed to the North’s which is 200,000 – 300,000. This shows the South’s prices are higher and more prices are at a higher price, which also supports my hypothesis that the south is a more expensive place to live.
Pearsons measure of skewness.
Pearsons measure of skewness’ calculation.
3(mean-median) / standard deviation = If it’s a negative number its shows a negative skew.
If it’s a positive number then it’s a positive skew.
South
3(482,958-487,500) / 135231.5664 = -0.10. This actually shows a negative skew for the south of England house prices. The reason this may not have come up clearer on my histogram is because the class interval widths were different, so this might explain why my histogram didn’t show a negative skew at first.
North
3(236,959.83-219,498) / 104055.7943 = 0.5. This shows a positive skew, for house prices in the north of England, as my histogram predicted.
Box plots
North
LQ=£176,237.50
UQ=£288750
MEDIAN=£219,498
IQR=£43,260
South
LQ=380,000
UQ=570,000
MEDIAN=487500
IQR=190000
I used box plots as from my Pearson’s measure of skewness calculation I found a skewed distribution, therefore to display my skew clearly I have used a box plot, this also shows the median, UQ, LQ and range very clearly. From the above diagrams you can clearly see that the house prices in the south have a higher price as the range is bigger also the median is clearly larger as well as LQ of the South being higher than that of the UQ in the North. This supports my hypothesis that the south is a more expensive place to live. However the box plots shows that the house prices in the North are closer together than those in the South, we can see the IQR of the North is smaller than the IQR of the South.
Outliers in my data set
South
Now I am going to calculate if there are any outliers within my dataset, to find out if there are any outliers I will perform the calculation:
Xi: LQ-(1.5*IQR) and xi: UQ+ (1.5*IQR). If there are any pieces of data that are outside of the range of these calculations they will be outliers within my data set. The reason I am doing these calculations is because these are the specific outlier calculations for a skewed distribution, we know I have a skewed distribution because this is what my box and whisker plot indicated.
Outliers in the south dataset:
Xi: 380,000-(1.5*190,000) = 95,000. This means anything below 95,000 in my data set would be classed as an outlier. But there are no pieces of data below this limit.
Xi: 570,000 + (1.5*190,000) = 855,000. This means anything over 855,000 in my data set would be classed as an outlier. Again there are no pieces of my data that are outside this boundary.
To conclude this section there are no outliers in my south house prices data set.
North
Again I am going to calculate if there are any outliers, but this time within my north house prices data set.
Xi: 176,237.50-(1.5*112512.50) = 7468.75. This means anything below this number would be classed as outlier in my data set. But there are no pieces of data lower than this value within my data set.
Xi: 288,750 + (1.5*112512.50) = 457,518.75. This means anything over this price would be classed as an outlier in my data set. And there is one piece of data that is above this limit, this piece of data was a 3 bedroom detached house costing 550,000 in Cheshire. So there is one outlier within my north house prices data set.
There is one outlier within my north data set.
Median and IQR
The median and the IQR that I got from the box plot, I going to use this measure of averages as they are the ones you should use if you have if you have a skewed distribution. My IQR indicates that the variation of house prices in the North is lower than that in the South. The median is the preferred measure of average for a skewed distribution, and the South’s median is higher than that of the North’s median which shows that the house prices in the north are lower than that in the south, which supports my hypothesis.