Comparative Newspaper Project
Statistics Coursework
Comparative Newspaper Project
In this investigation I am going to look at the difference between two types of newspaper: tabloids, and broadsheets. I could compare the number of letters in a word, the proportion of text to images, or the perhaps the number of words with 3 or more syllables, but I have chosen to compare the lengths of sentences. This is because I think broadsheets will have longer sentences on average, as they are more 'intellectual' newspapers. They are not like tabloids that are easy to dip into for news for busy working class people, but are there specifically for people who want to, and have the time to, to read the news fully, and in more depth. In addition, this will not be too complicated to find out, as, for example, finding the proportion of text to images is more open to error.
For this investigation I am going to take a sample size of 175 for two different newspapers, one national tabloid, and one national broadsheet, the parent population being sentence lengths in national daily newspapers across the country. I'm assuming that all broadsheets and all tabloids are similar. I've used a sample size of 175, as it is large enough to be reasonably accurate, but not too large that it would take too long to collect the data. It is also quite sensible, as I am collecting data in a group of 7, so everyone can count 25 sentences from each newspaper.
To make this sample more reliable, each sample is going to be selected at random, but first I'm going to choose two newspapers at random using a random number generator on my calculator. (Listed alphabetically to ensue fairness.)
I also used this method to (Ran# * 7) top generate the day on which to buy the relevant newspapers (including numbers less than 1 this time).
The way in which I selected each sentence was as follows:
E.g. Ran# = 0 . 1 2 3 4 5
Disregarding any number generated that does not work.
E.g. 0 . 9 4 3 1 2 page number too high
0 . 1 6 9 0 8 column number too high
0 . 3 4 2 9 9 line number too high
0 . 0 7 0 4 1 no column selected
See separate sheets for lists of raw data
Another sampling method I considered was to count all the sentences in one particular article for each paper, i.e. front-page story. A disadvantage to this is that it wouldn't be random, but ...
This is a preview of the whole essay
Disregarding any number generated that does not work.
E.g. 0 . 9 4 3 1 2 page number too high
0 . 1 6 9 0 8 column number too high
0 . 3 4 2 9 9 line number too high
0 . 0 7 0 4 1 no column selected
See separate sheets for lists of raw data
Another sampling method I considered was to count all the sentences in one particular article for each paper, i.e. front-page story. A disadvantage to this is that it wouldn't be random, but I would be certain that a topic of the same importance and subject was measured. However, this would pose problems such as there may be more sentences in the Broadsheet article than in the Tabloid article. My method ensures that the same numbers of sentences are counted, and the randomness ensues fairness amongst which articles and adverts are looked at, and which are not.
Calculating Distribution Measurements:
See separate sheet - Distribution for Telegraph
See separate sheet - Distribution for Sun
Mean (Telegraph):
=21.5
Mean (Sun):
= 18.2
Variance (Telegraph):
= 141.7127
= 141.71
Variance (Sun):
= 57.24428
= 57.24
Standard Deviation (Telegraph):
= 11.90431
= 11.90
Standard Deviation (Sun):
= 7.565995
= 7.57
It is clear that the sample of sentences from the Telegraph has a larger mean than the sample of sentences from the Sun. Suggesting that the sentences, on average, are longer in the Telegraph. However this is not conclusive, so further calculations will have to be made. The Telegraph also has a larger standard deviation and variance, meaning that the data is more spread out away from the mean, and the Sun's sentences are more consistent in length; which is further evidenced in the diagram below.
Box and Whisker Plots: (using medians, upper quartiles and lower quartiles calculated from separate stem and leaf diagrams)
Telegraph:
Sun:
Stem and Leaf Diagrams:
See separate sheet - Stem and Leaf Diagrams
These diagrams, again, show the fact that there are longer sentences in the Telegraph, however, they bring new information to light. As I had expected, the data has a roughly normal distribution, in both cases. However, the data from both newspapers is slightly positively skewed. To look at this in more detail, I will draw a frequency density graphs.
Frequency Density Graphs:
Telegraph:
Class Interval
f
From
To
Class Width
F.D.
0
-
9
35
-0.5
9.5
0
35
/
0
=
3.50
0
-
4
23
9.5
4.5
5
23
/
5
=
4.60
5
-
9
26
4.5
20.5
5
26
/
5
=
5.20
20
-
24
23
20.5
24.5
5
23
/
5
=
4.60
25
-
29
25
24.5
29.5
5
25
/
5
=
5.00
30
-
34
8
29.5
34.5
5
8
/
5
=
3.60
35
-
54
25
34.5
54.5
20
25
/
20
=
.25
Sun:
Class Interval
f
From
To
Class Width
F.D.
0
-
9
24
-0.5
9.5
0
24
/
0
=
2.40
0
-
4
21
9.5
4.5
5
21
/
5
=
4.20
5
-
9
55
4.5
20.5
5
55
/
5
=
1.00
20
-
24
41
20.5
24.5
5
41
/
5
=
8.20
25
-
29
23
24.5
29.5
5
23
/
5
=
4.60
30
-
34
7
29.5
34.5
5
7
/
5
=
.40
35
-
54
4
34.5
54.5
20
4
/
20
=
0.20
See separate sheet - Frequency Density Graphs
These frequency density graphs show that... The graphs do look quite sensible, partly due to the sample size being so large.
Parent Populations:
To make these results more definite, I need to apply what I have already found out, into predicting the mean, standard deviation and variance of the parent population, British national newspapers.
The sample mean () is a good, unbiased estimator of the mean of the parent population (). Therefore I can predict that the means of the parent populations are as follows:
Mean:
Broadsheets (Telegraph) 21.486
Tabloids (Sun) 18.217
However, the sample variance () is not an unbiased estimator of the variance of the parent population (). As is not known, then an estimate is used instead.
of parent population
Standard deviation:
Broadsheets (Telegraph) 11.9384711
Tabloids (Sun) 7.587710231
Variance:
Broadsheets (Telegraph)
Tabloids (Sun)
Confidence Intervals:
This graph shows the distribution of the sample means, that I can assume has an approximately normal distribution due to the central limit theorem, see later. To find a confidence interval, of say 95%, means that I can be 95% sure that the mean of the parent population is between the value on the left and the value on the right.
Telegraph:
Estimate of parent population:
= 11.9384711
Standard error
1.96*0.902=1.76792
I am 95% confident that the mean of the parent population, lies between 19.72 and 23.25.
Standard error
2.17*0.902=1.95734
I am 97% confident that the mean of the parent population, lies between 19.53 and 23.44.
Standard error
2.326*0.902=2.098052
I am 98% confident that the mean of the parent population, lies between 19.39 and 23.58.
Sun:
Estimate of parent population:
= 7.587710231
Standard error
1.96*0.574=1.12504
I am 95% confident that the mean of the parent population, lies between 17.09 and 19.34.
Standard error
2.17*0.574=1.24558
I am 97% confident that the mean of the parent population, lies between 16.97 and 19.46.
Standard error
2.326*0.574=1.335124
I am 98% confident that the mean of the parent population, lies between 16.88 and 19.55.
The 95% confidence intervals do not overlap. I then thought I would try and calculate these as accurate as I could without them overlapping, as this would mean that Broadsheets mean sentences are clearly longer than tabloids. The 97% intervals do not overlap, but the 98% intervals do. As a result, I can be 97% that Broadsheet means are longer than Tabloid means as they do not overap. Therefore I can conclude that the mean sentence length of British national broadsheets is nearly certain to be longer than that of the national tabloids. This supports my initial hypothesis that broadsheets have longer sentences on average. Because... Also, the data was positively skewed. This was because...
This investigation, although the data was collected randomly and fairly, may not be 100% accurate. This may be because of many things. 7 people were responsible for data collection, and although we discussed the way in which we were going to do this beforehand, I cannot be sure that every person collected the data in the same way. A limitation that I had was that I only looked at one tabloid and one broadsheet. The newspapers that we selected may not be typical of those kinds of paper, so it would have been an advantage to sample more papers. If I were to repeat this investigation, or extend it I would sample more newspapers, but it was not possible to do it this time because it would be so time-consuming. If it were feasible to collect data like this for many samples, then I'd plot an accurate graph for the means of the means of the sample, which would be normally distributed, as long as the sample were large enough - The Central Limit Theorem states that 'If the sample size is large enough then the distribution of the sample means is approximately Normal, irrespective of the distribution of the parent population.' It would then be easier to predict more accurately the mean of the parent populations.
To develop this investigation, I can use the data already collected to find out other information, such as how many sentences from a sample of, say, 100 chosen from a tabloid newspaper at random are 24 lines long or more. To do this I am assuming that the population is normal.
X ~ N(18.217, )
Z =0.696257
Sarah Ruston