Range = highest value - lowest values
Range of total type = highest value - lowest values of all 3 papers of that type.
Standard Deviation =
Standard Deviation of total type =
I calculated each of the above for the total of that type of newspaper as it will make comparing ‘broadsheet’ with ‘tabloid’ easier than having three types of each to compare.
I have used the above formulas to calculate the mean, mode, median, range and standard deviation for the total types. This was important because if I had just averaged the results from the three newspapers I would have had an average of an average and not taken account of the varying number of sentences or paragraphs in each newspaper.
The results of all the calculations for both words per sentence and sentence per paragraph are all on the Table 4. I have shown my calculations for The Times on the following pages and calculated the rest using a spreadsheet.
Interpretation
Article
After I had worked out all these calculations, I decided to begin by analysing the length of the articles.
I analysed it according to sentences per article and paragraphs per article. I have displayed these as stacked bar charts so it easy to see what proportion of each bar is made up by each newspaper. See Graph 1. I considered using proportional pie charts however decided on this method as it is clearer to compare because the scale on the axis are the same. As you can see from the bar charts broadsheets are longer (whether it has more sentences or paragraphs) than tabloids.
Using a stacked bar chart immediately shows that the article in The Independent was by far the longest and the article in The Sun was the shortest.
Words Per Sentence
Range
There are three methods of measuring spread, or how close the data is to the average value. These are the range, interquartile range, and the standard deviation.
The simple range score of the broadsheet at 47 and tabloid at 35 shows us that broadsheets have more variation in the number of words in each sentence than tabloids. See Table 4.
I drew the cumulative frequency curve (see Graph 2) to identify and illustrate where the median, upper quartile and the lower quartile can be found. The median is also calculated using the formula on page 2, this is shown on Table 4. I then used this information to plot box and whisker plots. See Graph 3.
These box and whisker plots are very useful to compare the different newspapers of each type, to check the reliability of the sample, and to compare total broadsheet with total tabloids.
From this diagram we can see:
Broadsheets – The Independent has a range that is comparable to The Guardian however the interquartile range (the box) is a lot larger. The Times has a smaller range than the others but the interquartile range is much nearer the small end of the range.
Tabloids – The Daily Express and The Daily Mail are very similar in their spreads whereas The Sun has a range that is a lot smaller and the interquartile range is also considerably smaller.
When we compare the two box and whisker plots (see Graph 4) for the total types we can see that the interquartile range of both are very similar however the overall range of the tabloid is a lot smaller that that of broadsheet. This shows that broadsheet newspapers have more long sentences than tabloids.
The standard deviation also shows the spread, but by its distance from the mean. The bigger the standard deviation, the bigger the spread of values from the mean, hence the less consistent the data. The standard deviation is calculated by the formula on page 3 and recorded on Table 4.
I have calculated the percentage of the data that is within one standard deviation of the mean below. If the percentage is 68, then it is a normal distribution (and will form a bell shaped frequency curve). If the data is above that it is positively skewed and if it is below it is negatively skewed.
Broadsheets –
s.d. = 11.409 mean = 24.454
24.454 + 11.409 = 35.863 24.454 – 11.409 = 13.045
79 out of 130 sentences have within 13.045 to 35.863 words (i.e. 14 to 35)
Therefore 79 x 100 = 60.8% of the data is within 1 standard deviation of the
130
mean.
Tabloids -
s.d. = 8.462 mean = 23.416
23.416 + 8.462 = 31.878 23.416 – 8.462 = 14.954
59 out of 89 sentences have within 14.954 to 31.878 words. (i.e. 15 to 31)
Therefore 59 x 100 = 66.3% of the data is within 1 standard deviation of the
89
mean.
These are both less than the normal distribution of 68% so the graphs they would produce are negatively skewed.
I now know that these are skewed but using Pearson’s coefficient of skewness I can calculate just how skewed they are. This skew is illustrated on Graph 6 and 7.
Skewness = mean – mode
standard deviation
Broadsheet -
24.454 – 31 = - 0.57
11.409
Tabloid -
23.416 – 29 = - 0.66
8.462
Averages
There are three different types of average. Mean, median and mode. Their definitions are on page 2. I have calculated each of these averages. See Table 4
I have plotted these calculations on Graph 6 and 7.
The mode is the the most frequented number of words in a sentence. For the Broadsheet this is 31, however for the tabloid it is 29. Both The Times and The Independent are bi-modal as is The Sun.
The median is the the mid number of words per sentence when all the numbers of words per sentence are put in numerical order. For the broadsheet this is 25 whereas for tabloid this is 24.
The mean is often referred to simply as the ‘average’ as this is the most common type of average used as it gives a ‘typical’ value. It is the sum of all the words divided by the number of sentences. For broadsheet this is 24.454 words per sentence whereas for tabloid it is 23.416. See Graph 5.
To be able to discover whether or not this small difference is statistically significant or not, some significance testing would have to be carried out.
Secondary Data
I have used secondary data from a published book to discover whether or not my data is reliable. See Table 5. This data was not the most desirable, see reasons below, but was all I was able to find given the time scale.
The data has been taken form the sports pages of newspapers so they will not make a direct comparison to my data however will give an indication of the reliability of my results; the broadsheet newspaper is the Daily Telegraph for which I have not any other data; and I do not know which day of the week this data was collected on which could also make a difference. I would expect the sports pages to have less words per sentence, if any difference at all, as the audience of sports pages is different to that of hard-hitting news stories. As the broadsheet newspaper is not one I have analysed I will be comparing this secondary data to my total type data.
The simple range score of the broadsheet at 52 and tabloid at 34 reinforces my data in showing that broadsheets have more variation in the number of words in each sentence than tabloids. See Table 5.
I have drawn the cumulative frequency curve (see Graph 8) to identify and illustrate where the median, upper quartile and the lower quartile can be found. The median is also calculated using the formula on page 2, this is shown on Table 5. I then used this information to plot box and whisker plots. See Graph 9. In order to compare this with my data I have put both data’s on the same graph.
From this diagram we can see:
Broadsheets – The Telegraph is fairly comparable to the total broadsheet plot although the range is slightly larger and the interquartile range (the box) is also larger. The median is far more central both in the interquartile range and the overall range.
Tabloids – The Sun is very comparable to the total tabloid plot. The range is very similar, however the interquartile range is considerably smaller and lower down in the range.
From this comparison, I can see that my data is comparable to the secondary data and therefore reliable.
By looking at the averages (see Table 5) I can see that the Telegraph is comparable to the total broadsheets however the Sun has averages that are a lot smaller than with total tabloids but if we compare the 2 pieces of data taken from the Sun these are more comparable.
Overall, I conclude that my data is fairly reliable based on the secondary data I have compared it with.
Conclusion – Words per Sentence
Following my analysis of the number of words per sentence in both individual broadsheet and tabloid newspapers and in the total of each type, I have drawn the following conclusion:
Broadsheets have a wider range of sentence length; they use sentences of both a very short and very long length. (3 to 65)
Whereas tabloids tend to use sentences of a more standard length; rarely going over 40. (5 to 41)
I conclude that my hypothesis for number of words per sentence was correct, as on average broadsheets have longer sentences.
Sentences per Paragraph
Range
The range for the number of sentences per paragraph was much less varied than with words per sentence as each paragraph rarely had more than 3 sentences. When analysing the total type range (broadsheet – 3, tabloid – 2) it appears that broadsheets have more variation however if you look at the range for each individual paper (see table 4) you can see that two out of three broadsheets have a range of 2 and two out of three tabloids have a range of 2 so the range is perhaps not as significant as it first seems.
For this data I could not use cumulative frequency or box and whisker plots as I did with the words per sentence because the range of possible values is so small that it would not work effectively.
I have calculated the standard deviation by using the formula on page 3 and recorded on Table 4.
I have also calculated the percentage of the data that is within one standard deviation of the mean below.
Broadsheet -
s.d. = 0.842 mean = 1.618
1.618 + 0.842 = 2.460 1.618 – 0.842 = 0.776
67 out of 80 paragraphs have within 0.776 to 2.460 sentences (i.e. 1 or 2)
Therefore 67 x 100 = 83.8% of the data is within 1 standard deviation of the
80
mean.
Tabloids -
s.d. = 0.541 mean = 1.277
1.277 + 0.541 =1.818 1.277 – 0.541 = 0.736
50 out of 65 paragraphs have within 0.541 to 1.818 sentences (i.e. 1)
Therefore 50 x 100 = 76.9% of the data is within 1 standard deviation of the
65
mean.
These are both more than the normal distribution of 68% so the graphs they would produce are positively skewed.
Using Pearson’s coefficient of skewness (see formula on page 5) I can calculate just how skewed they are. This skew is illustrated on Graph 6 and 7.
Broadsheet -
1.618 – 1 = +0.73
0.842
Tabloid -
1.277 – 1 = + 0.51
0.541
Averages
All three averages for both broadsheets and tabloids are shown on Table 4.
They are all also plotted on Graphs 6 and 7.
The mode and median for both broadsheets and tabloids is 1 therefore this is not a good average to compare. The mean for broadsheet is 1.618 however for tabloid it is 1.277. This is only a 0.341 difference however since the range is so small this may be significant. Some statistical significance testing would have to be undertaken to decide this.
Conclusion – Sentence per Paragraph
Following my analysis of the number of sentences in both individual broadsheet and tabloid newspapers and in the total of each type, I have discovered that a firm conclusion cannot be made.
My data has shown that broadsheets have longer paragraphs than tabloids, but only by a very small degree. This may or may not be true as my data covers too fewer newspapers to tell and in order to confirm this fact more newspapers would have to be tested.
Readability Level
I have used the ‘SMOG Readability Test’ to calculate the required reading age to be able to read each newspaper.
The readability level is calculated in the following way, I will use The Times as an example:
1. Number of words with 3 or more syllables in the first 10 sentences = 20
2. multiply by 3 –20 x 3 = 60
3. Nearest square number – 64
4. Square root of nearest square number – 8
5. add 8 – 8 + 8 =16
Therefore the required reading age for The Times is 16.
The other results are on Table 6.
Conclusion – Readability Level
The readability level is not a very reliable statistic as it only takes into account the first 10 sentences of the article and only looks at words with lots of syllables rather than long words. As this analysis is of an article of a medical theme it is highly likely that there will be many medical words with 3 or more syllables, which are in the articles so this does not give a fair implication of the newspaper’s typical language use.
Therefore there is in insufficient evidence to draw a firm conclusion, however a conclusion that broadsheets generally have a higher reading age could be made.
To be able to draw a firm conclusion the reading ages of different types of articles would have to be calculated.
Evaluation
Looking back there weren’t too many problems I encountered during this project however there were many things I did which could have been improved.
The data collection was a very lengthy process as every word, sentence and paragraph had to be counted, this was part of the reason I only covered 6 newspapers. It would have been desirable to have looked at a wider range of newspapers, as it would have given a more apparent conclusion; and look at letters or syllables per word, as I think it would give a clearer indication of the difference uses of language between broadsheets and tabloids. It would be preferable if the data collection process could be carried out perhaps automatically by computer to cut down the time it takes.
I found that, especially for sentences per paragraph, it was fairly useless calculating the mode and the median as they do not provide the ‘typical’ value and therefore the mean was much more useful in analysis.
My investigation was far less straightforward than that of other pupils and hence it was quite difficult to find statistical analyses that could be carried out!
I was quite surprised to find that there was not much published secondary data on newspaper comparisons. I was expecting to find lots on the Internet but none was found. As you can see I had to settle for data that was not wholly relevant as it was all I could find. This meant that the secondary data did not back up my data very well.
If I was to re-do or further the investigation, I would collect the required information using a computer, collect data on word length and perhaps syllables per word, and collect data from more sources. I could also analyse this data not only for serious articles but for something light-hearted such as sport or reviews as well. It would also perhaps be a possibility to see how the conclusion I obtain relates to sales of each newspaper and who the actual audience is.