A British national broadsheet newspaper on the same day, with the same topic will have on average, more letters per word than a tabloid newspaper.
Statistical Investigation into Newspapers
Introduction
There are many different kinds of newspapers for sale, which cater for the diverse range of readers' preferences. These range from the more descriptive and complicated broadsheet newspapers, to the less so serious styles of the tabloids. In addition, the style varies accordingly between the levels of the newspapers, for example, it is common knowledge that the broadsheet newspapers are more descriptive, and is harder to understand because of its longer words. While tabloids are easier, to read as it makes used of shorter words and the topics reported are often done so in a light style and depth.
Specify and Plan
Aim
My aim, is based on the hypothesis that:
This aim is to prove the common belief that broadsheet newspaper uses longer words than a tabloid one, which may lead to the conclusion that by the use of longer words a more complex vocabulary and sentence structure may occur. Which leads to the complexity of reading it, while a tabloid uses shorter words to get the information across, it also shows the readability of the different newspapers.
In my further investigation (once I have finished my primary objectives), I will be analysing any correlation between the average lengths of words and sentences within articles. This stems from an intriguing theory, which I would like to prove or disapprove: do longer sentences mean that there are longer words in it or is it because there is a large number of small words in it?
Original Aim Objectives
* To collect data (100 randomly selected words) on the number of letters per word in both a tabloid and a broadsheet paper, from a similar article. The random selection of words, will consist of taking a random number from the calculator by using the random number generator. This number will be used as a starting point (e.g. the fourth word) and then another random number will be used to determine the number of words between the starting word and the next sample, the number will be the first significant figure of the random number. This will be done at least 100 times, to get a very random sample of words from the article, in order to reduce bias. This will be repeated on the other newspaper as well.
* To present data in a meaningful way, using frequency/tally tables.
* To interpret and analyse results and draw suitable diagrams.
* To draw conclusions on analysis, see whether the hypothesis is correct.
Original Hypothesis Problems
* What is a tabloid and a broadsheet paper? It is general knowledge that The Mirror is a widely bought tabloid newspaper, while The Times is a well-respected broadsheet paper, these two will be used in the investigation. However other papers such as The Daily Mail/The Express are in a grey area, they are in a level of their own, the "broadloids," where no clear distinctions are made. However, for this two newspapers of clearly different levels is sufficient, ie. The Mirror and The Times.
* The letter number may be affected by the style of the writing and the topic written about.
* Does headlines/subtitles/names count in the investigation?
* Does accidents (such as typing errors) in the newspaper count and what about hyphenated words?
* What happens if I get to the end of the article without finishing the survey?
* Time is very limited
Original Hypothesis Plan
I will allow for the expected (and more) problems as follows:
* The paper will be used in the investigation will be of the same date; the date has been drawn from the random number generator. In this case, it is the 29th of March, this is to reduce bias.
* A similar article will be selected, therefore reducing the chance that a scientific article containing long words e.g. medical terminology will be unfairly compared with a normal text, therefore reducing bias.
* Hyphened words will be treated as one word.
* Headlines will be omitted, as they are sometimes not in proper sentences and not strictly part of the passage and might affect the result, they are written mostly by the editor and therefore not writer written.
* Subtitles (in the passage itself) will be included as they are part of the passage.
* No local papers will be used, to forgo local words that might affect the results.
* No adverts as the style of writing is different in every advert.
* All dates, numbers (in graphical form e.g. 4) are to be ignored, unless they are in text form.
* The names of the reporter is omitted as it will affect the result and is not a part of the passage
* Typing errors and the accidental joining of words, (e.g. accidentaljoining...) will be a valid part of the sample, as they appear to be familiar in newspapers.
* The names of the people and place names in the article will count as a valid sample.
* Apostrophes and punctuation marks will not count, abbreviations will not be counted as a letter as it is not a proper word.
* If I get to the end of the end of the article before the 100 word random sample is finished, I will just continue from the ...
This is a preview of the whole essay
* Typing errors and the accidental joining of words, (e.g. accidentaljoining...) will be a valid part of the sample, as they appear to be familiar in newspapers.
* The names of the people and place names in the article will count as a valid sample.
* Apostrophes and punctuation marks will not count, abbreviations will not be counted as a letter as it is not a proper word.
* If I get to the end of the end of the article before the 100 word random sample is finished, I will just continue from the beginning of the article again.
Original Hypothesis Data
* SAMPLING
I will be sampling because it will take a long time to get and process the results of all the broadsheet and tabloid newspapers' words, I shall use sampling to get an accurate cross-section of the newspaper population. I will sample two different types of newspapers on the same day, on a similar article.
The sample size will be 100 words because of the time and resources available, this is an accurate cross-section of the parent article as it is not too small as to have no conclusions drawn from it or dubious ones, yet not too big as to never getting it processed for a conclusion.
* BIAS
I will be sampling a similar article because I want to have no bias on the terminology of a medical article compared to a normal article. I will use a simple sampling technique where every word of the article has a chance to be in the sample, I will use the sampling with replacement method, as I might not have enough words in the article for the sample, so I must start from the beginning again (giving the chance that the sample can be re-selected).
I will use a random number generator on by calculator by pressing the SHIFT and then the RAN button to get this function. The numbers produced are three digit numbers between 0.001 and 0.999. I will ignore the decimal point and will take the first number to be my random figure. (though the method to achieve the "random" number are not truly random, it is adequate for school usage) This type of random sampling will be non judgemental therefore reducing the bias.
Also I will be doing all the sampling, therefore limiting any operator errors which could lead to bias.
* OTHER SAMPLING TECHNIQUES AND METHODS
There were 3 choices that I could have made for the sampling, I chose the random replacement sampling technique out of stratified and systematic sampling because I feel it has got less bias and is better for achieving a clear conclusion. In stratified, you have to make judgemental decisions on the article which produces unpredictable results, therefore bias (it would also take too long, to work out the percentage of words with that amount of letter, and then sampling it that way). In systematic sampling, there is not enough randomness in the method to guarantee a biasfree answer (as you are going from every x word until the end of the article. Though fast, it is not random). Therefore I made the decision to choose the truly random selection of words technique, as it is the most random, and therefore less bias involved.
Collect and Process
Original Hypothesis-Pilot sample
The samples are about an article on the Easter gridlock-the articles are inserted in the appendix section and the sampled words are highlighted in yellow and are attached at the end of this section. Below is a record of word length in The Mirror, this pilot sample will determine what is the minimum sample size in relation to the accuracy of the whole article population using a formula.
The Mirror- Gridlock article -pilot sample
Sample no.
Word length (x)
x squared
2
4
2
3
9
3
4
6
4
2
4
5
6
4
6
7
5
25
8
6
36
9
3
9
0
4
6
1
6
36
2
9
81
3
2
4
4
4
6
5
3
9
6
2
4
7
5
25
8
6
36
9
5
25
20
21
3
9
22
5
25
23
4
6
24
5
25
25
2
4
26
7
49
27
7
49
28
3
9
29
6
36
30
3
9
Sigma (x)
22
Sigma (x squared)
604
I then found the mean of the two values:
Mean = zz zSigma (x)zzzz = 4.067...(3dp)
Number of words
Mean2 = (4.067)2 = 16.538...(3dp)
I then found the standard deviation from the mean values:
Standard Deviation (s) = V ( Sigma x2 - mean2) = 3.596...(3dp)
No. of words
Margin of error (d) = 0.08
Margin of error is how close the result you would like to obtain if the entire article was used instead of the samples
Level of Certainty = 98%/200 = 0.49
= Z score of 2.3
The level of certainty is the percentage of results ( 98%) that lie within the Margin of error line, it is divided by 200 and then the answer compared to a table published by Lindley and Booth on the Z values-see the 0.49 figure and read the Z figure that correspond from it.
Smallest sample size = ( Zs )2 = ( 2.3X3.596 ) = 102 (3sf)
d 0.08
As you can see having done 30 already, I will need to do ˜72 more to get the minimum amount required. However to make it into nice even numbers, I decided on the original 100 samples, which isn't far off the smallest sample size, this calculation allows me to confirm/establish the minimum sample size.
Results
Below is the Frequency chart for the amount of letters in each word in the Easter gridlock articles in The Mirror and The Times.
THE TIMES (xf)
THE TIMES letter frequency
Number of letter in each word
THE MIRROR letter frequency
THE MIRROR (xf)
(f)
(x)
(f)
2
2
4
4
4
7
2
3
26
24
8
3
23
69
20
5
4
8
72
65
3
5
6
80
48
8
6
0
60
84
2
7
8
56
80
0
8
6
48
35
5
9
2
8
10
1
0
0
0
55
5
1
0
0
24
2
2
0
0
26
2
3
0
0
Sigma (xf)
687
Sigma (xf)
433
* MEAN
To find the mean of the words of the 2 newspapers to compare the 2 word lengths and therefore proving that the original hypothesis is correct, I will have to divide the (xf) number by (n), and since there is 100 samples from each newspaper, we just have to divide the (xf) figure by 100. Therefore:
Mean word length for The Times: 6.87 letters per word (687/100)
Mean word length for The Mirror: 4.33 letters per word (433/100)
Information from the frequency tables was used to plot the frequency polygon graph (next page), the graph clearly shows us that THE MIRROR had bunching of letters in the 2-6 letter word range (it is skewed to the left). The TIMES however was more evenly spread out and is roughly uni-modal. I used this technique because, I felt that this was the best summary value to use.
* RANGE
The range of the MIRROR is 9-1=8, which suggests that the figures are all bunched together in the 8-letter difference between the first and the last. The TIMES has a range of 12, which tells us that it has long words in it, longer than the MIRROR's.
Also I have plotted on the frequency polygon are a set of box plots, this is useful because they show that for the MIRROR, its most part of its words are consisted of less than 6 letters. In the TIMES however, the majority is over 4 words (75%).
CONFIDENCE
To calculate the confidence limit (to see how close your sample mean is to the real mean):
00 samples= (n)
Mean word length (x bar) = 4.33 (the MIRROR)
Standard Deviation (s) = V ( Sigma x2 - mean2) = 3.6 (1dp)
No. of words
Standard error of sample mean: SE(x bar) = s / Vn = 3.596/10 = 0.4 (1 standard error level)
Therefore, we can conclude that the real mean lies within 1 standard error level of our sample mean (4 - 4.7) with a 68% chance. There is a 95% chance that the real mean will lie within two standard errors, and a 99.7% chance that it will lie within 3 standard errors of our sample mean (the percentages are fixed to the error levels).
Analysis
By looking at the frequency polygon and box plots for both the Times and the MIRROR it shows that the TIMES has got a larger range (12 compared to MIRROR's 8) The interquartile range is different as well, with the TIMES being higher, also the upper and lower quartiles are different, with the TIMES being 2-3 letters more than the MIRROR. This is proved by the median of the two, with the TIMES being 3 letters bigger than the MIRROR.
Mean word length for The Times: 6.87 letters per word
Mean word length for The Mirror: 4.33 letters per word
The mean word length also proves it with the TIMES being 2.54 letters more than the MIRROR, (HOWEVER we should take note that the figures are not accurate enough to merit a 2 decimal place round off, as the figures are based on rounded off figures themselves!)
On the frequency polygon graph, it is clearly visible that the mirror's words are bunched at3-6 letters per word with a bit of a tail off towards 10 letters, but for the TIMES, it is clearly a varied text, with an almost equal amount of 11 letter words as there are 3 letter ones.
Conclusion/Evaluation
My original hypothesis is correct, in conclusion, the results suggest that THE TIMES has got more letters per word on average to the tabloid paper. But this does not mean that this is conclusive.
Possible avenues of mistakes include areas of error in the way the way the data was collected and processed. These occur from miscounting, and choosing the wrong word to be counted. These can all reduce the accuracy of the result.
The sample size was too small, in my original significance calculation, the minimum amount of words needed was 102 words, however, the more I did the sample and varied the pool of newspapers, I would expect a higher probability that my hypothesis would be proved conclusively.
However I have to say that the investigation worked, but there is still room for improvement, such as using a larger sample size, if I had more time I would have done 300 words for each newspaper covering different sections e.g. financial/sport etc which would be interesting to see how these sections compared with each other.
My results are significant as I feel that the events inferred from the data produced and the conclusions made is highly probable, and not due to some accident or mistake. This is partly due to the random method of collecting the data and the elimination of bias from most of the possible avenues. In addition, this is due to the sample size that I had undertaken, of 200 words altogether. However, I should do the experiment again on different days and then on different newspapers so that I can be sure that the results from the experiment is not just a coincidence.
From my results, I can then build on it, if I had the time, as in going to websites and getting USA tabloid and broadsheet papers and comparing them, to see if they are the same as the UK ones. My hypothesis may be right, but I need more time for more data to investigate.
My further investigation
My Hypothesis
A longer sentence will have on average a bigger mean word length than a smaller sentence, there will be a strong correlation between them.
For more details of this, see start of project.
DATA COLLECTION
Since I needed a lot of news articles, I thought it would be appropriate to use the Internet to search for news websites and use the archive as a population. I went to The Independent newspaper website and I took similar amounts of articles from the politic/ sport / financial / news sections. The dates of publication are all from May to June. In the archive, I numbered all the articles in chronological order and then, I used a random number generator, just like the first investigation, and used it to get me started and get articles. I used it to generate a number and then get an article and form there I did the random number generator again and moved on to the next (so and so) article (ignoring the decimal place and 1st significant figure.)
This counted as a type of random non replacement method, as after I had one article, I ignored it if I came back to it, and carried on until I got my 30 samples of articles.
Headings, names and lists, adverts and so on are omitted from the article to prevent bias and then I assigned a number to them, see above project for details.
DATA
n
Mean words (x)
Mean sentence Length (y)
x squared
y squared
XY
4.02
7.6
6.16
309.76
70.75
2
4.19
7.3
7.56
299.29
72.49
3
4.23
7.5
7.89
306.25
74.03
4
4.29
8.6
8.40
345.96
79.79
5
4.33
8.5
8.75
342.25
80.11
6
4.34
8
8.84
324.00
78.12
7
4.41
21.1
9.45
445.21
93.05
8
4.5
8.3
20.25
334.89
82.35
9
4.59
8.2
21.07
331.24
83.54
0
4.66
8.6
21.72
345.96
86.68
1
4.69
22.6
22.00
510.76
05.99
2
4.71
8.6
22.18
345.96
87.61
3
4.71
8.9
22.18
357.21
89.02
4
4.76
20
22.66
400.00
95.20
5
4.79
22.2
22.94
492.84
06.34
6
4.82
23
23.23
529.00
10.86
7
4.83
21.8
23.33
475.24
05.29
8
4.86
22.6
23.62
510.76
09.84
91
4.96
23.6
24.60
556.96
17.06
20
4.96
23.1
24.60
533.61
14.58
21
4.98
23.7
24.80
561.69
18.03
22
5
23.4
25.00
547.56
17.00
23
5.02
23.6
25.20
556.96
18.47
24
5.07
25
25.70
625.00
26.75
25
5.1
24.1
26.01
580.81
22.91
26
5.16
26.2
26.63
686.44
35.19
27
5.18
25
26.83
625.00
29.50
28
5.19
24.1
26.94
580.81
25.08
29
5.19
26.2
26.94
686.44
35.98
30
5.21
23.6
27.14
556.96
22.96
Sigma=
42.75
645
682.62
4104.82
3094.54
* MEAN
The mean of the values (which are in fact means themselves!) can be calculated:
X=142.75/30 = 4.76 (2dp)
Y=645/30 = 21.5
FINDING BEST FIT LINE-see scatter graph:
LSA: lower semi average-add all (y) figures below 21.5, add all (x) figures above 4.76, divide by (n)-participating numbers
USA: upper semi average-add all (y) figures above 21.5, add all (x) figure above 4.76
Then divide by (n)
Go through mean point and join. As you can see it is very strong positive correlation and I have worked out the Pearson's product correlation coefficient:
Standard deviation X= SX=V(1/n)Sigma x2 -mean x2
Then used excel to work out the coefficient, which comes out to be : 0.9362
Which proves that there is a strong correlation between the 2 variables.
BAR CHARTS:
The mean word length is slightly negatively skewed and the mean sentence length appears to be bimodal. However, with no values equalling 19, this can be regarded as a anomaly in the results despite the large number of sample results.
CONCLUSION
There are several connections between the ways that the article is written, the result of the correlation, shows that the training that the reporter has must be similar, or that the editorial team is very consistent on weeding out words and adding them in. This correlation could even be used to identifying torn off pieces of articles etc. indeed it has been used on Shakespeare to identify what he wrote is actually his. However it is too crude an instrument to determine exactly the writing styles etc of writers.
But there is an unmistakable correlation between the longer sentence will have on average a bigger mean word length than a smaller sentence, and that there will be a strong correlation between them. As shown by the calculations and the graph.
But also as shown on the bar charts, the 19th sentence is empty as no writer has put 19 words on one sentence, which seems strange, but forces me to say this. This random sampling has all got to do with chances, and if just by chance that there is no 19 word sentences, then that must mean it could be an unwritten rule of a writer or it could be that I haven't got enough samples. But in my keenness to get lots and lots of articles, I had missed out on a random article that I should have included, in the first time, but I forgot, (and yes it does NOT contain 19 letter worded sentences)but I checked it for the percentage error, which is (100x0.10/24.7= 0.4%) error. This error is low, showing that the correlation between the word and sentence are not unfounded.
EVALUATION
As with all sampling methods, inconsistencies and errors are almost guaranteed to give inaccuracies within the data. Some data could have been lost while filtering through the BIOS and FIS servers, which brings on inaccuracies.
In order to improve, I could try and do this in a wider sample base (around at least 100 different articles) as described by the miserable failure of the 19 word sentence, which shows one thing, randomness doesn't always work, therefore I might consider stratified sampling, therefore getting a part of everything. Perhaps it would be good as well to do sampling of subject content instead of a source content (which showed me that in the Independent, that they might have studied together or that the editors are very consistent-see above for more details). This would cover more texts and could help further improve my new hypothesis (if I have time to do it) whether there is a certain style specific to a type of profession. Also it would be interesting to find out about reporters who are freelancers and work for different newspapers, whether they change their style with the newspaper and if so, what is the change?
I could also do a project on the changing writing styles of an establishment, over the years through the archive systems in the newspaper websites. Seeing how the word lengths and sentence length matters.
Notice that there is an anomaly in the 21st Mean sentence length, where it is deviant of the main correlation by about 3 words for the mean sentence length, or 0.25 of a letter in the mean word length axis. This is although an infinitesimal small deviation, shows that there truly is an powerful correlation.
However, I have to say that a possibility for no 19 word sentence appearing is that, the sentence length are all mean and therefore, even if I have got a 19 word sentence, it is not mean for the whole article.
And finally, I should make the investigation more fair, by using tabloids as well as broadsheet papers, so therefore if I am going to improve this investigation, I will try different kinds of newspapers and other things as said above.
Appendix
The Independent website for the 30+ articles that I sampled is at: www.theindependent.co.uk
Philip Xiu GCSE Maths Statistics Coursework
- 1 -