The investigation of the average number of letters per word in a broadsheet newspaper compared to a tabloid paper .
Data Handling Coursework 1 - Lewis Harris
The investigation of the average number of letters per word in a broadsheet newspaper compared to a tabloid paper .
Introduction
The aim of this investigation is to see if there is a relationship between the type of newspaper and the length of the words in that paper.
In this coursework I will be using two newspapers. One newspaper is The Times, which is a broadsheet. The other newspaper is The Sun, which is a tabloid. I am assuming that The Times newspaper is representative of all broadsheet newspapers and my theory is that broadsheet newspapers are aimed at the higher class , more intellectual reader .I predict that the average length of word from the broadsheet newspaper will be longer than the average length of the tabloid . I am assuming that the Sun is a good representative of tabloid newspapers, and since it is aimed at the lower class audience it will have shorter words because the audience will find this easier to read.
The Hypothesis
The broadsheet will contain, on average, more letters per word.
The Plan
I will choose an article at random from one newspaper, and then I will look in the other newspaper for an article about the same topic. I think that if the articles are about the same subject it will ensure that the data is not biased. I am going to assume that these articles are representative of other articles in the respective newspapers.
I will call the article from The Times, article 1, and the article from The Sun will be article 2. I will count the number of words in each article but I will exclude words with 2 letters or less, and I will not include the title this will make my result more accurate.
Article 1 contains 450 words and Article 2 contains 479 words. The population in each article is almost the same but I think that the sample is too large because it will take me too long to collect the data from over 400 words. I prefer to save time and spend more time on the analysis of the data. I think that a good sample size will be about 40 words because this will be enough to give me a good result. I will use systematic sampling to obtain 40 words from each sample because this way I can obtain my data from the spread of the whole article.
Systematic sampling is a method to take a certain amount of samples, the words, out of the whole population of words in the article. The sample will then be used for the data, which I will analyse.
Article 1 contains 450 words
450 - 40 = 13.25
This means that if I count every thirteenth word, I will end up with a sample of 40 words, which is the amount of words that I need for my sample .I will then count the number of letters per word,
I will have to take precautions to make sure that the data is reliable , so first , as I am reading through the article , I will cross out all the words with two or one letters , this will help me to count every thirteenth word .Then I will go through the article and underline every thirteenth word and then I will write it down so that the words of my sample population will be written clearly and then I can easily count the number of letters in the word , and it will be easy ...
This is a preview of the whole essay
I will have to take precautions to make sure that the data is reliable , so first , as I am reading through the article , I will cross out all the words with two or one letters , this will help me to count every thirteenth word .Then I will go through the article and underline every thirteenth word and then I will write it down so that the words of my sample population will be written clearly and then I can easily count the number of letters in the word , and it will be easy for me to check my readings.
I will use exactly the same method for article 2 which is also a good precaution, because it will make the statistical analysis more accurate
Article 2 contains 479 words
479 - 40 = 11.25
In this case I will count every eleventh word and this will give me a population of 40 words as my sample.
Collection Of Data for article 1
Here are the words from article 1
Protest
The
Rude
Tripled
Corrected
Afternoon
Insisting
Straw
Thought
Decidedly
That
Cancelled
Lasted
Two
Exposed
Wonderful
Them
Herr
Differences
The
And
Was
Irritated
Outmanoeuvred
Should
Them
Talks
Agreement
Present
Night
Against
Said
Farming
Said
Disagree
Giscard
Presenting
Said
Because
Already
Europe
Talks
Recording Of The Data
I will record the numerical data in a tally chart because this will help me to show the results more clearly, and it will also be easier to analyse the results.
Tally Chart for article 1
Number of Letters per word
Tally
Frequency
Total number of letters
3
3
3 x 3 =9
4
8
4 x 8 = 32
5
3
5 x 3 = 15
6
6
6 x 6 =36
7
9
7 x 9 =63
8
3
8 x 3 = 24
9
5
9 x 5 = 45
0
2
0 x 2 = 20
1
1 x 1 =11
2
3
3 x 1 = 13
Total in the sample = 41
Total in the sample = 268
Collection of data for article 2
Here are the words from article 2
His
Bust
Before
The
Last
The
Friday
Insisted
France
Scored
Told
Common
Rebate
And
His
Decision
The
With
Biting
The
But
Will
And
Axed
With
The
Massive
Revised
The
Blair
Failing
The
Old
This
Agreement
Deal
You
Formal
The
Draft
Was
The
Tally chart for article 2
Number of letters per word
Tally
Frequency
Total number of letters
3
6
6 x 3=48
4
0
4 x 10=40
5
2
5 x 2 =10
6
9
9 x 6 =54
7
2
7 x 2=14
8
8 x 1=8
9
2
9 x 2=18
0
0
1
0
2
0
3
0
Total=42 Total=192 178
Presentation Of Data
The data is not continuous, but it is discrete, and so I can present this in a more convenient form to clearly show the results by using a frequency bar chart.
From the overall shape of the graph I will be able to investigate the peak and spread of the data.
For example, if the two sets peak in different places and the results for one set of data are further across than the other, then this will be evidence to prove that the words in the broadsheet have on average, more words than in the tabloid.
I will be able to investigate the spread of the data. The wider the spread, the more likely that my result is further away from the mean.
Interpretation of the bar graphs
The bar graphs show me that I have taken a reasonable sample of the population. This is important if my results are going to be valid.
The peak in the graph, shows me the item of data that occurs the most, this is the mode.
The peak for article 1 occurs at 7.
The peak for article 2 occurs at 3.
The peak represents the mode, which is the item of data that occurs the most often. Seven letter words are the most frequent in the broadsheet, compared to three letter words in the tabloid.
I can see that my hypothesis seems to be correct.
As the bar graphs seem to indicate the words in broadsheets have more letters than in the tabloids.
There are some other statistical values, which are important for this investigation.
* The median is the item of data in the middle, once all the items have been put in order of size, from lowest to highest, this can be found from the table above, by counting the frequency across the table and counting which one is the 20th item.
* The mean is the sum of all the items of data divided by the number of items , as calculated by counting the total amount of letters and then dividing by the total number of words.
* The range is the difference between the highest item of data and the lowest item of data
Table of Data
THE TIMES
Article 1
THE SUN
Article 2
Mode
7
3
Median
6
4
Mean
6.56
4.57
Range
1
6
From the table above I can see that the range of data for the Times is much larger than that of the Sun .
So I am going to represent the numbers by creating a box plot. This is a good way of displaying data for comparison.
This requires five pieces of data:
* The lowest value (the lowest number of letters in a word)
* The lower quartile Q1 (the number that is taken at the first quarter of the data)
* The median Q2 (the number which is the middle number of the data)
* The upper quartile (the number which is taken at three quarters of the data)
These data are always placed against a scale so that their values are easily plotted.
The scale will be the number of letters per word
This plot shows the interquartile range as a box.The interquartile range is the difference between the upper and lower quartiles.
This is a very vital statistic because it eliminates the extreme values and shows us the middle 50% of the data.In my investigation this is necessary because in Article 1 there was a large range.
My investigation contained 40 samples, so the number representing the lower quartile is Q1= 30 =10th item of data
4
The median quartile is Q2 = 40 = 20th item of data
2
The upper quartile is Q3 = 40 x 3 =30th item of data
4
THE TIMES ARTICLE 1
THE SUN
ARTICLE 2
Lowest value
3
3
Q1
5
3
Q2
6
4
Q3
8
6
Highest value
3
9
Now I can create two box plots on the same grid , this will allow me to compare the data .
MY CONCLUSIONS
My hypothesis that the broadsheet will contain on average more letters per word is supported by my results.
I can't be sure that the relationship I found is genuine because the data I got could be due to chance.
I feel that to improve the reliability , I could have taken more samples from the article.Also , I only looked at one article from each paper , whereas to investigate a larger number of articles would be more accurate.
Another improvement could be to look at more newspapers , for example three tabloids and three broadsheets.
To find out if my data gives strong enough evidence , I'd have to do a statistical test.
I think there may have been some errors in my results , because I included Proper nouns . I think it would have been less biased if I had not included these in my sample,although I did try to eliminate the bias by crossing out all the two and one letter words .
DATA HANDLING COURSEWORK 2- LEWIS HARRIS
Hypothesis 2 -The Broadsheet will contain more words per sentence when compared to a tabloid paper
PREDICTION
I predict that as The Times is aimed at a higher class of people ,that there will be more words per sentence than in the Sun , whose readers would prefer shorter sentences , as it needs less concentration to read , and so it will be easier to read.
THE PLAN
I will use the same articles as in coursework one , because It will be less biased.
I have counted the number of articles from the Sun , and I found it contains 34 sentences.This is a reasonable sampling size , because it will not be too time consuming but it willgive me enough data to investigate. I will use all the sentences for my data population.
The article from The Times,however contains 34 sentences,which is a problem , because if I compare 34 sentences to 30 sentences , it will be biased.
I would like to use 30 sentences for my sampling size.I can do this by using systematic sampling
34 =1.13 So ,I will miss out one sentence from every ten
30
sentences in The Times article.
I will now have an equal population of samples in both articles.
There is no bias because all the sentences have been used in the Sun , and I used systematic sampling for the other article to make the populations equal ie 30 .The article is about the same subject , so again this reduces the bias.
COLLECTING THE DATA
I will count the words in each sentence , and record the data on a tally chart .
If I put these results in a table it will be clear.
FREQUENCY TABLE FOR THE TIMES
Number of words per sentence
Tally
Frequency
8
9
0
1
2
3
4
5
6
2
7
8
9
3
20
21
22
2
23
3
24
4
25
26
27
3
28
29
30
31
2
34
Range = 34-8 = 26
I Will collect the data from The Sun Using the same technique
FREQUENCY TABLE FOR THE SUN
Range= 12
At the bottom of each table I have included the range to help me see more clearly how to further the investigation.
I can see that the range is wide for The Times , so I think the best way to show my results will be to create a box plot.This will eliminate the extreme values and give me a more accurate comparison.
DATA FOR THE BOX PLOTS
THE SUN
THE TIMES
Q1
1
9
Q2
4
23
Q3
6
27
Lowest value
7
8
Highest value
23
34
I will draw the box plots on the same grid , to make it easier to compare.
INTERPRETATION OF RESULTS
It is clear that the box plot for the Times is much further to the right , which shows that there are more words per sentence in The Times which has an interquartile range of between 19 -27 words per sentence.The interquartile range for the Sun is between 11 and 16 words per sentence.
CONCLUSION
The results of the investigation seem to agree with my hypothesis ,and there is a correlation between the type of newspaper and the length of the sentences in it , because the broadsheets contain more words per sentence than the tabloids .
I should really have studied more than one broadsheet and tabloid , but I did not have enough time.Also , I could have studied more than one article in each newspaper.
A significance test would show me how likely it is that there is a correlation between the length of the sentence and the type of paper .This should have been carried out to test the hypothesis .
Lewis Harris