I have always found it fascinating how the English language is built up and whether aspects within writing have an effect other parts text. Using Bivariate data analysis, I hope to use statistics to highlight the connections in English writing.

Authors Avatar

Introduction

I have always found it fascinating how the English language is built up and whether aspects within writing have an effect other parts text. Using Bivariate data analysis, I hope to use statistics to highlight the connections in English writing.

In this investigation I will be analysing any correlation between the average lengths of words and sentences within articles from newspaper articles written by reporters working in The Independent newspaper. It is an intriguing theory that I would really like to prove or disprove: do longer sentences mean that people use longer words; if so, is there a strong correlation? If there is a negative correlation, does it indicate that the writer has to use a large number of small words to get the same point across, as fewer longer words?

 

Data Collection

The collection of the data needed for this investigation took place utilising The Independent newspaper website. The first three headlines under UK News and World News were taken for analysis. As I wanted to balance out the types of articles, I endeavoured to analyse articles taken from different topical areas, such as transport, health and politics, if the headlines were biased to a certain subject area. The parent population of the data would be all news articles published online by the paper from October and November. From the data collected half, comes from weekday editions and the remainder is taken from weekend news from the site. This process of sampling should represent the parent population and can be treated as an acceptable random sampling process.

The Independent newspaper was chosen, due to its ease of data collection from their well laid-out site and for a consistent style of reporting. Unlike other newspapers, the Independent relies heavily on their own reporters rather than secondary sources. This will certainly help to show if a correlation exists within the same style of writing. If a range of newspapers were used it would be much less likely to show a correlation because writing styles do change from paper to paper. I know this for a fact, because I analysed data from 60 articles taken from tabloid and broadsheet newspapers. The correlation was very close to zero, the scatter chart showed no apparent line of best fit. The data source was the same as that of a previous coursework looking at newspaper readability (see: ) another example of my statistical research into newspaper article prose.

The sole reason for using the internet is to save an inordinate amount of time copying prose from the real Independent newspaper or the hassle and inaccuracy of OCR (optical character recognition) using a scanner. On the internet it is relatively easy to find articles in well-known newspaper web sites, and it is quite a simple process of copying, organising and processing the texts to give a list of statistics. I used Internet Explorer 5 to browse the web for the sample and Word 2000 produced an array of statistics on each of the articles. Sources are listed in the appendix. Rounding errors will not occur as Excel 2000 will refer to original data for each calculation. Numbers printed in this investigation will be rounded to 4 s.f.

 

Assumptions

The length of sentences and words can be considered as having random values. For any article in the newspaper that is read, it is quite difficult to give any reasonable average word or sentence length at a glance. There will not be any obvious link between the two variables from the careful study of a few articles, therefore I will study this is more depth, using a larger sample of 50 or more articles taken using a process which should be both random and representative of the parent population of the English newspaper articles as a whole. In the hope of using a representative sample: headings, dates, names of reporters and listed points will be omitted from the data analysis. This would have effected the sample by producing an unfair bias to longer words and short sentences.

Join now!

The date of acquiring data should have not have a negative effect on the data, it should continue to be equally representative of the parent population. Spelling of words will be data sample, mainly due to the problem of recognising uncommon names and terminology specific to a certain situation.

I am conscious that incorrect spelling will have an effect on the data, joining of words appears to be worryingly consistent in some newspapers. Typos (typing errors) are certainly inevitable and are a familiar part of newspapers, therefore I can consider them as a valid part of the sample. However, I ...

This is a preview of the whole essay