I have chosen to base my project on football statistics because they are both readily available and interesting enough for deep analysis. As a starting point I decided to look at the generally accepted theory of 'Home Advantage'.

Statistics Coursework – Football

Introduction

I have chosen to base my project on football statistics because they are both readily available and interesting enough for deep analysis. As a starting point I decided to look at the generally accepted theory of ‘Home Advantage’.

Home advantage, or the tendency for the home team to do better than they would away, could have several causes. It could be partly psychological – the home team would almost always have the majority of the crowd behind them, cheering them on. It could also be to do with the condition of the pitch – Premiership teams sometimes find it hard to play on muddy, waterlogged pitches of some lower-division teams.

Another factor is the attitudes of referees and officials. Because they are intimidated by the home crowd they often give decisions in favour of the home team, meaning teams may also have a worse disciplinary record when playing away.

Hypotheses:

Teams have a worse disciplinary record away than at home
Better attended teams have a greater home advantage
More successful teams have a better disciplinary record

Collecting Data

I found that football statistics were easy to find on the internet. I obtained mine from two main sites:

http://soccer-stats.football365.com

http://www.bettingzone.co.uk

There is a very small risk that some of the data I collected could be incorrect. However, I have found alternate sites for the Premiership statistics (such as www.4thegame.com) which gave the same results. I also think that a betting site must give accurate statistics because they are such an important part of gambling

Using Software

I chose to input my data into Microsoft Excel because it makes it much quicker and easier to manipulate the data.

Hypothesis 1 – Teams have a worse disciplinary record away than at home

Discipline ‘points’ system

On the internet I was able to find out the numbers of red and yellow cards for each team at home and away. However, in order to give an overall impression of how good or bad the team’s discipline was I needed to turn these two pieces of data into one measurement. I decided to use the points system (as on www.4thegame.com). Under this system a yellow card counts for one point whereas a red card is more severe and counts for three.

To make this easier to calculate I used formulae in Excel:

Because some divisions have different numbers of teams than others, some teams played more games than others. This means their players had slightly more opportunities to get booked or sent off, so their points totals might be higher. To correct for this I divided the points scores by the number of games each team had to play to give a ‘Disciplinary Points Per Game’ score. This can then be compared to any other team in any division.

To give a measure of how much better or worse the team’s disciplinary record is away and at home I decided to divide the away points per game score by the home. I subtracted one from this and expressed it as a percentage. This gives a positive percentage if the team has a worse disciplinary record away and a negative one if it is worse at home.

Pilot Study

In order to find out how well my data would support my hypothesis about teams having a worse disciplinary record away than at home I made a bar chart using Excel to show the difference between disciplinary points per game away and at home.

As you can see most teams have a considerably worse disciplinary record away than at home, as shown by the taller red bars. For this bar chart I simply ranked the teams in the Premiership and the First Division from the top of the Premiership (1) to the bottom of Division 1 (44). The names of these teams can be found in the appendix at the back.

Stratified random sampling

In order to better represent football at other levels of the game I also collected data for lower divisions (Division 2 and Division 3). However this gave me far too much data – a total of 92 teams – to perform statistical tests such as the Wilcoxon Signed Rank Test. In order to cut down on this I decided to use random sampling to lower the number of teams involved.

However, if I just randomly selected teams from all of the divisions put together I might over-represent some divisions over others, affecting the results. To make this fairer I decided to use stratified random sampling, with the different divisions as the strata. This way I was sure to get proportionate numbers of teams from each division.

I chose to take 25% of the teams in each division, to give me 23 sets of data – a much more manageable figure! I chose the teams by writing the numbers of the teams in each division e.g. 1-24 on small pieces of paper. I folded these up, shuffled them and picked them at random until I had the right number.

Once I had chosen the teams I put them in a new spreadsheet. I produced another bar chart similar to the one I had produced for the preliminary test. This illustrates how well my randomly sampled ...

This is a preview of the whole essay

Once I had chosen the teams I put them in a new spreadsheet. I produced another bar chart similar to the one I had produced for the preliminary test. This illustrates how well my randomly sampled data supports my hypothesis.

As you can see the pattern I noticed in the pilot study is continued with the data from the other divisions. The teams’ away disciplinary record is in almost all cases worse than at home.

As further evidence of this I found the mean disciplinary points per game at home and away. At home this was about 1.71 compared to about 2.28 away (to 3 significant figures). This shows a 33% difference between the two. I will now test whether or not this difference is statistically significant. I chose to compare the means of the two sets because this gives more weight to big differences between two scores than small differences.

Wilcoxon signed-rank test

Although graphs and charts can illustrate trends in data they cannot prove that my hypothesis is true. In order to prove my hypothesis I will have to use a statistical test. Because my data is nonparametric (i.e. I have no reason to believe it will follow a normal distribution) and I am comparing pairs of data from two categories I will use the Wilcoxon signed-rank test.

Method:

First I found the difference between the home and away disciplinary points per game for each team by subtracting one from the other using Excel.
Because some of the differences were negative I used the abs() function in Excel to find the absolute values of the differences.
I sorted the data by the absolute differences between the home and away disciplinary points per game. Ignoring the teams where the difference was zero, I ranked them in order from the lowest to the highest. Where several were the same I found the mean between them.
I then looked to see where the differences had originally been negative and I added the negative sign in front of the rank for those differences. This gave me the signed rank.
Finally I found the greatest absolute sum of the signed rank (in this case the negative ranks), which is the ‘W’ value. The number of teams where the difference is not equal to zero gives the ‘N’ value.

I found that the value of W was 195, and that N, the number of teams where the difference was not equal to zero, was 20. Looking these up in a table of critical values (OCR AS/A Level MEI Structured Mathematics Examination Formulae and Tables, October 2000) I found that there was only a 5% chance that the difference between home and away points per game was due to chance alone. This means that there is a 95% probability that the difference between disciplinary record at home and away is not due to chance alone. Therefore my hypothesis is highly likely to be correct.

Hypothesis 2 – Better attended teams have a greater home advantage

I proposed this hypothesis because a better attended team would have more of the crowd behind them when playing at home, giving them a psychological advantage over their opponents.

As with the disciplinary points system, I used Excel to find the points per game score for each team both at home and away. This time I divided the home points per game score by the away and subtracted one from this, expressing it as a percentage.

A problem arises because some teams have much bigger stadiums than others. For example, 20,000 might be considered good attendance for a First Division club, but very poor for a Premiership team. Because of this I divided the total capacity of each football ground by the average number of home supporters there to give the average attendance percentage. I plotted this against the home advantage percentage in a scatter graph.

Pilot Study

The scatter graph is a useful way of looking for correlation between two variables. As with the first hypothesis I used the data for the Premiership and the First Division as a pilot test.

As you can see there is no strong correlation between these two variables. There may be a slight trend for the higher home advantage percentages to be towards the higher percentages of stadium capacity. I decided to continue investigating this hypothesis because there might be clearer correlation in the data from the other divisions.

Spearman’s Rank

In order to tell for certain whether or not there is correlation between home advantage and attendance Because this data is also nonparametric I will need to use the Spearman’s Rank Correlation Coefficient.

Method:

The first step was to rank the teams by both % Home Advantage and Average % Capacity. As with the Wilcoxon test I found the mean of tied ranks.
I found the difference between these two ranks by subtracting one from the other using Excel.
I then squared the differences between the two ranks.
I used the formula below to find rs, the Spearman’s Rank Correlation Coefficient. My workings are illustrated in the table overleaf.

rs = 1 – 6∑d2

n3 – n

I found that rs = 0.0609, and that the critical value for rs at 10% was 0.2456 (OCR AS/A Level MEI Structured Mathematics Examination Formulae and Tables, October 2000). This means that the data fails the test for correlation at 10%, meaning there is a greater than 10% probability that any apparent correlation occurred only by chance.

This is no great surprise to me, as the pilot test showed little or no correlation. Unfortunately my hypothesis does not seem to be correct. Perhaps the fact that away supporters are not included might have made a difference – if a team is well-supported away from home it might reverse the disadvantage I predicted. I could not find any data on away supporters so I am unable to investigate this possibility.

Hypothesis 3 – More successful teams have a better disciplinary record

Pilot Study

My idea for a third hypothesis was that a team struggling at the bottom of the table facing relegation would lose confidence and become desperate, causing the players to commit more fouls. On the other hand, a team was near the top of the table would be confident and more relaxed, and would not feel the need for desperate challenges etc.

As a pilot test I decided to plot a scatter graph to look for a relationship between the position of a team within its division and its disciplinary points per game. As with the other tests I used only the data for the Premiership and the First Division.

This graph doesn’t show an obvious trend, but there is a slight tendency for the disciplinary points to rise further down the table, especially in the First Division. The second team in Division 1 (Leicester, shown circled) is clearly an outlier, and perhaps if I continued the study on the other divisions a clearer pattern would emerge.

In order to test this hypothesis further I decided to take all of the data from the Football League and randomly select 3 teams from the top 25% and 3 teams from the bottom 25% of each division. This means the data is collected using stratified random sampling. However, as the Premiership has only 20 teams instead of 24 it is slightly over-represented compared to divisions 1-3.

Most importantly I am not using the data from the middle 50% of the divisions, so any possible patterns there will be lost. However, there are two good reasons to sacrifice this data. Firstly, any differences between successful and unsuccessful teams would be most apparent at the top and bottom of each division. Secondly I need a more manageable sample size which I can perform statistical tests on.

I produced two histograms to show any difference between top and bottom teams.

As you can see, slightly more teams in the lower quarters of the divisions have higher disciplinary points per game, while slightly more teams in the upper quarters of the divisions have lower disciplinary points per game. The easiest way to tell this is that the histogram for the bottom 25% is shifted slightly to the right compared to the one for the top 25%.

I calculated the median for each set of data to give an idea of the central tendency for each distribution. I used the mean because I am comparing the ‘average team’ in the top 25% with the ‘average team’ in the bottom 25%. The median for the upper quarters is 2.12 and for the lower quarters, 2.41 (answers to 2 decimal places), meaning there is a 14% difference between the two. This suggests that the disciplinary points per game for the lower teams are generally higher than those of the upper teams.

In order to tell for certain whether or not there is a significant difference between the lower and upper quarters of the divisions I would have to perform a statistical test. In this case I will use the Mann-Whitney U-Test.

Mann-Whitney U-Test

This is a non-parametric statistical test to show whether or not two groups of samples are from different populations. In this case it will show whether or not there is a statistically significant difference between teams in the top and bottom 25% of each division, comparing their average disciplinary points per game.

Method:

First I ranked the data from both groups in increasing order of size (see column B in the table overleaf).
Next, for each team in group b, I counted how many teams in group a had a smaller disciplinary points per game total. Teams with equal disciplinary points per game scored ½. I did the same for group a. See column C in the table.
I found the total of the column C values for both group a and group b. I called these two totals Ua and Ub.
I chose the smaller value of U and I looked up the critical values of U at the 5% significance level.

Results

I found that the lower value of U was Ua (57.5). The critical value for U at the 5% significance level was 37(Advanced Biology Study Guide by C J Clegg & D G MacKean, 1996). This meant that Ua was larger than the critical value of U at the 5% significance level. Therefore the difference between teams in the top and bottom 25% of each division, comparing the average disciplinary points per game, is not significant. There is a greater than 5% probability that the difference was caused by chance alone.

Again this result is hardly surprising considering the lack of strong correlation in the pilot test. There could be several reasons why this hypothesis failed. Perhaps certain teams do well whilst still playing dirty – maybe this is even a valid tactic for success!

It might also be the case that the disciplinary points scores for some teams are disproportionately increased by certain players who are frequently booked or sent off – Patrick Viera of Arsenal for example. I am unable to find data on individual players so I cannot investigate this further.

Evaluation

I am quite pleased with the way my investigation went. Although hypotheses 2 and 3 were not statistically supported by my data, these raised other interesting questions, which could be investigated. Of course there are certain limitations to my study. The data I used came from complete, published tables, and its authenticity is not in doubt. However, there is nothing to say that the 2002-2003 season was a typical one, and that my results might have been different for a different year.

Another important point to consider is that the data for different teams is not independent. For example, because Manchester United was top of the Premiership, no other team could possibly be top as well. In fact, even the points totals of the teams are interdependent – a team can only be judged in comparison to the other teams it plays. It is possible that every team played worse in the 2002-2003 season than in previous or subsequent seasons – it is impossible to tell if this is true as the points totals for each team are relative to those of the other teams. Therefore there can be no stand-alone measure of how good a team is.

It is also important to remember that football is a sport played at many levels, in hundreds of countries and by many age and social groups. The English Football League is only a tiny part of this, and if I conducted my study on different aspects of the game I might obtai very different results.

Appendix – Team numbers

Page /

I have chosen to base my project on football statistics because they are both readily available and interesting enough for deep analysis. As a starting point I decided to look at the generally accepted theory of 'Home Advantage'.

Statistics Coursework – Football

Introduction

Hypotheses:

Collecting Data

Using Software

Hypothesis 1 – Teams have a worse disciplinary record away than at home

Discipline ‘points’ system

Pilot Study

Stratified random sampling

This is a preview of the whole essay

Wilcoxon signed-rank test

Method:

Hypothesis 2 – Better attended teams have a greater home advantage

Pilot Study

Spearman’s Rank

Method:

Hypothesis 3 – More successful teams have a better disciplinary record

Pilot Study

Mann-Whitney U-Test

Method:

Results

Evaluation

Document Details

Related Essays

For my GCSE Modelling assignment I have decided to design a Marking system.

Rollersnakes: I have researched my theme Rollersnakes on the Internet and I...

The aim of this project is to design a new system for anyone to use to mode...

Business Requirement Analysis - On my paper, I will be discussing the analy...