If X and Y have a geometric distribution, the distribution should look like this:
The sample size shall be 80 as a large sample size makes the geometric distribution as accurate as possible for testing purposes. It also allows me to use the chi squared test on the model to check if there is any evidence to suggest that one thrower is better than the other at various critical levels.
Assumptions that I am making to allow the model to work are that the trials are:
- Identical: The factors are exactly the same. This provides a fair test and is a property of the geometric distribution.
- Independent: The trials are not affected by the previous trial. The geometric model states that the events must be independent. No distribution could possibly account for the infinite amount of variables/influences that could occur e.g. improving skill as more shots are scored, fatigue etc. The variable would be different in each case.
The five practice shots will make the distribution more geometric as it will ‘warm’ up the performer beforehand so that they “get used to the feel of shooting.”
- Have two outcomes – score a basket or no score.
- Repeated to gain the sample size
Modelling the situation with a geometric distribution
Let X be the number of attempts before a basket is scored for Lee:
Probability of scoring a basket: P(score) = sample size/total number of shots
= 80/269
= 0.2973977695
This implies X~G( 0.297 )
X can be modelled as a geometric distribution with a probability of scoring first time equal to 29.7% (1 d.p.)
Finding Prob(X=r)
Therefore P (no score) = 1 – P (score)
= 1- 0.2973977695
= 0.7026022305
Using the formula: P(X = r) = qr-1p where r = 1, 2, 3……:
q = probability of not scoring
p = probability of scoring
P( X = 2) = 0.7026022305 x 0.2939776957 = 0.2065493847
P( X = 3) = 0.7026022305(3-1) x 0.2939776957 = 0.14512205844
Finding Expected Frequency
Expected Frequency for (X = r) = Prob (X=r) x sample size
Therefore Expected Frequency for (X = 1) = 0.2973977695 x 80
= 23.791821
Expected Frequency for (X = 2) = 0.2065493847 x 80
= 16.7161869
Let Y be the number of attempts taken before a basket is scored for Dom:
Probability of scoring a basket: P(score) = sample size/total number of shots
= 80/345
= 0.231884058
This implies Y~G ( 0.232 )
Y is geometric with a probability of scoring first time equal to 0.232 (3 d.p.). This result states also that there is a 23.2% chance of scoring on the first attempt and I aim to model these results by a geometric distribution.
Therefore P(no score) = 1 – 0.231884058
= 0.768115942
Therefore for Dom: P (Y = 2) = 0.768115942 x 0.231884058
= 0.1781138416
P (Y = 3) = 0.768115942(3-1) x 0.231884058
= 0.1368120813
Expected Frequencies will be: (Y = 1) = 0.231884058 x 80
= 18.55072464
(Y = 2) = 0.1781138416 x 80
= 14.24910733
Chi Squared Distribution
The chi-squared distribution can be applied to measure the ‘goodness of fit’ for the geometric models. It will examine the ‘goodness’ of the model by considering the number of possible outcomes of the events and will analyse the validity of the assumptions.
Thevalue will be expected to be small to suggest that the model fits the real distribution. A large value would suggest that the model is unlikely to be correct so I will use a 10% critical region to test it.
-
If thevalue lies within the critical region then, assuming the model is correct, it would mean that there is less then 10% chance of a result as high as this occurring. We reject the model as a consequence and conclude insufficient sampling etc.
-
Alternatively, if the value lies outside the critical region, the result is valid and there is a larger possibility of the value being what it is. The model is assumed to be correct and the model is accepted. Conclusion would be to state that the statistical model is appropriate to the situation and the assumptions are correct.
In the tables, the expected and observed frequencies were calculated but how close together are the values? The closer the observed value to the expected value the more accurate the geometric model will be.
The goodness of fit statistic is:
where O = Observed Frequency
E = Expected Frequency
To find the best measure of goodness of fit, add up all values for each statistic and compare with the 2 probability distribution tables.
The chi squared test should only be used if the expected frequency of a cell is more than five which means some of the groups are going to have to be combined. This enables the chi-squared distribution to be better approximated. The total frequency of expected frequencies should also be over 50. This makes the chi squared test work at a more accurate level.
Lee’s chi squared test
Using the equation :
As we can see by the result = 7
To analyse the result with the chi squared test the number of degrees of freedom have to be established following this procedure:
Degrees of Freedom = Number of Cells – Number of Constraints
In Lee’s table there are seven cells. The number of constraints is two because:
- A sample size of eighty is one constraint: The sample has to be eighty.
- The probability is another constraint: The mean of the model has to equal the mean of the data so we used the data to work this value out.
- Therefore: Degrees of Freedom = 7 – 2 = 5
-
at 10% critical level i.e. prob ( ) = 0.9
-
but observed value of = 7.478504913
- 7.478…… is less than 9.236
-
therefore, the value is not in the critical region
(result taken from probability distribution table)
The value is not in the critical region implying the model is significant enough to use. Lee’s results fit into the geometric distribution model and therefore it is a good model for Lee’s data. There is evidence to suggest that the assumptions are true and therefore we accept the assumptions as part of the geometric model. See graph above for explanation of what the results show.
Dom’s Chi Squared Test
Using the equation :
As we can see by the result = 5.694287179
- Degrees of Freedom = 8 – 2 = 6
-
at 10% critical level i.e. prob ( ) = 0.9
-
but observed value of = 5.694287179
- 5.694…… is less than 10.645
-
therefore, the value is not in the critical region
(result taken from probability distribution table)
Dom’s results fit into the geometric model, as the value is not in the critical region of 10%. We can assume that the geometric model was a good model to use for his results. We can again accept the assumptions as there is no evidence to suggest they do not fit into the geometric distribution. See graph above for an explanation of what the results shows.
Both results are comfortably in the geometric distribution proving that they are reliable results/models and the assumptions made are valid. We can adapt Dom’s model so that five degrees of freedom can be used giving the same accuracy as Lee’s result. I am predicting that it wouldn’t affect the results because there would need to be a dramatic increase in the value for it to be of any significance.
Both performers have had their results analysed at the same number of degrees of freedom and there was no significant difference. It shows no alteration for the final conclusion and still no evidence is available to reject the models.
Both results have shown X and Y can be modelled by the geometric distribution. By knowing this I could produce confidence intervals for any parameters I estimate from the distributions. However at this stage I will calculate the relevant parameter for this piece of coursework.
I will estimate the expected number of shots required by Lee and Dom to score a basket.
Expected Mean Values
To find out the expected mean value for a geometric distribution it is defined as the sum to infinity of: all the probabilities, which are multiplied by the value of X (in Lee’s case), Y (in Dom’s case). This can be simplified conveniently to 1/p where p is the probability of scoring when X = 1
For Lee the expected mean value would be E[X] = = 3.3625 (4 d.p.)
For Dom the expected mean value would be E[Y] = (4 d.p.)
These results demonstrate the average amount of shots it takes until the performer scores. Lee, having a lower expected mean value than Dom, is shown to be the better free-thrower as he takes an average of approximately three shots to score, unlike four shown in Dom’s case.
The total number of shots can be a very rough indicator of who seems to be the better free thrower. Lee took 269 shots and Dom accomplished 345 shots to score 80 baskets. Does this imply that Lee is more accurate? According to the expected mean values and the probabilities of scoring for each model it reinforces Lee’s success where all three tests are in his favour. There is a much higher chance now of Lee being picked for the game on Saturday.
A factor of the investigation was whether taking constant shots at the basket improved performance. This may happen because training has occurred and the brain is learning from past mistakes. The question being asked is, were the five practice shots enough practice to enable an independent model to be produced or should it not have occurred?
Raw data results were recorded in two stages; first 40 and second 40 and it suggests small decreases in many of the cells for 2nd 40 – especially in Dom’s case. Lower values of X or Y become more frequent in the 2nd 40. This complicates results and so is a factor to consider if the coursework is completed again. The decreases in the higher X or Y values and the increases in the smaller X or Y values suggest evidence of fatigue, boredom, frustration etc. I can say now that skill level did not increase during the collection of the sample size but what is more likely to have occurred is the opposite. The explanation for Dom being more tired, bored or frustrated is probably because he shot a total of 345 baskets whereas Lee completed his in 269 shots.
Two parent populations (X and Y) have been tested against geometric probability models and it so happens that they fit very snugly into them. Therefore, we can apply the knowledge that counting the amount of times before a basket is scored is modelled very well using a geometric distribution. There may be only two populations but they both show noticeable differences in their results and remain well within the statistical model. I will assume that it is highly probable for most other populations to fit into the geometric distribution on the basis that my models are very appropriate for the investigation. I have modelled the basketball situation in a real life atmosphere and the model was successful. Even though the situation is based on a theoretical distribution it was modelled appropriately. The club should now prepare for Lee having the role of free-thrower in this Saturday’s cup final and accepting the fact that Dom is on the subs bench for the start of the game
The data sampling was very organised and strict but not random. To have taken a random sample would mean:
- Watching a random sample of club games throughout the season
- Watching a sample of free-throws made by the performers from the game
- Calculate who is most accurate
A problem with this is time, as it would take a year to go through just one season, therefore it is impractical and illogical. The physical form of the player should also alter throughout the season so a random sample of more than one season would have to be made.
A much better way is to watch all training sessions and take a general overview of who supplies the most points in miniature matches from free throws. This gives more of a view of consistency than “on the day” performance but during game situations the performer will be thinking more logically. A sample of eighty straight baskets is tedious and will affect performance.
Modifications
- Use a longer time period. The performers were rushed to collect their sample size within two hours as a result of school timetabling and so one of them had to rush his last twenty shots.
- Use the same time period i.e. one performer did it one day and the other completed it the next day. Conditions may have been different and morale, energy etc may be variated for both Dom and Lee
- Use foot-mats on the floor so that it indicates an exact position for the feet to stand instead of just using the line. This may be an insignificant difference but to improve the coursework it is better than no difference at all.
- Using the same basketball. Half way through the sample collection the basketball was lost leaving us the trouble of having to use another basketball – maybe of different weight, age etc and possibly affecting the results
Improvements
- I would like to calculate confidence intervals for both expected values (X and Y) to determine my degree of confidence in Lee being a better freethrower.
- I would also like to be able to see if my result E[X] = E[Y] was statistically significant