we couldn’t stop half way through a word.
This is the data that I have gathered.
These graphs show the difference in data between each letter.
Rules
As I have said before I gathered near about 15000 letters and the rules I applied to this were:
- No proper nouns – place names, names of people, etc.
- No words with less than 4 letters
The first rule is easy to explain; in an article about a man from Southampton, for example, it is likely that the word 'Southampton' will come up far more often than in common English usage. The same applies for an article about a certain brand; words which are more likely to come up in the article than in normal usage should be ignored. Because of this, all proper nouns were deleted from the sample.
The second is slightly harder to justify. I believe that having this rule will make the points values more indicative of our language. This is because words such as ‘the’, 'and', and 'what', for example, are hugely more common than any other words. Without these words, the letters 'h' and 'n' will be far less likely to occur. This is important because it is in the nature of the game to score highly with your letters; a player is not likely to use such short words in the game, but if they were included in the sample, they would devalue letters which are otherwise fairly uncommon.
Testing of SPELL score
To test the SPELL score I needed a procedure that would be quick and specific, for this I researched and searched on the internet. I eventually came across a website which was www.mathworld.wolfram.com/SpearmanRankCorrelationCoefficient.html; here I found the Spearman’s Rank Correlation technique. The Spearman's Rank Correlation is a technique used to test the direction and strength of the relationship between two variables or in other words a device that shows whether any one set of numbers has an effect on another set of numbers.
I first checked how correct the SPELL score was by finding out if it had a negative correlation when applied to the Spearman’s Rank Correlation technique which I have said I found on the internet. The Spearman’s Rank Correlation is an easy method to gain a clear and precise result.
Spearman's Rank Correlation works by converting each variable to ranks. Thus, if you you're doing a Spearman's Rank Correlation of blood pressure vs. body weight, the lightest person would get a rank of 1; second-lightest a rank of 2, etc. The lowest blood pressure would get a rank of 1, second lowest a rank of 2, etc. When two or more observations are equal, the average rank is used. For example, if two observations are tied for the second-highest rank, they would get a rank of 2.5 (the average of 2 and 3). However in this case we are using the score and mean value of each letter and applying ranks to these and because of this we can give two tied values the same rank.
Once the two variables are converted to ranks, a linear regression is done on the ranks. The coefficient of determination (r2) is calculated for the two columns of ranks, and the significance of this is tested in the same way as the r2 for a regression or correlation. The P-value from the regression of ranks is the P-value of the Spearman's Rank Correlation. The ranks are rarely graphed against each other, and a line is rarely used for either predictive or illustrative purposes, so the distinction between correlation and linear regression doesn't matter here.
To work out this I had to use the formula:
The results of the Spearman’s Rank Correlation on the SPELL score is shown below.
Analysis of SPELL score
The SPELL score from first glance looked wrong. Like I have said before I knew the scoring system was faulty as to my knowledge “T” is a very common letter and awarding it 10 points just looked totally out of place, whilst at the same time “V” is awarded 2 when it is an infrequent letter. The Spearman’s Rank Correlation above shows that there is no correlation and that the null hypothesis is rejected because of this. I took this into account and decided to add more evidence to it and created a relative frequency graph to clearly show that there is no correlation.
The SPELL score should have an effect with the mean value however it doesn’t showing that there is no direction and strength of the relationship between the two variables. This means that the SPELL score is incorrect and is very far off the proper scoring system.
My SPELL score
The current SPELL score is incorrect and as of this I have constructed a new score system so that the game is valid overall. To go about creating this score system I have first sorted all the letters into order of their mean value. From this I devised a procedure that separates each of the letters into groups depending on their mean value or ultimately how common they are. After I did this I realised that there wasn’t enough groups to give a number from 1 – 10 so I decided that I would once again segregate them further. After doing this I had 10 groups of which the most common letters were given a value of 1, down to the least common which was given a value of 10. As of this I could now give them ranks which is what I did, then I took them through the Spearman’s Rank Correlation technique. The results are shown below.
The results show that I had a very strong negative correlation and this inn turn means that my new scoring system is much improved compared to the current SPELL scoring system. The graph above shows clearly the negative correlation and this is what we are looking for as it means that the ore common a letter is the lower the score for it is. I assume that the 3 most common letters in many cases of research and investigation are “E”, “T” and “A”, and also that the 3 least common letters are “Q”, “Z” and “J”. The reason for “E” being one of the most common letters is the fact that it is a vowel and is also used numerous times in a vast majority of words in the English language. The reason for “Q” being one of the least common letters is the fact that it more or less has to be used in conjunction with “U” thus bringing its appearance rate down.
Difference between My score and SPELL’s score
The visable diversity between My score sytem and SPELL’s score sytem is unambiguous. First of all like I have said before the points awarded for the letters are just totally random as they have no actual comprehension behind it. Secondly because of this the Rank is unlike each other causing massive change in the oncoming stages of the procedure. Lastly the null hypothesis is totally different and it is clear to understanad why.
Comparison of data with another
The graphs and data below were taken from wikipedia who also used the same number of letters but not the same genres. The graph showing the relative frequency of each letter is very similar in values to ours, and the order of most common to least is also. I could not gain comparison for the score as they did not publish this for obvious reasons and only put up the relative frequency for each letter instead, however for this I have gained another persons score and applied the Spearman’s Rank Correlation technique to this. From this I hope to prove my results for the score system created by me to be valid and improved. Also from comparing I shall be able to see the diversity in final correlation but also how different ranks for the letters can cause changes to the outcome of the null hypothesis.
Score comparison
I have chosen to compare my score with another’s so that I can ensure the validity of my newly applied score system. The screenshot below shows us the results of their score once inputted into the Spearman’s Correlation technique.
The correlation for the score is fairly negative and shows us that the score that I have given to SPELL is valid and accurate.
Conclusion
The calculation was that the current SPELL score had a no correlation with the genuine probability of the letters occurring with standard words. Once I created my own score for the set letters I came out with a strong negative correlation. This was proved to be correct by the means of collecting a large amount of data showing the frequency of each letter within a variety of origins, using this I worked out the probability and usage of the scores to creative a relative frequency scatter diagram. These both combined together allowed me to use the Spearman’s Rank Correlation technique to prove that the SPELL score has no correlation. And in the same way I used this to set my own score.
If I had more time I would have done the procedure for other languages such as German, Spanish and French. From these I would be able to obtain the knowledge of knowing which letters are most common and least common. And from this I would have applied a score system for other languages for the game of SPELL.