1.7 Background of Singapore Colloquial English (SgE)
Colloquial Singapore English (SgE) is a variant of Standard English, spoken in Singapore. It should be noted that this variety of English has had significant influence from other languages, notably Hokkien and Malay. It can be said that features of SgE have been incorporated into CMCs in the Singaporean context. Hence, this research considers how SgE might affect the type of English teenagers use for communication, by examining the type of SgE features occurring in teenagers’ language used on Facebook as well as their location of occurrence, for more insights regarding how SgE as a local variety is adapted to the Facebook networking platform as CMC.
Certain features of SgE phonology affect the orthography of the variety of CMC used by Singaporean teenagers on Facebook which distinguish it from varieties used on Facebook by speakers of other major English varieties. For example, the consonant [θ] in coda position is pronounced as [f] or [θ] as a [d] in the initial position by speakers of SgE, thus causing the appearance of certain re-spellings such as <wif> instead of the standard spelling, <with>, and <that> to be re-spelled as <dat>. Unreleased plosive consonants such as [t] in SgE have also led to spellings such as <don> or <dun>.
SgE features can be said to be borrowed into CMC language used by Singaporeans at the syntactic level as well, such as the discourse particles “lah”, “leh”, “lor”, “meh” etc. placed at the end of the sentence to serve a pragmatic function. Such particles, due to the fact that they have never been written before the arrival of the electronic medium, are words which vary in spelling (e.g. <la> vs <lah>). This is another example of spoken language being adapted to the electronic medium within the constraints of the keyboard. Finally, lexical items unique to the Singaporean variety such as “roti prata”, “handphone” and “hawker centres” may be used in the CMC language of Singaporean teenagers.
However, while the research proceeds with the assumption that it is SgE that is adapted to the Facebook medium, in the case of teenagers aged 13 to 16 years old, this assumption can be challenged by the research, with evidence which may suggest that teenagers, influenced by the medium of writing being of standard English, may not reproduce the same basilectal form of SgE used in face-to-face conversation, but instead a variety closer to the standard written variety (to an extent which this research can ascertain). It is by this fact that it can be said that SgE is being reformulated by the electronic medium, along with other features of CMC, a new variety of language use by this particular age group of Singaporeans.
2. Methodology
Unlike other sociolinguistic research on spoken varieties of language, data can be conveniently obtained from the Internet. Furthermore, obtaining data from Facebook does not pose much of a problem in comparison to research done on spoken language, or even closed conversations such as emails, or synchronous ones such as Internet chat.
The participants of the research were secondary school students/teenagers aged 13 to 16. As the focus of the research was on the sociolinguistic variable of gender, the number of male and female participants as well as the amount of text the corpora for both genders contained were balanced as much as possible, since such corpus of should reflect the participation of both genders equally. (Ooi, Tan & Chiang, 2007) Also, these participants came from a variety of backgrounds, in the case of this particular age group, it was ensured that as much as possible that students of different schools were included in the research.
Various accounts organised based on a random choice of secondary schools were set up to make “friends” with other users, and hence obtain data. Sentence(s), known as status messages in Facebook terminology, will be collected from the “walls” of the user who posted the message. The data will be organized in as part of a collection of discourse, separated by the various sociolinguistic variables in consideration. The gathered data was organized into three files on MS Word, two files for analysis on the sociolinguistic variable of gender, and a combined file for a general analysis of teenage language. The final combined file contained approximately 22,000 words of text. As the corpus software supported only ACSII text, the files were saved in plain text format and uploaded.
The tool used for data synthesis and analysis was WMatrix2 (Rayson, 2008), a software tool for corpus analysis and comparison. It provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains. While the corpus linguistic software was built mainly to handle standard written texts without CMC or SgE features, it can still be adapted for use, as shown by Ooi, Tan & Chiang (2007):
“In applying Wmatrix, the objective should not be taken as one that seeks to highlight its weaknesses in handling CMC aspects but one which instead focuses on some of the future challenges and attendant issues for the compilation and analysis of such texts.
The () is an online integrated corpus linguistic software environment in which texts can be loaded and analyzed for word frequency profiles and concordances, annotated in terms of part-of-speech (using the well-known tagger, see ) and word-sense (semantic content and word sense tagger), and analyzed in terms of complex lexical frequency profiles that statistically compare them against standard corpora samplers (e.g. the British National Corpus 1-million word sampler). CLAWS is said to achieve 96-97% accuracy for standard written texts. The part-of-speech component uses categories such as 'N' for nouns, 'V' for verbs, 'J' for adjectives, 'R' for adverbs, and 'A' for articles and determiners. The semantic content component, named the UCREL Semantic Analysis System (or ), contains a multi-tier structure with 21 major discourse categories (see Figure 1):
Figure 1: Semantic categories in Wmatrix (after , ).
These 21 categories are further refined and categorized. A particular refinement within the 'Z' category worth noting is that unmatched items (or those items not recognized by the system) are categorized as 'Z99'.”
The majority of these unmatched items could be typological errors, or certain lexical items that are of interest for the purposes of this research. This could be either conventions in computer mediated communication (CMC) shared among the major varieties of English, such as emoticons and discourse markers, abbreviations, creative re-spellings, new/popular proper nouns, and informal English, or features which characterise SgE used on the Internet, such as discourse particles, or respellings based on the phonology of SgE. (see chapter 1. Introduction for more information on features of CMC and SgE)
3. Results and analysis
Three corpora were analysed with the Wmatrix software, one for males (M), one for females (F), and a combined (C) corpus to represent the teenage population in Singapore.
3.1 Analysis of corpus C on teenagers’ language on Facebook
A central feature of Wmatrix is the statistical profiling analysis of any corpus to be done in terms of a conventional baseline corpus to compare with, i.e. the corpus is compared with conventional speech or writing, represented in this research by the British National Corpus (BNC) Spoken/Written Sampler. The BNC Sampler consists of one-fiftieth of the well-known British National Corpus of 100 million words, standing at 2 million words. (Ooi, Tan & Chiang, 2007) The statistical ranking is done in terms of the log-likelihood (LL) value, which shows the overuse (with the plus sign; or underuse, with the minus sign) of the first corpus (i.e. the blog corpus) against the reference corpus (i.e. the BNC Sampler). In Wmatrix, statistical significance begins with a LL value of approximately 7, since 6.63 is the cut-off for 99% confidence of significance (Rayson 2003, Rayson 2005). Refer to for more details as to the calculation of LL value.
The following figures show the top 10 lexical items of the C corpus when compared with the BNC Spoken and Written Samplers.
Figure 2: The top 10 lexical items of the C corpus, compared with the BNC Spoken Sampler.
The item <d> appears at the top of the list. This represents the overuse of the various emoticons which contain the character <D>, such as <: D>, <XD> (more on emoticons in section 3.3) Interestingly, the next two items are pronouns, which can be attributed as one of linguistic features characteristic of Facebook as a medium. Note the use of the abbreviated form <u> to represent “you”. The items <LOL>, <haha> and <omg> can be considered to be notable features of CMC language. The overuse of the item <status> is interesting, for it is a feature which may be exclusive to Facebook as a medium. It can be further emphasised that the overuse of CMC features seem to be much more than the overuse of SgE features, especially since every one of the top ten items for both corpora listed above is a feature of CMC recognisable to speakers of major English varieties.
Figure 3: The top 10 lexical items of the C corpus, compared with the BNC Written Sampler.
The above frequency listing brings some differing results. The majority of the top items here are pronouns such as <I>, <you>, <u>, <my>, <me>, and <im>, probably due to the smaller use of first and second person pronouns in standard written texts. As mentioned in chapter 2, Wmatrix contains a semantic tagger which can tag each lexical item into 21 different categories; if it is unable to do so, it will relegate the information into a broad "Z99" category. Figure 4 shows the C corpus benchmarked against the BNC Spoken Sampler for such semantic information respectively:
Figure 4: The top 15 semantic categories of the C corpus, compared with the BNC Spoken Sampler.
The large overuse of the unmatched category can be attributed to the relatively low proportion of unmatched items in standard written texts. This has been addressed above in section 2. Items such as <LOL, <happy>, <fun> and <smile> are tagged under tag E4.1+. The overuse of terms associated with education reflects the age of teenagers as a significant sociolinguistic variable, including a range of lexical items such as <school>, <teacher>, <class>. The majority of words tagged as S7.1 “Power, organising” is the item <status>. There is an interesting overuse of items tagged as B1 “Anatomy and physiology”, such as <sleep>, <eyes>, <tired>. Some of these items have been wrongly tagged, such as <shit>, <suck> and <profile>, with the first two being used the interjectory sense. It would be good to note that the overuse of S9 “Religion and the supernatural” is due mainly to the fact that <nt>, which occurred 48 out of the total 98 items tagged in S9, was wrongly sorted. This item was cut from items such as <didnt>, <dont>, etc. The following figure shows the top ten part of speech markers compared with the BNC Spoken Sampler.
Figure 5: A comparison of the top ten Wmatrix part-of-speech markers for the C-corpus with the BNC Spoken Sampler.
Note that <i> was wrongly classed as MC1 because of it being written in lower case, and should have been classed as PPIS1. It is interesting to note the high overuse of items tagged as “NNU”, which is the unit of measurement category. Many items are abbreviations, often 3 letter words with comprise consonants, thus misleading the software to wrongly categorise such items under units of measurement (see figure 6). The following two tags FO and ZZ1 also contain mainly mismatched items.
Figure 6: The top 25 items classed as “NNU” (units of measurement) in the C corpus with explanatory annotation besides items.
Many commonly used abbreviations for certain commonly used lexical items, which are based on background research also prominent in other modes of CMC, such <tmr>, <ppl>, <nvm>. The high occurrence of <sch> as an abbreviation for “school” emphasises the age of the participants as students. It would be good to note again the relatively low occurrence of SgE discourse particles. It would be good to note again the relatively low occurrence of SgE discourse particles. The two figures following show the occurrences of SgE and CMC features.
Figure 7: List of SgE features.
The above figure demonstrates the large variations in spellings for SgE features, with differences such as <sia> and <siah>, or <lorh> and <lor>. Again due to the insufficient amount of text in the corpus, the number of occurrences of such discourse is small.
Figure 8: List of emoticons.
It can be generally said that there a large use of certain emoticons such as <:d>, <:)> and <:D>, and that those conveying happiness tends to occur more frequently than those connoting sadness or anger. By simply looking at the number of occurrences for each of the categories, the number of items deriving from SgE seems to be far short of the number of CMC features, especially as only emoticons (as features meant to convey paralinguistic meanings lost through the electronic medium) are shown here. However, this is insufficient evidence to claim that CMC features are more popular, or more commonly used than SgE features among teenagers, especially since both have very different functions.
It was mentioned before that it may be a possibility that due to the influence of the written medium, teenagers tend to utilise CMC features, and leaving behind certain SgE features which would have been used if the message were spoken. Qualitative analysis seems to suggest that this may be a true. Five messages which demonstrate this are shown below:
Quantitative evidence also seems to suggest this. Two pages of text were examined and the number of sentences not exhibiting any SgE features was manually counted. The findings showed that 70% of the sentences were not SgE on preliminary examination. Though this may not be fully conclusive given that much of what can be characterised to be SgE is found in the phonology of the language, it does seem to suggest that the language of teenagers becomes more neutral or standardised due to the primarily textual nature of the electronic medium. Also, contrary to preliminary hypothesis many of the presupposed respellings based on SgE phonology was hardly used. For example, there was 132 occurrences of <that> against a single occurrence of <dat>, and 100 occurrences of <with> and a single <wif>. The complete spelling of <want> occurred 48 times compared to 32 occurrences of <wan>. It can be said that as a whole, SgE may not be used as much as Standard English when it is written down, due to SgE being used by Singaporean teenagers mainly in the spoken medium. Conversely, CMC features which are the only way of expressing prosodic and paralinguistic information, are used in this variety. However, certain sentences maintain the flavour of SgE syntax. The word “already” or “liao” in SgE is a particle added to the end of a clause to indicate the perfective aspect, or a change of state.
Figure 9: Concordance showing use of SgE aspectual “already” or “liao” particle.
A total of 14 occurrences of this particle outweighs the 11 sentences written in the perfective aspect by use of standard English auxiliary verbs “has”, “have” or “had”. Another good example of how the flavour of SgE has been retained in specific senses which would be harder to convey in standard English, is the SgE use of “right” or “rite”, to mean “isn’t it” to confirm whatever has been said in the sentence.
Figure 10: Concordance showing use of “right” or “rite” SgE particle.
It is mentioned in section 2 that Wmatrix software, which lexicon currently contains 54776 single words, and 18823 multi-word expressions, is built to handle standard written texts and not CMC language. In data analysis, unmatched items which could include features of CMC or SgE, not recognised by the system are sorted under the semantic tag Z99. The top 7 lexical items of this semantic tag is listed below, does not include emoticons also listed, such as <-.-> and <=d>:
Figure 2: Top 7 unmatched items in Z99 semantic tag in order of descending ranking, for males, females and combined corpora
As in this listing, items such as <lol> as well as emoticons are screened out, it is clear the top few SgE features used by teenagers are <sia>, <hor>.
3.2 Comparison of corpora M and F
Figure 9: The top 10 lexical items of the M corpus, compared with the BNC Spoken Sampler.
Figure 10: The top 10 lexical items of the F corpus, compared with the BNC Spoken Sampler.
The two figures above show the top 10 lexical items of the M and F corpora compared against the BNC Spoken Sampler. The two corpora were not compared against the written sampler to avoid similar findings as shown above, where personal pronouns were ranked as more overused than features of CMC and SgE. Note that the overuse of the <d>, <=>, and <-.> are due to these symbols being a part of the emoticons. As little that has not yet been addressed can be inferred from the above figures, the M and F corpora were compared against each other.
Figure 11: The top 10 lexical items of the F corpus compared with the M corpus.
Figure 12: The top 10 lexical items of the M corpus compared with the F corpus.
There seems to be a greater use of pronouns for females (this observation supported by background research concerning the possible differences between genders), such as the overuse of <i>, <she> and <they>. It is difficult to make any substantial inferences from the figures above, partly due to the scarcity of text in the relatively small corpora. However, more substantial findings about words used by both genders can be concluded upon using the semantic tagger. Figures 11 and 12 show the top 10 lexical items of the M and F corpora when compared against each other.
Figure 13: The top 10 semantic categories of the M corpus compared with the F corpus.
The conclusions which can be made based on the above figure become immediately obvious, and enforce the common stereotypes regarding the nature of the topics discussed by the two genders. This can be substantiated with the overuse of lexical items tagged as K5.1 (Sports), G3 (Warfare), K5 (Sports and games) and S2.2 (People: Male).
Figure 14: The top 10 semantic categories of the F corpus compared with the M corpus.
Similarly, conclusions about lexical items used by females can be concluded, with the overuse of words tagged under O4.2+ (Judgement of appearances: Beautiful), S4 (Kin). Conclusions of a more linguistic nature can be made that females tend to use pronouns more often than males. Females tend to discuss more about personal safety and relationships, as seen from A15+ and S4. With more use of items tagged under Q2.1 (Speech: Communicative), females can be perceived as being more communicative. Some items sorted under this tag include <talk>, <say>, <comment>. Something observed during data collection was that females tend to have longer status messages in general.
By the same methodology used above in section 3.1, the M and F corpora were manually counted individually for the average number of sentences that exhibited SgE features, a significant difference can be observed. It would be hypothesized that since females would usually be more expressive and hence casual in the way they chat, as well as make Facebook posts, there would be a higher occurrence percentage of SgE particles in each sentence on average. The percentage of sentences in the F corpus that contains at least 1 SgE particle is 37.5%, while that of the M corpus is 22.7%. This statistic provides evidence that suggests that females indeed do use more SgE particles in Facebook than males.
4. Research Assessment
4.1 Limitations of techniques in obtaining information
In terms of research methodology, some limitations were encountered in the process of obtaining data. Firstly, to ensure that the data had a large variance, “friends” of different schools and gender were required. This posed a challenge due to the need for a wide range of participants from different backgrounds. Hence, gathering “friends” was a difficult step as the percentage of people who accepted “friend requests” was low, thus explaining a relatively small corpus size of about 22, 000 words.
Because the data required was mainly from the “friends” posts on Facebook, and there has been no software to date that can extract all the posts of the “friends”, data had to be copied manually and pasted into a word document as part of the data collation process. In addition, due to lack of manpower in this data collation process, gathering data required much more time than usual. The small size of the corpus can mainly be attributed to this, for number of people involved in the gathering of data for this research is incomparable to the amount of manpower and resources available for the gathering of texts for large small projects for creation of representative corpora. Furthermore, the length of the average status message posted on Facebook is relatively short in comparison to the large amounts of text available to the researcher in other modes of CMC such as chats and blogs. Although is presented as a limitation to the amount of text the corpus contains, it can be said that the restricted length of each string of text has caused a relatively interesting linguistic features which may be unique to the variety of language used on Facebook.
4.2 Limitations of methods of data analysis
WMatrix 2.0 software was used to aid analysis of collated data from Facebook posts. The software allowed for more convenient and faster counting of word occurrences, as well as semantic category tagging of the words with the CLAWS tagger and has the function of statistically comparing corpora samples. However, the software itself is limited. CLAWS is said to be 96-97% accurate for standard written texts. At times, certain lexical items may be wrongly tagged to a semantic category. For example, the software separates the words <dont> and <didnt>, into <do> and <did> respectively and <nt>, and these <nt>s are wrongly sorted into tag S9 for “Religion and the Supernatural”.
The WMatrix 2.0 software is also unable to cope with non-standard written text, such as texts of SgE in the corpus. Features of SgE and CMC were originally tagged under the Z99 category (unmatched lexical items), making it convenient to adapt the software to deal with CMC or SgE features. However, the USAS semantic tagging wrongly tagged some items, such that these SgE particles were not tagged under Z99. The lack of corpora representative of SgE or CMC limited findings which would have distinguished features unique to this particular variety. Besides SgE particles, the WMatrix software was also unable to identify certain emoticons that had a < . > (full stop) (e.g., -.-, T.T). Instead of identifying those as whole emoticons, the software cut off any symbols after each full stop. Such problems were easily overcome by referring to the concordances, for example it was seen how there was an overuse of <D> in the frequency listing (refer to section 3.1). Another recurrent problem was that the program was unable to disregard capitalisation of certain letters which began words, leading to certain errors in the frequency listing, such as counting no occurrences of a particular item without capitalisation, when comparing it against the same item with capitalisation.
4.3 Suggestions for further research
If possible, increasing the corpus size would be a viable option to obtain a corpus which can serve as a better representation of the teenage population in Singapore, as well as increasing the reliability of the findings. For future research projects, it can be verified whether our data findings are consistent by acquiring data from a larger number of schools. Also, a similar study could possibly be done to investigate differences in the language usage of teenagers from co-ed and single-sex schools. This is interesting as it is controversial as to whether differences that have been found to distinguish the language used by the genders can in fact be attributed to social environment and differences in language in different social networks. This is especially the case since the data was not collected from every school in Singapore.
Before embarking on this research project, the sociolinguistic variable of ethnicity was also considered. However due to time constraints, it was decided that the focus would be only on gender. Hence, another possible area of research would be to investigate how this variable affects the usage of English by teenagers or Singaporeans in general over this interesting medium of Facebook. In all, there is a possibility of research to be done on other sociolinguistic variables other than gender to further the understanding of how Facebook, a relatively new mode of CMC, can reformulate the use of English.
6. Bibliography
-
Crystal, D. (2003). The Cambridge Encyclopedia of the English Language. (2nd edition) Cambridge: Cambridge University Press.
-
Crystal, D. (2006). Language and the Internet (2nd edition) Cambridge: Cambridge University Press.
-
Herring, S.C. (2000) "Gender differences in CMC: Findings and implications". Computer Professionals for Social Responsibility Journal (formerly Computer Professionals for Social Responsibility Newsletter) 18(1). 29 Apr 2007.
-
Herring, S.C. (ed.) 1996. Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives. Amsterdam: John Benjamins.
-
Ooi, V., Tan P. & Chiang A. (2007). “Analyzing Personal Weblogs in Singapore English: the Wmatrix Approach”. eVariEng (Journal of the Research Unit for Variation, Contacts and Change in English) Vol. 2: Towards Multimedia in Corpus Studies. Finland: University of Helsinki. 31 Aug 2008.
-
Rayson, P. (2003) Matrix: A Statistical Method and Software Tool for Linguistic Analysis through Corpus Comparison. Ph.D. thesis, Lancaster University. 29 Apr 2007.
-
Rayson, P. (2005) Wmatrix: A Web-based Corpus Processing Environment. Computing Department, Lancaster University. 29 Apr 2007.