A sociolinguistic study on Singaporean teenagers use of language on Facebook a research on gender as a sociolinguistic variable in teenagers use of English through the social networking platform of Facebook

University Degree Linguistics, Classics and related subjects

A sociolinguistic study on Singaporean teenagers’ use of language on Facebook

a research on gender as a sociolinguistic variable in teenagers’ use of English through the social networking platform of Facebook

By Team Members:

Chow Keng Ji (Leader)

Bryan Ang Wei-En

Pethuel Ho

For Expert mentor:

A/P Vincent B Y Ooi,

Department of English Language and Literature

NUS

For Teacher Mentor:

Mr. Desmond Lim

Raffles Institution Research Education

2011

ii. Acknowledgements

Our group would like to thank our teacher mentor Mr Desmond Lim for his continuous guidance throughout the project. We would especially like to thank our expert mentor from NUS, Department of English Language and Literature, A/P Vincent B Y Ooi, for taking time off his schedule to help us and provide valuable insights on our research topic.

iii. Abstract

iv. Contents

1. Introduction

1.1 Research objective

This research aims to analyse the use of the English language on Facebook by teenagers from ages 13 to 16, considering the influence of Singapore Colloquial English (SgE), as well as the sociolinguistic variable of gender. According to Herring (2000), gender can be seen as a significant sociolinguistic variable online in general. This research will confirm if it is relevant in the context of SgE used on Facebook.

1.2 General context of topic

From a sociologist’s perspective, the Internet has drastically affected the lifestyle of the people who have access to, opening new modes of communication that speed up work processes and social interaction. From a linguist’s perspective, the Internet is essentially a new medium of communication, and the focus being on language, everyone who has access to it contributes to the development of new varieties of language through the new medium. Since the start of the Internet, new varieties, in this specific case, English, have been arising with the computer keyboard as the factor mediating conventional speech and writing conventions, thus leading to the existence of computer-mediated communication (CMC). (Ooi, Tan & Chiang, 2007) CMC is commonly regarded as a totally different medium of communication, unlike the conventional mediums of speech and writing; the electronic medium is almost a combination of both.

This topic aims to analyse how the variety of spoken English used in Singapore, known as Singapore Colloquial English, has been modified to be used in the electronic medium by the teenagers, with special emphasis on the sociolinguistic variable of gender. It is anticipated, through general experience as well as preliminary observation that the impact of the electronic medium on the use of the English language should not defer too greatly from the other varieties of global English. This project will aim to quantify the observations and contribute to the currently available sociolinguistic research on the SgE variety.

1.3 Research framework

While the overall focus of the research is on the variety of language which can be characterised to contain both features of CMC as well as the reflecting the original SgE spoken by the users of the social networking platform, emphasis will also be given on the sociolinguistic variable of gender, and how it affects language use on Facebook by a set age group. The gathered data will be analysed over a number of linguistic levels, namely the orthographic, syntactic and lexical.

1.4 Significance of research

Since its creation, Facebook has resulted in a unique new form of language. Facebook, a social networking website, is arguably more interesting and innovative than any other Internet-based language communication medium. New mediums, especially electronic ones, with their numerous features, tend to change language in interesting ways, in this case, asynchronous communication which is posted on a personal “wall”, left to free response or directed at a specific target, the entire community of “friends” the user is connected to, no one in particular, or simply used as a medium of self-expression. Furthermore, Facebook is a fast growing medium of communication for the purpose of social interaction at an informal level, and has already become a medium that is a large part of the lives of many teenagers in Singapore. The project aims to add on this research on the new emerging varieties. Therefore, studying the language of Facebook by speakers of SgE would be a good research to further the overall understanding of new varieties of English due to its new position as a lingua franca in the global world, adding to the amount of information currently available on the subject, or perhaps renewing the current information, for language is always shifting, especially slang and informal language, which falls in and out of use. By focusing on a specific group, a better understanding of how English affected by SgE is evolving through an electronic medium can be achieved.

1.5 Facebook

Facebook is a relatively new mode of social networking that has emerged as the most popular social networking site in the past few years, especially in the Singaporean context. It thus has developed into a significant platform on which social interaction takes place on the World Wide Web among teenagers, and thus it is very likely that a significant new variety of English has emerged which can be analysed.

This social networking medium works based on the concept of “walls” which is owned by every member of the Facebook community. On the walls, status messages, which refer to messages on many subjects, are posted. This is the text of which the text for the corpus will be collected from. A basic terminology which will be constantly referred to in the course of the methodology is the Facebook neologism of “friend”. The term “friend” bears a similar concept of friends in real life, and it basically refers to other users, whose wall, photos or personal information can be accessed, if you are the user’s “friend”. A user can view the walls of any other user of which he/she is a friend. These “friends” are necessary for the purpose of the research, as they contribute to the amount of text that can be collected from the walls of the participants.

This social networking platform can be in overall be considered to be a new mode of communication on the Internet, where users often direct their status messages to either a particular recipient, or to every other user (“friend” who has permission to view his/her wall, or as a mode of self-expression. This function of posting “status messages” is thus the main focus of this research. It should be noted that other functions which Facebook provides, such as private messages and chats, will not be discussed in the scope of this research study.

1.6 General background of computer mediated communication (CMC)

Herring (1996), defines computer-mediated communication (CMC) as communication that takes place between human beings via the computerised medium. CMC involves one or more of the following:

One-to-one asynchronous communication, eg. Email
Many-to-many asynchronous communication, eg. Electronic bulletin boards, online forums
Synchronous communication that can be one-to-one or many-to-many, eg. Internet Relay Chat (IRC), online games, or chatrooms on commercial services or email services

David Crystal (2006), also a leading authority on language use on the Internet, coins the term “Netspeak”, and also notes that Netspeak and face-to-face conversation differ in the following ways:

There is no simultaneous feedback in both asynchronous and synchronous CMC
Rhythm of interaction on the Internet is slower than in normal speech
Difference in use of prosodic and paralinguistic features

Due to the constraints of the electronic keyboard, the prosodic and paralinguistic features which would have been conveyed in face-to-face conversation by the use of intonation, stress or body language is lost in the electronic medium. As such, CMC is adapted to the keyboard, and by orthographical conventions, such features are conveyed. It is these features which make CMC text linguistically rich and a significant part of linguistic research today. Examples of commonly used CMC orthographic conventions which seem to be universally used across the English-speaking Internet users are repeated letters and punctuation marks, capitalisation, abbreviations, respellings and emoticons.

Due to primary medium of keyboard, CMC is typed and studied mainly as a written (text-based) phenomenon. However, it is important to keep in mind that CMC is not just a re-representation of conventional written language, but more of a blend of both spoken and written features of language, and the language used in a particular context more spoken (in the case of chats) or more written (in the case of blogs). (Ooi, ed. Kawaguchi & Minegishi, 2009)

Facebook as a medium for computer mediated communication is interesting, and is its interface functions differently from the other modes where CMC takes place. It would be in general characterised as asynchronous, since users leave their status messages on their walls. The issue of whether it is one-to-one, or one-to-many is difficult to establish, although it can be accepted that the recipient of communication is rather similar to the case of a weblog, where messages are left to either a particular person, or many other people in the network of friends the user has (similar to the “blog ring”s’ bloggers are a part off, since friends on Facebook also receive similar updates of messages posted by the user) or as a platform for personal expression. The interface however, restricts the Facebook user to post shorter messages in comparison to a blog, and hence this may influence the nature of the language used on this particular mode of CMC, such as the greater use of short and abbreviated forms similar to those used in chatrooms or in texted messages.

1.7 Background of Singapore Colloquial English (SgE)

Colloquial Singapore English (SgE) is a variant of Standard English, spoken in Singapore. It should be noted that this variety of English has had significant influence from other languages, notably Hokkien and Malay. It can be said that features of SgE have been incorporated into CMCs in the Singaporean context. Hence, this research considers how SgE might affect the type of English teenagers use for communication, by examining the type of SgE features occurring in teenagers’ language used on Facebook as well as their location of occurrence, for more insights regarding how SgE ...

This is a preview of the whole essay

1.7 Background of Singapore Colloquial English (SgE)

Certain features of SgE phonology affect the orthography of the variety of CMC used by Singaporean teenagers on Facebook which distinguish it from varieties used on Facebook by speakers of other major English varieties. For example, the consonant [θ] in coda position is pronounced as [f] or [θ] as a [d] in the initial position by speakers of SgE, thus causing the appearance of certain re-spellings such as <wif> instead of the standard spelling, <with>, and <that> to be re-spelled as <dat>. Unreleased plosive consonants such as [t] in SgE have also led to spellings such as <don> or <dun>.

SgE features can be said to be borrowed into CMC language used by Singaporeans at the syntactic level as well, such as the discourse particles “lah”, “leh”, “lor”, “meh” etc. placed at the end of the sentence to serve a pragmatic function. Such particles, due to the fact that they have never been written before the arrival of the electronic medium, are words which vary in spelling (e.g. <la> vs <lah>). This is another example of spoken language being adapted to the electronic medium within the constraints of the keyboard. Finally, lexical items unique to the Singaporean variety such as “roti prata”, “handphone” and “hawker centres” may be used in the CMC language of Singaporean teenagers.

However, while the research proceeds with the assumption that it is SgE that is adapted to the Facebook medium, in the case of teenagers aged 13 to 16 years old, this assumption can be challenged by the research, with evidence which may suggest that teenagers, influenced by the medium of writing being of standard English, may not reproduce the same basilectal form of SgE used in face-to-face conversation, but instead a variety closer to the standard written variety (to an extent which this research can ascertain). It is by this fact that it can be said that SgE is being reformulated by the electronic medium, along with other features of CMC, a new variety of language use by this particular age group of Singaporeans.

2. Methodology

Unlike other sociolinguistic research on spoken varieties of language, data can be conveniently obtained from the Internet. Furthermore, obtaining data from Facebook does not pose much of a problem in comparison to research done on spoken language, or even closed conversations such as emails, or synchronous ones such as Internet chat.

The participants of the research were secondary school students/teenagers aged 13 to 16. As the focus of the research was on the sociolinguistic variable of gender, the number of male and female participants as well as the amount of text the corpora for both genders contained were balanced as much as possible, since such corpus of should reflect the participation of both genders equally. (Ooi, Tan & Chiang, 2007) Also, these participants came from a variety of backgrounds, in the case of this particular age group, it was ensured that as much as possible that students of different schools were included in the research.

Various accounts organised based on a random choice of secondary schools were set up to make “friends” with other users, and hence obtain data. Sentence(s), known as status messages in Facebook terminology, will be collected from the “walls” of the user who posted the message. The data will be organized in as part of a collection of discourse, separated by the various sociolinguistic variables in consideration. The gathered data was organized into three files on MS Word, two files for analysis on the sociolinguistic variable of gender, and a combined file for a general analysis of teenage language. The final combined file contained approximately 22,000 words of text. As the corpus software supported only ACSII text, the files were saved in plain text format and uploaded.

The tool used for data synthesis and analysis was WMatrix2 (Rayson, 2008), a software tool for corpus analysis and comparison. It provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains. While the corpus linguistic software was built mainly to handle standard written texts without CMC or SgE features, it can still be adapted for use, as shown by Ooi, Tan & Chiang (2007):

“In applying Wmatrix, the objective should not be taken as one that seeks to highlight its weaknesses in handling CMC aspects but one which instead focuses on some of the future challenges and attendant issues for the compilation and analysis of such texts.

The () is an online integrated corpus linguistic software environment in which texts can be loaded and analyzed for word frequency profiles and concordances, annotated in terms of part-of-speech (using the well-known tagger, see ) and word-sense (semantic content and word sense tagger), and analyzed in terms of complex lexical frequency profiles that statistically compare them against standard corpora samplers (e.g. the British National Corpus 1-million word sampler). CLAWS is said to achieve 96-97% accuracy for standard written texts. The part-of-speech component uses categories such as 'N' for nouns, 'V' for verbs, 'J' for adjectives, 'R' for adverbs, and 'A' for articles and determiners. The semantic content component, named the UCREL Semantic Analysis System (or ), contains a multi-tier structure with 21 major discourse categories (see Figure 1):

Figure 1: Semantic categories in Wmatrix (after , ).

These 21 categories are further refined and categorized. A particular refinement within the 'Z' category worth noting is that unmatched items (or those items not recognized by the system) are categorized as 'Z99'.”

The majority of these unmatched items could be typological errors, or certain lexical items that are of interest for the purposes of this research. This could be either conventions in computer mediated communication (CMC) shared among the major varieties of English, such as emoticons and discourse markers, abbreviations, creative re-spellings, new/popular proper nouns, and informal English, or features which characterise SgE used on the Internet, such as discourse particles, or respellings based on the phonology of SgE. (see chapter 1. Introduction for more information on features of CMC and SgE)

3. Results and analysis

Three corpora were analysed with the Wmatrix software, one for males (M), one for females (F), and a combined (C) corpus to represent the teenage population in Singapore.

3.1 Analysis of corpus C on teenagers’ language on Facebook

A central feature of Wmatrix is the statistical profiling analysis of any corpus to be done in terms of a conventional baseline corpus to compare with, i.e. the corpus is compared with conventional speech or writing, represented in this research by the British National Corpus (BNC) Spoken/Written Sampler. The BNC Sampler consists of one-fiftieth of the well-known British National Corpus of 100 million words, standing at 2 million words. (Ooi, Tan & Chiang, 2007) The statistical ranking is done in terms of the log-likelihood (LL) value, which shows the overuse (with the plus sign; or underuse, with the minus sign) of the first corpus (i.e. the blog corpus) against the reference corpus (i.e. the BNC Sampler). In Wmatrix, statistical significance begins with a LL value of approximately 7, since 6.63 is the cut-off for 99% confidence of significance (Rayson 2003, Rayson 2005). Refer to for more details as to the calculation of LL value.

The following figures show the top 10 lexical items of the C corpus when compared with the BNC Spoken and Written Samplers.

Figure 2: The top 10 lexical items of the C corpus, compared with the BNC Spoken Sampler.

The item <d> appears at the top of the list. This represents the overuse of the various emoticons which contain the character <D>, such as <: D>, <XD> (more on emoticons in section 3.3) Interestingly, the next two items are pronouns, which can be attributed as one of linguistic features characteristic of Facebook as a medium. Note the use of the abbreviated form to represent “you”. The items <LOL>, <haha> and <omg> can be considered to be notable features of CMC language. The overuse of the item <status> is interesting, for it is a feature which may be exclusive to Facebook as a medium. It can be further emphasised that the overuse of CMC features seem to be much more than the overuse of SgE features, especially since every one of the top ten items for both corpora listed above is a feature of CMC recognisable to speakers of major English varieties.

Figure 3: The top 10 lexical items of the C corpus, compared with the BNC Written Sampler.

The above frequency listing brings some differing results. The majority of the top items here are pronouns such as , <you>, , <my>, <me>, and <im>, probably due to the smaller use of first and second person pronouns in standard written texts. As mentioned in chapter 2, Wmatrix contains a semantic tagger which can tag each lexical item into 21 different categories; if it is unable to do so, it will relegate the information into a broad "Z99" category. Figure 4 shows the C corpus benchmarked against the BNC Spoken Sampler for such semantic information respectively:

Figure 4: The top 15 semantic categories of the C corpus, compared with the BNC Spoken Sampler.

The large overuse of the unmatched category can be attributed to the relatively low proportion of unmatched items in standard written texts. This has been addressed above in section 2. Items such as <LOL, <happy>, <fun> and <smile> are tagged under tag E4.1+. The overuse of terms associated with education reflects the age of teenagers as a significant sociolinguistic variable, including a range of lexical items such as <school>, <teacher>, <class>. The majority of words tagged as S7.1 “Power, organising” is the item <status>. There is an interesting overuse of items tagged as B1 “Anatomy and physiology”, such as <sleep>, <eyes>, <tired>. Some of these items have been wrongly tagged, such as <shit>, <suck> and <profile>, with the first two being used the interjectory sense. It would be good to note that the overuse of S9 “Religion and the supernatural” is due mainly to the fact that <nt>, which occurred 48 out of the total 98 items tagged in S9, was wrongly sorted. This item was cut from items such as <didnt>, <dont>, etc. The following figure shows the top ten part of speech markers compared with the BNC Spoken Sampler.

Figure 5: A comparison of the top ten Wmatrix part-of-speech markers for the C-corpus with the BNC Spoken Sampler.

Note that was wrongly classed as MC1 because of it being written in lower case, and should have been classed as PPIS1. It is interesting to note the high overuse of items tagged as “NNU”, which is the unit of measurement category. Many items are abbreviations, often 3 letter words with comprise consonants, thus misleading the software to wrongly categorise such items under units of measurement (see figure 6). The following two tags FO and ZZ1 also contain mainly mismatched items.

Figure 6: The top 25 items classed as “NNU” (units of measurement) in the C corpus with explanatory annotation besides items.

Many commonly used abbreviations for certain commonly used lexical items, which are based on background research also prominent in other modes of CMC, such <tmr>, <ppl>, <nvm>. The high occurrence of <sch> as an abbreviation for “school” emphasises the age of the participants as students. It would be good to note again the relatively low occurrence of SgE discourse particles. It would be good to note again the relatively low occurrence of SgE discourse particles. The two figures following show the occurrences of SgE and CMC features.

Figure 7: List of SgE features.

The above figure demonstrates the large variations in spellings for SgE features, with differences such as <sia> and <siah>, or <lorh> and <lor>. Again due to the insufficient amount of text in the corpus, the number of occurrences of such discourse is small.

Figure 8: List of emoticons.

It can be generally said that there a large use of certain emoticons such as <:d>, <:)> and <:D>, and that those conveying happiness tends to occur more frequently than those connoting sadness or anger. By simply looking at the number of occurrences for each of the categories, the number of items deriving from SgE seems to be far short of the number of CMC features, especially as only emoticons (as features meant to convey paralinguistic meanings lost through the electronic medium) are shown here. However, this is insufficient evidence to claim that CMC features are more popular, or more commonly used than SgE features among teenagers, especially since both have very different functions.

It was mentioned before that it may be a possibility that due to the influence of the written medium, teenagers tend to utilise CMC features, and leaving behind certain SgE features which would have been used if the message were spoken. Qualitative analysis seems to suggest that this may be a true. Five messages which demonstrate this are shown below:

Quantitative evidence also seems to suggest this. Two pages of text were examined and the number of sentences not exhibiting any SgE features was manually counted. The findings showed that 70% of the sentences were not SgE on preliminary examination. Though this may not be fully conclusive given that much of what can be characterised to be SgE is found in the phonology of the language, it does seem to suggest that the language of teenagers becomes more neutral or standardised due to the primarily textual nature of the electronic medium. Also, contrary to preliminary hypothesis many of the presupposed respellings based on SgE phonology was hardly used. For example, there was 132 occurrences of <that> against a single occurrence of <dat>, and 100 occurrences of <with> and a single <wif>. The complete spelling of <want> occurred 48 times compared to 32 occurrences of <wan>. It can be said that as a whole, SgE may not be used as much as Standard English when it is written down, due to SgE being used by Singaporean teenagers mainly in the spoken medium. Conversely, CMC features which are the only way of expressing prosodic and paralinguistic information, are used in this variety. However, certain sentences maintain the flavour of SgE syntax. The word “already” or “liao” in SgE is a particle added to the end of a clause to indicate the perfective aspect, or a change of state.

Figure 9: Concordance showing use of SgE aspectual “already” or “liao” particle.

A total of 14 occurrences of this particle outweighs the 11 sentences written in the perfective aspect by use of standard English auxiliary verbs “has”, “have” or “had”. Another good example of how the flavour of SgE has been retained in specific senses which would be harder to convey in standard English, is the SgE use of “right” or “rite”, to mean “isn’t it” to confirm whatever has been said in the sentence.

Figure 10: Concordance showing use of “right” or “rite” SgE particle.

It is mentioned in section 2 that Wmatrix software, which lexicon currently contains 54776 single words, and 18823 multi-word expressions, is built to handle standard written texts and not CMC language. In data analysis, unmatched items which could include features of CMC or SgE, not recognised by the system are sorted under the semantic tag Z99. The top 7 lexical items of this semantic tag is listed below, does not include emoticons also listed, such as <-.-> and <=d>:

Figure 2: Top 7 unmatched items in Z99 semantic tag in order of descending ranking, for males, females and combined corpora

As in this listing, items such as <lol> as well as emoticons are screened out, it is clear the top few SgE features used by teenagers are <sia>, <hor>.

3.2 Comparison of corpora M and F

Figure 9: The top 10 lexical items of the M corpus, compared with the BNC Spoken Sampler.

Figure 10: The top 10 lexical items of the F corpus, compared with the BNC Spoken Sampler.

The two figures above show the top 10 lexical items of the M and F corpora compared against the BNC Spoken Sampler. The two corpora were not compared against the written sampler to avoid similar findings as shown above, where personal pronouns were ranked as more overused than features of CMC and SgE. Note that the overuse of the <d>, <=>, and <-.> are due to these symbols being a part of the emoticons. As little that has not yet been addressed can be inferred from the above figures, the M and F corpora were compared against each other.

Figure 11: The top 10 lexical items of the F corpus compared with the M corpus.

Figure 12: The top 10 lexical items of the M corpus compared with the F corpus.

There seems to be a greater use of pronouns for females (this observation supported by background research concerning the possible differences between genders), such as the overuse of , <she> and <they>. It is difficult to make any substantial inferences from the figures above, partly due to the scarcity of text in the relatively small corpora. However, more substantial findings about words used by both genders can be concluded upon using the semantic tagger. Figures 11 and 12 show the top 10 lexical items of the M and F corpora when compared against each other.

Figure 13: The top 10 semantic categories of the M corpus compared with the F corpus.

The conclusions which can be made based on the above figure become immediately obvious, and enforce the common stereotypes regarding the nature of the topics discussed by the two genders. This can be substantiated with the overuse of lexical items tagged as K5.1 (Sports), G3 (Warfare), K5 (Sports and games) and S2.2 (People: Male).

Figure 14: The top 10 semantic categories of the F corpus compared with the M corpus.

Similarly, conclusions about lexical items used by females can be concluded, with the overuse of words tagged under O4.2+ (Judgement of appearances: Beautiful), S4 (Kin). Conclusions of a more linguistic nature can be made that females tend to use pronouns more often than males. Females tend to discuss more about personal safety and relationships, as seen from A15+ and S4. With more use of items tagged under Q2.1 (Speech: Communicative), females can be perceived as being more communicative. Some items sorted under this tag include <talk>, <say>, <comment>. Something observed during data collection was that females tend to have longer status messages in general.

By the same methodology used above in section 3.1, the M and F corpora were manually counted individually for the average number of sentences that exhibited SgE features, a significant difference can be observed. It would be hypothesized that since females would usually be more expressive and hence casual in the way they chat, as well as make Facebook posts, there would be a higher occurrence percentage of SgE particles in each sentence on average. The percentage of sentences in the F corpus that contains at least 1 SgE particle is 37.5%, while that of the M corpus is 22.7%. This statistic provides evidence that suggests that females indeed do use more SgE particles in Facebook than males.

4. Research Assessment
4.1 Limitations of techniques in obtaining information

In terms of research methodology, some limitations were encountered in the process of obtaining data. Firstly, to ensure that the data had a large variance, “friends” of different schools and gender were required. This posed a challenge due to the need for a wide range of participants from different backgrounds. Hence, gathering “friends” was a difficult step as the percentage of people who accepted “friend requests” was low, thus explaining a relatively small corpus size of about 22, 000 words.

Because the data required was mainly from the “friends” posts on Facebook, and there has been no software to date that can extract all the posts of the “friends”, data had to be copied manually and pasted into a word document as part of the data collation process. In addition, due to lack of manpower in this data collation process, gathering data required much more time than usual. The small size of the corpus can mainly be attributed to this, for number of people involved in the gathering of data for this research is incomparable to the amount of manpower and resources available for the gathering of texts for large small projects for creation of representative corpora. Furthermore, the length of the average status message posted on Facebook is relatively short in comparison to the large amounts of text available to the researcher in other modes of CMC such as chats and blogs. Although is presented as a limitation to the amount of text the corpus contains, it can be said that the restricted length of each string of text has caused a relatively interesting linguistic features which may be unique to the variety of language used on Facebook.

4.2 Limitations of methods of data analysis

WMatrix 2.0 software was used to aid analysis of collated data from Facebook posts. The software allowed for more convenient and faster counting of word occurrences, as well as semantic category tagging of the words with the CLAWS tagger and has the function of statistically comparing corpora samples. However, the software itself is limited. CLAWS is said to be 96-97% accurate for standard written texts. At times, certain lexical items may be wrongly tagged to a semantic category. For example, the software separates the words <dont> and <didnt>, into <do> and <did> respectively and <nt>, and these <nt>s are wrongly sorted into tag S9 for “Religion and the Supernatural”.

The WMatrix 2.0 software is also unable to cope with non-standard written text, such as texts of SgE in the corpus. Features of SgE and CMC were originally tagged under the Z99 category (unmatched lexical items), making it convenient to adapt the software to deal with CMC or SgE features. However, the USAS semantic tagging wrongly tagged some items, such that these SgE particles were not tagged under Z99. The lack of corpora representative of SgE or CMC limited findings which would have distinguished features unique to this particular variety. Besides SgE particles, the WMatrix software was also unable to identify certain emoticons that had a < . > (full stop) (e.g., -.-, T.T). Instead of identifying those as whole emoticons, the software cut off any symbols after each full stop. Such problems were easily overcome by referring to the concordances, for example it was seen how there was an overuse of <D> in the frequency listing (refer to section 3.1). Another recurrent problem was that the program was unable to disregard capitalisation of certain letters which began words, leading to certain errors in the frequency listing, such as counting no occurrences of a particular item without capitalisation, when comparing it against the same item with capitalisation.

4.3 Suggestions for further research

If possible, increasing the corpus size would be a viable option to obtain a corpus which can serve as a better representation of the teenage population in Singapore, as well as increasing the reliability of the findings. For future research projects, it can be verified whether our data findings are consistent by acquiring data from a larger number of schools. Also, a similar study could possibly be done to investigate differences in the language usage of teenagers from co-ed and single-sex schools. This is interesting as it is controversial as to whether differences that have been found to distinguish the language used by the genders can in fact be attributed to social environment and differences in language in different social networks. This is especially the case since the data was not collected from every school in Singapore.

Before embarking on this research project, the sociolinguistic variable of ethnicity was also considered. However due to time constraints, it was decided that the focus would be only on gender. Hence, another possible area of research would be to investigate how this variable affects the usage of English by teenagers or Singaporeans in general over this interesting medium of Facebook. In all, there is a possibility of research to be done on other sociolinguistic variables other than gender to further the understanding of how Facebook, a relatively new mode of CMC, can reformulate the use of English.

6. Bibliography

Crystal, D. (2003). The Cambridge Encyclopedia of the English Language. (2nd edition) Cambridge: Cambridge University Press.
Crystal, D. (2006). Language and the Internet (2nd edition) Cambridge: Cambridge University Press.
Herring, S.C. (2000) "Gender differences in CMC: Findings and implications". Computer Professionals for Social Responsibility Journal (formerly Computer Professionals for Social Responsibility Newsletter) 18(1). 29 Apr 2007.
Herring, S.C. (ed.) 1996. Computer-Mediated Communication: Linguistic, Social and Cross-Cultural Perspectives. Amsterdam: John Benjamins.
Ooi, V., Tan P. & Chiang A. (2007). “Analyzing Personal Weblogs in Singapore English: the Wmatrix Approach”. eVariEng (Journal of the Research Unit for Variation, Contacts and Change in English) Vol. 2: Towards Multimedia in Corpus Studies. Finland: University of Helsinki. 31 Aug 2008.
Rayson, P. (2003) Matrix: A Statistical Method and Software Tool for Linguistic Analysis through Corpus Comparison. Ph.D. thesis, Lancaster University. 29 Apr 2007.
Rayson, P. (2005) Wmatrix: A Web-based Corpus Processing Environment. Computing Department, Lancaster University. 29 Apr 2007.