Rule-based vs. SMT: Idiomatic expressions and collocations

Authors Avatar by bala23 (student)

Rule-based vs. SMT: Idiomatic expressions and collocations

17 juin 2011

        Abstract

Languages complexity is reinforced by the use of elements such as collocations and idiomatic expressions.  Since the creation of Machine Translation systems (MT), those collocations and idioms generate complications in translating data, due to their syntax as well as their omnipresence. In this paper, we describe how two different commercial MT Systems are managing them and the results arising from their methods. The procedures examined will be the rule-based approach (Systran) and the statistical one (Google) for the language pair French-English.  The obtained results show evidence concerning the influence of adjacency and insertion of alien elements in segments, in the achievement of a quality output as much as the impact of colorful and metaphorical elements.  

INTRODUCTION

E. Wehrli and D. Anastasiou pointed out that current Machine Translation systems are facing difficulties both rule-based systems and statistical ones, in tackling the issue of idioms and collocations. The key element for a proper translation is the identification of those elements, their extraction in the source-text in order to propose a correct output.

As they fail to do so, the target text is often too literal, compositional-like (the term’s meaning are summed up to form a sentence) and sound unnatural.

As a result, bilingual resources are highly needed. In Section 2 is briefly exposed what collocations and idioms are and which problem they are setting while in Section 3 will be tackled our set of idiomatic expressions and collocations through a rule-based MT Tool and a statistical one, and in the fourth Section, stands the conclusions.

COLLOCATIONS AND IDIOMS

Collocations and idioms are subclasses of multiword expressions, in a given syntactic relation. The father of collocations, the British linguist J.R. Firth vaguely defined them with the following statement:  “you should know a word by the company it keeps”. The idea conveyed by the term and its definition were specified by Lehr, who defined them as “some conventional way of saying things”.

Collocations are arbitrary and recurrent (Benson 1990): non-native speakers of a given language often have difficulties in expressing their ideas in an appropriate fashion but using them, one’s language will be more natural and it will enable you to be easily understood. In most cases, they formulate their idea translating word-for-word what they would say in their native language. Learning them is also significant as they are not exceptions and it is easier for brains to memorize chunks.

Collocations are pair of words with a strong tendency to co-occur talking about:

proximity in space (e.g. distance of 5 words between two terms forming a collocation)

syntactic relation (e.g. adj+noun, verb+noun)

textual segments (e.g. sentence, alexandrines, paragraph, page, paper…)

They can be idioms and have an opaque meaning (a couch potato, to go Dutch…), compounds (short story, figure skating, life support machine), lexical collocations (commit suicide, ring a bell, roaring lions), grammatical collocations (rely on, easy about, hopeless at)…

The fact that co-occurrences can either be segment-based or distance-based constitutes another problem for the MT Tool which can get confused by the insertion of different words in a distance based co-occurrence.

Collocations are domain-dependant: some of them can be hardly understandable for the layman. For example, economists distinguish between “government spending” on newly produced goods and services, such as paying a company to build a new highway, and government spending on transfer payments, which are payments such as welfare payments intended to redistribute equitably the income. For the layman, government spending are on goods and services plus transfer payments.

Collocations are seen as composed word pairs: one element is said to be “free” (the base element) it retains its meaning, and the other (the collocate) is lexically determined and contributes a meaning with both words cannot have standing alone. To illustrate this idea: a school of fish, heavy smoker, a pride of lion…

As collocations and idioms do not translate well literally, proper bilingual resources are highly necessary: an adequate identification has to be made to provide a fluent target text otherwise the consequences are likely to be dramatic. The result can whether be understandable, or incomprehensible.

EXPERIMENTS

GOOGLE TRANSLATION TOOL

The first ideas of statistical machine translation were introduced by Warren Weaver in 1949 who was the pioneer of machine translation. The software Google Translate is a free statistical machine translation system which is corpus-based and data driven, widely used by corporations, internet portals, individuals. It is mainly based on European Parliament and United Nations bilingual resources and trained on large amounts of texts to align automatically words and phrases. Its basic principles lie in two factors:

Join now!

Faithfulness to the source language                - fluency in the target language

The quality of the conveyed texts depends on the pair of languages used: Spanish-French or Dutch-German will be easier to handle than the French-German pair. As Google Translate proposes a statistical matching rather than a dictionary/grammatical approach, oddities can occur: swapped terms, obvious errors, nonsensical sentences… Such a system requires profuse high quality parallel corpora, data being notably expensive.  "Solid base for a usable statistical machine translation system: bilingual text corpus of more than 1million words + two monolingual corpora of each more than 1billion words"

However, if ...

This is a preview of the whole essay