Title:Where Interlingua Can Make a Difference
Authors: David M. Zajic (MITRE Corporation and University of Maryland)

Phone:(410) 850-5815

Fax:(410) 850-5459

Email:dzajic@mitre.org

Mailing Address:

The Mitre Corporation

Commons Corporate Center

7467 Ridge Rd, Suite #140

Hanover MD, 21076

Keith J. Miller (MITRE Corporation and Georgetown University)

Phone: (703) 883-6920

Fax: (703) 883-1279

Email:keith@mitre.org

Mailing Address:

The MITRE Corporation

Washington C3 Center

1820 Dolley Madison Boulevard

McLean, VA 22102-3481

Abstract: Furnished with an English text and its equivalent in 13 foreign languages, we set out to determine the potential for improving the quality of translation results by an Interlingual (IL) approach to machine translation (MT) as compared to transfer-based systems.In this paper, we analyze the errors made by two commercial transfer-based MT systems, provide an observational classification of the errors, and group the errors according to whether or not an Interlingual approach would improve system output.We describe an existing IL, Composed Lexical Conceptual Structures, and illustrate with examples how some of the observed errors might be corrected using this IL representation.
Subject area keywords:Interlingual Machine Translation, LCS, Machine Translation



Where Interlingua Can Make a Difference

David M. Zajic & Keith J. Miller

dzajic@mitre.orgkeith@mitre.org

The MITRE Corporation


Abstract
Furnished with an English text and its equivalent in 13 foreign languages, we set out to determine the potential for improving the quality of translation results by an Interlingual (IL) approach to machine translation (MT) as compared to transfer-based systems.In this paper, we analyze the errors made by two commercial transfer-based MT systems, provide an observational classification of the errors, and group the errors according to whether or not an Interlingual approach would improve system output.We describe an existing IL, Composed Lexical Conceptual Structures, and illustrate with examples how some of the observed errors might be corrected using this IL representation.

Introduction

Though the debate between proponents of transfer-based machine translation (MT) and interlingual (IL)-based MT is far from being settled, it is not a point of contention that a successful IL system would provide certain advantages.From an engineering perspective, the principal benefit of an IL system is the reduction in the number of modules necessary to enable translation betweennew language pairs.From a research perspective, attempts to design a workable IL are desirable in that they provide a more psychologically explanatory view of the relationship between a language?s surface forms and the underlying meaning representations of those forms.Research IL systems also provide an environment for the testing of theories regarding the universality of the semantic structure in the world?s languages.

 

Our interest in this workshop stems from practical concerns.We are currently involved in a MT project, CyberTrans, which provides translation services to a wide community of users.By providing a common interface to two large commercial transfer-based systems and several smaller translation systems, CyberTrans gives this community easy access to a considerable suite of MT tools.The maintenance of separate domain-specific lexicons for each of these systems, however, is a large task.In an effort to make this task more manageable, we have undertaken a project known as the Lexicon Service Bureau (LSB).As described in (Miller & Zajic, 1998), the LSB project has as its goal to provide a unified lexical resource from which the system-specific lexicons can be generated for the transfer-based systems.The design for the LSB calls for a single monolingual lexicon for each language, containing morphological, syntactic and semantic information common to the supported MT systems.Links between these lexicons provide the information necessary to generate the system-specific bilingual lexicons.Since source language (SL) to target language (TL) mappings are not all one-to-one, the design calls for links to be made by means of a pivot language (or IL).Thus, we are interested in the determination of what such a pivot must entail.Furthermore, we are interested in discovering which types of MT problems are particularly suited to an IL approach.That is, what are the areas in which an IL approach to MT could solve problems currently encountered by the state-of-the-art transfer based MT systems that we currently provide to our users?

Goals, Resources and Procedures

In the operational setting motivating our work, analysts require translations into English of documents in a variety of languages.Thus, our observations are oriented towards improving the quality of translations in this setting.Our goal is to identify weaknesses inherent in transfer-based MT systems which can be addressed by IL-based MT systems.
The source material for our investigation is an article from the UNESCO Courier, (Otero, 1997).This article is available in human translations into thirteen languages.No one version is distinguished as the original version.We choose to focus on the English, French, Spanish, Russian, and German versions, as these are the languages for which we have access to MT and human resources.The MT resources consist of two transfer-based MT systems (Systran and an old version of Globalink (GTS)), and a system that provides English glosses of foreign language texts (i.e. substitution-based MT).The human translators provided word-for-word translations of the non-English texts.

The first step of our investigation was to identify classes oferrors typical of transfer based MT systems.We submitted each of the four non-English texts to the two transfer-based MT systems for translation into English.Comparing the output of the MT systems with the original texts and with the native English article, we generated a list of errors.From this list, we developed an observational classification of the error types.In the second phase of our investigation, we partitioned the error classes into three broad categories:those which will not be ameliorated by IL-based MT because they are caused by lack of resources, those which will not be ameliorated because they involve problems beyond the scope of MT, and those which will be ameliorated by IL-based MT.Finally, we chose example sentences from the source text, created IL representations of them loosely based on Dorr?s Lexical Conceptual Structures (Dorr 1993), to demonstrate how an IL representation could improve translation quality.These examples will serve as a reference point for discussion during the present IL workshop.

Classification of Errors

 

We submitted each of the foreign language documents to the two-transfer-based systems for translation into English.We regarded as errors any text in the MT output which did not properly reflect the semantic content of the original document as we understood it, or which we, as native English speakers, did not accept as well-formed English.For each error encountered in the output, we entered the following information into a grid:
·the original source language text
·the output of each of the MT systems (For each segment containing an error by either MT system, the output of both systems was entered into the grid.It is interesting to note that the two systems often made errors on the same segment; sometimes the errors were the same, but often they were not.)
·the word-for-word gloss of the segment
·the corresponding segment from the native English version of the document

·remarks concerning the classification of the error.

Classification of the errors was based on all of the factors listed above, as well as the authors? knowledge of the systems in question.The result is the observational classification of errors below.We do not claim that this is a complete classification of errors made by transfer-based MT systems, that these errors are inherent to transfer-based systems, or even that this is the correct classification.The purpose of the classification is merely to separate those problems that are amenable to an IL treatment from those that are not.Table 1 shows the categories of errors which will be described in the next section.

Text Box: Table 1Error ClassesAmenable to ILIncorrect Word SenseYesGeneration Problems from Transfer RulesYesThematic RolesYesMorphologic Ambiguity in Source TextYesAspect, Tense, Mood:  Sentence-Level PhenomenaYesCompositional Phrase UnitsYesLexical MismatchYesConflationYesErrors in Source DataNo - Lack of ResourceGaps in LexiconNo - Lack of ResourceFailure in Morphological AnalysisNo - Lack of ResourceParsing ErrorsNo - Lack of ResourceNamed EntitiesNo - Lack of ResourceNon-Compositional Phrase UnitsNo - Lack of ResourceTarget Word Sense OverrideNo - AI CompleteImplications, EntailmentsNo - AI CompleteDiscourse-Level PhenomenaNo - AI CompleteHumorNo - AI Complete

Where IL Will Make a Difference

We have discovered that eight of our classes will benefit from IL-based MT.We regard these errors as inherent (though probably not unique) problems of transfer-based MT.

1.Incorrect Word Sense. Since an IL representation of a source language sentence is intended to be unambiguous and meaning-preserving, cases involving problems with word sense will benefit from an IL treatment.This is a vast class, and includes issues such as polysemy, semantic content of prepositions, and fine-grained sense distinctions.For example, the French document contains the subheading ?Un succès éclatant? (lit. ?a success bursting?), which was translated as ?A Vivid Success?, ?A Bright Success?, and ?A Stunning Success? by Globalink, Systran, and a human translator, respectively.None of these is necessarily incorrect, but perhaps one captures the meaning of the original more precisely than the others.As a more mundane example the noun portefeuille has the multiple senses ?wallet? and ?portfolio?, both of which are potentially valid in a financial domain.A clear (IL) meaning representation of the source sentence will enable the generation of the correct form in the target language.

2.Generation Problems Attributable to Transfer Rules.Errors such as French ?spécialisée dans la microfinance? translating to ?specialized in the microfinance? seem to be the results of specific transfer rules.In the most extreme case, transfer-based systems default to substitution systems, often resulting in so-called ?word salad?.

3.Thematic Roles.A verb has a set of argument structures or thematic grids in which it may appear.Arguments are associated with these thematic roles, and this association is reflected in the content of IL representations.This will help avoid mistranslations such as Spanish ?brinda? asistencia técnica a? to English ?offers to technical attendance to??.Aside from the word sense problem (asistencia should be assistance in this case), the more serious problem is that attendance appears to be the recipient of some unspecified direct object.

4.Morphologic Ambiguity in the Source Text.There are many instances in human language where the surface form of a word and its context are not sufficient to derive the information normally imparted by morphology.For instance, in the German ?die in den vergangene fünf Jahren Darlehen für mehr als 1 Milliarde Dollar (durchschnittlich weniger als 500 Dollar) an Klenstunternehmen vergeben haben?, Darlehen, loan, is ambiguous with respect to number.GTS produced ?which have distributed in the past five years loan for more than 1 billion dollar,?incorrectly guessing singular.As humans with world knowledge, we understand that banks make more than one loan every five years, and the parenthetical comment about the average loan amount reinforces this.Assuming an analysis component which is powerful enough to figure out cases like this, an IL-based approach will address this type of error, since the IL representation will contain information about number, gender, thematic role, etc.

5.Sentence-Level Phenomena.In this category we consider such phenomena as tense, mood, and aspect.These phenomena associated with verbs express important semantic content.As before, we assume that substantial problems of analysis can be solved, and that IL representations contain representations of the semantic content encoded by tense, mood and aspect.Consider this example, from the Spanish version:

Spanish:La transición? iba a durar dos años?

GTS:The transition? was going to last two years?

Systran:The transition? were going to last two years?

Human:The transition required two years? work?

An IL representation of the original Spanish would represent the transition as something which happened and did last two years.This would avoid the unclear English translations, in which it is not clear whether the transition lasted two years, or was ever accomplished.

6.Compositional Phrasal Units.Non-composition phrases must appear in a lexicon in order to be successfully translated, because of their characteristic that their meaning cannot be derived from their components, e.g. ?kick the bucket.?However, many compositional phrases should be quite intelligible.Consider the Spanish:

Spanish:préstamos de una cuantía media inferior a qunientos dólares.

GTS:loans of an inferior mean quantity to five hundred dollars.

Human:loans averaging less than five hundred dollars.

In an IL representation, this property should appear as a modifier to loans.In the generation phase, it will be realized using the appropriate lexical elements from the target language.

Lexical Mismatch.As in the example of Spanish pescado (fish as a food) versus pez (fish as an animal) given in Dorr(1993), it is common for a word in a source language to contain more or less information than its counterpart in a target language.In English, there are three lexical items who, that and which, that correspond to French qui.Without information concerning the referent (human or not) and the context (restrictive or not), transfer-based MT systems regularly make errors in choosing the correct English word.

7.Lexical Incorporation / Conflation.Conflation refers to the inclusion of an argument within a word?s meaning.As an example, the English verb telephone can be thought of as call using a telephone, and llamar por teléfono is a good Spanish translation of this verb.Although we did not find errors of this type in the sample data, we include it here as a situation in which an IL-based approach will be particularly helpful.

Lack of Resources

We identify ten categories of errors that we do not believe will be solved by IL-based MT systems, because the error arises from a general lack of resources.The factor which disrupts successful translation occurs before transfer rules are applied, or before the IL representation would be generated.We do not regard these errors as inherent to transfer based systems.

1.Errors in source data.Some words are not recognized because of errors in the source data.The Spanish text contained hyphenated words in the middle of a line of running text (e.g. infe - rior for inferior), presumably an artifact of the conversion of the document from a previous format.The missing resource in this case is appropriate pre-processing tools to recognize and correct such errors.Other types of errors which are best addressed by pre-processing include spelling correction, diacritic reinsertion, and encoding recognition.These and other pre-processing tasks are among the primary services provided by CyberTrans.

2.Gaps in the lexicon.Lexicon acquisition is a problem for all types of MT systems.For example, neither MT system knew the German verb unterstüzt.

3.Failure in morphological analysis.MT systems commonly try to perform morphological analysis on unknown words, or break a word into smaller chunks in order to match them with known lexical entries.Neither system was able to analyze the unusual German Kleinstunternehmer as klein small + st ? + unternehmer businessman. In contrast, French micro-enterprise was inappropriately separated into micro + enterprise microphone company.

4.Parsing errors.Robust parsers are another scarce resource.Consider the following example

Spanish:En cinco años BancoSol se ha convertido?

GTS:In five years Bancosol itself has converted?

Systran:Into five BancoSol years one has become?

Human:In five years, BancoSol has become?

We assume that Systran analyzed BancoSol as a modifier of años and finding no subject for the reflexive verb provided inserted the generic pronoun one.

5.Named entities.The author?s name was not often recognized as an element not to be translated.It was rendered variously as ?Marrying Otero,? ?Maria Hill? and ?Maria Knoll?.We have observed similar problems occurring with e-mail addresses.Comprehensive gazeteers are also scarce resources.

6.Non-compositional Phrasal Units.This is similar to class #2 in that it is a problem that points to gaps in the MT systems? lexicons.For instance, Systran translated the French phrase à but non-lucratif as ?with nonlucrative goal?, rather than as ?non-profit?.IL representations will not improve this situation unless they include a substantial amount of real-world knowledge.In this case, for a system to generate ?non-profit? from the IL, it would need to know that ?non-profit? describes an entity?s goals, rather than, e.g., its size, location or properties when submerged in water.However, another writer might have rendered the concept ?non-profit? as ?with nonlucrative methodology? or ?with nonlucrative status.?A robust representation of what it means for an entity to be ?non-profit? seems well beyond the scope of existing interlingual representations.

AI-Complete Systems Are Not Yet Available.

The last class contrasts a resource problem (gaps in the lexicon) with tasks that we do not expect MT systems to perform (figuring out the meaning non-compositional phrases on the fly).The following error classes are also not expected to benefit from IL-based MT, because they exist well beyond the current state of the state-of-the-art for human language processing.

1.Target word sense override.This is a rather subtle case in which an appropriate target language word is chosen to express some element of the internal representation, however in the final surface form, some other sense of the target word seems to override the intended sense.Consider:

German:die Institutionen? technisch unterstütz

GTS: [the] institution? technically supports

Human:the institution? provides technical support.

In this case, technically in the sense of ?having specialized knowledge of a scientific topic? is certainly the correct target word, however it seems to be overshadowed by the sense of ?being marked by a strict or legalistic interpretation.?

2.Implications or Entailments.A translation may bear implications or entailments not borne by the original.Consider

Spanish:que brinda actualmente?

GTS:that offers currently?

Systran:that at the moment offers?

The second translation implies that the offer will eventually be rescinded, while the first translation and the original do not make this implication.

3.Discourse-level Phenomena.Although we have not (yet) encountered any discourse-level translation problems in the text in question, this class is a natural progression from the sentence-level phenomena.Given a large enough body of text, discourse-level phenomena would certainly present themselves.For a discussion of IL treatment of discourse-level phenomena, see (LuperFoy & Miller, 1997)

4.Humor.The titles of both the English and Russian articles make puns on the entity name ACCION International, while the French, Spanish and German titles do not.Fortunately, automatic generation of cute journalistic headline puns is not currently a major goal of the MT research community.

An Interlingua: Lexical Conceptual Structures

We will represent some sentences from the source text in the style of lexical conceptual structures (LCS).A sentence is represented by a composed LCS (CLCS) and a lexical entry is represented by a root LCS (RLCS).At present, verbs, prepositions and nouns which denote events or states are assigned lexical entries;nouns which do not represent events or states are treated as atomic entities.

Text Box: Table 2FIELDTHEMEREFERENCE OBJECTSno field{thing, event, state}{event, state}locational{thing, event, state}{thing, event}possessional{thing, event, state}{thing}temporal{event, state}{time, event, state}identificational{thing, event, state}{thing, event, property}existential{thing, event, state}EXISTcircumstantial{thing}{event, state}perceptual{thing}{thing, event, state}The core insight which allows CLCSs to represent a wide range of sentences is that the semantics of motion and location serve as an analogy for many other areas of human understanding [Gruber 1965].For instance, by regarding time as a one dimensional space, events can be positioned in time (sentence 1), just as things or events are positioned in space (sentence 2).

(1) The meeting is at 3:30

(2) The meeting is in Dr. Dorr?s office.

The distinction here is among semantic fields, temporal and locational respectively.A semantic field can be described as an analogy between physical motion and a motion analog and between physical position and a position analog.Semantic fields can be described by three parameters[Jackendoff 83]:

(1) what type of entities can assume the role of theme, i.e., undergo the motion analog or be located by the position analog,

(2) what type of entities can serve as reference objects to define the path of the motion analog or location of the position analog

(3) what concept is represented by analogy to motion and location.

In order to understand the different semantic fields, we must first know what values can fill in the parameters described above.The types of entities, or semantic types, which will serve to restrict themes and reference objects with respect to fields are as follows.Things include all physical objects (e.g. table, telephone, James Garfield) as well as non-physical human concepts (e.g. idea, hope, intelligence, Calculus, ?Romeo and Juliet?) and human conceptual organizations applied to physical objects (e.g. the faculty of the Computer Science department, MITRE Corporation).States are sets of conditions which apply to some universe over time.Events are changes in those conditions:in an event, something happens.Times are points or intervals along the one-dimensional timeline.Properties are characteristics which can apply to things.

In [Dorr 93] nine semantic fields are identified:locational, possessional, identificational, perceptual, existential, circumstantial, intentional, instrumental and temporal.In addition, some causative events may exist without any state.Instrumental and intentional fields do not apply to events and states, but are used in adding additional information about an event or state to a CLCS.Table 1 shows the restrictions on semantic types as they serve as parameters of themes.

For the locational field, the relevant issues are where a thing is located, where an event takes place, and where a state holds.If a reference object is a thing, the location of the theme is given relative to the location of the reference object.If the reference object is an event, the location of the theme is given relative to where the event takes place.In the possessional field, the theme is possessed by the reference object.More recent work has downplayed the use of the temporal field in favor of representing temporal information in modifiers.Themes of the identificational field are instances of or identical to a reference object, or bear a property.In the existential field, the theme is understood to exist for things, to have happened for events, or to hold true for states.In the circumstantial field, the theme is a character in a reference situation (event or state), as in ?Ludwig continues to compose string quartets? or ?Xavier avoids being rude.?Lastly, the theme of the perceptual field perceives a reference thing or situation.

A place is a prepositional relation to one or more reference objects.The reference object can be any of the semantic types, and the prepositional relation describes the position of a theme with respect to the reference object(s) (e.g., above, at between).A prepositional relation may make use of zero or more reference objects, as in

?here?

?above the printer?

?between World War I and World War II?

A path is a path operation applied to one or more places.

We will not attempt to enumerate the primitives of the LCS system here, however we will make use of representative primitives in our examples.

CLCSs take the form of trees with each node of the tree an instance of a common node structure.In our examples we will make use of an informal adaptation of the LCS short-form notation.Moreover, the process of encoding sentences as CLCSs is, at present, something of a craft, and quite a non-trivial craft for complex sentences, such as those found in the article by Otero.

Applying the Craft:Mapping Text into Composed Lexical Conceptual Structures

We will present an example CLCS representation of a fragment from the Spanish text that was translated incorrectly by the transfer-based MT systems.We will show how the representation is created and suggest that better results can be obtained using the IL representation.

Consider the note about the author: ?María Otero, estadounidense de origen boliviano, es vicepresidenta de la asociación Acción Internacional.?This is translated by GTS as ?Maria Hill, American of Bolivian origin, it is vice president of the association International Action.?Our focus will be on the incorrect addition of the default subject it in the English translation.This is an error of class Generation Problems Attributable to Transfer Rules:it seems that the appositive prevents the system from recognizing Maria Otero as the subject of ser.

The first step is to find the LCS definition of the verb ser.It is given here in a simplified form.The first element is the type of the node, the second is an entity type.Then follows either a primitive and a field, or an indication of the theta role of the variable to be filled in when the definition is composed to represent actual text.Theta roles are shown in italics.
 

(ROOT: STATE BE IDENTIFICATIONAL
(SUBJECT: THING THEME)
(ARGUMENT: POSITION AT IDENTIFICATIONAL

 
(SUBJECT: THING THEME)
(ARGUMENT: THING/PROPERTY IDENTIFICATIONAL PREDICATE)))
Not shown here are the variable specifications that put constraints, such as +HUMAN, on the entities which may fill a variable.
Now we will fill in the variables in the definition with text from the sample text.
 
(ROOT: STATE BE IDENTIFICATIONAL
(SUBJECT:?Maria Otero?)

(ARGUMENT: POSITION AT IDENTIFICATIONAL
 

(SUBJECT: ?Maria Otero?)
(ARGUMENT: ?vicepresidenta?

 
(MODIFIER: ?de la asociación Acción Internacional?)))
(MODIFIER: ?estadounidense de origen boliviano?))
Notice that the appositive describing Otero and the prepositional phrase modifying vicepresidenta are encoded for the moment in unanalyzed modifier nodes.The LCS system does have more sophisticated ways of representing prepositions, and future work may include qualia-based analysis of nouns, as in Pustejovsky (1995).

Finally we will realize the CLCS in English.A verb is sought whose definition is compatible with the structure of the CLCS.In this case be is the unsurprising candidate.Its LCS definition is identical to that of ser.A generation system based on this CLCS might produce the English surface forms: ?Maria Otero, a Bolivian-born American, is the vice president of Acción Internacional.?

Future Work

Regarding automatic generation of CLCSs from parsed text, much work remains to be done.In particular, the needs to be a more specific representation of the relationship between modifiers and the entities they modify.

Regarding our analysis of errors in existing MT systems, a larger corpus of text should be analyzed, and the resulting data should be used to generate frequency statistics organized by MT system and by language pair.It may be shown that certain types of errors are typical of a particular system, and not characteristic of transfer-based MT in general.We might also discover that certain language pairs cause certain types of errors across MT systems.

References
Dorr, Bonnie, Machine Translation:a View from the Lexicon, MIT Press, 1993.
Gruber, Jeffrey S.Studies in Lexical Relations.Doctoral Dissertation, MIT, Cambridge MA, 1965

Jackendoff, Ray S. Semantics and Cognition.MIT Press, Cambridge, MA, 1983

LuperFoy, Susann and Keith Miller, The Use of Pegs Computational Discourse Framework as an Interlingua Representation, AMTA SIG-IL First Workshop on Interlinguas (held at MT Summit VI), San Diego, California, 1997.

Miller, Keith and David Zajic, Lexicons as Gold: Mining, Embellishment, and Reuse, AMTA-98 (to appear).

Otero, María, ?Latin America: ACCION Speaks Louder Than Words?, UNESCO Courier, January, 1977.

Pustejovsky, James, ?The Generative Lexicon?, Massachusetts Institute of Technology, 1995

Reeder, Florenceand Dan Loehr, Finding the Right Words, AMTA '98 (to appear).