XML Europe 2004 logo

xml:tm - A Radical New Approach to Translating XML Documents

Translation Memory: Possessing the Holy Cow

Abstract

xml:tm is a revolutionary approach to tackling the problem of translating XML documents. It offers substantial advantages over traditional translation processes.

This paper addresses why and how you should manage your XML document translations natively using xml:tm. An overview of the advantages that XML has as a translation format will be provided. The range of XML based translation standards will be introduced. Additionally, a theoretical overview of how xml:tm works will be given, and possible tools mentioned. This will be followed by a brief discussion on how information creators can take more control of the translation process and the costs associated with it.


Table of Contents

1. Introduction
2. Possessing the Holy Cow
3. XML Based translation standards
4. xml:tm - XML based Text Memory
4.1. Translation process using xml:tm
4.2. Translation Interface
Bibliography
Biography

1. Introduction

The adoption of XML as a standard for the storage, retrieval and delivery of information has meant that many enterprises have large corpora in this format. Very often information components in these corpora require translation. Normally, such enterprises have enjoyed all of the benefits of XML on the information creation side, but very often, fail to maximize all the benefits that XML based translation can provide.

There are two possible approaches to automating the translation process:

  • Machine translation:

    Machine translation has been a “holy grail” of the IT industry for more than 40 years. There have been significant advances in language technology over this period and we all benefit from these on a day to day basis when we use spelling and grammar checkers and ever more sophisticated search engines.

    One of the fundamental reasons why machine translation has not so far produced convincing results is that language is more than mere words and grammar. Language conveys meaning and until you can clearly define and understand what is being conveyed you cannot hope to translate it[1]. A good test of a Machine Translation system is to translate the text into the target language and then back again - the results can be quite comical[2].

    The goal of completely automated free format machine translation is still far away. There has nevertheless been some good progress when tackling very tight domains with controlled terminology and grammar such as the Canadian METEO system for translating weather forecasts into French and English.

    Machine Translation can also provide useful “gisting” information about a target language text.

  • Translation Memory:

    Translation memory works by aligning previously translated text in a target language with the source language. This is accomplished either by the use of a manual tool, or automatically by using a controlled environment for the translation process. Alignment is usually done at a sentence level. This affords the best level of usable granularity. The aligned source and target text is held in a repository. The next time the document is updated the repository is searched in order to locate any text that has not changed. Where such a sentence is identified the source language text can be replaced with the target language text.

    This relatively low tech method can nevertheless provided benefits in terms of translation consistency and reduced costs. In the past 20 years there have been no real advances in translation memory technology until now. The advent of XML has changed the way we can treat text for translation.

    The weakness of this approach is that a given translation is often dependant on the surrounding context. When text is pulled in from a translation memory repository it does not posses any notion of the context within which it existed in the original document. Because there is no contextual information regarding the target language text, a translator is still required to proof read the matched text and adapt it if required. The proof reading process, although less expensive than straight forward translation still consumes time and money.

2. Possessing the Holy Cow

Translation memory is central to translation quality and cost control. Indeed it can be seen as the “Holy Cow” of the entire translation process. Those who control the translation memory are in a key position to dictate translation cost, turnaround time and to control accuracy and consistency.

The traditional translation route is arduous and protracted. Usually, it involves the customer shipping raw XML to the translation supplier. The supplier then extracts all of the textual units for translation, hence removing any of the contextual metadata in the XML source. The supplier then pushes the extracted text through a proprietary translation memory to match the new source against any previous translation of similar text units. Over time, the supplier will probably have built up an extensive translation memory and will derive benefit in terms of turnaround time and consistency from using this memory.

The next step in the process involves the supplier preparing the text as it comes out of the memory into a format that a translator can use. The text is then translated using a proprietary package (often in MS Word). This necessitates the re-merging of the translated text back into the original XML source format for supply to the customer. This re-merging process, while automated, may have pitfalls such as character corruption and normally requires a quality assurance process.

Generally speaking, the entire process is detailed and thus costly. The customer pays for the use of a supplier's translation memory. Over time, a particular supplier's translation memory will grow and become more valuable to the translation process. This is a double-edged sword for the customer. On the one hand, the customer avails of all of the benefits of the memory in terms of consistency, accuracy etc. On the other hand, a customer may get locked into a particular supplier because of the possession of an extensive memory. Regardless of a changing business climate, such as the need for cost reduction, a customer may be reluctant to shop for the most appropriate supplier.

Suffice it to say that based on research by the Localisation Research Centre the actual amount that is paid to translators represents only 25% of the overall cost of translation.

The supplier translation memory is inevitably in a proprietary non-XML aware format. The memory is very often the intellectual property of the supplier, not the customer. Even if the customer had ownership of the memory, little real benefit will be derived as it may be usable only with proprietary tools. Additionally, many translation memory tools are devoid of contextual and morphological information. Context information such as the previous and next text units, and morphological reduction (stemming) greatly enhances the translation process. These key functions are missing from most proprietary translation memories, however, with an XML based translation memory you get them for free.

3. XML Based translation standards

The translation industry has been an enthusiastic creator and adopter of XML based standards. Special mention must be made here of LISA (Localization Industry Standards Association – www.lisa.org) which has been responsible through its OSCAR standards body for the following standards:

TMX

Translation Memory Exchange format[3]

TBX

Termbase Exchange format[4]

SRX

Segmentation Rules Exchange format[5]

The other body involved in translation based XML formats is OASIS (Organization for the Advancement of Structured Information Standards – www.oasis-open.org) which is responsible for many XML standards. The most relevant OASIS technical committees regarding translation are:

XLIFF

XML Localisation Interchange File Format[6]

TransWS

Translation Web Services[7]

All of these excellent standards relate to the exchange of translation data using XML as the interchange format. They do not address the issue of the actual translation of XML documents themselves. This is where xml:tm comes in.

xml:tm deals with the issue of how to simplify and reduce the costs of translating XML based documents. It introduces the concept of "text memory" which is maintained automatically and transparently in the source and target versions of the same document. It automates the task of managing memory and provides the first real advances in the realms of translation memory in 20 years.

xml:tm was created by XML Intl based on 12 years of in depth experience in designing enterprise level SGML and XML translation memory systems. xml:tm has been offered to LISA for consideration as an open OSCAR standard. A full specification of xml:tm is available online[8].

4. xml:tm - XML based Text Memory

Information encoded in XML has many benefits in terms of translation:

  • Human readable - easy to understand

  • Pure content - a translator receives just what needs to be translated without any cumbersome formatting

  • Unicode, all multilingual characters can be easily represented

  • Variety of output media, XML encoded translations can be easily formatted for a variety of media web, CD, WAP.

The creation and maintenance of an XML based translation memory (TM) has similar advantages for creators of large corpora of XML encoded information. All XML based corpora have the potential to compile their own translation memory over time.

Whereas traditional translation memory systems have only concentrated on the translation aspect of the document lifecycle, xml:tm goes deeper into the lifecycle process establishing the concept of “text memory”. Each sentence (text unit) in the document is given a unique identifier. This identifier remains immutable for the life of the document. xml:tm uses the XML namespace mechanism to achieve this.

The following diagram shows how the xml:tm namespace coexists within an XML document:

click image for full size view

4.1. Translation process using xml:tm

At the core of xml:tm is the concept of “text memory”. Text memory is made up of two components:

  1. Author Memory

  2. Translation Memory

XML namespace is used to map a text memory view onto a document. This process is called segmentation. The text memory works at the sentence level of granularity – the text unit. Each individual xml:tm text unit is allocated a unique identifier. This unique identifier is immutable for the life of the document. As a document goes through its life cycle the unique identifiers are maintained and new ones are allocated as required. This aspect of text memory is called author memory. It can be used to build author memory systems which can be used to simplify and improve the consistency of authoring. A detailed technical article about xml:tm has been published on O'Reilly's xml.com web site[9]

The use of xml:tm greatly improves upon the traditional translation route. Each text unit in a document has a unique identifier. When the document is translated the target version of the file has the same identifiers. The source and target documents are therefore perfectly aligned at the text unit level. The following diagram shows how perfect matching is achieved:

click image for full size view

The first step in the translation process is that all the raw XML data is extracted to an XML standard for translation (e.g. XLIFF). This extraction process uses the xml:tm namespace to identify all of the text that requires translation. For those text units that have not changed since the last update, the target language text can be automatically inserted. This process is called perfect matching and does not require any translator intervention thereby reducing the cost.

xml:tm can be used to perform text unit matching within the same document looking for previous translations of the new text which may be the same as existing translated text. This type of matching is called in-document leveraged matching and requires proofing for context acceptance by a translator.

xml:tm can be used to perform text unit matching across a customer's entire corpus in the traditional translation memory way. The original translated source and target text can be loaded into a translation memory database at the text unit level. This type of matching is called leveraged matching and requires proofing for context acceptance by a translator.

Unlike a supplier whose memory is only able to match against those objects the customer has supplied for translation, a customer can match against anything that has previously been translated regardless of translation supplier. xml:tm can identify a number of different categories of translated material:

  • Perfect match – the same text unit that has been previously translated elsewhere in the document

  • Leveraged match - 100% the same text unit translated in another document/context, from within the same document as well as from the translation memory database.

  • Fuzzy match – 90% of the key words in a unit are found in the required order in a previously translated text unit. Fuzzy matching can be based on text memory information from within the same document or from the translation memory database.

  • Non-translatable text – text units that should not be translated e.g. numeric only, alphanumeric only, measurements text units etc.

xml:tm provides a much more cost effective and better tuned matching mechanism than traditional translation memory systems

The advantages of xml:tm for a customer are enormous:

  • Independence, a business is no longer tied to a particular translation supplier. The knowledge resource embodied by xml:tm is owned by the customer. Additionally, it means that a customer is free to switch supplier at will without having to consider the translation memory.

  • Cost reduction – it means that a customer can send just what needs to be translated to a translator. Therefore, the customer does not pay for the management of non-translatable text or any matching.

  • Quality – xml:tm reduces the degree of manual intervention required in file preparation. It is no longer necessary to manually prepare translation files and add translation memory information. The files that are sent to the translator are auto-populated with all of the contextual and morphological information.

    Furthermore, xml:tm can avail of linguistic tools to do morphological reduction (stemming) that are built in to native XML databases such as Oracle 9i. These functions greatly enhance the matching process – a key feature that is missing in several proprietary translation memory systems.

  • Automation – xml:tm provides the architecture for fully automated translation memory management and use. No human intervention is required.

  • Protection – the XML document is protected from accidental damage. Because the text is translated from the extracted XLIFF format the original XML document is protected and cannot be damaged or corrupted during the translation process.

  • Enterprise level scalability – there are no architectural limitations on the size or quantity of XML documents that can be handled by xml:tm.

Translation interface - as the XML integrity of the information requiring translation is preserved, it is possible to present it for translation in a web-interface. This means that a translator can perform a translation using a browser, without the need for expensive, proprietary translation tools.

4.2. Translation Interface

xml:tm preserves pre-populated matched files in a native XML format. This has significant advantages for translation. It means that a translator will not need either a translation memory or specific translation tools to perform a translation. Rather the pre-matched files can be delivered via a web application for translation either online or offline. Cognitran Limited has developed such an interface for browser-based translation. This will be demonstrated at the end of the talk. However, such an application is predicated upon the use of xml:tm and the XML family of translation standards such as XLIFF.

Allowing translators direct access to the text via the internet provides the opportunity to reduce translation costs significantly in itself.

Bibliography

[1] Stephen Budiansky - Lost in translation (http://www.theatlantic.com/issues/98dec/computer.htm)

[2] Tim Oren, Pacifica Fund - The State of Machine Translation (http://www.pacificavc.com/blog/2004/01/27.html#a517)

[3] TMX - Translation Memory eXchange format (http://www.lisa.org/tmx/)

[4] TBX - TermBase eXchange format (http://www.lisa.org/tbx/)

[5] SRX - Segmentation Rules eXchange format (http://www.lisa.org/oscar/seg/drafts/srx/srx03-20030724/srx.htm)

[6] XLIFF - XML Localisation Interchange File Format (http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff)

[7] Translation Web Services (http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=trans-ws)

[8] xml:tm - detailed specification (http://www.xml-intl.com/docs/specification/xml-tm.html)

[9] Translating XML Documents with xml:tm - detailed xml.com technical article about xml:tm (http://www.xml.com/pub/a/2004/01/07/xmltm.html)

Biography

Andrzej Zydron is the CTO of XML Intl. Educated in France he started working in IT in 1976. His experience has covered all aspects of computing, with in depth knowledge of Software Engineering, SGML, XML, encoding methodologies and translation memory. Highlights of his career include:

  1. The design and architecture of the European Patent Office patent data capture system for Xerox Business Services.

  2. The design and architecture of the Xerox Language Services XTM translation memory system.

  3. Writing the XML and SGML filters for SDL International's SDLX Translation Suite.

  4. Assisting the Oxford University Press, the British Council and Oxford University in work on the New Dictionary of the National Biography IT systems.

He is currently engaged in developing the next generation of XML based "text memory" systems.

Andrzej Zydron is a member of the British Computer Society. He also sits on the OASIS technical committees for Translation Web Services and XLIFF (XML Localization Interchange File Format).

Andrzej is fluent in Polish, English and French.

Mavis Cournane is a Principle Consultant with Cognitran Limited. She has a PhD in History and Computer Science from the National University of Ireland. The focus of her dissertation was “The Application of SGML/TEI to Multilingual Text Processing”. The dissertation involved processing Irish, Hebrew, Greek, Norse, Norman French and German texts.

Mavis has previously worked for The European Foundation for the Improvement of Living and Working Conditions, an EU Institute, where she managed the project to get all of the multilingual information into an XML repository.

At Cognitran Limited, Mavis is heavily involved in XML Automotive projects involving the translation of XML based technical information into a variety of languages. She also leads the Cognitran Translation Services Agent (TSA) project. This is a web-based application that allows translators to perform Translations in a browser-based environment, while availing of the benefits of a Translation Memory.

Mavis has a particular interest in the development of standards for e-Business. She currently co-chairs the OASIS Universal Business Language (UBL) subcommittees for Naming and Design Rules (NDR) and Code Lists (CLSC).

Mavis is fluent in Irish, German and English.