Keywords: XML, Publishing, Unicode, Translation, Standards
Biography
Andrzej Zydroń was born in England. Educated in France he started working in IT in 1976. His experience has covered all aspects of computing. He started working with SGML in 1985, writing complex photo composition filters for SGML for Xerox. An expert in Software Engineering, SGML, XML, encoding methodologies, translation memory and document image processing. Highlights of his career include:
The design and architecture of the European Patent Office patent data capture system for Xerox Business Services.
Writing a system for the automated optimal typographical formatting of generically encoded tables (1989).
The design and architecture of the Xerox Language Services XTM translation memory system.
Writing the XML and SGML filters for SDL International's SDLX Translation Suite.
Assisting the Oxford University Press, the British Council and Oxford University in work on the New Dictionary of the National Biography.
He is currently CTO of XML Intl, and the technical architect of the XML based “text memory” system - xml:tm, a revolutionary new approach to the authoring and translation of XML based documents.
Andrzej is a member of the Localization Industry Standards Association (Lisa) OSCAR steering committee and technical architect of the proposed OSCAR GILT Metrics specification, as well as editor of the proposed OSCAR TBX-link specification. He is also an active member of the OASIS XLIFF and Translation Web Services technical committees as well as being a member of the British Computer Society.
Andrzej is fluent in Polish, English and French.
Translating XML documents presents many opportunities as well as challenges. There are clear do's and don'ts when it comes to designing your documents regarding translation. You can use also use XML to your advantage to reduce costs and increase quality. One of the most exciting ways to do this is via the use of the XML Text Memory Namespace - xml:tm.
There are also other translation industry XML based standards (TBX, SRX, TMX, XLIFF, OLIF, GMX) to help along the way covering everything from the interchange of terminology and translation memories, to segmentation and Translation Web Services. XML also comes with fundamental built-in mechanisms that can help you to control authoring and reduce the cost of translation. Details of all XML related translation standards are provided along with advice on how to make best use of them.
1. Introduction
2. Designing XML documents for translation
2.1 Avoid the use of specially defined entity references
2.2 Avoid translatable attributes
2.3 Avoid using CDATA sections that may contain translatable text
2.4 Avoid the use of infinite naming schemes
2.5 Avoid Processing Instructions (PIs) in translatable text
2.6 Avoid the use of text in bitmap graphics
2.7 Never make any assumptions about text length sizes in your design
2.8 Always use UTF-8 (or alternatively UTF-16) encoding throughout your process
2.9 Never break a linguistically complete text unit over more than one non-inline element
2.10 Avoid the use of "typographical" elements
2.11 Do not mix translatable and non-translatable text in the same elements
2.12 Avoid holding source and target PCDATA in the same document
2.13 Clearly define text that requires translation
2.14 Suggested Further reading
2.15 Finally – please invest time and effort in the quality of the source text.
3. XML Based translation standards
4. xml:tm - XML based Text Memory
4.1 Translation process using xml:tm
4.2 Translation Interface
Bibliography
The adoption of XML as a standard for the storage, retrieval and delivery of information has meant that many enterprises have large corpora in this format. Very often information components in these corpora require translation. Normally, such enterprises have enjoyed all of the benefits of XML on the information creation side, but very often, fail to maximize all the benefits that XML based translation can provide.
The separation of form and content which is inherent within the concept of XML makes XML document easier to localize than traditional proprietary text processing or composition systems. Nevertheless decisions made during the creation of the XML structure and authoring of documents can have a significant effect on the ease with which the source language text can be localized into other languages. The difficulties introduced into XML documents through inappropriate use of syntactical tools can have a profound effect on translatability and cost. It may even require complete re-authoring of documents in order to make them translatable. This is worth noting as a very high proportion of XML documents are candidates for translation into other languages.
A key concept in the treatment of translatable text within XML documents is that of the "text unit". A text unit is defined as being the content of an XML element, or the subdivision thereof into recognizable sentences that are linguistically complete as far as translation is concerned.
Before we go any further we need to look at the current technologies used in automating translation.
The following technologies can be applied to the translation process to make it more effective:
It is very important to consider the implications for localization when designing an XML document. Wrong decisions can cause considerable problems for the translation process thus increasing costs. All of the following examples assume that the text to be translated is to be extracted into an intermediate form such as XLIFF (XML Localization Interchange File Format). Anyone planning to provide an XML document directly to translators will soon be disabused of this idea after the first attempt. The intermediate format protects the original file format and guarantees that you get back an equivalent target language document to that of the original source. An additional concept which is important regarding the localization of XML documents is that of the 'inline' element. Inline elements are those that can exist within normal text (PCDATA - Parsable Character DATA). They do not cause a linguistic or structural break in the text being extracted, but are part of the PCDATA content.
The following is a list of guidelines based on (often bitter) experience. Most of the problems are caused by not following the fundamental principles of XML and good XML practice. It is nevertheless surprising how often you can come across instances of the following type of problem. Please note that this is not a proscriptive list, there may be special circumstances where the proposed rules may have to be broken:
Although entity references can look like a 'slick' technique for substituting variable text such as a model name or feature in a publication, they can cause more problems than they resolve.
<para>Use a &tool; to release the catch.</para>
|
Example 1: Incorrect use of Entity References
Entities can cause the following problems:
It is generally better to use alternative techniques rather that entity references:
<para>
Use a <tool id="a1098">claw hammer</tool>
to release the CPU retention catch.
</para>
|
Example 2: Proposed solution
One area where entities CAN be used to great effect is that of boilerplate text. The technique here is to use parameter entities to store the text. The text must always be linguistically complete in that it cannot rely on positional dependencies with regard to other entities etc. Boiler plate text is used solely within a DTD. There need to be parallel target language versions of the DTD for this technique to be used which can add to the maintenance cost, although judicious use of INCLUDE directives and DTD design can mitigate this.
Translatable attributes can also look like a smart way of embedding variable information in an element.
<para>
Use a <tool id="a1098" name="claw hammer">
to release the CPU retention catch.
</para>
|
Example 3: Incorrect use of translatable attributes:
Unfortunately, they present the translation process with the following difficulties:
<para> Use a <tool id="a1098">claw hammer</tool> to release the CPU retention catch. </para> |
Example 4: Proposed solution
There is a good rough rule of thumb that if text has more than one word then it should not be used in attributes. As a syntactical instrument attributes are much more limited than elements. For a start you can only have one attribute of a given name. The use of attributes should be reserved for single "word" values that qualify in a meaningful way an aspect of their element.
CDATA sections are typically used as a means of escaping multiple '<' and '&' characters. Unfortunately they pose particular problems for tools that are extracting such text. The problem is not one of the escaped characters, but how to treat the CDATA text.
<TEMPLATE><![CDATA[<p>Please refer to the <em>index page </em> page for further information</p>]]> </TEMPLATE> |
Example 5: CDATA section problems:
The problem is a similar one to that posed by translatable attributes. Is the text to be treated as 'inline' to the surrounding text? What of the escaped characters. Are they to be replaced on translation with the appropriate characters that were originally escaped, or are they to be left in their escaped form. How is the software to know?
I have come across whole XML documents being embedded as CDATA within an encompassing XML document. This poses significant problems regarding the treatment of the CDATA text. It must first be extracted and then re-parsed before it can be extracted for translation.
Unless the text within CDATA sections is specifically never to be translated, please avoid using CDATA sections and use the standard built in character references to escape the text.
<TEMPLATE>
<p>Please refer to the <em>index page
</em> page for further information</p>
</TEMPLATE>
|
Example 6: Proposed solution:
<TEMPLATE xlink="ftp://ftp.xml-intl.com/res/ex1.xml"/> |
Example 7: Or alternatively use a link to an external resource:
Do not use the following type of element elm001, elm002,
elm003 in well formed documents.
<?xml version="1.0" ?>
<resources xml:lang="en">
<err001>Cannot open file $1.</err001>
<hint001>Hint: does file $1 exist.</hint001>
<err002>Incorrect value.</err002>
<hint002>Hint: value must be between $1 and $2.</hint002>
<err003>Connection timeout.</err999>
.
.
</resources>
|
Example 8: Example of infinite naming scheme usage:
This presents problems for extraction programs and is not regarded as good XML practice. A much better way of doing this is to use the ID and IDREF attribute mechanisms to link elements together.
<?xml version="1.0" ?>
<resources xml:lang="en">
<error id="001">
<caption>Cannot open file $1.</caption>
<hint>Does file $1 exist.</hint>
</error>
<error id="002">
<caption>Incorrect value.</caption>
<hint>Value must be between $1 and $2.</hint>
</error>
.
.
</resources>
|
Example 9: Proposed solution:
Processing instructions are a very 'weak' syntactical instrument in XML. There is no built in mechanism in XML to assist syntactically in the preservation of Processing Instructions. Above all avoid translatable text in PIs.
<para> Use a <?tool name="claw hammer"?> to release the CPU retention catch. </para> |
Example 10: Incorrect use of translatable text in PIs:
<para> Use a <tool id="a1098">claw hammer</tool> to release the CPU retention catch. </para> |
Example 11: Proposed solution
It is generally not a good idea to have any PIs present within translatable text. There is no guarantee that they will survive the translation process, unless special processing is carried out to preserve them. The problem is that deciding if the PIs are significant or not. This can cause problems with translation memory systems. Due to their syntactical weakness it is not easy for off the shelf extraction software to parameterize their handling. The insertion of a PI can cause otherwise linguistically identical text to fail TM matching. As a syntactically week element PIs do not have the handling capabilities of elements. It is better to strip out any PIs prior to translation.
There should be no excuse with the existence of SVGs to use bitmapped graphics. They pose particular problems in that the original bitmap will need to recreated for the target language with the translated text. This is usually a very costly and error prone process and requires appropriate target language knowledge of the person that is editing the graphics.
Always allow for the fact that the target language text may be significantly longer than the source. For example "Welcome" becomes "шчыра запрашаем" in Belarusian and "maligayang pugdatíng" in Tagalog. Design your output with flexibility in mind.
With English source we can often get tempted to use 7 bit ASCII or ISO 8859/1 encoding. As soon as you find that you are required to translate into a language that is not covered by ISO 8859/1 you will find that trying to maintain documents in different encoding schemes a real problem. Always use UTF-8 from the start. It gives you immediate access to commonly used punctuation characters such as 'm-dash' and 'n-dash' etc. It also significantly simplifies your document processing. All XML parsing tools have to be able to cope with UTF-8 and UTF-16. UTF-8 is more economical in terms of space usage for most European Languages whose scripts are based on the Latin alphabet.
Never start a sentence in one non-inline element and continue it in another. You cannot rely on the translated text being in the same word sequence in terms of the sentence content as the target. It also makes the job of translation much more difficult as the translator does not see the whole sentence.
<para>
<line>This text should not be</line>
<line>broken this way – the translated
text may well be in a different order.</line>
</para>
|
Example 12: Example of a sentence broken over more than one element:
Use logical elements instead that encompass the text.
<para><b>Do not use</b> '<br/>' type elements. </para> |
Example 13: Example of typographical element usage:
Use emph instead of bold.
Encompass any text that requires to be on a line with line elements.
<para>
<emph>Do not use</emph> 'br' type elements.
</para>
|
Example 14: Suggested correct usage:
Avoid at all cost introducing any line breaks into the text stream. You can unconditionally guarantee that this will cause problems in some if not all of the target languages.
Keep non-translatable PCDATA in different elements than translatable PCDATA.
<data-items>
<data id="class">
com.xmlintl.data.dataDefDefinition
</data>
<data id="text">
Replace generic data
definitions with specific instances.
</data>
</data-items>
|
Example 15: Example of mixed PCDATA:
Most XML translation tools will have problems with this type of construct. It is only when inspecting the 'id' attribute that a decision can be made as to whether the PCDATA should be extracted or not.
<data-items>
<class id="com.xmlintl.data.dataDefinition">
<text>
Replace generic data definitions with specific instances.
</text>
</class>
</data-items>
|
Example 16: Suggested solution:
This can cause all manner of problems for processing and extraction tools.
<para>
<text xml:lang="en">
My hovercraft is full of eels.
</text>
<text xml:lang="fr">
Mon aéroglisseur est plein d'anguilles.
</text>
<text xml:lang="hu">
Légpárnás hajóm tele van angolnákkal.
</text>
<text xml:lang="ja">
私のホバークラフトは鰻で一杯です。
</text>
<text xml:lang="pl">
Mój poduszkowiec jest pełen węgorzy.
</text>
<text xml:lang="es">
Mi aerodeslizador está lleno de anguilas.
</text>
<text xml:lang="zh-CH">
我隻氣墊船裝滿晒鱔.
</text>
<text xml:lang="zh-TW">
我的氣墊船充滿了鱔魚 [我的气垫船充满了鳝鱼]
</text>
</para>
|
Example 17: Example of mixed source and target PCDATA:
Unless your document requires mixed language content use a separate document instance to store each target language version. If you store both source and target data in the same document it will become unwieldy, overly large and cumbersome to process.
Keep any PCDATA that requires translation in different elements from PCDATA that does not require translation. Use special elements for text within PCDATA that is specifically not to be translated.
<para>
The following part of this sentence should
<notrans>not be translated</notrans>
at all.
</para>
|
Example 18: Suggested solution:
Yves Savourel, who has done so much good work in the field of localizing XML has an excellent web page dedicated to the subject of XML Internationalization and Localization FAQ. Another very good reference work is the paper by Richard Ishida of W3C Localisation Considerations in DTD Design.
If the source text is properly written in a clear and understandable manner, then it will be easy to read and easier to localize. It is worth investing in tools that will check the grammar and terminology in you source text. Without tools, your authors do not have a benchmark to test themselves against and it is all to easy for poorly written text to get into your documents.
The translation industry has been an enthusiastic creator and adopter of XML based standards. Special mention must be made here of LISA (Localization Industry Standards Association) which has been responsible through its OSCAR (Open Standards for Container/Content Allowing Re-use) committee for the following standards:
The OLIF Consortium was set up for the interchange of Lexicons and terminology specifically aimed at machine translation systems.
The other body involved in translation based XML formats is OASIS (Organization for the Advancement of Structured Information Standards) which is responsible for many XML standards. The most relevant OASIS technical committees regarding translation are:
All of these excellent standards relate to the exchange of translation data using XML as the interchange format. They do not address the issue of the actual translation of XML documents themselves. This is where xml:tm comes in.
xml:tm deals with the issue of how to simplify and reduce the costs of translating XML based documents. It introduces the concept of "text memory" which is maintained automatically and transparently in the source and target versions of the same document. It automates the task of managing memory and provides the first real advances in the realms of translation memory in 20 years.
xml:tm was created by XML Intl based on 12 years of in depth experience in designing enterprise level SGML and XML translation memory systems. xml:tm has been offered to LISA for consideration as an open OSCAR standard. A full specification of xml:tm is available online[10].
Information encoded in XML has many benefits in terms of translation:
The creation and maintenance of an XML based translation memory (TM) has similar advantages for creators of large corpora of XML encoded information. All XML based corpora have the potential to compile their own translation memory over time.
Whereas traditional translation memory systems have only concentrated on the translation aspect of the document lifecycle, xml:tm goes deeper into the lifecycle process establishing the concept of “text memory”. Each sentence (text unit) in the document is given a unique identifier. This identifier remains immutable for the life of the document. xml:tm uses the XML namespace mechanism to achieve this.
The following diagram shows how the xml:tm namespace coexists within an XML document:

At the core of xml:tm is the concept of “text memory”. Text memory is made up of two components:
XML namespace is used to map a text memory view onto a document. This process is called segmentation. The text memory works at the sentence level of granularity - the text unit. Each individual xml:tm text unit is allocated a unique identifier. This unique identifier is immutable for the life of the document. As a document goes through its life cycle the unique identifiers are maintained and new ones are allocated as required. This aspect of text memory is called author memory. It can be used to build author memory systems which can be used to simplify and improve the consistency of authoring. A detailed technical article about xml:tm has been published on O'Reilly's xml.com web site[11]
The use of xml:tm greatly improves upon the traditional translation route. Each text unit in a document has a unique identifier. When the document is translated the target version of the file has the same identifiers. The source and target documents are therefore perfectly aligned at the text unit level. The following diagram shows how perfect matching is achieved:

The first step in the translation process is that all the raw XML data is extracted to an XML standard for translation (e.g. XLIFF). This extraction process uses the xml:tm namespace to identify all of the text that requires translation. For those text units that have not changed since the last update, the target language text can be automatically inserted. This process is called perfect matching and does not require any translator intervention thereby reducing the cost.
xml:tm can be used to perform text unit matching within the same document, looking for previous translations of the new text which may be the same as existing translated text. This type of matching is called in-document leveraged matching and requires proofing for context acceptance by a translator.
xml:tm can be used to perform text unit matching across a customer's entire corpus in the traditional translation memory way. The original translated source and target text can be loaded into a translation memory database at the text unit level. This type of matching is called leveraged matching and requires proofing for context acceptance by a translator.
Unlike a supplier whose memory is only able to match against those objects the customer has supplied for translation, a customer can match against anything that has previously been translated regardless of translation supplier. xml:tm can identify a number of different categories of translated material:
xml:tm provides a much more cost effective and better tuned matching mechanism than traditional translation memory systems
The advantages of xml:tm for a customer are enormous:
Translation interface - as the XML integrity of the information requiring translation is preserved, it is possible to present it for translation in a web-interface. This means that a translator can perform a translation using a browser, without the need for expensive, proprietary translation tools.
xml:tm preserves pre-populated matched files in a native XML format. This has significant advantages for translation. It means that a translator will not need either a translation memory or specific translation tools to perform a translation. Rather the pre-matched files can be delivered via a web application for translation either online or offline. However, such an application is predicated upon the use of xml:tm and the XML family of translation standards such as XLIFF.
Allowing translators direct access to the text via the Internet provides the opportunity to reduce translation costs significantly in itself.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.