XML Europe 2003 logo

Using XSL Formatting Objects for Production-Quality Internationalized Document Printing

Abstract

The XSL Formatting Objects (XSL-FO) specification was designed from the start to be locale and language neutral. This makes XSL-FO well suited to the task of composing for print internationalized documents, and in particular, documents in non-Western languages. However, users of XSL-FO are dependent on both the implementation of XSL-FO internationalization features, such as support for different writing modes, as well as on implementation of locale-specific composition functions, such as line breaking, hyphenation, and glyph shaping. Users of XSL-FO that use XSLT to generate FO instances are also dependent on the ability to do locale-specific processing within the XSLT transform, including the localization of generated text strings and locale-specific sorting for back-of-the-book indexes. This paper evaluates the XSL 1.0 specification and the currently-available implementations against the composition requirements of internationalized documents, including Arabic, Hebrew, Thai, and other Asian languages. It discusses the various challenges inherent in the production of documents in non-Western languages and the XSLT and XSL-FO facilities available for meeting those challenges. We then report our experience in using XSL-FO with commercial and open source tools to produce localized hardware user manuals for a line of consumer computer peripherals. We discuss the XSLT and XSL issues, as well as XSL-FO extensions that may be required to satisfy typical print production requirements. Finally, we provide a set of recommendations, based on the current state of the XSL specification and the current state of tools, as to when the use of XSL-FO is appropriate and which XSL-FO implementations are best suited to which tasks or disallowed by certain sets of requirements.

Keywords


Table of Contents

1. Introduction
2. Background
2.1. A Historical Perspective on XSL and Internationalized Documents
2.2. Internationalization and Localization Challenges for XSL-FO
3. XSL Formatting Objects Overview
3.1. XSL FO Basics
3.2. Typical FO-Based Production System
4. Internationalization and Localization Basics
4.1. Writing Systems, Languages, and Locales
4.2. Overview of Character Sets, Encodings, and Fonts
4.2.1. Characters and Bytes
4.2.2. Characters, Glyphs, and Fonts
4.2.3. Collation
5. Layout and Formatting Issues
5.1. Controlling Writing Mode
5.2. Characters, Glyphs, and Fonts
5.3. Line Composition
6. XSLT Processing Issues
6.1. Generated Text Management
6.2. Collation and Sorting
7. Current FO Implementations
7.1. Overview of FO Implementations
7.2. Support for Non-Western Languages
7.3. Support for Graphics and Mathematics
7.4. Support for Non-RGB Color Models
8. Future Directions
9. Conclusions
Bibliography
Biography

1. Introduction

Support for all human languages has always been one of the key requirements for XML and the W3C has made internationalization a focus of its activities, helping to ensure that the entire family of XML-related and Web-enabling specifications support and enable internationalization and localization as much as possible.

The goal of the XSL Formatting Objects[XSL-FO] specification is to enable the automatic creation of sophisticated renditions of XML-based data, primarily, but not limited to, the production of printed pages. Thus the FO specification provides a number of features that support the layout requirements of different writing systems and languages, including providing different writing modes, precise control of glyph placement, control of bi-directional text, and so on. Coupled with features of processing languages such as XSLT and Java it is possible to create a single document processing system that can create appropriate renditions of documents in a large percentage of, if not all, modern languages.

Of course there are limitations in the technology and some inherent difficulties in satisfying the requirements of some of the more challenging languages. The purpose of this paper is to highlight the internationalization features of XSL Formatting Objects and its supporting technologies and to discuss some of the practical challenges inherent in the task of producing high-quality pages from non-Western languages. The challenges and solutions presented in this paper largely reflect ISOGEN's experience in developing FO-based publishing systems for companies that sell consumer products to a world market, with requirements for delivering documents in more than 50 different national languages, including Middle Eastern (right-left) and Asian languages, including languages that use challenging writing systems, such as Thai. This paper also reflects what we have learned from FO implementors and localization service suppliers about the challenges of localization and localized document production.

For XSL FO and XSLT both internationalization and localization is required. The processing that generates the FO instances must be internationalized so that it can generate FO instances that are appropriately localized for the target language or languages they reflect.

By internationalization is normally meant the adapting of information systems to support a variety of different national languages and usually implies support for non-Western languages (that is, languages that do not use a Latin-based script). An internationalized document is normally one that can contain content in any number of languages, for example, a technical document that contains both the original-language content and the translations of those content.

By localization is normally meant the adapting of an information system to a specific national language. The act of translation from one language to the other is the act of localizing the document to the target language. By the same token, software and hardware is localized by translating its user interface components to reflect a specific language and writing system.

The term internationalization is often abbreviated “i81n”. The term localization is often abbreviated “l10n” (the letter el, the number 10, the letter “n”).

2. Background

A brief historical perspective on how we got into the mess we are in with national language support in computer systems, followed by an overview of the internationalization challenges inherent in the use of XSL-FO for document composition.

2.1. A Historical Perspective on XSL and Internationalized Documents

Until very recently it was difficult bordering on impossible to have a single information system that could manage the authoring, storage, and rendition of documents in both Western and non-Western languages. It is an accident of history that the first computer systems were built by peoples whose writing systems could be represented by the simplest possible scheme: one byte per character. It is also an accident of history that the language group with the largest single constituency of native speakers,[1] Chinese in its various forms, was physically, culturally, politically, and economically isolated during the first two decades of the computer revolution. This meant that a group that should have otherwise had a tremendous influence on the design of computer systems was instead almost entirely ignored, which meant that Western computer scientists and engineers could largely ignore the needs of ideographic languages simply because there was little or no market pressure to do so. The other peoples with ideographic writing systems, such as the Japanese, also had the technological capacity to satisfy their requirements locally. Because communication between people and enterprises was still primarily via paper, rather than electronic data interchange, communication and commerce was not significantly impeded by the fact that computer systems adapted for non-Western languages produced data sets that could not easily be used by computer systems adapted for Western languages. As long as one could produce paper, communication happened.

The eventual result was a world in which each country or language group had its own private scheme for representing data, using tricks like code-page swapping and language-specific encoding schemes to enable the representation of writing systems that have tens of thousands of unique characters. Software and hardware systems that had been developed with the assumption that a character was always exactly one byte were expensive to adapt to languages that required two or more bytes to represent a single character.

The advent of electronic data interchange coupled with advances in global transportation of goods meant that paper was no longer sufficient to enable communication for commerce. Now data, not paper, flowed between enterprises. Western companies and governments could mandate English or other Western languages as the language for data, but clearly that would not work for commerce entirely within Asia, for example. As China, in particular, reformed its economic relationship with the world, opening itself to greater commercial interaction, the need for smoother electronic communication became more pressing.

It started to become clear that computer systems based on a Western assumption about writing systems was not going to provide a workable solution in the long term. Efforts such as the Unicode standard and ISO 10646 addressed this issue by attempting to define a single character encoding system that could accommodate all the world's languages, paving the way for computer systems that could be used as-is with any national language.

At the same time, truly global companies and truly global markets emerged. Huge companies like IBM, Sony, and Hewlett-Packard became suppliers to the world, selling products everywhere. This naturally required that the documentation for these products had to be localized for each market. In the 1980's and 1990's the cost of this localization was quite high, often representing the bulk of the cost of developing and maintaining the information. Computer systems for doing document publishing were almost exclusively Western and did not support non-Western languages well. Publishing documents in Arabic or Traditional Chinese or Thai was technically difficult and expensive because it required specialized versions of publishing systems (usually desktop publishing systems or word processors). Often specific languages would require the use of a publishing tool different from the one used for the base language, further complicating the localization process.

In the late 1980's, the SGML standard provided a standard way to represent character-based data, and in particular the documents that support the kind of products that IBM, Sony, and HP sell. SGML seemed to offer an obvious solution to some of the problems with data representation and information management that were leading to the high costs of localization. Asian countries found SGML particularly attractive because it was a standard that provided a way to standardize not just the semantic identification of information components, but the details of how that data was encoded as sequences of characters, something they had never had before. Unfortunately, SGML still reflected a Western bias—while it provided facilities for supporting non-Western languages, its facilities were somewhat cumbersome. SGML was definitely better than anything that had come before, but there was still a lot of room for improvement.

In the 1990's the advent of the World Wide Web made global electronic commerce not only a reality but an unavoidable imperative. Suddenly everybody, not just EDI wonks, saw that the world not only could be completely connected but in fact was completely connected. At the same time, the rise of enterprises such as Wal-Mart and Dell, which sold products that are produced all over the world and that based their business in large part on keeping all operational costs as low as possible, could demand that suppliers use the most effective and efficient electronic commerce technology available. At the same time, the speed of the marketplace increased as communication become more efficient. Suddenly electronic commerce was not just a luxury but required to compete at all.

When the XML Working Group started developing XML we understood the limitations of the then-current technologies with respect to localization and internationalization and vowed to do something about it. We made “support for all human languages” one of the top requirements for XML. Our focus, at least initially, was enabling the publishing of high-quality renditions of marked up documents on the Web. And we wanted it to be as easy to publish a Chinese or Thai or Arabic document as it was to publish an English or French or Italian document.

Now, more than 5 years after the publication of XML we have started to realize this goal with the advent of XSL FO and FO implementations that implement the needed internationalization features.

2.2. Internationalization and Localization Challenges for XSL-FO

The XML recommendation enables internationalization and localization in two key ways: requiring the use of Unicode for character representation and providing a built-in attribute for binding elements to national languages. These two features of XML are necessary to do internationalized document publishing but are not sufficient. For a complete solution, we must have a presentation specification mechanism that can describe the layout and formatting needs of any writing system, language, and culture, as well as implementations of if. This is the goal of XSL Formatting Objects.[2]

The XSL FO specification addresses internationalization requirements by providing an architecture and model for doing complex layouts that is explicitly designed to support any writing mode and any system of glyph construction and placement. In particular XSL-FO does not privilege left-to-write writing modes or Latin-based writing systems. It attempts to provide the layout facilities needed to express the typographic and layout conventions of different writing systems, languages, and cultures (although it doesn't always succeed; for example, there are some requirements of traditional Japanese typography that XSL FO cannot currently satisfy without extensions—I don't know what these are but I have it on good authority).

However, there is more to rendering documents than typography and layout. In particular, the details of line composition are highly dependent on both the details of a given writing system and language, typographic preference, and typographic implementation capabilities. Key variables are the use or non-use of hyphenation, which is always language dependent, the accurate determination of appropriate line breaking points, and details of glyph placement, which in some languages and writing systems are quite complex. All of these fall outside the scope of what can be standardized by the FO specification and are therefore up to implementations to support and implement.

Another challenge that falls outside the scope of the FO specification is collation (sorting), of which the most obvious application is back-of-the-book indexing, but which may be needed for other things, such as generating glossaries, parts lists, and so on. Because collation rules are highly variable, even within a single language or locale, it is impossible to standardize collation at the FO or XSLT level, leaving collation specification and implementation as part of the local configuration of FO-based publishing systems.

Individually, none of the localization challenges faced by an FO-based publishing system is that difficult to address, but they must all be addressed in order to have a complete publishing system and no out-of-the-box solution, and certainly no standard by itself, is going to address these challenges completely, if at all. In all cases, part of the task of implementing an FO-based publishing solution for localized documents will be to define and implement solutions to the various challenges identified in this paper.

Fortunately there are a number of resources to assist with that task.

3. XSL Formatting Objects Overview

Fundamentally, XSL Formatting Objects is an XML document type that defines elements that represent the physical components of laid-out pages, such as page sequences, blocks, inlines, footnotes, floats, and so on. It is analogous to HTML in that it is a generalized markup language that is intended to be interpreted by FO implementations in order to produce final form renditions, such as PDF files, directly-printed pages, or scrollable online displays. Like HTML, FO document instances can be created directly by authors but it is much more likely that they will be created from other sources, such as more specialized XML documents.

3.1. XSL FO Basics

The FO specification reflects the lessons learned over the last 40 years of doing computer-based document layout and typesetting. The basic FO constructs will be familiar to anyone who has used a desktop publishing system like FrameMaker or Interleaf or a markup-based layout system like GML or LaTeX. Its basic architecture presumes a two-stage process in which the FO document instance is produced from some input, e.g., by an XSLT transform applied to an XML document, say a technical manual or purchase order. The FO instance is then processed by an FO implementation to generate (abstractly) an “area tree”, which represents the paginated result of processing the formatting objects in terms of the semantics defined in the FO specification. Thus the abstract processing model for XSL-FO is the placement of a hierarchy of rectangular areas to build up sequences of pages (note that implementations need not literally implement and area tree and at least two widely-used FO implementations do not).

This architecture is quite powerful and enables fairly sophisticated formatting from a well-structured, easy-to-generate markup structure. One important aspect of this model is that it is not biased toward a particular writing mode. That is, the geometry model used in XSL-FO is defined entirely in terms of writing direction. Except for the physical page itself, all positioning in XSL-FO is relative to the directions in which lines are laid out and blocks are placed on the page. This has the effect of making it just as easy to define the layout for a right-to-left or top-to-bottom language as it is for Western left-to-right languages. It also means that most of the formatting for a document will “just work” when you change writing mode—low-level formatting objects do not normally need to be aware of what the writing mode is. Even when shorthands inherited from the CSS[CSS] specification use terms like “left” and “top”, they are mapped to the more generic concepts “start” and “before”. For example, in a right-to-left layout, the CSS value “left” maps to “start”, which would in fact be on the right side of the physical page (because right-to-left lines start on the right, not the left).

XSL FO also provides a number of options for positioning glyphs with respect to different baselines to better accommodate the needs of different writing systems.

There are some layout limitations in XSL 1.0. For a discussion of these limitations see [Prod FO].

An FO instance consists of two mandatory parts: page layout master definitions and one or more page sequences. The page layout masters define the geometry of the different pages needed for a document, including the physical page size and the size of the different page areas (headers, footers, body, etc.).

The page sequences contain the actual document content (the data to be presented on the rendered pages), represented as a hierarchy of “flow objects” of various types. A page sequence consists of a “flow”, which is the flow objects that are flowed from page to page, and any “static content”, that is, content that is either invariant from page to page, reflects the paginated result (i.e., page numbers), or reflects data pulled from a particular page (for example, section titles reflected in a running foot or running head).

Both the flow and static content consist of trees of block flow objects which then contain the remaining structuring elements: inlines, floats, tables, and so on. In XSL 1.0 there can be only one flow per page sequence.

Each flow object has a set of properties or characteristics that define the details of its intended rendition result: font details, placement details, spacing before and after, indentions, borders, colors, etc. Properties that are defined as “inheritable” are automatically propagated from ancestors to descendants within the XML element hierarchy by the FO processor according to the FO-defined rules for property refinement.

Figure Figure 1 shows a minimal FO document instance.

<?xml version="1.0"?>
  <fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">
   <fo:layout-master-set>
      <fo:simple-page-master master-name="mypage"
         page-height="11in"
          page-width="8.5in">
        <fo:region-body/>
     </fo:simple-page-master>
    </fo:layout-master-set>
    <fo:page-sequence master-reference="mypage">
      <fo:flow flow-name="xsl-region-body"
          font-family="sans-serif">
        <fo:block
          ><fo:inline
              font-weight="bold"
               font-size="42pt"
          >Hello world!</fo:inline
        ></fo:block>
      </fo:flow>
    </fo:page-sequence>
 </fo:root>

Figure 1. 

3.2. Typical FO-Based Production System

While FO document instances can of course be generated by any process, the typical use of XSL-FO is to generate renditions from XML document using XSLT transforms. Figure Figure 2 shows the typical XSLT-based production system.

click image for full size view

Figure 2. 

In this two-stage process the XSLT style sheet for generating the XSL-FO instance, including doing any sorting that may be required by auto-generated document content, such as back-of-the-book indexes, is used with an XSLT engine to produce an FO instance document. It is also in the XSLT processing that other content processing, such as inserting word breaks in Thai content, may be applied (although this task may be performed by the FO implementation). The FO instance, along with any external graphics it might reference, is processed by an FO implementation to produce the paginated result. The paginated output can take any number of forms, although the most common output formats are PDF and PostScript (which may then be distilled into PDF). FO implementations may also print directly to printers or produce other printable formats.

In addition to the FO instance itself, FO implementations typically provide facilities for configuring fonts and hyphenation, usually through some sort of configuration file.

The XSLT process may be extended to do things like pull data from external databases, provide custom collation functionality, or provide extension functions that support the specific needs of a particular input document type.

4. Internationalization and Localization Basics

In order to understand the XSL-FO and XSLT facilities for localization and internationalization it is necessary to understand a few basic concepts that can usually be ignored when working only with Western-language documents.

4.1. Writing Systems, Languages, and Locales

Informally we often use the term “language” when what we really mean is ”script” or “writing system.” It's an important distinction when working with documents in a variety of national languages.

A writing system is “the set of glyphs used for representing a given human language in written form, generally along with their conventions for use”[3]. That is, a set of characters, ideographs, or other symbols that are used to write down statements in a language. The symbols that make up a writing system represent a script. For example, the Latin script defines all the characters used by the languages of most of the countries in Western Europe, North America, and Africa, but different languages may use different parts of the script (for example, English does not normally use any characters with diacritical marks). Thus the same script may be used by several writing systems.

A language, in this context, is:[4]

  1. Communication of thoughts and feelings through a system of arbitrary signals, such as voice sounds, gestures, or written symbols.

  2. Such a system including its rules for combining its components, such as words.

  3. Such a system as used by a nation, people, or other distinct community

The same writing system may be used by a number of different languages. By the same token, a single language may have several different writing systems by which it can be written, such as Tagalog, which has both a Latin-based writing system and a historical Indic-script-based writing system used before Spanish colonization of the Philippine islands.

In practice a given national language implies a particular writing system and script, but a script does not imply a language. For example, the Arabic language uses the Arabic script, but so do a number of other languages. Thus you must also be careful to indicate whether you mean the language or the script when discussing such things.

When investigating whether or not a given publishing system will support the languages you need or the effort required to support a given set of languages, it is usually the script and writing system that determines the cost, not the language. That is, the same challenges and solutions will generally apply to all the languages that use a particular script, except for those things are dependent on the actual vocabulary and conventions of a given language, such as hyphenation. Most of the limitations in tools stem from limitations in their abilities to support specific scripts, such as glyph shaping for Arabic or glyph composition for Thai, rather than limitations in language-specific functions like hyphenation, for which most tools will provide a general configuration mechanism.

The term locale as used in this context usually refers to a specific geographic or political entity, i.e., a country or region within a country. In XML locales are formally identified using the two- or three-letter language codes defined in ISO 639[ISO 639], optionally qualified with a country code. For example, the language code “zh” indicates the language Chinese in general. To indicate Chinese as spoken in Hong Kong, you would add the country code “HK”: zh-HK. Language codes are always lowercase, country codes are always uppercase.

FO implementations use locale codes, usually specified on the xml:lang attribute, to apply locale-specific processing to formatting objects, such as font selection and hyphenation algorithms. The Java language also uses ISO locale codes to enable locale-specific processing within Java programs.

4.2. Overview of Character Sets, Encodings, and Fonts

When working with XML data it is critical to understand the difference between characters, bytes, and glyphs, and in particular, how these concepts are applied when using Unicode for data representation.

Unicode largely solves the problem of character representation for national languages, which makes it much easier to solve problems in collation, which requires the ability to both define collation orders in terms of character codes and do comparisons of strings of characters. However, Unicode does not completely solve problems of character display.

4.2.1. Characters and Bytes

The character data in an XML document is exactly that: characters. In the abstract a character is a logical thing, e.g., the idea of the Chinese ideograph for “road or way” (道, Unicode character \u9053). A given logical character may be represented in any number of ways in a data file or within a running software system.

In XML, which is a data representation standard, a character is a sequence of bytes that corresponds to a character in a character set. A character set is nothing more than a table of byte sequences and the logical or semantic characters they correspond to. For example, in the ASCII character set, the byte with the code 0x41 corresponds to the logical character “A”. Over the years, a number of different character sets have been developed to support the needs of different writing systems and national languages.

The ASCII Latin-1 character set, the one made ubiquitous by its use in Unix systems and personal computers, originally only used 7 bits for the character codes, meaning that it could represent at most 128 distinct characters. This was sufficient for representing English and other Latin languages that used essentially the 26 letters of the English alphabet. To handle most other European languages, and to provide a number of graphic characters, ASCII was extended to 8 bits (one byte) for the character codes, allowing up to 255.

This had the result that most software expected characters to only be one byte long. If you wanted to handle languages that required more than 255 characters, such as ideographic languages, you had to either define new character sets with multi-byte characters or have many single-byte character sets and some way to indicate which character set a particular character was to be drawn from. Needless to say this complicated things quite a bit.

To try to solve these problems the Unicode character set has been developed[Unicode]. The Unicode character set is a single character set whose ultimate goal is to provide all the characters needed by all the world's national languages, past, present and future. Unicode provides up to 4 bytes for each character, more than enough for all the known languages and any languages yet to be defined or discovered. Unicode subsumes the ASCII Latin-1 character set, so that all the characters in the ASCII Latin-1 character set have the same character code in Unicode. The core of Unicode is a two-byte character set, the basic multilingual plane, which provides the characters for all modern languages.

Because Unicode provides all characters in a single character set there is no need for Unicode-based processing to do any character set switching or escaping. This simplifies that aspect of document processing.

However, because Unicode is a “multi-byte” character set it does present the problem of how to encode the characters when written out as a sequence of bytes. For ASCII it's relatively easy: data is simply a sequence of 8–bit bytes and each byte corresponds to exactly one character. For two-byte character sets, such as some of the Japanese and Chinese character sets, the data is stored as a sequence of pairs of 8–bit bytes where each pair equals one character.

But Unicode characters can be represented by from one to four bytes, depending on a character's position in the character set. It would be highly inefficient to store all characters as four-byte sequences, especially when the vast majority of data would use at most two bytes and most Western-language data would only use one byte. Thus the Unicode specification provides for a number of different encoding schemes, allowing data to be stored in the way that is most efficient for a given data set.

This means that it is not sufficient to simply talk about “Unicode data.” Rather, you must distinguish between Unicode data as stored as byte sequences and Unicode data as held in memory by processing programs.

The two most common Unicode encodings are UTF-8 and UTF-16. UTF-8 uses variable-length sequences of 8–bit bytes to encode characters. In UTF 8 all ASCII characters are exactly the same as in an ASCII document but characters above 255 require 3 or more bytes (one for a marker and two or more for the character itself). In UTF-16, each character is represented by a two-byte sequence or a pair of two-byte sequences. Thus, the same sequence of Unicode characters may be stored on disk in at least two different, equivalent ways. The XML specification requires that all conforming XML processors support both UTF-8 and UTF-16. In this case, “support” means at least being able to read data in either encoding, if not being able to write data in either encoding.

Note that both the Windows operating system and the Java programming language are natively Unicode based. For example, in Java, all strings are represented internally as Unicode strings.

There are two canonical ways to refer to Unicode characters: character name and character code using the syntax “\u0000” or “U+0000”, which is a two-byte hexidecimal number. Each character in the Unicode character set has a unique descriptive name, e.g. "GURMUKHI LETTER CHA" or "LATIN CAPITAL LETTER A." In XML, Unicode characters can be referenced directly using numeric character references of the form “&#x00A0;” (hexidecimal format) or “&#160;” (decimal format).

4.2.2. Characters, Glyphs, and Fonts

A character set defines characters, which are logical things (the idea of the letter “A”). To display a given character it must be mapped to a glyph or set of glyphs in a particular font.

A glyph is a graphical representation of the character. A set of related glyphs is a font. There are essentially two types of fonts: bitmapped fonts, where each glyph is defined as a bitmapped image, and raster fonts, where each glyph is defined using vector drawing instructions. For example, SVG fonts use SVG drawing primitives to define the shapes of the glyphs[SVG]. Usually a font is organized so that each glyph has a code point corresponding to the character code for the character it is intended to represent.

Fortunately the character-to-font-and-glyph task is usually handled transparently by either the underlying operating system or by rendition software, such as Web browsers and print composition tools. Most rendering tools provide the appropriate character-to-glyph-in-font mappings out of the box, although many can be configured if necessary, for example, to accommodate a national language the tool supplier didn't provide for. Most new fonts are Unicode fonts in that they have glyphs for characters in the Unicode character set above the ASCII range. However, few, if any, Unicode fonts have glyphs for all characters. This means that you may need several different Unicode fonts to provide full glyph coverage for a document. For example, the CJK fonts provided with Microsoft Windows do not include a number of glyphs most documents use, such as symbols for bullets. Thus you must use both the Asian language fonts and a Western font, such as Arial Unicode, to render all the characters in a Chinese-language document.

Figure Figure 3 shows the relationship between data as encoded, characters as represented internally by programs, and glyphs generated from the characters.

click image for full size view

Figure 3. 

In this diagram, the same logical character, CJK-Unified Ideograph-9053 (“road or way”, dao4 in pin-yin) is stored on disk as byte sequences in both UTF-8 and UTF-16 encodings. For UTF-8 three bytes are required. For UTF-16 two bytes are required. When either of these byte sequences is read by a data reader the result is the same abstract character in the processing system's memory, that is, an internal program object that represents the character “CJK-Unified Ideograph-9053” (the name of the character as defined in the Unicode database).

When this character is rendered by a renderer a glyph for the abstract character is pulled from whatever font has been associated with the data (i.e., through the font-family property on the containing formatting objects). In this case the two fonts SimHei and PMingLiU have noticeably different glyphs for this character. For this particular character many fonts would not have a glyph at all, resulting in either a square box in the rendered result (a common fallback strategy) or nothing at all.

When rendering documents that use characters outside the ASCII range, if you are not seeing the characters you expected and you know the data is being read correctly into your processor (which it almost always will be if your data is XML using UTF-8 or UTF-16) then the problem is almost always a font specification problem. Unfortunately, because font configuration is local to individual machines, FO implementations, and final-form renderers (e.g., Adobe Acrobat), there are several places where the font configuration could go wrong. For example, you may have the necessary font on your system but not defined for your FO implementation or not embedded into generated PDFs or you may have simply failed to specify it for the appropriate formatting object in your FO instance.

Also, different locales may use different fonts for the same characters. For example, while Japanese uses Chinese ideographs as part of its writing system, it uses different glyph forms for many of them. These different forms are very important to Japanese readers but may not be obvious to casual Western observers unschooled in the details of Japanese and Chinese typography. Depending on how your FO implementation manages its font configuration it may handle these locale distinctions for you. For example, the XSL Formatter product from Antenna House comes out of the box with the appropriate configuration for selecting the appropriate ideographic font for different locales. However, this requires that you specify the locale correctly in your FO instances.

4.2.3. Collation

Collation is the act of sorting character strings into some specified order. For languages like English, collation is entirely a function of the spelling of words and the order of letters in the alphabet: A, B, C, D, etc., such that “baboon” collates after “ape” and before “chimpanzee.” In most other languages, collation is defined by more complex rules, such as stroke count in Traditional Chinese. Ideographic languages have no natural ordering in the way that alphabetic languages do. The characters in any character set are inherently ordered, i.e., in increasing numeric order of the character codes themselves. However, the order of characters in a character set does not necessarily correspond to the collation order for those characters, and in most languages, will not correspond to any particular desired collation order.

5. Layout and Formatting Issues

The layout and formatting of national languages requires control of three main aspects of the rendered result: writing mode, glyphs and fonts, and line composition.

Writing mode determines the way that lines and blocks are placed. The rendering of non-Latin writing systems usually requires careful control and configuration of fonts and glyphs, including issues of glyph shaping and composition. Line composition requires control of how the renderer breaks lines to lay out flowed text within areas, and includes control of hyphenation and word boundaries.

5.1. Controlling Writing Mode

In XSL-FO, writing mode determines both the direction that glyphs are placed in lines and the direction that lines and blocks are placed. The three most common writing modes for modern languages are left-to-right, top-to-bottom (lr-tb), right-to-left, top-to-bottom (rl-tb), and top-to-bottom, right-to-left (tb-lr).[5] The writing mode keyword indicates the inline progression direction then the block progression direction.

For lr-tb, glyphs are placed into lines from left to right and lines and blocks are placed from top to bottom, i.e., most Western languages. Many Asian languages are also laid out using lr-tb writing mode in technical and business documents, even though they are traditionally written top to bottom. For rl-tb, glyphs are placed into lines from right to left and lines and blocks are placed from top to bottom, i.e., languages that use Arabic script and Hebrew. For tb-rl, glyphs are placed into lines from top to bottom and lines and blocks are placed from right to left, i.e., Chinese and Japanese. Figure Figure 4 shows these different writing modes.

click image for full size view

Figure 4. 

Writing mode can be set on any formatting object that establishes a reference area: fo:simple-page-master, fo:region-*, fo:block-container, fo:inline-container, and fo:table. For a document that is primarily in a language that uses a writing mode other than lr-tb, you would normally specify the writing mode on simple-page-master, as this also defines the relative placement of the page areas, as well as the writing mode for all content that uses that page master.

If a document may contain content in different writing modes on the same page, you must set writing mode locally using fo:block-container or fo:table. For example, a document that contained side-by-side Hebrew and Spanish would require this type of approach:

<fo:page-sequence
    master-reference=”two-column-page”>
  <fo:flow flow-name=”xsl-region-body”>
    <fo:block-container
        xml:lang=”es”
        writing-mode=”lr-tb”>
      <fo:block>....</fo:block>
    </fo:block-container>
    <fo:block-container
        xml:lang=”he”
        writing-mode=”rl-tb”
        break-before=”column”>
      <fo:block>....</fo:block>
    </fo:block-container>
    <fo:block span=”all”>  
       <!-– Force columns to balance ––>
      <fo:leader leader-pattern=”space/>
    </fo:block>
   </fo:flow>
 </fo:page-sequence>

Figure 5. 

If you have right-to-left content it is likely that you will have left-to-right content embedded in it, so-called bi-directional text. When the content is just a sequence of words and unpaired punctuation, most FO processors that support right-to-left writing modes at all should correctly apply the Unicode bidirectional algorithm in order to correctly render the left-to-right content within the right-to-left content, that is, with the left-to-right words presented in the correct left-to-right order: .CIBARA NI SDROW.this is English .CIBARA NI SDROW

If the left-to-right text includes paired punctuation, such as parentheses you can use both Unicode control characters in the range \u202A-\u202E as well as the fo:bidi-override element. For identifying embedded bidirectional text, use fo:bidi-override instead of the Unicode Left-to-Right Embedding and Right-to-Left Embedding control characters (\u202A and \u202B), as the FO markup is easier to work with than simple pairs of marker characters. To control the placement of individual characters use the Unicode Left-to-Right Overide and Right-to-Left Override characters (\u202D and \u202E). These control characters are normally entered as part of the input XML data if they occur in authored content or they must be generated by whatever generates the FO instance if they are part of generated content.

For example, to render a sequence of Arabic characters followed by parenthesized numbers you must do something to make the opening parenthesis render at the end of the string and not the beginning. By default, the parenthesis, which has an inherent writing direction of left-to-right will be rendered at the beginning of the Arabic sequence, instead of at the end, as shown in the first example in Figure Figure 6. The second example shows the effect of using fo:bidi-override around the parenthesized text in order to defined its placement relative to the Arabic text. The markup for the second example is shown in Figure Figure 7

click image for full size view

Figure 6. 

<fo:block
    space-before="1em"
  >Arabic text followed by 
parenthesized digits w/in bidi-override unicode-bidi="override" direction="ltr":
 <fo:inline color="green" >&#x0627;&#x0644;&#x0625;&#x062E;&#x062A;&#x0628;&#x0627;&#x0631;
 &#x0627;&#x0644;&#x0639;&#x0631;&#x0628;&#x064A;<fo:bidi-override
direction="ltr" unicode-bidi="bidi-override">(12)</fo:bidi-override></fo:inline></fo:block>

Figure 7. 

Another thing to keep in mind when working with right-to-left content is that the order the characters occur in the XML document is their logical order, that is, with the first character in a word first within the XML content. However, depending on the functionality of the editor you're using and whether or not or how it supports right-to-left text, that data may appear to be in another order with respect to the markup if the editor presents right-to-left text from right-to-left in the editor. But in all cases, the order the characters actually occur in the XML document should always be logical order.

5.2. Characters, Glyphs, and Fonts

Once you have writing mode under control, the next challenge you will probably face is managing the rendering of glyphs for non-ASCII characters. Here the main challenge is understanding the relationship between characters and fonts so that you can ensure that the right fonts are applied to the right formatting objects.

For a given script or writing system you must first determine which font or fonts are required to render the characters in that script. Which fonts you use will depend on a number of factors, including:

  • The operating system you are using and the localization or internationalization configuration for the machine on which the rendition will be created. For example, on Windows, the various regional support packages include fonts for the different scripts. If you install all of the Windows regional support you will have font coverage for most, if not all, of the scripts in modern usage. On Unix or Linux you may need to acquire fonts. For many writing systems there are free or low-cost fonts available, although they may or may not be of acceptable quality for print production.

  • Document rendition design, which may specify specific fonts for a given script.

  • The font technology your FO implementation supports, which may determine the cost or availability of specialized fonts.

  • The font configuration needed for rendition itself. If you are rendering to PDF, you need to understand how PDF manages fonts, which fonts are available, and, for example, whether the PDFs you're generating need to have embedded fonts.

Once you know what the fonts are, you must configure your FO implementation to make these fonts available. For example, RenderX's XEP product includes a font configuration file (etc/fonts.xml) by which you bind font names to font files. Antenna House's XSL Formatter product under Windows can access Windows fonts directly by name with no additional configuration. It also provides a user interface for defining default per-language mappings from the built-in FO font names to specific fonts.

The XSL-FO specification defines a set of generic font names that all FO processors must support: "serif", "sans-serif", "cursive", "fantasy", and "monospace". FO implementations must map these generic font names to appropriate fonts, e.g., mapping “serif” to “Times New Roman”. See the CSS2 specification for a detailed discussion of the generic font families[CSS].

When specifying fonts in FO instances, you use the font-family= attribute. The font-family= attribute takes a list of comma-delimited font names, either the generic font names or font names for local fonts. Generic font names are keywords and therefore are never quoted. Other font names should be quoted if they contain white space. Fonts are tried in the order they are specified in the attribute. The font-family property is inherited so you can specify it anywhere within the FO document hierarchy.

To define a default set of fonts for an entire document, specify font-family= on the fo:root element. You can then specify it on specific formatting objects to override the default font setting:

<fo:root
    font-family="Garamond, 'Open Symbol', serif">
    ...
    <fo:block
        font-family="'Courier New', monospaced">
       ...
    </fo:block>
    ...
</fo:root>

It is good practice to always specify a generic font family as the last font in the list—this makes the design intent clearer and provides a fallback in the case when a font is not available. However this can sometimes make it difficult to debug font configuration problems. If you're not sure if your FO implementation is failing to find a font and therefore falling back to a generic font, one trick is to reconfigure your FO implementation to map the generic font to something like Wingdings so that it's painfully obvious that the fallback is being used.

Keep in mind that not all fonts have all glyphs. For example, the common Windows fonts for Chinese and Japanese ideographs do not contain glyphs for basic symbols like bullets (\u2022), so you must include a font that includes these symbols, such as the OpenOffice Open Symbol font, part of the Open Office package[Open Office]. Some FO implementations may include a built-in fallback behavior for common characters such as the bullet symbol, but you can't depend on all FO implementations doing that. For example, to ensure the correct rendering of Chinese bulleted lists, you would need a font specification such as font-family="SimHei, 'Open Symbol', serif", where “SimHei” is the standard font for simplified Chinese and Open Symbol provides symbol characters. Figure Figure 8 shows the formatted result using these fonts.

click image for full size view

Figure 8. 

Another issue with characters and glyphs is glyph shaping and glyph composition. Some scripts, such as Arabic, use different forms of the same abstract character depending on where it occurs in a word. Other scripts, such as Thai, have complex rules for composing glyphs from multiple components. Normally the base content for these scripts is entered using the base character, not its final presented form. Thus glyph shaping is the process of applying the rules for the language and script to transform the base characters into their appropriate rendition forms. Thus it is not always sufficient for an FO engine to simply be able to render a particular Unicode character—it may need to know how to do glyph shaping or glyph composition in order to produce a usable rendition of the language. Modern scripts that require glyph shaping and composition include Arabic and Thai.

For most scripts Unicode provides both the generic and specialized forms of characters, so it is possible to preprocess the input to do the glyph transformation before giving it to the FO engine, but this would add significant complexity to a typical processing system. Glyph composition is more difficult to do outside of the composition engine because it involves the relative placement of several glyphs to form a final result glyph.

5.3. Line Composition

Line composition is one of the areas in which FO implementations have lots of room for differentiation. There is a lot of art to the composition of flowing lines of text and different FO implementations will implement different algorithms. To flow lines you must know where it is appropriate to break lines for a given writing system and language. Some languages, such as Thai, do not have well-defined word boundaries. If you are using hyphenation, the composition system must know how to do hyphenation for a particular language.

For Thai, if your FO implementation does not itself implement a Thai word breaking algorithm, you can apply a preprocessor to the XML data to insert zero-width non-breaking spaces into the Thai data to indicate appropriate break points. The IBM ICU package[ICU], which provides a number of internationalization utilities, includes a Thai word breaker that is adequate for most technical documentation.

The XSL-FO specification provides a number of hyphenation-related properties. However, it is up to each FO implementation to actually implement the hyphenation algorithm for a given language. FO implementations must provide some way to configure hyphenation algorithms on a per-language and locale basis in order to be useful for general purpose composition of localized documents. In order for an FO processor to do hyphenation it must know at least the language of the content, normally specified using the xml:lang= attribute.

By default hyphenation is set to “false” for all formatting objects. To turn hyphenation on you set the value of the hyphenate= attribute to “true” on the appropriate formatting object. The hyphenate property is inherited so you can set it on the fo:root element to turn hyphenation on globally. You can also set the hyphenation character, the character to be used when a word is hyphenated, if you need something other than the default (\u2010, “‐”). If your FO processor requires more information than the language as specified by xml:lang=, you can use the country=, language=, and script= attributes to specify precisely what hyphenation rules to apply in a given context.

Two additional attributes, hyphenation-remain-character-count and hyphenation-push-character-count, allow you to define the minimum number of characters that must occur before or after the hyphen when a word is hyphenated. The default for both attributes is “2”, meaning that there should always be at least two characters before or after a hyphen by default.

6. XSLT Processing Issues

When generating FO instances from more generic XML documents there you will almost always be generating text, such as the word “Note:” for a <note> element or the word “Chapter” for a top-level division. If your documents include generated things that are sorted, such as back-of-the-book indexes and parts lists, then you must do sorting within the XSLT process. Both of these tasks require internationalization and localization in order to support the needs of documents in different national languages.

6.1. Generated Text Management

If you are producing documents in just one language, regardless of what it is, then there is no unique challenge to creating generated text from XSLT: you just put the literal text in your XSLT templates. However, if you need to render documents in a number of different national languages to the same FO target structures, then you must also have the generated text strings for each of those languages. In most cases, the only difference between the processing for different languages is the generated text, so it doesn't make sense to have different XSLT style sheets for each language.

The normal engineering approach to this type of problem would be to have a single style sheet with some mechanism for selecting the appropriate text strings based on the active language. There are a number of ways this can be done in XSLT, but the best solutions will generally be those in which the text strings are managed completely separate from the XSLT templates. This helps to keep the style sheet independent of the details of the generated text and makes it easier for translators to maintain the generated text strings without having to modify the XSLT style sheet directly.

One technique is to create a separate XSLT module that contains named templates with the generated text for different contexts, which are then called from the templates where the generated text is needed. These named templates can be nothing more than xsl:choose elements with one xsl:when element for each distinct language. While this is easy enough to implement in XSLT it may not be a completely satisfactory solution.

In many cases the same generated text strings are needed in a number of processing contexts, such as in an authoring environment or task-specific user interface as well as in XSLT transforms. In that case, it makes more sense to keep the generated text strings completely separate from any given processor and then provide libraries for use by different processors that can access those text strings. ISOGEN has implemented such an approach in our Internationalization Support Library[Kimber Bltm], a Java library that provides a number of facilities to help with managing generated text in a general way that can be used from any Java-capable processor.

The package defines a relatively simple XML document type for binding text strings to contexts or string lookup keys. These documents represent simple generated text databases. Because they are XML the generated text documents can be managed and translated using the same localization tools used for other XML content.

The Java library provides an API for requesting strings from generated text documents. Essentially you pass in a context string or lookup key and the target language and get back the appropriate string.

Note that this library is just doing simple lookup of predefined strings. It is not doing automatic or dynamic translation of text.

The library provides facilities for managing translations for strings as would be needed for typical text-before/text-after style rules, as well as facilities for managing the translations of attribute values to display strings.

The library has been integrated with the Saxon and xsltc XSLT engines, as well as with Epic editor for in-editor generated text. The XSLT integrations provide a set of extension functions that style sheet authors can use to request generated text, e.g., “get-generated-text-before()”, which would be used in place of the usual xsl:text element:

<xsl:template match="chapter/title">
  <fo:block 
      font-size="24pt"
      font-weight="bold"
    ><xsl:value-of 
      select="isotrans:get-generated-text-before()"
    /><xsl:apply-templates
  /></fo:block>
</xsl:template>

6.2. Collation and Sorting

The XSLT 1.0 specification provides a simple mechanism for sorting nodes: xsl:sort. However, because of the limitless variation in which a given set of text strings might be sorted, there is no way that XSLT could standardize the sorting rules. Some programming languages, such as Java, provide default collation rules for a number of different locales, but even those defaults may not be appropriate for all applications.

The term collation usually refers to the task of sorting strings based on some formal collation rules. Java, for example, defines a base Collator object which can take a collation rule as an argument in order to configure it. Collator objects are then used to implement sorting algorithms. The formal definition of how characters and strings in a given national language sort is often referred to as a collation sequence.

One obvious candidate for the collation sequence for a language is the order that the characters in that language occur in the Unicode character set. A character set is, by definition, an ordered sequence of characters. However, the Unicode character sequence will almost never be appropriate for collation and for many languages is explicitly not the appropriate order. For example, in alphabetic languages you will usually want uppercase and lowercase letters to sort together. In ideographic languages, where there may be several different, equally-reasonable ways to do collation you may need to different collation sequences for different countries, audiences, or editorial preferences.

The end result is that in almost every case you will need to be able to completely control the details of the collating sequence used for a given locale and, in some cases, for a particular application of sorting. For example, there are different editorial practices for the sorting of index entries that contain spaces: some editors prefer to ignore spaces when comparing words, others include them. Thus, even for the same input document type and output presentation style, you might need different sorting rules for index entries.

The only way to configure sorting in XSLT 1.0 is to integrate custom collators with your XSLT engine. Unfortunately, not all XSLT implementations provide facilities for integrating custom collators, so be sure that your XSLT implementation provides a way for you to add the sorting rules you need. The Saxon[Saxon] XSLT engine does include sorting configuration features and the ISOGEN I18N library includes examples of integrating custom collators with Saxon.

The sorting configuration features of XSLT will be significantly extended in XSLT 2.0, but it will not eliminate the need to actually implement collation business logic.

The ISOGEN I18N library defines an XML document type for specifying index configurations, including index groups and collation rules (using Java's collator configuration syntax).

7. Current FO Implementations

Note

All of the FO implementations are being actively and rapidly improved. This roundup of tools reflects their state or anticipated state as of mid-March 2003. In many cases limitations documented here will have been addressed by the time you read this.

All of the FO implementations clearly document their support for the various FO features. You can use this documentation to determine if a particular product will support the specific requirements of your documents and processing system. This section does not restate the feature support details provided by each implementation. Rather, this section talks in general terms about the key features and characteristics that distinguish the various FO implementations.

Of the four full-featured FO implementations currently available, XSL Formatter implements the most FO features, although XEP version 3 almost matches it. FOP, as an open-source, volunteer product is the least feature complete and is not generally suitable for production use at this time. Epic is slightly limited by the fact that some FO constructs, mostly to do with page geometry and page master sequences, have no direct mapping into Epic's underlying formatting engine, which was originally engineered to support the FOSI style language. However, Arbortext has for the most part provided extensions that work around these limitations. Epic has been used for years to do high-quality production of technical documentation, so its support for features that are actually needed by most technical documents is quite good (in other words, the features of FO that it doesn't support are features that you probably don't need anyway if you are producing typical technical manuals).

The Sun xmlroff FO implementation, while still fairly thin in terms of features supported, has been explicitly designed to support internationalization requirements from the start.

7.1. Overview of FO Implementations

Epic from Arbortext is a Windows- and Unix-based FO implementation built around the FOSI-based Epic Publisher composition engine. Epic can be used as a standalone composition engine or in its integration with the Epic Editor SGML and XML editor. The version at the time of writing is 4.3. Arbortext has announced the development of version 5, slated for mid-2003 release, and promises a more complete FO implementation at that time. It can be used either as a command-line tool or interactively through the Epic Editor user interface.

The FOP FO implementation is implemented in pure Java as Apache-licensed open source. It lacks a number of important FO features but is actively being developed. It can be used either as a command-line tool or integrated with other Java tools using its Java API. FOP is also integrated with the eXcelon Stylus Studio XSLT development environment.

The XEP product is implemented in pure Java and can be used with any JVM. It exposes a Java API. It can be used as a command-line tool or integrated with other Java tools using its Java API. RenderX also licenses a version of XEP integrated with the XML Spy editor.

The XSL Formatter product is currently Windows-only, although a Unix/Linux version has been announced. XSL Formatter exposes both COM and Java APIs. It can be used as a command line tool, as an interactive tool using its graphical user interface, or integrated with other tools using its COM or Java APIs.

7.2. Support for Non-Western Languages

Of all the FO implementations, XSL Formatter[AHXF] has the best support for non-Western languages, including build-in locale-specific font configurations, complete Thai glyph construction, and full support for bi-directional text. RenderX is close, supporting bi-directional text and the Unicode bidirectional algorithm, but requires more effort to configure locale-specific fonts. As of XEP version 3.2 Thai glyph composition is not supported. XEP also does not support top-to-bottom writing modes.

7.3. Support for Graphics and Mathematics

All the FO implementations support the common bitmap formats GIF, JPEG, and TIFF (although XSL Formatter requires a separate, nominally-priced license for GIF rendering. XEP and Epic support scaling of bit-mapped graphics, although the quality of the scaled result may be poor in some instances (RenderX recommends scaling graphics to the appropriate size before including them in the FO instance to avoid any problems with dynamic scaling, either by the FO renderer or by the presentation device (i.e., PDF)).

All the FO implementations support EPS graphics. XSL Formatter and XEP only support interpreted EPS graphics when using their PostScript output options (as opposed to their direct-to-PDF options). For direct PDF generation, they both use the EPS preview image, if present. Note that both XSL Formatter and XEP include PDFMark in the Postscript they generate, meaning that you can create “online” PDFs from both tools using a PostScript-to-Distiller process instead of the direct-to-PDF process. Epic does not have a separate direct-to-PDF option, instead requiring the use of Distiller to create PDFs. Thus interpreted EPS graphics are always supported by Epic.

Epic supports CGM graphics, as does XSL Formatter when using the free ISOView CGM viewer plug-in on Windows.

FOP and XSL Formatter support the use of embedded SVG graphics. XEP had partial SVG support in version 2 but removed it in version 3, although RenderX has announced the intent to restore SVG support in the near future.

XSL Formatter supports embedded MathML through the use of a Windows MathML rendering plug-in. Epic supports the use of TeX for mathematics (the underlying Epic composition engine is TeX based).

XSL Formatter supports the Windows WMF (Windows Metafile) format.

7.4. Support for Non-RGB Color Models

XEP supports both RGB and CMYK color. All of the other FO implementations implement RGB color exclusively (although XSL Formatter may soon have CMYK support). The generation of PostScript or PDF that uses CMYK or another color model requires post-processing. There are a number of RGB-to-CMYK post processors available for both PostScript and PDF, including a number of PDF plug-ins.

8. Future Directions

Both the XSL and XSLT specifications are being updated. XSLT 2.0 will add additional features for doing sorting that should make it easier to configure collation with XSLT.

Antenna House is currently working on defining the additional layout requirements represented by various Asian languages, including Japanese. These requirements will certainly be input into any XSL revision activity.

We can also expect to see better support for non-Western languages in the main open source FO implementations.

9. Conclusions

ISOGEN has now accumulated over a year of practical experience using XSL-FO to produce both localized single-language and internationalized multi-language customer documentation for consumer electronics. It is clear from this experience that XSL-FO and XSLT provide, for the first time, an affordable, practical, and maintainable system for producing documents in almost any modern national language. The primary limiting factor to the use of XSL-FO for a given language is the ability of a given FO implementation to properly render the language's writing system. Supporting a given script is purely an engineering problem, meaning that there is no architectural barriers to supporting a particular language, only engineering resource constraints. The only other limiting factor is the layout requirements of the documents you are producing. However, for most business and technical documents, XSL-FO, coupled with commonly-available extensions, XSL-FO is more than capable of satisfying the layout requirements.

XSL-FO is a relatively new technology, but it reflects more than three decades of practical experience with doing computer-based page composition. The XSL-FO can be expected to improve over the next few years as the W3C extends the layout features supported by the language and as FO implementors add features and refine their implementations to provide better typographic results with ever greater performance.

Bibliography

[AHXF] Antenna House XSL Formatter product. A Windows-based FO implementation. Version current at time of writing is 2.4. See http://www.antennahouse.com for more information. Free evaluation version available

[CSS] Cascading Style Sheets, level 2, Recommendation of the W3C. http://www.w3.org/TR/REC-CSS2/.

[DSSSL] ISO/IEC 10179:1996, Document Style Semantics and Specification Language (DSSSL). See http://www.jclark.com/dsssl for more information.

[EPIC] Epic page composition system (an optional feature of the Epic SGML/XML editor). See http://www.arbortext.com for more information. Available on Windows and Unix platforms (but not Linux).

[EXSLT] A set of community-defined extensions to XSLT 1.0. See http://www.exslt.org.

[FOP] Apache Project's FO implementation. Open source, volunteer-developed FO implementation. See http://www.apache.org for more information. Implemented in Java.

[FOSI] Formatting Output Specification Instance, defined in U.S. Department of Defense standard MIL-PRF-28001. See http://navycals.dt.navy.mil/28001/28001c.pdf.

[ICU] International Components for Unicode, IBM Corp. An open source collection of libraries providing a number of internationalization facilities for both the C and Java languages. See http://oss.software.ibm.com/icu/index.html.

[ISO 639] International Organization for Standardization (ISO). ISO 639:1988 (E/F). Code for the Representation of Names of Languages. First edition, 1988-04-01. Reference number: ISO 639:1988 (E/F). Geneva: International Organization for Standardization, 1988. iii + 17 pages.

[Kimber Extr] Kimber, W. Eliot. Internationalized Back-of-the-Book Indexes for XSL Formatting Objects. Presented at Extreme Markup, 2002, Montreal, Canada. Available online at http://www.isogen.com/papers/botb-index-i18n.pdf.

[Kimber Bltm] ISOGEN's Internationalization support library. An open-source Java library that supports the internationalization of text strings (generated text) and back-of-the-book index configuration and generation (including custom collators for Saxon). http://www.isogen.com/downloads/cool_tools/i18n_support.jsp.

[Open Office] Open-source office software suite. Managed by OpenOffice.org, http://www.openoffice.org/.

[Prod FO] Kimber, W. Eliot, Using XSL Formatting Objects for Production-Quality Document Printing. Presented at XML 2002, Baltimore, USA. Available online at http://www.isogen.com/papers/production-quality-xsl-fo.pdf.

[PSVTEX] PassiveTex. A TeX-based FO implementation developed by Sebastian Rahtz. See http://www.tei-c.org.uk/Software/passivetex/.

[Saxon] Saxon XSLT processor. Open-source XSLT implementation by Mike Kay. http://www.saxon.org. Provides most complete support for integrating custom collators, making it the only XSLT engine that can currently fully support the requirements of documents that must do locale-specific collation.

[SVG] Scalable Vector Graphics (SVG) 1.0 Specification, W3C Recommendation 04 September 2001. Available online at http://www.w3.org/TR/SVG/.

[THAI] Thai word breaker SAX filter, written in Java. Available from the ISOGEN Web site, http://www.isogen.com/downloads. Open source under LGPL license. Based on IBM ICU4J Thai word breaker.

[Unicode] The Unicode® Standard. http://www.unicode.org.

[Unipad] Unipad Unicode-based text editor product. http://www.unipad.org.

[XEP] RenderX XEP product. A Java-based FO implementation. Version at time of writing is 3.2.2 . See http://www.renderx.com for more information. Free evaluation version available.

[XFC] IBM XSL Formatting Objects Composer. Developed by IBM alphaWorks. See http://www.alphaworks.ibm.com/tech/xfc.

[XMLROFF] xmlroff FO implementation. A C-based open-source FO implementation from Sun Microsystems. Developed and managed by Tony Graham. Focus of xmlroff is explicitly on internationalization features as part of initial feature set. See http://xmlroff.sourceforge.com.

[XSL-FO] XSL 1.0 Recommendation ("Formatting Objects"), published by the W3C October 2001. See http://www.w3.org/TR/xsl.

[XSLT] XSL Transformations (XSLT) 1.0 Recommendation, published by the W3C November 1999. See http://www.w3.org/TR/xslt.

Biography

W. Eliot Kimber is a Consultant at ISOGEN. Eliot is a founding member of the XML Working Group, Co-editor of ISO/IEC 10744:1977 (HyTime), and Co-Editor of ISO/IEC 10743, Standard Music Description Language. Eliot is a member of the W3C XSL Working Group. Eliot writes and speaks frequently on the subject of SGML, XML, hyperlinking, and related topics. When not trying to wrestle chaotic data into orderly structures, Eliot enjoys swimming, biking and guitar playing. Eliot is a devoted husband and dog owner.



[1] As opposed to peoples for whom English is a second but ubiquitous language.

[2] One interesting historical note, highlighted by Jon Bosak in his closing keynote at XML 2002 in Baltimore, Maryland, USA, was the fact that even though XSL FO was originally developed primarily with the needs of technical publishers in mind it turns out that the ability of XSL FO to produce documents in non-Western languages is of vital importance to world commerce for the simple reason that people in China or Cambodia or Pakistan will need to print invoices or purchase orders or sales contracts that are in XML in some business markup language but that are in their national language. If the technology to do that printing is low cost or free, as XML technology largely is, it goes a long way towards enabling anyone to participate in global electronic commerce. This could have profound implications for third-world countries struggling to compete in the world economy. It is largely for this reason that Sun has developed the xmlroff FO implementation[XMLROFF], which has focused initially on satisfying the requirements of internationalized business documents rather than technical documentation.

[3] [The Free On-line Dictionary of Computing].

[4] Source: [The American Heritage® Dictionary of the English Language], Fourth Edition Copyright © 2000 by Houghton Mifflin Company.

[5] The writing mode tb-lr is not part of the base set of XSL-FO writing modes. However it is one of the additional writing modes defined in Appendix A.1 of the XSL-FO specification.