Keywords: DocBook, XSLT, XSL-FO, Electronic Publishing, Publishing
Biography
Jirka Kosek is the freelance writer, consultant, trainer, university teacher, open source developer and PhD student. He has written several books about XML, HTML and PHP. He is a member of DocBook developers team, which is developing and improving DSSSL and XSL stylesheets for formatting DocBook documents. He is also a member of OASIS DocBook Technical Committee.
Many electronic publishing systems built on the top of XML (e.g. DocBook) use XSLT to convert source XML document into target formats like HTML or XSL-FO (for print output). During the transformation back-of-the-book index can be generated and populated by index entries spread over the document. Creating index basically means to sort and group index entries by their first letter. However this solutions is appropriate only for some languages, English included. For other Latin based languages like Czech, Hungarian or Spanish grouping method is more sophisticated and can't be expressed in the standard XSLT 1.0. The task is even more challenging if we want to get internationalized indexes in some general stylesheet package like DocBook XSL stylesheets. These stylesheets should support as many XSLT implementations as possible what disqualifies usage of vendor extensions.
This paper will show you how support for non-English index generation was implemented in the DocBook XSL stylesheets, what problems were overcame and what functionality is missing in XSLT 1.0, but can be added using EXSLT extensions. To deal with grouping problems like different accented letters belonging to the same group, multi-letter sequences denoting one group etc. solution based on XSLT keys over user defined function is provided. This function uses external localization files to lookup values which drive index generation and grouping.
Method presented up to this point is sufficient for indexes in HTML output. Print output brings new problems. As the transformation and formatting phases in the XSL are separated there is no direct support for merging duplicate page numbers in XSL-FO. Fortunately many FO engine vendors provide custom extensions to deal with this issue. Integration of these extensions into the DocBook XSL stylesheet will be presented.
Article also includes evaluation of XSLT 2.0 features available for index generation and proposals for further improvement of indexing method that will be able to handle CJKV languages.
1. Introduction
2. Marking up index entries in the DocBook
3. Using XSLT to generate index
4. DocBook stylesheets and localization
5. Internationalized index
6. Integration with DocBook stylesheets and compatibility
7. Print output issues
8. World after XSLT 2.0 and XSL-FO 1.1
9. Further work
10. Related work
11. Conclusion
Bibliography
Footnotes
Usability of a document, especially a printed document, depends on a good back-of-the-book index. Creating an index is a very laborious and responsible work often performed by specialists. At these days indexes are not built manually after the final layout of book is known, instead index terms are marked directly in a manuscript and index is then built automatically during the document formatting.
Nowadays in an era of XML publishing several document types suitable for large documents emerged. Probably the best known and used is DocBook. DocBook DTD defines several elements for marking up index terms. These terms are properly formed into an index when the document is processed by the DocBook XSL stylesheets. However getting proper index by means of XSLT 1.0 and XSL-FO 1.0 is almost impossible. In the following text I'm going to show you where are the limits of current XSL regarding the index generation, how they were overcome in the DocBook stylesheets and how new versions of XSLT and XSL-FO will make this task easier.
The most difficult part of creating an index must be done
manually and consists of marking up index entries in a document. In
DocBook, this is done by placing the indexterm
elements everywhere where you write about the given topic. The content
of the indexterm is not displayed as a part of a
document flow; it is used later when building the index.
<para>Wealth of a modern societies is built upon information<indexterm><primary>information</primary></indexterm>.</para> |
The indexterm element can also hold multilevel
entries:
<indexterm> <primary>information</primary> </indexterm> ... <indexterm> <primary>information</primary> <secondary>retrieval</secondary> </indexterm> ... <indexterm> <primary>information</primary> <secondary>dissemination</secondary> </indexterm> ... <indexterm> <primary>information</primary> <secondary>dissemination</secondary> <tertiary>oral</tertiary> </indexterm> |
Such index terms will result into the following index output (the page numbers are of course for illustration only):
information, 13
dissemination, 17
oral, 25
retrieval, 15DocBook markup offers several other more advanced methods for marking up index entries. You can read about them in [TDG] or [DBIDX]. In the following text I will assume only single level entries because multilevel entries do not add any significant processing complexity from the internationalization point of view.
Generating the index consists of grouping the index terms with the same initial letters and then alphabetical sorting of the entries within each letter group. The stylesheets exactly implement this algorithm.
The most common design pattern for grouping in XSLT is so called
Muenchian method. In order to use this method we must define a key
that indexes all elements to be grouped based on their group key. This
means that we must create key which will cover all
indexterm elements and key will be based on the
first letter of an index term.
<xsl:key name="letter" match="indexterm" use="substring(primary,1,1)"/> |
Moreover we want to group index terms regardless of their case. Thus key should be created over lower-cased or upper-cased letters. DocBook stylesheets use later approach. The final definition of key is
<xsl:key name="letter" match="indexterm"
use="translate(substring(primary,1,1),
'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/> |
We need to select each letter used as the first letter of an index term just once. We do it by selecting first index term starting with such letter and processing it in a special mode that is responsible for producing one index group.
<xsl:apply-templates select="//indexterm[count(.|
key('letter',
translate(substring(primary,1,1),
'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'))[1])
= 1]"
mode="index-div">
<xsl:sort select="translate(primary, 'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
</xsl:apply-templates> |
When processing an index group we must emit the group label and
process all entries belonging to this particular group. Entries in the
group must be also sorted and grouped together. For this purpose
second xsl:key is defined and Muenchian method is
used once more.
<xsl:key name="primary" match="indexterm" use="primary"/>
<xsl:template match="indexterm" mode="index-div">
<!-- Get the group key (ie. first letter of index terms in this group -->
<xsl:variable name="key"
select="translate(substring(primary,1,1),
'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
<!-- Output label for current index group -->
<xsl:value-of select="$key"/>
<xsl:apply-templates select="key('letter', $key)
[count(.|key('primary',primary)[1])=1]"
mode="index-primary">
<xsl:sort select="translate(primary,
'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/>
</xsl:apply-templates>
</xsl:template> |
For each entry in the group the following template is called. The template outputs index term and adds all page references after the term. If there are any secondary terms defined they are processed then.
<xsl:template match="indexterm" mode="index-primary">
<!-- Find all occurences of index term -->
<xsl:variable name="refs" select="key('primary', primary)"/>
<!-- Output text of index term -->
<xsl:value-of select="primary"/>
<xsl:for-each select="$refs[not(secondary)]">
<!-- Create page number reference (print) or link with back
reference (HTML) to each occurrence of the index term -->
</xsl:for-each>
<xsl:if test="$refs/secondary">
<!-- Process secondary level entries -->
</xsl:if>
</xsl:template> |
As we can see indexing code is not very complex if one knows tricks like Muenchian grouping or method to implement upper case function. The actual code in the DocBook XSL stylesheets is more complex as it deals with all ways in which index terms can be expressed. But principal method is the same.
The described method is able to generate satisfactory back-of-the-book index for English documents. But it is clear that for other languages is inappropriate. Accented letters are not grouped together with unaccented ones.[1] Other Latin based languages have separate groups for letters composed from two characters.[2] The biggest problem is that these rules are unique for each language and as such should be localized.
The DocBook XSL stylesheets adapt its output to a document
language. This means that automatically generated texts like
“Table of Contents”, “Figure” or shape of
quotes marks are different for each language. The document language
can be specified by using the lang attribute.
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE book PUBLIC '-//OASIS//DTD DocBook XML V4.3//EN'
'http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd'>
<book lang="de">
... German book ...
</book> |
How does the DocBook localization work? In the stylesheets
distribution you can find many files named like
en.xml, cs.xml or
pt_br.xml in the common
directory. Files are named by coressponding ISO 639 language code.
Each of these files contain translations of text phrases into the
target language. At the time of this writing stylesheets supported 44
different locales. The following fragment shows how the translation of
phrase “Table of Contents” into Russian is recorded in the
localization file.
<l:gentext key="TableofContents" text="Содержание" /> |
All locale files are merged into one XML document
l10n.xml using entities.[3] This document is then used to lookup translations or if
there is no corresponding translation then to get default English
value. Code that deals with finding correct translation is placed in
l10n.xsl stylesheet. The two most important
templates here are named templates gentext and
gentext.template. Their usage is very simple. If
you want to get translation of “Table of Contents” in a
current document language you can simply call the
gentext template with appropriate parameter.
<xsl:call-template name="gentext"> <xsl:with-param name="key">TableofContents</xsl:with-param> </xsl:call-template> |
It seems natural to use the same localization mechanism to get
localized index. For many Latin based languages it should be
sufficient to treat some accented characters like unaccented in order
to place them into the same index group. This can be accomplished by
changing parameters of translate() function which
is used to convert index terms to uppercase during grouping. For
example in German umlauted characters should be grouped as if they are
without umlaut. This can be done by using the following parameters for
translate() in XSLT indexing code.
translate(substring(primary,1,1),
'abcdefghijklmnopqrstuvwxyzäÄöÖüÜ',
'ABCDEFGHIJKLMNOPQRSTUVWXYZAAOOUU') |
Instead of original
translate(substring(primary,1,1),
'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ') |
The problem is that we can not call gentext
template directly from XPath expression because XSLT named templates
are not treated as XPath extension functions. Instead we must grab
result of calling template into a variable and then use this variable
in an expression.
<xsl:variable name="index.lowercase">
<xsl:call-template name="gentext">
<xsl:with-param name="key">index.lowercase</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="index.uppercase">
<xsl:call-template name="gentext">
<xsl:with-param name="key">index.uppercase</xsl:with-param>
</xsl:call-template>
</xsl:variable>
...
translate(substring(primary,1,1), $index.lowercase, $index.uppercase) |
This approach seems viable at the first glance. Unfortunately it does not work because Muenchian method used for grouping depends on a key. Localized definition of the key is
<xsl:key name="letter" match="indexterm"
use="translate(substring(primary,1,1),
$index.lowercase, $index.uppercase)"/> |
But any decent XSLT processor should signal error at this instruction as XSLT recommendation forbids usage of variables inside key definitions.
So if we stay within limits of XSLT 1.0 we can not made indexing
locale aware. The only solution is to edit lowercase and uppercase
string constants in the autoidx.xsl file manually
for usage with other than English language. This is quite easy as
these strings are defined just once as an internal text entities and
then used on many places. But this solution does not scale well so
another solution was needed.
The demand for internationalized indexing was rising and one day
I was in the need for perfect Czech index. I was thinking about
several approaches but the most viable was based on usage of user
defined functions. Such function can label each word with ordinal
number that identifies group in index and also position of this group
inside index. This function can use external localization files to
handle various languages in a different way. Value of this function
can be used inside xsl:key definition and thus we
can use Muenchian grouping method.
The current implementation of this function is designed in a way that it can correctly handle characters with diacritics and letters that are composed of two characters. The later is needed because some languages treat “ch” as a single letter which should sort between “c” and “d” in traditional Spanish or between “h” and “i” in Czech.
All information necessary for a correct grouping of index terms and collating these groups is stored together with other localization texts in localization files. Special structure is used as shown on Example 1.
<l:letters> <l:l i="-1" /> <l:l i="0">Symboly</l:l> <l:l i="1">A</l:l> <l:l i="1">a</l:l> <l:l i="1">Á</l:l> <l:l i="1">á</l:l> <l:l i="2">B</l:l> <l:l i="2">b</l:l> <l:l i="3">C</l:l> <l:l i="3">c</l:l> <l:l i="4">Č</l:l> <l:l i="4">č</l:l> <l:l i="5">D</l:l> <l:l i="5">d</l:l> <l:l i="5">Ď</l:l> <l:l i="5">ď</l:l> <l:l i="7">E</l:l> <l:l i="7">e</l:l> <l:l i="7">É</l:l> <l:l i="7">é</l:l> <l:l i="7">Ě</l:l> <l:l i="7">ě</l:l> <l:l i="7">Ë</l:l> <l:l i="7">ë</l:l> <l:l i="8">F</l:l> <l:l i="8">f</l:l> <l:l i="9">G</l:l> <l:l i="9">g</l:l> <l:l i="10">H</l:l> <l:l i="10">h</l:l> <l:l i="11">Ch</l:l> <l:l i="11">ch</l:l> <l:l i="11">cH</l:l> <l:l i="11">CH</l:l> <l:l i="12">I</l:l> <l:l i="12">i</l:l> <l:l i="12">Í</l:l> <l:l i="12">í</l:l> <l:l i="13">J</l:l> <l:l i="13">j</l:l> <l:l i="14">K</l:l> <l:l i="14">k</l:l> <l:l i="15">L</l:l> <l:l i="15">l</l:l> <l:l i="16">M</l:l> <l:l i="16">m</l:l> <l:l i="17">N</l:l> <l:l i="17">n</l:l> <l:l i="17">Ň</l:l> <l:l i="17">ň</l:l> <l:l i="19">O</l:l> <l:l i="19">o</l:l> <l:l i="19">Ó</l:l> <l:l i="19">ó</l:l> <l:l i="19">Ö</l:l> <l:l i="19">ö</l:l> <l:l i="20">P</l:l> <l:l i="20">p</l:l> <l:l i="21">Q</l:l> <l:l i="21">q</l:l> <l:l i="22">R</l:l> <l:l i="22">r</l:l> <l:l i="23">Ř</l:l> <l:l i="23">ř</l:l> <l:l i="24">S</l:l> <l:l i="24">s</l:l> <l:l i="25">Š</l:l> <l:l i="25">š</l:l> <l:l i="26">T</l:l> <l:l i="26">t</l:l> <l:l i="26">Ť</l:l> <l:l i="26">ť</l:l> <l:l i="28">U</l:l> <l:l i="28">u</l:l> <l:l i="28">Ú</l:l> <l:l i="28">ú</l:l> <l:l i="28">Ů</l:l> <l:l i="28">ů</l:l> <l:l i="28">Ü</l:l> <l:l i="28">ü</l:l> <l:l i="29">V</l:l> <l:l i="29">v</l:l> <l:l i="30">W</l:l> <l:l i="30">w</l:l> <l:l i="31">X</l:l> <l:l i="31">x</l:l> <l:l i="32">Y</l:l> <l:l i="32">y</l:l> <l:l i="32">Ý</l:l> <l:l i="32">ý</l:l> <l:l i="33">Z</l:l> <l:l i="33">z</l:l> <l:l i="34">Ž</l:l> <l:l i="34">ž</l:l> </l:letters> |
Example 1: Index localization data for Czech language
As you can see there is a separate l element
for each letter that can occur at the start of an index term.
Attribute i assigns group number
to this letter. Letters that should appear within the same index group
thus have the same value in this attribute. For example index terms
starting with either “A”, “a”,
“Á” or “á” will be placed in the same group.
The label of group that will appear in the index is taken from the
first l element with the corresponding group
number. This means that words starting with variants of letter
“a” will be in the group labeled with
“A”.
The localization table is also able to cope with two character
letters. The following definition ensures that terms starting with
“ch” will be in a separate index group after
“h” and before “i”. The order is defined by
value stored in the attribute i.
<l:l i="10">H</l:l> <l:l i="10">h</l:l> <l:l i="11">Ch</l:l> <l:l i="11">ch</l:l> <l:l i="11">cH</l:l> <l:l i="11">CH</l:l> <l:l i="12">I</l:l> <l:l i="12">i</l:l> |
One of the goals of the DocBook XSL stylesheets is to be as portable as possible. For that reason I decided to implement above described indexing function as an EXSLT function instead of proprietary Java/C/Python/… extension for particular XSLT processor. The most used XSLT implementations for DocBook processing are probably Saxon, xsltproc and Xalan. All of these programs claim support for EXSLT user defined function. Such user defined function is very similar to named XSLT template. The biggest difference is the fact that user defined function can be called directly from XPath expression as any other XPath function.
<func:function name="i:group-index">
<xsl:param name="term"/>
<xsl:variable name="letters-rtf">
<xsl:variable name="lang">
<xsl:call-template name="l10n.language"/>
</xsl:variable>
<xsl:variable name="local.l10n.letters"
select="($local.l10n.xml//l:i18n/l:l10n[@language=$lang]/
l:letters)[1]"/>
<xsl:variable name="l10n.letters"
select="($l10n.xml/l:i18n/l:l10n[@language=$lang]/
l:letters)[1]"/>
<xsl:choose>
<xsl:when test="count($local.l10n.letters) > 0">
<xsl:copy-of select="$local.l10n.letters"/>
</xsl:when>
<xsl:when test="count($l10n.letters) > 0">
<xsl:copy-of select="$l10n.letters"/>
</xsl:when>
<xsl:otherwise>
<xsl:message>
<xsl:text>No "</xsl:text>
<xsl:value-of select="$lang"/>
<xsl:text>" localization of index grouping
letters exists</xsl:text>
<xsl:choose>
<xsl:when test="$lang = 'en'">
<xsl:text>.</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>; using "en".</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:message>
<xsl:copy-of select="($l10n.xml/l:i18n/l:l10n[@language='en']/
l:letters)[1]"/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="letters"
select="exslt:node-set($letters-rtf)/*"/>
<xsl:variable name="long-letter-index"
select="$letters/l:l[. = substring($term,1,2)]/@i"/>
<xsl:variable name="short-letter-index"
select="$letters/l:l[. = substring($term,1,1)]/@i"/>
<xsl:variable name="letter-index">
<xsl:choose>
<xsl:when test="$long-letter-index">
<xsl:value-of select="$long-letter-index"/>
</xsl:when>
<xsl:when test="$short-letter-index">
<xsl:value-of select="$short-letter-index"/>
</xsl:when>
<xsl:otherwise>0</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<func:result select="number($letter-index)"/>
</func:function> |
Example 2: Function that returns index group number for a given term
The largest part of function i:group-index()
is code that loads the localization data from the correct place. Then
function tries to find match for the first two characters of term
(e.g. handling of “ch”). If there is no two letter match,
then one letter match is used. Function returns number of the
corresponding index group. If no letter match is found then function
returns zero that represents group for symbols and other terms not
starting with letter.
Using this function we can define supplementary key for grouping.
<xsl:key name="group-code"
match="indexterm"
use="i:group-index(primary)"/> |
Now it is quite easy to modify original code for grouping index entries. We use our user defined function to form groups based on their number taken from localization file. The sort order of groups is also defined by this number.
<xsl:apply-templates
select="//indexterm[count(.|key('group-code',
i:group-index(primary))[1]) = 1]"
mode="index-div">
<xsl:sort select="i:group-index(primary)" data-type="number"/>
</xsl:apply-templates> |
We must also slightly modify the code for handling an index group. This code emits group label and process each term belonging to this group just once.
<xsl:template match="indexterm" mode="index-div">
<!-- Get the group index -->
<xsl:variable name="key" select="i:group-index(primary)"/>
<!-- Get the current language -->
<xsl:variable name="lang">
<xsl:call-template name="l10n.language"/>
</xsl:variable>
<!-- Output label for current index group -->
<xsl:value-of select="i:group-letter($key)"/>
<xsl:apply-templates select="key('group-code', $key)
[count(.|key('primary', primary)[1])=1]"
mode="index-primary">
<xsl:sort select="primary" lang="{$lang}"/>
</xsl:apply-templates>
</xsl:template> |
We are using supplementary function
i:group-letter() to return label of group with
particular number.
The code that we presented up to this point was able to properly group index terms and order these groups according to language specific rules. But one problem still left unresolved—sorting of terms inside each index group. I decided to left this task to XSLT processors as they can use underlying implementation provided by virtual machine or operating system to do proper collating. Other processors allow you to specify your own collating sequence. For example if you want correct sort order for Czech in Saxon[4] you must create simple Java class, compile it and then add it into CLASSPATH. Other processor can provide similar functionality.
package com.icl.saxon.sort;
import java.text.Collator;
import java.util.Locale;
public class Compare_cs extends TextComparer
{
int caseOrder = UPPERCASE_FIRST;
public int compare(Object a, Object b)
{
Collator csCollator = Collator.getInstance(
new Locale("cs", "cz"));
return csCollator.compare(a, b);
}
public Comparer setCaseOrder(int caseOrder)
{
this.caseOrder = caseOrder;
return this;
}
} |
Example 3: Sample class that adds proper Czech collating support into Saxon
Code presented in this paper shows only the most important parts
and it is simplified little bit. If you want to study complete code
you can look at files common/autoidx-ng.xsl,
html/autoidx-ng.xsl and
fo/autoidx-ng.xsl in the stylesheets
distribution.

Figure 1: Sample index processed with different configurations
EXSLT is not supported in all XSLT implementations and thus we can not add internationalized indexing into the stock stylesheets. This would break compatibility for people who are not interested in internationalized indexes. Different deployment method was thus selected. Internationalized stylesheets are part of distribution but are not included by default.
If we want to use internationalized indexing features of the
stylesheets we must create a customization layer that overrides
default index generating templates by including a small
autoidx-ng.xsl stylesheet.
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:import
href="http://docbook.sf.net/release/xsl/current/fo/docbook.xsl"/>
<xsl:include
href="http://docbook.sf.net/release/xsl/current/fo/autoidx-ng.xsl"/>
<!-- Parameter settings and other modifications of stylesheet -->
</xsl:stylesheet> |
This customization layer is then used instead of stock stylesheets.
Internationalized indexing method is known to work with Saxon
6.5.3 and Xalan 2.6.0. I put substantial amount of time into porting
this code to xsltproc but I was unsuccessful up to this time. Authors
of xsltproc interpret EXSLT specification in a very restrictive point
of view of XSLT 1.0 and does not allow usage of variables inside user
defined functions that are used in keys. For that reason I created
alternative implementation of internationalized indexing that did not
use Muenchian grouping method that depends on
xsl:key. But grouping is very slow without
keys—xsltproc is 40 times slower than Saxon that can use keys for this
task. Moreover xsltproc has very poor support for specifying user
defined collation sequences so I give up xsltproc support until these
problems are resolved.
Generating a printed back-of-the-book index in XSL is a two phase process. The first phase is a XSLT transformation that converts a source DocBook document into a set of abstract formatting objects. Page numbers for the index entries are not known at this moment. The actual rendering and page number evaluation takes part during the second formatting phase, which is performed by a FO processor like FOP, XEP or XSL Formatter. The problem arises when one index term occurs twice within a page. In this case, the index contains duplicate page numbers for this entry. This serious drawback can be overcome in two ways. The first solution utilizes the FO processor, that implements a vendor extension for the index generation. The other possibility is to use multiple passes over a document to detect and remove the duplicities.
The vendor extensions are supported in the two most known commercial FO processors—XEP and XSL Formatter. The DocBook XSL stylesheets contain support for these FO implementations and are able to add special indexing elements and attributes into FO output. For each FO processor there is a parameter turning on special indexing features. For instance XEP should be invoked by the following command line
|
In the real world we usually change behaviour of the stylesheets by customizing more then one parameter. The best practice is then to create a customization layer, that imports stock stylesheets and sets all necessary parameters.
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:import
href="http://docbook.sf.net/release/xsl/current/fo/docbook.xsl"/>
<xsl:param name="paper.type" select="'A4'"/>
<xsl:param name="xep.extensions" select="1"/>
</xsl:stylesheet> |
If you prefer XSL Formatter over XEP, you can use a similar
parameter axf.extensions to turn on the XSL
Formatter support. Use of these parameters results in removing
duplicate page numbers and in creating a page range for continuous
sequences of page numbers. For example if a single index entry occurs
on the following pages:
5, 5, 8, 9, 10, 37
the output will be more reasonable and esthetic in the following way:
5, 8–10, 37
When we are using FO processor that does not support indexing
extension, we must employ more difficult procedure. This is also the
case of the open-source FOP
processor. We must process the document twice. The first pass
is done with the make.index.markup parameter
set. The resulting PDF will contain a XML markup for index entries and
page numbers. This PDF can be converted to a plain text from which the
XML markup is extracted. The duplicates are then removed and the
modified XML fragment of the index is now used to get the proper PDF.
This process is a real hackery and it does not work very well for
languages that use characters outside the ISO Latin 1 as the FOP does
not insert the proper Unicode mapping vector for embeded fonts. This
technique was
invited by G. Ken Holman.
We have seen that it is probably impossible to implement
internationalized indexing using only XSLT 1.0 features. Our
implementation used two EXSLT extensions—user defined functions and
node-set() function. Fortunately authors of
upcoming version XSLT 2.0 listen very carefully to user needs and XSLT
2.0 offers many new features that can significantly simplify our
indexing task.
New xsl:function instruction offers
a standard way for declaring user defined functions. Such function can
be used in any XPath expression. There is no need for using Muenchian
grouping method as XSLT 2.0 offers new instruction
xsl:for-each-group. Alike there is no need for
using node-set() function because XSLT 2.0 throws
away result tree fragments.
Creating an index in XSLT 2.0 is pretty straightforward using new grouping facilities.
<!-- Get the current language -->
<xsl:variable name="lang">
<xsl:call-template name="l10n.language"/>
</xsl:variable>
<!-- Create index groups -->
<xsl:for-each-group select="//indexterm"
group-by="i:group-index(primary)">
<xsl:sort select="i:group-index(primary)"/>
<!-- Output label for current index group -->
<xsl:value-of select="i:group-letter(current-grouping-key())"/>
<!-- Group index terms in one group -->
<xsl:for-each-group select="current-group()" group-by="primary">
<xsl:sort select="primary" lang="{$lang}"/>
<xsl:apply-templates select="." mode="index-primary">
<xsl:for-each-group>
</xsl:for-each-group> |
XSLT 2.0 also offers better interface for specifying user defined collations which is important for proper sorting of terms inside an index group. Saxon 7/8, only reasonable XSLT 2.0 WD implementation available, supports all collations provided by underlying Java VM.
To summarize it: XSLT 2.0 offers great new features that made generating of internationalized indexes very easy compared to XSLT 1.0 + EXSLT solution. The new solution does not need to use any extensions and it is thus more portable. The only drawback of XSLT 2.0 is its slow standardization process. For the last four years the answer to question “When the XSLT 2.0 will be finished?” was “Probably in the next year.” I hope this year is the last year when this answer is correct. It can be expected that after finalizing specification more implementations will be placed on the marked. But if you are early adopter you can use XSLT 2.0 with Saxon 8 right now. Saxon 8 is mature enough for production use.
Development of new version of XSL-FO is not in as advanced stage as XSLT 2.0 but there is Working Draft for XSL-FO 1.1. It introduces new formatting objects for dealing with index and suppressing duplicate page numbers in index. These new constructs are replacement for vendor extensions or multi pass processing that must be done these days.
The current implementation of internationalized indexing is known to work in Saxon and in Xalan. Index localization data are available for seven languages: Czech, Danish, German, English, Spanish, French and Turkish. All of these files were created manually except English one. English one was generated from Unicode character database and contain all accented variants of 26 letters. These variants are always placed in the same index group as unaccented one.
Currently we are planning to add support for indexing CJKV languages. This is especially challenging task as there are several thousands of such characters and my knowledge of CJKV languages is equal to zero. But with the help of other resources things becoming clearer. One thing that is obvious right now is need for different layout of index localization files. Requirements for CJKV languages are very different. Index terms are grouped by number of strokes or by radicals in glyph. Group label is not just first letter of an index term as in Latin based languages. We will probably use alternative layout of index localization data more appropriate for CJKV languages and the stylesheet will be adapted to handle data in both formats.
In the long term the stylesheets will migrate to XSLT 2.0 where the internationalized indexing will be on by default. However this new implementation probably would not start before XSLT 2.0 reaches at least Candidate Recommendation status.
Meanwhile work on compatibility with another XSLT implementations is expected as well as adding support of new languages. If your preferred language is currently not supported in internationalized indexing we would appreciate if you can provide index localization data to the DocBook XSL stylesheets project.
Original indexing code in the DocBook XSL stylesheets comes from Jeni Tennison. It was later modified to support several new features but overall design of code remained intact.
After I finished first prototype of internationalized indexing I come across Kimber's and Reynold's paper [FOIDX] that deals with the same problem. Their solution is implemented as Java extension for Saxon and it is ready to support CJK languages. Advantage of solution used in the DocBook XSL stylesheet is better compatibility—at least two processors (Saxon and Xalan) are able to do internationalized indexing.
In the article I shown general method for creating internationalized back-of-the-book indexes using XSLT 1.0 with EXSLT extensions. Advantage of this method is that it works in more than one XSLT implementation. Integration of this method into the DocBook XSL stylesheets was presented. Internationalized indexing is not turned on by default to maintain compatibility with XSLT processors that do not support EXSLT. New emerging standards like XSLT 2.0 and XSL-FO 1.1 will make internationalized index generation easier and will allow make it default feature.
For example in German letter “ö” should be in the same group as “o” and within this group it should be sorted as “oe”.
For example in Czech “ch” is treated as one letter, it should have separate group in index which is placed between “h” and “i”.
Merged document is quite large and has about two megabytes. If
you are really concerned about performance you can comment out entity
references for unused
languages in l10n.xml. Such information
arrangement may be seen as a very ineffective but please have in mind
that it was designed several years ago when there were just few
localizations and there were also much less items to translate.
Saxon version 6.5.x is used with the DocBook XSL stylesheets. Newer versions of Saxon can use JVM collation automatically without need for user defined classes.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.