XML Europe 2003 logo

Multilingual Markup of Digital Library Texts Using XML, TEI and XSLT

Abstract

In this paper we will show and defend the benefits of using multilingual markup schemes for large digitization projects, such as the Miguel de Cervantes Digital Library, and the consequent increase in production due to using markup tags in one's own language. We will also show the process carried out in designing the scheme and developing the parsers and programs for the generation of Extensible Stylesheet Language (XSL) transformations and C parsers to convert English markup and Document Type Definition (DTDs) to Spanish. Finally, we will adduce the conclusions of the implementation of this project.

Keywords


Table of Contents

1. Introduction
2. Markup, meaning and multilinguality.
3. Automatic generation of markup translators
4. Explanation of usage
5. Conclusions
6. Future work
Bibliography
Glossary
Biography

1. Introduction

The Miguel de Cervantes Digital Library (http://cervantesvirtual.com/) is the biggest electronic publishing project in Spain, and perhaps the biggest digital library of Spanish texts on the Internet, currently with more than 10000 entries in its catalogue [Bia-Pedreno:2001]. It produces an average of 150 Extensible Markup Language (XML) digital texts per month, most of which are Spanish classics from the 12th century up to these days, comprising a wide variety of subjects and styles such as poetry, narrative, drama, history, geography, law, etc. These texts are used both by the casual reader and by specialized researchers that take advantage of the power of complex structural markup.

In a project like this, the amount and quality of production depends highly on technology. Even slight changes in critical aspects of the production technology involved produce considerable changes in times, costs and the quality of the final output. We have carried out many projects to improve the production process of XML text books like building specialized spell-checking tools [Bia-Sanchez:2002], designing approaches and tools to simplify DTDs [Bia-Carrasco-Sanchez:2002] [Bia-Carrasco:2002], developing parsing tools to convert files to and from XML, developing software for the control of the production workflow and document management and algorithms for production cost estimates [Bia:2002].

2. Markup, meaning and multilinguality.

In 1998 Robin Cover wrote: [How does XML help with the encoding of information at the semantic level? Or does it? New users sometimes refer to XML as semantic markup, and may be heard to praise XML for its ability to express semantic clarity through markup. ... Someone who uses a text editor to examine an XML document -- comparing it to an ancient WordStar file, to a comma-delimited text file, to Postscript, or to any document using a procedural or presentational markup language -- will readily judge the XML document more meaningful with respect to the information objects represented by text. The markup itself is a form of 'metadata', explaining to us what the constituent elements are (by name), and how these information objects are structured into larger coherent units.] [Cover:1998]

In spite of currently preferring predicate logic as more suitable than conventional DTDs for semantic purposes [Dubin-SperbergMcQueen-Renear-Huitfeldt:2002], Sperberg-McQueen et.al. in 2000 supported the usefulness of markup as a source of meaning: [The function of markup is not random. Markup has meaning. What does it mean to have meaning? How does markup have meaning? Why worry about this question?: For better markup language documentation, for better QA (verification), for better automated processes (translation, normalization, query), to provide a way to survey current practice (relevance for software developers) ... and because it's interesting.] [How does markup mean? Because markup means something, ... we know certain things. I.e. because we see certain markup, we are allowed (licensed) to make certain inferences]. and concluded that: [the meaning of markup is the set of inferences it licenses.] [SperbergMcQueen-Huitfeldt-Renear:2000]

So one of the key aspects of structural markup is the meaning it conveys, which depends on our ability to understand it. Understanding XML tags is key to correctly delimit complex text structures for further automated processing. This understanding may be compromised when tag names (elements, attributes and attribute values) are in a foreign language.

Our digital library is a multidisciplinary project where specialists from different study fields (philologists, computer scientists, librarians, sociologists, etc.) work together in cooperation. The largest group of specialists in the library is the proof-reading and markup team (about 40 persons), comprised of specialists from different humanities fields, none of them related to the English language. It is in this area where the necessity and importance of translating the original English markup into one's own language (Spanish) is made evident.

We learned from practice that using a tagset in a foreign language, compared to using a tagset in our own language, increases the learning time and reduces the quality and amount of digital text production, since tag names are mnemonics that may sound familiar to English speakers but are hard to understand and memorize by users of other languages. Giving our encoders the possibility of applying tags in Spanish has increased the amount and quality of digital text production.

Convinced as we are of the value and advantages of using standards we have chosen the Text Encoding Initiative (TEI) tagset which is a de facto standard at least within the English-literature scholar community. After using it successfully for sometime, we embarked in the project of translating TEI element names, attribute names and attribute values to Spanish. Finally we developed the translation tools to grant automatic conversion to and from the main TEI English core. These automatic conversion programs translate not only the markup of XML documents but also the corresponding DTDs.

Now we are in the process of building other TEI tagsets and translations for several other languages. The purpose is to have many official translations of the TEI tagset, but one core version (the original English one). The automation of the language translation of the tags is vital to assure easy interchangeability of documents amongst projects using different languages. In this way, and from the structural and semantic point of view, the tagset is the same, only the names change.

We also believe that having multilingual versions of a given tagset, like TEI, can facilitate the introduction of them in many parts of the world like Latin America where the use of XML for electronic publishing is still uncommon. This may be of special interest for digital libraries and digital publishers worldwide.

The main reason for this initiative is that markup schemes usually are defined in English and there is a large community of users who do not use the English language so fluently and then lose its meaning. If the markup scheme is translated to the users' language, the process of assimilating and controlling its use will be accelerated and the production of marked-up texts will increase, with the corresponding reduction of costs.

3. Automatic generation of markup translators

We started by defining the set of possible translations of element names, attribute names, and attribute values to the different target languages. We stored this information in an XML multilingual translation mapping document. An example of this document and its DTD follow.

TRANSLATION MAPPING DOCUMENT FOR ENGLISH, SPANISH AND FRENCH (SAMPLE):

<TAGMAP>
...
  <ELEMENT en="body" sp="cuerpo" fr="corps">
  </ELEMENT>
...
  <ELEMENT en="div0" sp="div0" fr="div0">
     <ATTR en="lang" sp="lengua" fr="langue">
     </ATTR>
     <ATTR en="type" sp="tipo" fr="type">
       <VALUE en="news" sp="noticias" fr="nouvelles"/>
       <VALUE en="suggestions" sp="sugerencias" fr="sugestions"/>
       <VALUE en="biblnews" sp="novedades" fr="publications"/>
     </ATTR>
  </ELEMENT
...
  <ELEMENT en="p" sp="parrafo" fr="paragraphe">
     <ATTR en="align" sp="alinear" fr="aligne">
       <VALUE en="left" sp="izq" fr="gauche"/>
       <VALUE en="right" sp="der" fr="droite"/>
       <VALUE en="center" sp="centro" fr="centre"/>
       <VALUE en="justify" sp="justificar" fr="justifie"/>
     </ATTR>
     <ATTR en="indent" sp="sangria" fr="retraitpositif">
       <VALUE en="left" sp="izq" fr="gauche"/>
       <VALUE en="right" sp="der" fr="droite"/>
       <VALUE en="both" sp="ambas" fr="lesDeux"/>
       <VALUE en="none" sp="ninguna" fr="aucune"/>
     </ATTR>
     <ATTR en="specialindent" sp="sangriaespecial" fr="retraitnegatif">
       <VALUE en="none" sp="ninguna" fr="aucune"/>
       <VALUE en="firstline" sp="primeralinea" fr="premiereLigne"/>
       <VALUE en="french" sp="francesa" fr="francaise"/>
     </ATTR>
  </ELEMENT>
...
</TAGMAP> 

DTD FOR THE ABOVE FILE:

<!ELEMENT TAGMAP (ELEMENT)+ >

<!ELEMENT ELEMENT (ATTR)* >

<!ATTLIST ELEMENT
     en CDATA #REQUIRED
     sp CDATA #REQUIRED
     fr CDATA #REQUIRED>

<!ELEMENT ATTR (VALUE)* >

<!ATTLIST ATTR
     en CDATA #REQUIRED
     sp CDATA #REQUIRED
     fr CDATA #REQUIRED>

<!ELEMENT VALUE EMPTY >

<!ATTLIST VALUE
     en CDATA #REQUIRED
     sp CDATA #REQUIRED
     fr CDATA #REQUIRED>

This mapping document which contains all the necessary structural information to develop the language converters is read by the transformations generator, which was built as an XSLT stylesheet [Kay:2000]. XSL can be used to process XML documents in order to produce other XML documents or a plain text document. As XSL stylesheets are XML, they can be generated as an XSL output. In this way, and for each of the languages contained in the multilingual translation mapping file, we produced both an English to local language XSL transformation and a local language to English XSL transformation. In this way we assured both ways convertibility for XML documents.

We also generate for each language a DTD translator in the form of a parser written in C++ and Lex. Take into account that DTDs are not XML compliant and hence cannot be transformed using XSLTs. So it is for this we used the XSL capability of producing plain text, now in the form of a C++ program. We only considered a one way translation from the English DTD to a local language DTD, since we assumed that the DTD would be first built in the original XML vocabulary language (English) and then translated to the local language, and not the other way around. We saw no need to translate the local language DTD back to English (dashed line), but this is a transformation that could easily be generated in the same way if the need arises, allowing for maintenance and modifications to be done in the local language and then translated to English. Just as the English DTD can be used to validate the English-marked-up set of XML documents, the local language DTD can be used to validate the local-language marked-up set of files.

This transformations generation process is shown for the Spanish-language as a target in Figure 1. Many other markup translators can be built to other languages in the same way. In our tests we played with English, Spanish and French, being able to generate transformations to translate to and from any pair of these languages, although the idea is to translate to and from the original tag-set's language (English), which should be used as the standard file transfer language amongst projects.

click image for full size view

Figure 1. Automatic generation of markup translators

4. Explanation of usage

With a minimalist approach, we think local-language markup should be used almost only for creation and maintenance purposes (i.e. to tag, edit, and correct the XML documents). There are cases where it should also be used, as for XML searches when the search-engine interface allows the user to enter queries using markup elements, attributes and attribute values, which would be better understood in the local language. If the user has to use tagnames in a foreign language the semantic advantage is lost. A translating interface can also be used for multilingual XML search engines.

For automated processing and document interchange we think it is more convenient to use markup in the language of the original standard. In this way, stylesheets need not be translated to the local language, but the document translated to the original tagset instead.

This approach can be argued, and there may be users which prefer to translate everything (including stylesheets) to the local language. Although not strictly necessary, it can be easily done.

A risk to be avoided, if the advantages of using a standard widely accepted markup scheme are to be preserved, is the development of alternative markup schemes in different languages which may evolve independently from the central standard.

5. Conclusions

- Learning times were noticeably reduced.

- Production times were also reduced, along with an increase in markup quality. Encoders showed themselves satisfied and more confident in their task.

- When using markup in one's own language, the meaning of markup is not lost.

- Cooperative multilingual projects may benefit from the possibility of easily translating the markup to each encoder's language.

- Sometimes new non-standard vocabularies are developed just because it seams comparatively easier than learning a standard vocabulary in a foreign language. Having the possibility of using a standard vocabulary in one's own language plays against developing a new custom vocabulary to fulfil a local markup requirement. This may help spread the use of XML vocabularies like TEI or DocBook in non-English speaking countries.

- Spreading the use of standard markup vocabularies is good for document interchangeability.

6. Future work

The same set of tools can be re-engineered in a language like Java or C++, to provide faster performance and a nicer interface. If could also be implemented as a client-server Web service.

A similar or different strategy may be developed for Schema based XML projects.

Bibliography

[Bia:2002] Alejandro Bia, DiCoMo: A cost estimation model for digitization projects, in ACH/ALLC 2002: New Directions in Humanities Computing, The 14th Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, 24-28 July 2002, University of Tuebingen, Germany, pages 11-15.

[Bia-Carrasco:2002] Alejandro Bia and Rafael Carrasco, Generation of Simplified DTDs From a Set of XML Sample Files, in XML Europe 2002 Conference and Exposition, 20-23 May 2002, Hotel Princesa Sofia Inter-Continental, Barcelona, Spain, page 80, http://www.xmleurope.com/.

[Bia-Carrasco-Sanchez:2002] Alejandro Bia, Rafael C. Carrasco, and Manuel Sanchez-Quero, A Markup Simplification Model to Boost Productivity of XML Documents, in Digital Resources for the Humanities 2002 Conference, pages 13--16, University of Edinburgh, George Square, Edinburgh EH8 9LD - Scotland - UK, 8-11 September 2002.

[Bia-Pedreno:2001] Alejandro Bia and Andres Pedreno, The Miguel de Cervantes Digital Library: the Hispanic voice on the Web, in LLC (Literary and Linguistic Computing) journal, Oxford University Press, 2001, vol. 16, n. 2, pages 161-177, ISSN: 0268-1145.

[Bia-Sanchez:2002] Alejandro Bia and Manuel Sanchez-Quero, Building ancient Spanish dictionaries for spell-checking of DL texts, In LREC 2002, in Third International Conference on Language Resources and Evaluation (Manuel Gonzalez-Rodriguez and Carmen Paz Suarez-Araujo, eds.), vol. VI, pages 1832-1837, Las Palmas de Gran Canaria, Spain, 29-31 May 2002.

[Cover:1998] Robin Cover, Cover Pages XML and Semantic Transparency. October 23, 1998. Revised November 24, 1998. http://www.oasis-open.org/cover/xmlAndSemantics.html

[Dubin-SperbergMcQueen-Renear-Huitfeldt:2002] David Dubin, Michael Sperberg-McQueen, Allen Renear and Claus Huitfeldt, A logic programming environment for document semantics and inference, ACH/ALLC 2002: New Directions in Humanities Computing. The 14th Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities", 24-28 July, 2002, University of Tuebingen, Germany

[Kay:2000] Michael Kay, XSLT Programmer's Reference, Wrox Press, 2000, 1102 Warwick Road, Acocks Green, Birmingham, B27 6BH, UK, 1st. ed., ISBN 1-861003-12-9,

[SperbergMcQueen-Huitfeldt-Renear:2000] C. M. Sperberg-McQueen, Claus Huitfeldt and Allen Renear, Meaning and Interpretation of Markup not as simple as you think, in Extreme Markup Languages, Montreal, 15 August 2000.

Glossary

ACM

Association for Computing Machinery

DTDs

Document Type Definition

TEI

Text Encoding Initiative

XML

Extensible Markup Language

XSL

Extensible Stylesheet Language

Biography

Alejandro G. Bia is the Head of Research and Development at the Miguel de Cervantes Digital Library in Alicante, Spain.

He has a BS and a MS degree in Computer Sciences from ORT University, a Diploma in Computing and Information Systems from Oxford University and is finishing his PhD thesis on Computing Methods to Automate the Production of Digital Resources in Digital Libraries at the University of Alicante. Currently he is working as Head of Research and Development at the Miguel de Cervantes Digital Library of the University of Alicante, where the results of his ongoing research are being put to practice. He also works as Associate Professor of the Department of Fundamentals of Economic Analysis of the University School of Entrepreneurial Sciences of the University of Alicante.In the past he has worked as Special-Projects Manager at NetGate (1996), Documentation Editor of the GeneXus project at Advanced Research and Technology (ARTech) (1991-1994), and worked at the Telephone-Traffic Data Processing Unit of ANTEL (1994-1989). He has been a lecturer on Operating Systems, Computer Organization, Computer Networks and English for Computer Sciences at ORT University (1990-1996). His current interests are digitisation automation by computer methods, digital preservation, digitisation metrics and cost estimates, texts structuring and markup languages. He is an active member of the TEI Consortium and of the Association for Computing Machinery (ACM).

Manuel Sanchez-Quero is the Head of XML Markup at the Miguel de Cervantes Digital Library in Alicante, Spain.

He has a B.A in English Philology from the University of Alicante. Currently he is working as Head of XML Markup at the Miguel de Cervantes Digital Library (U. of Alicante). He has also made several contributions to conferences on digital libraries and computing and the humanities. His current interests are digital edition and publication, text structuring and markup languages.

Regis Deau is currently carrying out an internship at the Miguel Cervantes Digital Library of the University of Alicante, Spain, in order to obtain an Engineering Diploma in Telecommunication and Networks from the Engineering School of Advanced Sciences of Saint-Etienne (ISTASE), France. He has a DEUG STPI (2 year degree) in engineering sciences, a D.U.T of Electrical Engineering and Industrial Data Processing Systems specialising in LAN (Local Area Networks) from the University Institute of Technology of Nimes FRANCE (equivalent qualification to HND 2-year qualification in technical subjects), and a BAC STIGE (Science and Industrial techniques specialising in Electrical engineering), obtained with honours (French equivalent of the 3 GCE A levels specialising in Electro-Technology).