XML Europe 2003 logo

Dictionaries for all: XML to Final Product

Abstract

This paper looks at an implementation of a new Dictionary Publishing System at UK Publisher, Longman Dictionaries and explains some of the technical and production aspects of the new system.

Keywords


Table of Contents

1. Introduction
2. Longman DPS Architecture Overview
3. Conclusion
Glossary
Biography

1. Introduction

Dictionaries play an important part in everyone’s life, from early childhood to adolescence and beyond, they are a helpful tool when finding out about a word’s meaning, checking up on a difficult spelling or just understanding words and their usage. This is particularly important when each year, half of the one billion-language students worldwide that learn English as a second language, do so with the help of a dictionary.

Longman English Language Training (ELT) English Language Training (ELT), part of Pearson Education plc have been producing and publishing dictionaries since 1727, (title’s have included Dr Johnson's Dictionary in 1755, and the original Roget's Thesaurus in 1852) and have witnessed many changes in dictionary production methods over the years.

Like many dictionary publishers, Longman have embraced many different technologies to assist in the process of information gathering, production control, publishing and distribution. With the need to be even more efficient, and continue to publish traditional paper based dictionaries and increase its market presence with new electronic and online products; Longman recently revamped its entire dictionary production methodology with resulting dramatic improvements in editorial, production and publishing time frames.

Longman needed, to streamline the dictionary publishing process by allowing lexicographers to directly write dictionary entries as XML documents whilst the entries were being held in a centralised database and managed by powerful workflow tools; and to have a fast and efficient method for in-house proofing and the publishing of the finished product.

Building and deploying an integrated dictionary production system in such a context is a challenge; users expect at least 10 years of services from their dictionary system.

In 1989, the team in charge of the previous implementation of a dictionary system at Longman’s had chosen SGML as their native format. This was implemented on an OS/2 platform, with an off-the-shelf dictionary editor, Gestorlex.

In 2000, the decision point for selection of a new dictionary system; SGML in the guise of the its more widely supported XML subset, seemed again the most obvious choice for many reasons:

  • Conversion from SGML, the established meta-standard for professional publishers would be as straightforward as could be reasonably hoped for.

  • XML, being less ‘proprietary’ had become more mainstream and was being supported by all the major vendors (Microsoft, Sun, IBM). Today XML is widely accepted and covers many aspects of the computing landscapes. Actually, it’s the one thing those fierce competitors all seem to agree on.

  • XML offered true support for all character sets and languages. That was very important to Longman, as bilingual dictionaries with their complex character set requirements are an essential growth market for them.

  • XML, a meta-format rather than a format in the accepted sense, provides nearly unlimited flexibility in what you can do with data structured this way, as we will see during this presentation.

Once the decision was made to go ahead with the procurement of a new system, in addition to the system being XML compliant, a number of other technical requirements were added to the list including:

  • SQL/XML Database support.

  • An XML compliant dictionary definitions Editor.

  • Workflow Control Mechanism.

  • Ability to export valid XML data to multimedia devices, for example to CD-ROM versions of the dictionaries and to Web Servers for Internet Publishing.

  • Automatic proofing to paper to measure the ‘extent’ of the entries, this is where a dictionary publisher needs to know how many final paper pages a set number of dictionary entries will make.

There were also a number of important business issues to be addressed:

  • A time to market reduction was required

  • More efficient use of resources, both in-house and contractor.

  • Reduced Production costs.

  • Ability to repurpose dictionary content so that new titles can be created from existing dictionary data.

  • To streamline the production of electronic dictionaries for CD-ROM, online and for licensing.

  • To facilitate collaborative working between editors working on the same projects but who are based in different countries/continents including the USA, South America and Asia.

A number of software vendors were then approached. During these discussions it was clear that one vendor could not meet all of the requirements, in particular, the automatic proofing system for ‘extent’ testing. It was important that any vendors selected, had to provide products that were technically compatible and as importantly, were able to work with each other.

After product comparisons and testing of a number of products had been made by Longman, Ingénierie Diffusion Multimédia s.a (IDM)Ingénierie Diffusion Multimédia s.a. (IDM)http://www.idm.fr, a Paris based software developer with experience in developing dictionary/encyclopaedia solutions was chosen to develop and implement the new Dictionary Publishing System (DPS)Dictionary Publishing System (DPS) for Longman and XyEnterprise Inc. http://www.xyenterprise.com/xpp.asp, a Boston based software company was selected to provide the implementation of the automatic proofing system. IDM would be the lead developer liaising with XyEnterprise on the development of the ‘extent’ proofing system.

The DPS project started in early 2000 with development of product programs and delivery for onsite testing, as more developed code was delivered for the project, the DPS system began to take shape.

The final shape of the DPS is discussed in the following paragraphs.

2. Longman DPS Architecture Overview

The Longman DPS platform is built around a number of seamlessly integrated components:

  • XML Dictionary Database Server.

  • Application Server.

  • Administration 8 Workflow application.

  • XML Dictionary Definition Editor.

  • Multi-lingual corpora.

  • Corpus Manager.

  • Composition Engine.

As can be seen in the figure below:

click image for full size view

The Dictionary database server holds the dictionary data in a hybrid SQL/XML database repository (the current implementation sits on top of MS SQL Server 7.0). The SQL/XML database itself centralises all definitions for a project. It allows for an unlimited number of DTD’s to be handled as well as sophisticated searches based both on content and dictionary-specific structures; for example, tracking various project status’s or extracting "all entries that contain more than 10 senses”, if an editor wants to review ‘difficult’ definitions.

The Application Server holds the management web site, the Input/Output (I/O) engine for batch processing of sets of definitions, and the composition engine used to produce Adobe Portable Document Format (PDF) Portable Document Format (PDF) ‘extent’ proofs.

The Administration 8 Workflow application allows project managers and editors to supervise the editorial work. The managing editors can dispatch specific tasks to the lexicographers, via their specific project home page. As work progresses, each task (for example new, dispatched, in progress, available for review, validated, etc.) moves through several stages, which are tracked and managed by the workflow system. The technical status of the definitions, and specifically DTD conformance, are also tracked by the system. All relevant project metrics including progress, definition status and work assignments are displayed on-line using a project ‘dashboard’ tool. The system design and capacity allows the administrators to track and manage an average of 25 simultaneous distinct projects.

click image for full size view

The administrator workstation is configured to directly access both the database server and the application server. It hosts the scripting and XML manipulation tools used to configure dictionary projects and to process batches of definitions. While the application server provides the sophisticated IO and check-in/check-out mechanisms needed to work on dictionary data without disrupting the day-to-day workflow, the administrator selects the tools with which he wants to work, such as Omnimark™, Awk etc.

Editors and project managers use the Project Management Console, which is basically the administrative part of the web site hosted on the application server. From there, they can query the dictionary database for specific definitions, move those definitions in a task file and assign the file to a lexicographer. They can merge back definitions from an incoming task file, and track overall project progression via lexicography-specific indicators. Lexicographers communicate with their editors and project managers using the Lexicographer Console. From there, they retrieve new work, ask questions, and follow up on definitions already produced.

click image for full size view

The XML Dictionary Definitions Editor is based on Microsoft’s Internet Explorer 6.x components and JavaScript programs and is available over Longman's internal intranet network and Internet to all lexicographers worldwide. The XML editor automatically adapts itself to a specific project’s DTD, and gets updated with the latest version and project configuration files every time the lexicographer goes online to download a task file allowing authors to structure their definitions accordingly. Once the download has completed, the Lexicographer has the option to then work offline.

click image for full size view

The Composition engine/’extent’ proofing uses XyEnterprise's XML Professional Publisher (XPP) XML Professional Publisher (XPP)composition tools to convert exported XML dictionary entries to seamless hardcopy renderings of dictionary pages in final page style and layout in PDF format. These renderings are fed back to the relevant author in ‘real-time’ to allow for instant decisions on which entries will be included in the final pages of a paper dictionary.

click image for full size view

This technical partnering of IDM and XyEnterprise technologies and integration, has considerably shrunk the time-to-market delays inherent in traditional proofing cycles as well as reducing the associated costs of proofing dictionaries.

The Longman Corpus Viewer

click image for full size view

allows an author to query a number of huge databases of many of the worlds languages including English, Spanish, Portuguese, Italian, Polish and Japanese usage for words or phrases, according to the country of origin, subject matters or levels of speech. This database is updated daily with hundreds of articles of very diverse origin.

click image for full size view

Lets take a deeper look at some of the underlying technologies used in the DPS.

From content to layout: Displaying and rendering XML data.

As mentioned previously, XML is the native format for dictionary definitions. Project specific configuration files are used to define how this XML content is rendered on screen or on paper.

A key aspect of those configuration files is that they are not proprietary: The DPS only uses standard, W3C-sanctioned scripting languages.

A DTD is used to specify how project entries are structured. At design time, XML schemas were fully standardised in the system, but not implemented across all the tools. Future versions of the DPS may use schemas instead of DTD's.

XSL files are used to specify standard transformations and extractions from dictionary definitions (most notably how definitions or parts of definitions should be translated for display in the editor and on the administration web site).

CSS files are used to further refine the appearance of definitions on the editors screen.

Standard XML files are used to describe resources provided by the DPS, such as abbreviations lists, definition vocabulary specifications, user and system templates.

The following architecture is used over and over again in the DPS:

click image for full size view

This two-layered approach to rendering allows us to separate content selection and organisation from what are strictly layout issues.

XSL scripts take care of:

  • Filtering out or translating content that is not meant to be displayed.

  • Ordering entries and XML elements inside entries, where necessary.

  • Adding delimiters and counters between elements. Translating cross-references in definitions, that are stored as system ids, to a meaningful representation.

CSS scripts take care of layout issues, such as fonts, colours, capitalization, layout and spacing.

XML for definition storage

XML in the DPS must adhere to the constraints described in the following paragraphs:

The top-level “Entry” element

The XML data making up the dictionary must consist of a collection of identically named top-level elements that constitute dictionary entries. For example, the following XML fragment can be interpreted as a valid set of entries:

<dictionary>;
          <dicEntry>
                     <headword>Cheeseburger</headword>
		<sense>
			<def>A hamburger, with added cheese</def>
		</sense>
	</dicEntry>
	<dicEntry>
		<headword>Hamburger</headword>
		<sense>
			<def>A cheeseburger, without the cheese</def>
		</sense>
	</dicEntry>
</dictionary>;
			

…But this one cannot, because of the “header” and “letter” elements:

<dictionary>
	<header><startDate>20020101</startDate><version>1.00</version></header>
	<letter>C
		<dicEntry>
			<headword>Cheeseburger</headword>
			<sense>
				<def>A hamburger, with added cheese</def>
			</sense>
		</dicEntry>
	</letter>
	<letter>H 
		<dicEntry>
			<headword>Hamburger</headword>
			<sense>
				<def>A cheeseburger, without the cheese</def>
			</sense>
		</dicEntry>
	</letter>
</dictionary>
			

Note that the “Entry” element can have any valid XML name, as the DPS platform provides the configuration options to map a tag name to the concept of “dictionary entry”

The “Headword” element

Each “Entry” element must include a single distinct element that maps to the concept of an “Entry Headword”. The content of this element is used by the DPS to present entry lists and for elementary online dictionary searches.

The content of that element does not have to be unique across the dictionary, as the DPS fully understands (and manages) homonyms. For example, the following XML fragment defines a valid entry, thanks to the <headword> element:

<dicEntry>
<headword>Cheeseburger</headword>
	<sense>
		<def>A hamburger, with added cheese</def>
	</sense>
</dicEntry>
			

...But this one doesn't, as there is not a clearly identified element that maps to the notion of a “Headword”:

<dicEntry>Cheeseburger
	<sense>
		<def>A hamburger, with added cheese</def>
	</sense>
</dicEntry>
			

Note that the “Headword” element can have any valid XML name, as the DPS platform provides the configuration options to map a tag name to the concept of “entry headword”.

The “Sense” element

Each “Entry” element can include one or several “Sense” elements. The content of those elements are used by the DPS to track work progress and entry complexity metrics. Defining a tag name that maps to the lexicographic notion of “Sense” is important if you want to take advantage of the progress management screens of the DPS and of the automated entry layout facilities of the editor (like automated sense numbering).

For example, the following entry layout will allow you to take full advantage of all the metrics available in the DPS:

<dicEntry>
	<headword>Cheeseburger</headword>
	<sense>
		<def>A hamburger, with added cheese</def>
		<example>John likes to have a cheeseburger for lunch</example>
	</sense>
</dicEntry>
			

...But this one typically won't:

<dicEntry>
	<headword>Cheeseburger</headword>
		<def>A hamburger, with added cheese</def>
		<example>John likes to have a cheeseburger for lunch</example>
</dicEntry>
			

Attributes are not allowed

The use of XML attributes inside dictionary entries is not allowed. The use of attributes is reserved for the system, to manage meta-data and uniquely identify elements.

Meta-data items managed by the system include:

  • Entry author(s).

  • Entry version information.

  • Entry editorial status.

  • Unique element ids, as managed by the database.

As all those items are directly managed by the DPS, there is no need to include them in the XML.

There are several justifications for this restriction of the XML allowed by the system:

  • It’s of little practical consequences as child elements can replace attributes with no loss of semantics and via a straightforward, syntactic transformation.

  • It clearly partitions meta-data managed by the system as opposed to semantic dictionary data.

  • The XML syntax for attributes does not support extended character sets. By allowing semantic information to be stored in attributes, we might jeopardize the multilingual support of the platform.

The encoding of choice is Unicode/UTF-16

The DPS definitions editor expects XML encoded using UTF-8, which is converted from the SQL stored UTF-16. As UTF-16 is the most comprehensive of the encodings available and that translation from one encoding to another is straightforward, this is not seen as a much of a constraint.

Each project has its own DTD

The DPS server is designed to host several dictionary projects at once. Each project has one DTD. Several versions of the same DTD can coexist at the same time for a given project, as long as those versions remain reasonably similar: Their main characteristics, especially the mapping to the dictionary concepts of entry/headword/senses can't change.

If more extensive changes to the DTD are mandated, the DPS provides specific features to allow a project to transition from one DTD to the next, via the administration workstation and its batch process engine.

Cross Referencing is an important System addition

In the DPS, a cross-reference is a link between two elements of an dictionary entry: the source and the target and is simply stored in an XML element..

A cross-reference consists of two steps:

  • Cross-reference creation.

  • Cross-reference resolution.

Cross-reference editing and modification carried out by the lexicographer within the XML Dictionary entry editor.

Interestingly the source of the cross-reference needs to exist but not necessary the target, the lexicographer only needs to type the headword corresponding to the entry. Cross-reference resolution is a server side process that happens near the end of the production cycle when the dictionary content is being validated. This somewhat time consuming process has been optimised in the DPS by a help ‘wizard’ assisting task managers in resolving cross-references across a project. As all the links for the cross-reference are already stored in the database, resolution performance is very high and quality is 100% accurate.

To identify cross-references sources, the search can be done either by task or by headword range. As the result of the search, a number of resolved and unresolved cross-references are listed in a table. The destination headword then needs to be completed to select a destination entry, thereby solving any homonym ambiguities. It is also possible to select a target element within entry.

An example of a cross reference source and target displayed simultaneously can be seen in the figure below.

click image for full size view

Other Technologies used within the system include, Omnimark, Perl, XPP and other general MS Windows based system tools as can be seen in the following figure.

click image for full size view

Multimedia Publishing – the next step.

As a further example of the DPS’s capability, Longman are directly exporting XML dictionary data to a new CD-ROM mastering production system, again developed by IDM. The new mastering system enables new Dictionary CD-ROM’s to contain many more features than have previously been found on Longman Dictionary CD-ROM’s.

For example, the Longman Dictionary of Contemporary English (LDOCE) Longman Dictionary of Contemporary English (LDOCE)contains 76,000 more dictionary examples than the equivalent paper dictionary; it also has 150,000 extra word combinations to help readers understand how words are used. Over 1 million extra sentences direct from the corpus show the reader more words and phrases in context than in previous Longman ELT CD-ROM’s. Using the new DPS and CD-ROM mastering system it has also been able to manage the inclusion of other Longman dictionary publication to further enhance the CD-ROM.

The CD-ROM also now incorporates natural language examples, based on real language from corpus research to enable users to learn how to use the 3000 most frequently used words in both spoken and written English. Using links from the core XML database, words and phrases can be accurately highlighted and the incorporated word frequency information show users which words are the most important to learn.

Over 25,000 fixed phrases and word collocations show users how words work together, and so help develop fluency; the inclusion of over 1,500 illustrations also helps make the meanings clearer. LDOCE also has a built in ability for users, at the click of a button or by typing in the relevant words, to be able to hear perfect pronunciations of the words in British English and link straight to the meaning of any word, as can been seen from this example:

click image for full size view

3. Conclusion

As you will have seen, the new DPS presents itself as a combination of practical technical solutions built on today’s standards and provides Longman with a technical 'future-proof' and versatile production platform for the 10 plus years.

Began in April 2000, in production by May 2001 and delivering its first bi-lingual paper dictionary in July 2002, followed by its first CD-ROM in January 2003, the decision to invest in a new DPS was clearly a major strategic one.

The DPS has subsequently delivered significant increases in productivity, reduced production times, and has delivered many cost saving business benefits to the Longman division of Pearson, and finally the new DPS has placed Longman in a very strong position in today’s very competitive ELT market place.

Glossary

DPS

Dictionary Publishing System

ELT

English Language Training

IDM

Ingénierie Diffusion Multimédia s.a.

LDOCE

Longman Dictionary of Contemporary English

PDF

Portable Document Format

XPP

XML Professional Publisher

Biography

Mike McNamara has been involved in IT for over 30 years with a variety of software and hardware vendors, including over 15 years with XyEnterprise Inc. where he was Director of International Operations and was responsible for providing SGML/XML Content Management and Publishing production solutions to a wide variety of high profile UK and European commercial and aviation publishers.

Mike is a founder member of Araman Consulting Ltd., a UK based XML consultancy promoting the adoption of XML content and XML-enabled applications across organisations of all sizes. Araman Consulting Ltd. provides a wide variety of XML services including, Content Management product needs analysis, Workflow 8 DTD analysis 8 creation and system implementation 8 project delivery management.