With the growth of computer use, the invention of the World Wide Web, and the increased pace of scientific research, publishers realized that making research available electronically was vital to both researchers as well as their own business interests.
Over the past decade, scholarly publishers have designed DTDs and implemented SGML systems. Despite efforts to create a common DTD for scholarly publishing, publishers have developed proprietary DTDs and SGML implementations that diverge from efforts toward standardization on a single DTD because publishers required solutions that best suited their business requirements.
The recognition that individual organizations within an industry segment may have unique business requirements is important for all parties involved in the design and implementation of SGML systems, especially those who work on content or data interchange projects.
We examine differences between DTDs and SGML of ten publishers, explore some reasons for these differences, and discuss issues that are raised by these differences, especially when content must be transformed from one proprietary DTD to another.
Keywords: Transforming; Publishing; Mapping; Interoperability
| XML Source | PDF (for print) | Author Package | Typeset PDF |
From 1665, with the advent of Oldenburg's Philosophical Transactions, until recently, the primary method of scholarly communication was the printed journal [Eisenstein, 1979]. Easily read on paper and archived on long library shelves, the printed journal serves as the "social registry of scientific innovations" [Guédon, 2001].
With the growth of computer use, the invention of the World Wide Web, and the increased pace of scientific research, publishers realized that making research available electronically was vital to both researchers as well as their own business interests. To fulfill this requirement, publishers of scholarly journals turned to two formats, SGML and PDF [Bide, 2000].
Many journal publishers recognized in the early days of SGML that because journal content is highly structured, consistently styled, and has unique archival requirements, SGML is ideally suited to encode it. These publishers were among the first to implement large-scale SGML systems. Over the past decade, scholarly publishers have designed DTDs and implemented SGML systems that take advantage of SGML's strengths [Owens, 1996].
Even before publishers were making daily use of SGML, there were efforts to create a common DTD for scholarly publishing by the Electronic Manuscript Project of the AAP [American Association of Publishers] [Goldfarb, 1993] and the ISO 12083 working group [NISO, 1995]. However, publishers have developed proprietary DTDs and SGML implementations that diverge from efforts toward standardization on a single DTD because publishers required solutions that best suited their business requirements.
In this paper, we will examine differences between DTDs and SGML of ten publishers, explore some reasons for these differences, and discuss issues that are raised by these differences, especially when content must be transformed from one proprietary DTD to another. Finally, we will discuss some issues when developing industry-standard and interchange DTDs.
Most journal publisher DTDs owe their legacy, directly or indirectly, to the work of the AAP from 1983 to 1987 (Z39.59), and the ISO 12083 Serial DTD that was derived from this work. The 12083 DTD was released as an ANSI standard in 1988 and became an ISO standard in 1993. The 12083 DTD was last updated in 1995 [NISO, 1995].
Table 1 shows the legacy of ten publishers DTDs. In some cases, DTDs were directly derived from ISO 12083 and only minimal changes were made. In other cases, the DTD author(s) reviewed the structural foundation of 12083, but built a DTD from scratch using a unique set of elements and attributes.
Even those publishers who use 12083 with only minimal modification were unable to use it verbatim. In a 1998 survey conducted by the ISO 12083 working group, the comments of one publisher captured the feelings of many: "It is way too complicated, yet it is not flexible enough to represent the things I need to have in the journal I publish on the Internet" [Kennedy, 1998].
| DTD | Legacy |
|---|---|
| AIP | Derived from 12083 Serial DTD |
| BioOne | Derived from 12083 Serial DTD |
| Blackwell | Developed by Blackwell Science. Significant earlier versions are 2.2 and 3.0 |
| Elsevier | Developed by Elsevier Science. Earlier versions are 1.1.0, 2.1.1, 3.0.0, 4.1.0, and 4.2.0 |
| Highwire | Derived from Elsevier DTD 4.1 |
| IEEE | Derived from 12083 Serial DTD |
| Nature | First developed by Alden Press |
| PMC | Derived from the Keton Full Text DTD, which is based on the CJS [Cadmus Journal Services ] 2.1 DTD. The CJS DTD is derived from the Elsevier 3.0.0 DTD |
| UCP | Derived from AAP Article DTD Z39.59 with modifications based on 12083 Serial DTD |
| Wiley | Based on the Elsevier DTD 3.0.0, as modified from analysis of Wiley journals |
Publishers did not adopt ISO 12083 verbatim because it is too generalized. It was designed to work for everyone, but it does not really work for anyone. Every organization has specific needs that must be met in a DTD, and each publisher made changes (major or minor) to meet their requirements.
Most of scholarly publishers started to archive their content in a structured form prior to the creation of XML. As a result, most publishers today use SGML DTDs rather than XML (see Table 2).
| DTD | Type | Definition | Version | Last Revised |
|---|---|---|---|---|
| AIP | SGML | DTD | 3.0.2 | August 14, 2001 |
| BioOne | SGML | DTD | 1.0.1 | October 16, 2000 |
| Blackwell | XML | DTD | 4.0 | October 2000 |
| Elsevier | SGML | DTD | 4.3.1 | April 2001 |
| Highwire | SGML | DTD | 4.2.14 | July 2001 |
| IEEE | SGML | DTD | 2.0 | February 2, 2000 |
| Nature | SGML | DTD | 3.29 | July 27, 2001 |
| PMC | XML | DTD | 1.13 | Sept 10, 2001 |
| UCP | SGML | DTD | Version 6 | Sept 19, 2001 |
| Wiley | SGML | DTD | 3.4 | July 10, 2000 |
Over time, we expect most publishers to switch to XML. This change will occur in part because there is a wider array of tools for XML than SGML. The time frame for the SGML to XML transition will vary by publisher.
We are not familiar with any publisher that plans to use a different model than a DTD. This may be in part because XML Schema only became a W3C recommendation in February 2001 and other DTD alternatives are still being developed (RelaxNG, Schematron, DSDL). The fact that the XML DTD models for CALS tables and MathML have not been converted to W3C XML Schema (or other models, e.g., RELAX NG and Schematron) in widely accepted form may also account for some lack of movement away from DTDs by journal publishers. Alex Brown [Brown, 2002] provides a more complete overview of the issues in the DTD versus alternatives discussion in his paper presented at XML Europe, 2002.
Please note that because most publishers still use SGML today, we use the term SGML generically to refer to content tagged in a markup language (either SGML or XML). In cases where the distinction is important, specific reference may be made to XML.
Some publishers only create PDF because, relative to the cost of creating SGML, PDF creation is relatively inexpensive and can typically be created as a by-product of the traditional print process. SGML and PDF each have technological advantages and disadvantages. Some key differences are:
Because PDF and SGML solve different problems, publishers often create both SGML and PDF manifestations of their content. PDF allows all files to be viewed with the same application, independent of the publisher that created the PDF file.
However, in SGML, the same content will be tagged differently by each publisher because each one uses a different DTD. While this strategy serves the needs of individual publishers, it creates a Tower of Babel rather than a consistently accessible repository when SGML is viewed by anyone outside of the publisher's organization. A unique application must be built to render each publisher's SGML.
Different philosophies arose as publishers began to design DTDs and implement SGML systems around their unique business and technical requirements.
SGML enables publishers to separate format instructions, which are often proprietary, and structural information by tagging content for semantic meaning rather than format. Format can then be applied to the structural elements with a style sheet when the content is rendered. Steve DeRose comments, "Strong separation of formatting from structure is the hallmark of good SGML use" [DeRose, 1997].
In an ideal SGML application, a complete separation of formatting and structure will be preserved. In journal publishing, the degree of separation varies by publisher. To discuss these differences, we need to define two terms, generated text and boilerplate text.
We define generated text as inconsequential, formulaic, or stereotypical text and formatting omitted from an SGML file, which is applied to content by a style sheet when an SGML file is rendered. The style sheet generates text based on the structure information provided by the markup elements and attributes. For example, generated text includes spacing, punctuation and face markup (e.g., emphasis in the form of bold, italic, etc.) added with a style sheet to the presentation when the <author> element is rendered from SGML according to the 12083 DTD.
Boilerplate text is inconsequential, formulaic, or stereotypical text and formatting that has been included in the SGML file even though it could have been omitted.
Some publishers, such as Elsevier, follow DeRose's philosophy:
In order to separate structure and presentation one applies the concept of generic markup: generic codes (or tags) are placed around most – or all – elements in a document. These elements could be a paragraph, a title, an abstract etc. The tags usually indicate the structure of the document. They do not indicate the style or format of the document, such as fonts, column widths etc. For each different style a style sheet is required to translate the logical structure into a presentation on paper, for example. The set of tags and their mutual relations comprise the ‘generic markup language’ [Poppelier, et al., 1997].
SGML created according to Elsevier standards excludes almost all boilerplate text and face markup. To render content and apply generated text, a sophisticated style sheet, which is separate from the SGML document instance, is applied to the SGML content. This separation allows the style of presentation to be modified easily, meeting a key goal of Elsevier's electronic workflow requirements. However, because Elsevier does not archive style sheets with SGML files, the style information must be recreated to render SGML in a new environment.
Other publishers, such the University of Chicago Press, take a different view on the issue of generated text:
Our overriding concern in our SGML implementation was to accurately preserve the entire text as published.
As an example of this design philosophy at work, consider the issue of generated text. Many DTDs, including ISO 12083, either assume or allow for the possibility that the formatting system will generate text such as counters, labels, or the punctuation and connecting text around a list of author names. However, if one uses generated text, then one must also archive the generation rules with the text in order to accurately recreate the original text. We know from experience that journal styles evolve over time; it seemed to us a much better solution to dispense with generated text entirely [Owens, 1996].
The University of Chicago Press approach shows a keen insight into the problems faced by archivists such as libraries. By including more boilerplate text and format information in the SGML file and avoiding generated text, the print version of the article can more easily and faithfully be reproduced from an archive. Blackwell, in the context of explaining their format for structured references, gives further insight into this issue:
Many SGML and XML DTDs consider punctuation to be generated text i.e., the punctuation required is generated by stylesheet rules and is not stored in the document. The disadvantages of relying on stylesheet rules to create generated text are:
1. The XML document is no longer a 'standalone' document
2. The generation rules need to be stored along with the document throughout its life
3. The document cannot be read without applying a process which applies the punctuation rules to the XML document
4. It can be quite inefficient if different rules/templates have to be created to reflect differing punctuation styles across a store of documents
Storing the generated text in an x element in the XML document means that the XML fragment can easily be converted to, for example, simple text, a typesetting format or HTML without the need for complicated templates or rules. If existing rules for generating punctuation are already in place, or if more 'abstract' XML is required, then the contents of the <x> element can be ignored and the rules applied [Blackwell Publishing, 2001].
The range of boilerplate text and formatting publishers include in SGML varies widely. However, inclusion or exclusion is based on publisher policy rather than DTD design because DTDs are not powerful enough to enforce most of these policies. In many cases, even schemas may not be able to enforce them.
Even if policies for structure/format separation could be enforced in a DTD or schema, a tightly defined structural framework would become unnecessarily restrictive for some publishers:
A narrowly targeted DTD can enforce some of these restrictions, but a broadly targeted one cannot, since it must be adaptable to different house styles if required [Megginson, 1998].
The problem is multiplied when a DTD must be broadly targeted to accommodate different house styles:
The problem is much more complicated than simply choosing names: authors at the two companies are accustomed to thinking about the structure of their information in very different ways, and a DTD that is well suited to one will work very poorly for another. If you are a DTD designer, there are four broad approaches that you might chose in this situation:
1. you can create a DTD that uses a new, neutral structure, different from either of the existing ones
2. you can impose one of the two structures arbitrarily
3. you can create a DTD that allows either of the two structures as alternatives
4. you can create a less-restrictive DTD that can be adapted to any appropriate structure
Most industry-standard DTDs use the fourth method more often than the others, because the DTDs need to be useful for a wide range of applications within a single industry; but, as a result, the DTDs provide a lower level of guidance for authors, validation for processing, and context-sensitivity for searching [Megginson, 1998].
Elsevier Science production specifications require most journals to apply one of five standard style sheets to all content. As a reflection of this production model, the Elsevier DTD is somewhat restrictive. Content must conform to Elsevier's structural requirements so that it can be rendered in one of the standard styles. Content does not conform must be changed so that it conforms to the DTD.
Blackwell Science takes a different approach. Individual journals are permitted to retain their unique styles. In order to accommodate these variations, the DTD is correspondingly less restrictive than the Elsevier DTD, and a greater portion of boilerplate text and formatting is retained in XML files.
The University of Chicago Press chose to retain all boilerplate text and formatting in SGML files. This was a conscious decision driven in part by their requirements to archive all text, exactly as it appeared in print, without the need to archive an accompanying style sheet. As a result, the University of Chicago Press DTD is the least restrictive of these three.
These differences in the DTD structure are not necessarily dictated by technical choice. These decisions have been made for pragmatic reasons rather than philosophical reasons. Publishers' production requirements and business imperatives often drive them. In effect, each journal publisher has incorporated their specific production and business requirements into their DTD structure and their SGML files.
All of the DTDs we surveyed use a similar high-level structural approach for major article elements, in part because of the shared AAP/12083 heritage. However, the implementation details of SGML document instances vary widely. The largest variance is found in items that have the greatest granularity, such as article history and reference citations. Often the differences revolve around the use of boilerplate versus generated text and formatting. The following sections highlight varying methods of different publishers to tag the same textual content.
The degree of granularity that is required by publishers for certain similar structural elements varies significantly. Article history (which typically includes the received, revised, and accepted dates for the manuscript) is an example where publishers differ widely on how the content should be tagged.
The publishers we surveyed use several models to tag the article history:
Each of these models represents similar data in different ways, based on the requirements for how each publisher will use the data.
Figure numbers, which appear at the start of figure captions, are typical of the range of implementations for generated text. Table 3 shows the differences in tagging figure numbers by surveyed publishers.
| Publisher | Print Example | Corresponding SGML |
|---|---|---|
| AIP | FIG. 1. | <figgrp id="F1"> |
| BioOne | Fig. 1. | <TITLE>Fig. 1. Two recent…</TITLE> |
| Blackwell | Figure 1. | <num id="leg-f1">Figure 1.  </num> |
| Elsevier | Figure 1. | <no>Figure 1</no> |
| Highwire | Figure 1. | <no><b>Figure 1.<b> </no> |
| IEEE | Fig. 1. | <title just="just" autonum="off">Fig. 1. The…</title> |
| Nature | Fig. 1 | <fig id="f1" entname="figf1"> |
| PMC | Figure 1 | <title><p>Figure 1</p></title> |
| UCP | Figure 1: | <LABEL>Figure </LABEL><NO>1: </NO> |
| Wiley | Figure 1. | <FIG ID="fig1" LOC="FLOAT"><GRAPHIC NAME="fig001"></GRAPHIC><NUMBER>1</NUMBER> |
Several different approaches have been used:
The result in this case is that ten different publishers have ten different ways to tag substantially identical content for a relatively simple element.
The variances in implementations become greater and more complex with a review of citation links. Table 4 illustrates the variety of tagging used for numbered ("Vancouver" style) citations by surveyed publishers.1
| Publisher | Citation | SGML Example |
|---|---|---|
| AIP | superscript.1,6 superscript.3–5 | superscript.<citeref rid="r1" style="superior">1</citeref> <citeref rid="r6" style="superior">6</citeref> superscript.<citeref rid="r3" style="superior">3</citeref> <citeref rid="r4" style="superior">4</citeref><citeref rid="r5" style="superior">5</citeref> |
| BioOne | (1,6) (3–5) | <CITEREF RID="i0031-8655-071-01-0001-b1">(1,6)</CITEREF> <CITEREF RID="i0031-8655-071-01-0001-b3">(3–5)</CITEREF> |
| Blackwell | [1,6] [3–5] | [<link rid="b1 b6">1,6</link>] [<link rid="b3 b4 b5">3–5</link>&rsqb |
| Elsevier | [1,6] [3–5] | <cross-ref refid="bib1 bib6">[1,6]</cross-ref> <cross-ref refid="bib3 bib4 bib5">[3–5]</cross-ref> |
| Highwire | (1, 6) (3–5) | (<cross-ref refid="bib1" type="bib">1</cross-ref>, <cross-ref refid="bib6" type="bib">6</cross-ref>) (<cross-ref refid="bib3" type="bib">3</cross-ref>– <cross-ref refid="bib5" type="bib">5</cross-ref>) |
| IEEE | [1], [6] [3]–[5] | <citegrp><citeref rid="ref1" type="ref"></citeref></citegrp>, <citegrp><citeref rid="ref6" type="ref"></citeref></citegrp> <citegrp><citeref rid="ref3" type="ref"></citeref><citeref rid="ref4" type="ref"></citeref><citeref rid="ref5" type="ref"></citeref></citegrp> |
| Nature | superscript1,6. superscript3–5. | superscript<bibr rid="b1 b6">. superscript<bibr rid="b3 b4 b5">. |
| PMC a | [1,6] [3-5] | [<abbr bid="B1">1</abbr>,<abbr bid="B6">6</abbr>] [<abbr bid="B3">3</abbr>-<abbr bid="B5">5</abbr>] |
| UCP | [1,6] [3–5] | [<CITEREF RID="rf1">1</CITEREF>,<CITEREF RID="rf6">6</CITEREF>] [<CITEREF RID="rf3">3</CITEREF><CITEREF RID="rf4"></CITEREF>–<CITEREF RID="rf5">5</CITEREF>] |
| Wiley | [1,6] [3–5] | <BIBR HREF="bib1">1</BIBR><BIBR HREF="bib6">6</BIBR> <BIBR HREF="bib3">3–5</BIBR> |
| a PMC allows ranges to be tagged as shown above or as [<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>,<abbr bid="B5">5</abbr>]. The style is determined by the publisher who submits SGML content to PMC | ||
Several points should be noted in the tagging of citations:
The approach taken by Highwire and PMC is designed to allow easy rendering while retaining the general look of the original text. Both exclude links to the middle references in the range. The resulting user interface in HTML is simplified by creating hyperlinks for the first and last numbers as hyperlink text.
This approach causes no loss of significant functionality when linking to a reference section because references in the middle of the range are easily accessible if the reader clicks on the first or last number. However, if links to middle objects of the range are excluded when this model is used for figure or table citations (e.g., "see Figures 1-4") it may be difficult or impossible to link to some objects.
Some publishers recommend a more sophisticated rendering engine that presents a drop down menu with a list of possible targets [Pepping & Schrauwen, 2001], pp. 234-235. A model where text and links to all objects are preserved, similar to that used by Blackwell and Elsevier, facilitates this rendering style. However, this model requires the use of IDREFS in the DTD rather than IDREF. IDREFS may allow to SGML that is closer to the printed text, but IDREF is easier to render when converting the SGML to simple HTML web pages.
The devil can cite Scripture for his purpose.2
Without doubt, structuring reference sections is one of the most difficult aspects of SGML production for journal publishers. Table 5 summarizes the types of references for which different publishers create structured SGML.
| DTD | Journal | Book | Conf | Report | Patent | Thesis | Standard | E-ref d | |
|---|---|---|---|---|---|---|---|---|---|
| AIP | F | F | F | F | N | N | N | F | |
| BioOne | F | F | F | N | N | N | N | F | |
| Blackwell | F | F | N | N | N | F | N | F | |
| Elsevier | F | F | F | N | N | N | N | F | |
| Highwire | P a | N | N | N | N | N | N | F | |
| IEEE | F | F | F | F | F | F | F | F | |
| Nature | F | F | N | N | N | N | N | F | |
| PMC | F | F b | F b | N | N | N | N | F | |
| UCP | P c | N | N | N | N | N | N | F | |
| Wiley | F | F | N | N | F | N | F | F | |
| Key: F: Full structure support; N: No structure support. Reference is unstructured text inside a single SGML element; P: Partial structure support | |||||||||
| a The Highwire DTD has elements to tag the journal title, volume, date and first page. Other reference components are not tagged because they are not required for reference linking to databases today. | |||||||||
| b PMC allows fully structured book and edited book references. Some publishers, however, may only partially tag the reference content in their own DTD. The result, when converted to the PMC DTD, may be a partially tagged reference. For example: <bibl id="B12"><title><p>Arenaviruses (Chapter 50).</p></title> <aug><au ca="no" ce="no"><snm>Peters</snm><fnm>CJ</fnm></au> <au ca="no" ce="no"><snm>Buchmeier</snm><fnm>MJ</fnm></au> <au ca="no" ce="no"><snm>Rollin</snm><fnm>PE</fnm></au> <au ca="no" ce="no"><snm>Ksiazek</snm><fnm>TG</fnm></au></aug> <source>In Virology; third Edition. Edited by B.N. Fields, et al. Lippincott-Raven,</source> <pubdate>1996</pubdate> <fpage>1521</fpage><lpage>1551</lpage></bibl> In order to establish a link from this data, it would be necessary to sub-parse the <source> data: book name, edition, editors, and publisher. | |||||||||
| c UCP does not include structural sub-elements in references. Reference linking information to a variety of databases (e.g., ADS, Medline) is stored in reference attributes. When queried, UCP replied they could tag sub-elements based on pattern matching to produce archive SGML. | |||||||||
| d All surveyed DTDs support tagging of external links in references, allowing the link to be completed if the data is correctly tagged. However, in the case of electronic references, many of the other elements may not be fully structured due to the variety of styles used by authors in citing web resources. As a result, E-refs are fully structured in some cases and partially structured in others. | |||||||||
Typically, journal references have a somewhat predictable structure. Most publishers tag these references for production purposes, including the ability to establish external links. However, the tagging approach varies in several aspects from publisher to publisher:
Most of these structuring approaches also apply to non-journal references — if they were parsed. Many publishers do not structure non-journal references because they are more difficult to parse than journal references, they appear less frequently, and databases to which these references can be linked are less common. As a result, some publishers do not include support for non-journal references in their DTDs (see Table 5).
A closer examination reveals that the DTD structures and SGML tagging are related to each publisher's production processes and business requirements.
The Highwire DTD only has elements to structure journal references and the model only includes those elements required to link citations to articles or abstracts. Because Highwire only establishes journal links, this minimalist tagging satisfies their requirements without placing any extra burden on those who create the SGML. Because Highwire does not create links to non-journal content when they place articles online, they have chosen not to include tags to structure these references in their DTD.
The University of Chicago Press takes a different minimalist approach. Reference attributes contain link IDs to databases such as PubMed. However, within the reference text, only face markup elements are included, not semantic markup elements. During SGML creation, the University of Chicago Press parses each journal reference automatically to look up database link attribute values, but then the granular information is discarded. This decision was made in part because the editorial staff does SGML-based copy editing, and it was decided that editing references would be more difficult if the editorial staff had to edit heavily tagged content.
With one exception, all surveyed DTDs use either the CALS or Elsevier table models, although some publishers still handle tables as graphics rather than as full-text SGML.3 However, all publishers support scanned images for the table body because some tables are too complex to be captured in SGML. Table 6 summarizes table models used by different publishers.
| DTD | Model |
|---|---|
| AIP | CALS |
| BioOne | CALS |
| Blackwell | CALS |
| Elsevier | Elsevier 4.0 |
| Highwire | Elsevier 4.0, although some features have been removed. Highwire also allows submission of CALS tables even though the CALS table model is not part of the Highwire DTD |
| IEEE | ArborText CALS, although by convention IEEE includes only the table title in SGML, and the body of the table is always scanned. |
| Nature | CALS |
| PMC | Elsevier 3.0 |
| UCP | ArborText table model |
| Wiley | CALS (OASIS version) a |
| a The most significant difference between the OASIS CALS model used in the Wiley DTD and the CALS adaptation used by AIP, BioOne, Blackwell and Nature is that table footnotes in the latter DTDs are handled with an element <tfoot> that is part of <tgroup>. Wiley handles table footnotes in a manner similar to regular article footnotes. | |
The CALS model is based on the MIL-M-38784B 910201 DTD originally developed for the US Department of Defense. Over the years a large number of organizations have adopted it. OASIS adopted a simplified version of the SGML CALS table model in the mid 1990's and later modified it for XML use. The OASIS version was created based in part on polling software vendors about which features they supported and potential users about which features they most needed. Because the CALS table model has been widely adopted, a significant number of SGML applications have built-in support for it.
The Elsevier table model was introduced in DTD 3.0. It was modified in DTD 4.1 to handle certain complex tables and table embellishments that were unsupported in the earlier DTD.
The Elsevier and CALS table models can structure most tables found in journal articles. The models are not completely parallel. So some structures may be tagged in the Elsevier DTD (e.g., multiple alignment points within a single cell) that may be difficult to replicate in the CALS DTD.
In some cases, however, neither DTD can adequately represent a table. The most common case is when a graphic appears within a table, and the alignment of surrounding cells to specific parts of the graphic must be carefully setup (such alignment typically requires typographic commands that are the antithesis of good SGML markup and are therefore unlikely to be supported in a DTD). In such cases, most publishers recommend that the table content be incorporated in the SGML as a scanned image rather than tagged SGML. When tables are scanned, most publishers follow the protocol of tagging the table number and table caption in SGML and scanning the table body (including heading cells and table footnotes).
All of the table models support left, right and center alignment of text in table cells. Most DTDs support alignment with a specific character (sometimes called "decimal alignment" because it's most commonly used to align a column of numbers by placing the decimal points in a vertical line).
The PMC DTD and early versions of the Highwire DTD do not support character alignment because HTML lacks this capability. This DTD design decision made by Highwire and PMC directly reflects their business requirements, which are focused on the online presentation of content based on today's technologies.
Published math appears in two basic forms: inline math and display math. Inline math describes simple equations that appear in the running text flow. Display math describes more complex equations that appear on their own line or in their own visual block.
Most inline math is quite simple (e.g., "x2 + y2 = z2"). It can usually be represented with Unicode character values and face markup (italic, bold, superscript and subscript). Many publishers do not tag such equations as mathematical expressions, although some publishers (Elsevier) request that these expressions be tagged.
In a few cases, inline math may be more complicated. For example stacked elements,
, a simple summation,
, or a radical,
can appear inline. In these cases, the equation must be tagged using SGML markup rather than simple face markup.
All surveyed DTDs include a model for encoding display math and complex inline math with a stream of text commands.4 Four primary encoding models are used by the surveyed publishers: 12083, Elsevier, MathML, and TeX.5 These models are summarized in Table 7.
| DTD | Model |
|---|---|
| AIP | ISO12083:1993 |
| BioOne | ISO12083:1993 |
| Blackwell | MathML (W3C, 7 April 1998) |
| Elsevier | Elsevier Math |
| Highwire | Elsevier Math |
| IEEE | TeX |
| Nature | ISO12083:1994 |
| PMC | TeX |
| UCP | Based on ArborText's implementation AAP Math with further changes by UCP. |
| Wiley | TeX or LaTeX in external file |
Because 12083, AAP and Elsevier math are structural cousins, there are actually three primary math models: 12083, MathML, and TeX. In addition, most publishers (but not all, e.g., UCP) support scanned images for math because some equations may be too complex to be captured in SGML.
AAP Math is the original foundation of SGML math markup for most journal publication [van Herwijnen, 1993]. 12083 math, although not directly derived from the AAP DTD, was developed in part based on a review of the AAP DTD. Elsevier's math model is more closely related to AAP although certain semantic constructions, such as explicit integrals and products, have been dropped.
MathML is a newer math model, developed for XML rather than SGML. Unlike 12083 math, which is strictly concerned with presentation markup, MathML can be used for presentation or content markup. "The intent of the content markup in the Mathematical Markup Language is to provide an explicit encoding of the underlying mathematical structure of an expression, rather than any particular rendering for the expression" [Diaz, et al., 2001].
MathML first became a W3C recommendation in February 1998, and version 2.0 became a W3C recommendation on February 21, 2001. Most surveyed publishers do not use MathML because they developed their DTDs prior to the original MathML recommendation, and they have not converted their DTDs to XML. Only one surveyed publisher uses MathML (Blackwell). Several publishers have indicated their long-term intent to migrate from 12083 math to MathML [Pepping & Schrauwen, 2001].
Neither 12083 math nor MathML can be natively displayed in most current browsers.6 As a result, when publishers prepare full-text SGML for online presentation, the equations are converted to an image, usually in GIF format.
TeX [TeX Users Group, 2000] is a powerful ASCII-coded typesetting system created by Donald Knuth of Stanford University in 1981. Leslie Lamport developed LaTeX [LaTeX Project, 2000], a 'dialect' of TeX in 1985. LaTeX is particularly suited to the production of long articles and books, since it has facilities for the automatic numbering of chapters, sections, theorems, equations etc., and also has facilities for cross-referencing.
Because TeX and LaTeX have been around so long, and because they provide tremendous typesetting facilities for complex expressions, they are widely used in the mathematical and physics community. Many publishers have chosen to retain math in TeX or LaTeX rather than convert it to SGML. In addition, some publishers have chosen TeX rather than SGML because all of the subtle presentational nuances of an equation can be completely preserved when the equation is rendered, nuances that might require processing instructions if the equation were tagged in SGML.
There are many tools that convert TeX to GIF images for web presentation. These tools have encouraged some publishers to stick with TeX and bypass SGML for tagging of math. In fact, some organizations prepare SGML for the web by converting SGML equations to TeX and then using a TeX to GIF converter to create a graphic file for each equation.
Most surveyed publishers also permit equations to be captured as scanned images rather than SGML or TeX encoding. While the scanned image route is usually intended for equations that cannot be captured with the available encoding, some SGML suppliers use scanned images for all equations because they are easier to create than text-encoded equations. This approach preserves the exact visual appearance of an equation, but it precludes the possibility of re-formatting equations at a later date.
If publishers worked completely in isolation and never had to share their SGML files with anyone outside of their own organization, the differences in DTDs and implementation practices would not be important. However, journal publishers do share their SGML files with other organizations, most commonly with content aggregators and content archives. In some cases, the aggregators and archivists can be one and the same organization.
In this section, we examine issues encountered by Highwire Press at Stanford University in their work with different DTDs. The experience of Highwire Press over the past six years is typical of those faced by any organization that must take SGML designed and implemented for the internal needs of the supplying organization and reuse it under a different set of requirements.7
Highwire Press started to place full-text journal content online in May 1995 with the Journal of Biological Chemistry. Initially, Highwire worked with providers of SGML to develop or modify DTDs that would allow online presentation and full-text indexing. For each DTD, Highwire built a custom parser that converted the SGML to HTML for online presentation.
After several years of building custom parsers for SGML to HTML transformation, Highwire developed their own DTD. Ideally, all content is delivered in this DTD, however, if a customer already creates content in another DTD, Highwire converts the customer's content to the Highwire Press DTD.8
The Highwire DTD was designed to satify the following key goals:
Highwire derived their DTD from the Elsevier 4.1.0 DTD.9 The DTD was changed to address the specific requirements that Highwire faced in placing content online. They simplified some element models that did not affect online presentation and linking while they added other elements that aided in tracking articles and creating links to related content. The changes made include:
All content delivered to Highwire is now converted into the Highwire DTD10 , and Highwire encourages new customers without a DTD to adopt the Highwire DTD from the start.
This new approach for Highwire has generally been successful; however, it has not been without problems. The following sections examine some of the issues Highwire has faced.
When a journal first submits SGML to Highwire, a formal validation process is conducted. At least two issues of the journal receive careful proofing. Problems are reported to the publisher, and Highwire works directly with the publisher or SGML provider to facilitate changes that may be required in their SGML production processes. The goal of validation is to ensure that deliveries will be consistent and correct.
Tables and math receive special attention during the startup process. Tables, which may be submitted in either CALS or Elsevier format, sometimes have problems with column headers and spanning cells. Graphics embedded in tables may cause alignment problems, and Highwire recommends scanning those tables with graphics that require alignment to specific cells. In some situations, publishers submit all tables as scans rather than full-text SGML. Highwire discourages this practice because the scan files are large, slow to load in browsers, and cannot be indexed for searching.
Unfortunately HTML has more limited presentation capabilities than CALS or Elsevier SGML. For example, decimal or character alignment is unsupported in HTML. As a result, Highwire sometimes sacrifices a degree of table formatting in online presentation. Their primary goal is to ensure the table is readable.
Math is usually tagged according to the Elsevier or MathML DTDs, or encoded as TeX. Because most browsers are unable to render math, the equations are converted to GIF images. Highwire converts all equations from their SGML format to LaTeX, and then converts the LaTeX to GIF images. Sometimes math presents formatting problems, especially in long equations that require line breaks. As a result, some equations receive hand massaging even during the regular production process.
When validation is complete and all problems have been resolved, Highwire moves the journal to a more automated process for regular issues. The importance of this validation process for quality of the final presentation cannot be overstated. Without this step, the overall quality of journals hosted by Highwire would be much lower.
Highwire faces a number of challenges when converting content from other DTDs to the Highwire DTD. Some of these challenges include:
All of the problems discussed up the to this point would exist even in a perfect high quality SGML production system. Unfortunately, real world production of SGML has shown that consistency and high quality are not always achieved [Bide, 2000].
Highwire has found several kinds of quality and consistency challenges:
Minimization of quality and consistency problems requires startup validation and ongoing quality assurance processes. Because content sent to Highwire is put online almost immediately, most of these problems are caught at some stage of the production process. Furthermore, because Highwire is organizationally focused on quality and customer satisfaction, they have created feedback loops to customers to ensure that systemic problems with SGML quality are addressed.
One of the most compelling reasons for electronic journals is linking. The ability to rapidly follow hyperlinks through research materials allows new forms of discovery. For this reason, Highwire pays special attention to creation of links when placing journals online.
The SGML Highwire has received from customers shows that there are consistent issues with linking:
Highwire tries to compensate for incorrect or missing link information whenever possible. However, they deliberately try to undershoot rather than overshoot when automatically creating links because they believe it's better to miss a link than to create an incorrect link. In some cases, they limit their regular expression pattern matching to specific sections of the document. For example, they only search for untagged email addresses in the front matter of an article.
Highwire encourages customers to deliver content in the Highwire DTD, but they cannot require publishers who have their own DTD to deliver in Highwire format. When a DTD transformation is required, Highwire does the transformation, rather than the publisher, for several important reasons:
While there are many advantages to converting SGML at Highwire rather than at the publisher, there is one significant disadvantage: if a publisher significantly upgrades their DTD, a full-fledged parser update, coupled with extensive integrity testing, is required at Highwire.
Communication of DTD upgrades to Highwire, whether major or minor, is critical. In many cases, Highwire has only learned of a DTD upgrade through the failure of a file to parse, rather than through proactive communication from a publisher.
The Highwire experience illustrates that high quality results can be maintained through clear standards, continual monitoring, feedback mechanisms, and appropriate levels of investment. However, Highwire is not the only organization to have addressed this problem directly. Elsevier Science provides another valuable case study.
When Elsevier Science first required full-text SGML, they provided a DTD and documentation for the DTD. Even with a 90-page reference manual, Elsevier found that the quality of the SGML they received from suppliers was not as high as they had hoped. Files parsed (most of the time), but high quality SGML required more than parser validation.
Elsevier found that the interpretation of the DTD was inconsistent, sometimes because desired interpretations were undocumented, and sometimes because those people creating SGML had not memorized the documentation. Certain policy decisions that could not be encoded in a DTD needed clarification. Some common author mistakes appeared in SGML because editors did not catch them; sometimes editors made mistakes that affected SGML as well.
When these problems were encountered in 1996, Elsevier was archiving a lot of the SGML - they were not yet placing most of their SGML online. In this regard, the problem Elsevier faced with deferred-use of the SGML was similar to the problems that archivists will face.
Elsevier realized that these errors would cause long-term problems if they were left unchecked and unfixed. In order to catch these and a wide range of other errors, Elsevier started to build a quality control application, known simply as "QCTool".13
QCTool validates that a file parses, and then it reports three classes of problems: errors, warnings, and notifications. Errors are problems that indicate misuse of the DTD or violations of Elsevier policies (for example, a non-EMPTY element that has no content). Warnings are lesser issues that probably indicate incorrect SGML (for example, the text "Smith (2001)", if untagged, will cause a warning to be issued that a possible citation needs tagging). Notifications are warnings that might indicate a problem, but are just as likely to be a false warning.
Initially QCTool was used inside Elsevier to examine SGML files for errors. As more errors were caught with the tool, a small team known as "the repair shop" started to fix the most egregious ones. It did not take long to realize that the capacity of this team to fix the errors reported by QCTool would soon be exceeded. In addition, Elsevier decided that the responsibility to fix these errors lay with the supplier, not with Elsevier.
In mid-1997, Elsevier distributed QCTool to suppliers and asked them to run SGML files through it in the hope that suppliers would fix the problems that were reported. In some cases, though, the number of problems reported was so large that suppliers found it more expedient to ignore QCTool. In addition, some errors were the result of author or editorial mistakes, and sometimes it was unclear how they should be fixed.
By mid-1998, most of the early questions about how to use and respond to the QCTool were resolved. But because a large number of errors were still being fixed in the repair shop, Elsevier no longer requested that suppliers run QCTool and fix the problems. New p