Abstract
In March 2003, the National Library of Medicine (NLM) released into the public domain a suite of DTD modules for describing journal literature, books, and many kinds of textual material. The full suite was developed by the National Center for Biotechnology Information (NCBI) and the XML consulting firms Inera, Inc. (funded by the Andrew W. Mellon Foundation) and Mulberry Technologies, Inc. (funded by NCBI). The modular DTD library and several complete DTDs made using the modules are in the public domain for use by any organization or individual without permission from NLM.
The first two public DTDs developed from the suite, the Journal Archiving and Interchange DTD and the Journal Publishing (authoring) DTD, were also released in March 2003 along with tag set documentation. The Publishing DTD defines a common format for the creation of journal content in XML. The advantages of a common format are portability, reusability, and the creation and use of standard tools. The Archiving DTD also defines journal articles, but it was created to provide a common, public format in which publishers, aggregators, and archives can exchange journal content and store it in large commonly-tagged repositories.
The DTD Suite was developed from work begun by NCBI in support of PubMed Central (PMC). The DTDs will form the basis for the PMC (PubMed Central) repository as well as the JSTOR (Journal STORage: The Scholarly Journal Archive) Electronic Archive Project. In the months since their release:
How have the DTDs been accepted and used in the journal publishing community? By electronic archival projects and repositories? By the large publishing conglomerates who have DTDs of their own? By first-time XML publishers who have never had a DTD? By conversion vendors? By the content aggregators and abstract and indexing services who must deal with multiple DTDs and schemas? In the wider publishing world beyond that of STM journals?
What new DTDs are available for public use? What DTDs are planned?
NLM has set up an advisory board of publishers, academics, aggregators, and consultants to oversee the DTD suite and ensure that the direction of growth and modification of the modules and new public DTDs is in everyone's best interests. What direction has come from the advisory board?
When will schema versions of the suite be available and what schema languages will be supported?
What tools have been developed to support the DTDs? What tools are planned?
(Note: Most of the update material is late-breaking and will therefore be given in the presentation but is not present in this paper.)
Keywords
Table of Contents
In March 2003, the National Library of Medicine (NLM) released, for public use, a Suite of XML DTD modules (that describe journal articles and non-article journal material such as editorials, book reviews, letters) and the first two DTDs to be constructed from these modules:
The Journal Archiving and Interchange DTD (Archiving DTD) provides a common format in which publishers and archives can exchange journal content.
The Journal Publishing DTD (Publishing DTD) provides a common format for the creation or initial conversion of journal content into XML.
Both DTDs were created from the Journal Archiving and Interchange DTD Suite, a set of XML modules that define generic elements and attributes for describing journal content. The Suite is a set of building blocks form which journal DTDs can be constructed. The intent of the Suite is to preserve the intellectual content of journals independent of the form in which that content was originally delivered. The Suite has been written as a set of XML DTD fragments, each of which is a separate physical file. No module is an entire DTD by itself, but these modules can be combined into any number of different DTDs. DTDs constructed from the Suite would share basic structural and semantic definitions.
The two DTDs may be used as they are, modified using established Parameter Entity mechanisms, or the Suite can be used to construct new DTDs for authoring and archiving journal articles and for transferring journal articles from publishers to archives and between archives. Although the full Suite was developed to support electronic production, the structures should be adequate to support some print production as well.
The DTDs and documentation are freely available online: http://dtd.nlm.nih.gov
The National Center for Biotechnology (NCBI) at the NLM administers the PubMed/MEDLINE system (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed) for medical and other life science journal citations. Through PubMed, anyone can access MEDLINE’s 14,000,000 biomedical journal citations to research biomedical questions.
In 2000, NCBI created PubMed Central (PMC), a digital archive of life sciences journal literature that provides free and unrestricted access to the full text of over 100 life sciences journals.
Although PMC (http://www.pubmedcentral.nih.gov/) started as a way to allow free access to complete articles from participating PubMed journals, it is now the NLM’s digital archive of life sciences journal literature.
Any journal that participates in PMC must supply the full text of articles in an SGML or XML format that conforms to an established DTD for journal articles. The first version of PMC was built using content from Molecular Biology of the Cell, Proceedings of the National Academy of Sciences of the U.S.A., bmj.com, Breast Cancer Research, and Arthritis Research. The content from the first three of these journals was in SGML conforming to the keton.dtd and was supplied by Highwire Press. The others were supplied by BioMed Central as XML in their article.dtd (which was based on the keton.dtd).
The first version of PMC loaded the native SGML or XML into a database. When a user requested an article, it was retrieved from the database and assembled and converted to HTML through DTD- and journal-specific software. The system worked, but there was tremendous overhead creating custom software for getting content into and out of the database. The system was not scalable.
Using the participating journals as samples, NCBI created an XML format (pmc-1.dtd) that all incoming content would be transformed into. The common XML would be loaded and rendered to HTML without the need for DTD- or journal-specific software. The only “custom” work would be the source to pmc-1.dtd transformations.
In 2001, more journals were participating in PMC, and the target pmc-1.dtd needed to be expanded to accommodate the new content. Also, NLM adopted PMC as its digital archive of life sciences journal literature, but because the pmc-1.dtd was written to be a “simple” DTD for online display, it had to be expanded to maintain as much of the information in the source files as possible. NCBI started a major revision of the pmc-1.dtd with Mulberry Technologies in 2001.
In 2001, Harvard University Library’s Office for Information Systems E-Journal Archiving Project (funded by the Mellon Foundation) hired Inera, Inc. to examine current journal DTD/schema practices (including the pmc-1.dtd) to determine the feasibility of creating one DTD or schema for intellectual content of journal articles[1].
The study concluded that it could be done, but no one existing DTD could do it.
NCBI applied many of the recommendations from the study to the pmc-2.dtd. Inera reviewed the pmc-2.dtd and suggested to Harvard that it could be used as the base for the one DTD/schema for intellectual content of journal articles that they were looking for. In April of 2002, NCBI met with the Harvard Library and representatives from the Mellon Foundation, Inera, and Mulberry to discuss extensions to the pmc-2.dtd so that it could be applied to a wider range of journals and disciplines and used in applications other than PMC.
NCBI worked with Mulberry and Inera to create a modular DTD (for easy customization) that could be used by both PMC and the E-Journal Archiving Project. The DTD philosophy was based on Harvard’s mission statement for the archive project:
The archive's purpose is to preserve the significant intellectual content of journals independent of the form in which that content was originally delivered in order to assure that this content will be available to the scholarly community for the indefinite future. Functionally, the archive is designed to render text and still images and other formats as practical with no significant loss of intellectual content. The archive reserves the right to freely manipulate the internal format of the manifestation over time as long as the plain meaning of the intellectual content is preserved.[1]
Mulberry and Inera examined thousands of articles from hundreds of journals and dozens of journal DTDs to be sure that the content models being defined by the tagset were comprehensive. After this extensive modeling exercise, the consultants worked with NCBI to create the Archiving and Interchange DTD as a general archiving DTD. NCBI and Mulberry then created the Journal Publishing DTD to help publishers who wanted to create content in XML. (The Archiving DTD is too loose for comfort in the creation of new material; it was designed for conversion form another format.)
The DTD Suite provides a set of XML modules that define elements and attributes for describing the textual and graphical content of journal articles. Each module is a separate file. No module is an entire DTD by itself, but these modules can be combined into a number of different DTDs. The Suite is designed as a standard class/mix structure, with most modules containing the declarations for one class of elements. Parameter Entities are used to define classes, and redefine these classes into smaller or larger sets for use by different DTDs. A Customization Module for each new DTD defines the specific differences in classes and attributes for that DTD. One module names all the potential modules in the Suite.
Individual DTDs are assembled using the modules they need from the Suite.
Modules in the DTD Suite include
Article metadata (elements such as article title, publication date, and keywords)
Display elements (such as figures, graphics, tables)
List elements (lists and definition lists)
Math elements (MathML)
Link elements (cross-reference, footnote, related-article)
Paragraph-level elements (paragraph, speech, display quote)
Phrase-level elements (bold, superscript, named-content)
Bibliographic reference elements (publisher name, access date, series)
DTDs for books and online help are planned, but the first two DTDs, which were released with the Suite and which are available for use now, are for journal articles. The article DTDs are generic enough to cover much of the non-article material in a journal, such as errata, letters to the editor, editorial, commentary, and short new pieces. The DTDs do not, by design, cover full journals, or such journal material as advertising, job annoucements, and author’s guidelines.
The purpose of the Archiving DTD is to provide archives with the structural and semantic models to preserve a journal’s intellectual content over time. Journal publishers were among the first to realize the usefulness of SGML (XML’s parent language) and many journal publishers are currently using either XML and SGML DTDs. An archive can therefore expect to receive tagged data in:
Publisher-written DTDs (such as Elsevier, Wiley, Blackwell, et al.) ;
Consortium written DTDs (such as AAP, ISO 12083, DocBook-Lite, et al.); and
Repository and aggregator-written DTDs (such as Ovid, Keystone, Highwire, et al.).
Many of these DTDs are proprietary. Some of the ones available for public use (e.g., ISO 12083) have not been updated to meet modern practice. There are dozens that a publisher could choose. The intent of the new Archiving and Interchange DTD is to make a single DTD that reflects the current practices of the XML journal publishing community. It is intended as a:
Translation target for other DTDs;
Base DTD for XML repositories; and
Interchange XML tagset for communication between publishers, archives, aggregators, service vendors.
It would not be possible to capture all the structural and semantic variety in these varied DTDs, and the Archiving DTD does not try. The Archiving DTD was written for ease of conversion from other DTDs. It captures the most common and most useful structures and provides sufficient generic models to transform the rest relatively cleanly. The idea is that it be easy to convert from the XML and SGML that journal publishers have now into a single repository format.
Therefore the Archiving DTD is descriptive (designed to tag what is in the original content), inclusive (preserves as much of the original tagging as possible), and non-enforcing (because there is no one right way to tag). In the content models: almost nothing is required; there are few required sequences (metadata is ordered, but little else is); and the distinguishing characteristic is many large OR groups (to make an easy target for transformation).
The purpose of the Publishing DTD is to provide guidance in creating new journal material. It provides:
Tagset and content models for the initial creation of journal content in XML; and
Conversion target for backfile.
This DTD was written to help users create consistent structures when authoring article content or during initial tagging of non-XML content.
It differs from the Archiving DTD because it:
is smaller (not as many elements needed);
is prescriptive (fewer choices simplify tagging decisions);
is enforcing (there is only one way to do many things); and
has more required elements.
(See hierarchical diagrams inSection 10, “ Appendix A NLM Journal Publishing DTD” )
The document type for both DTDs is <article>. The general structure for journal articles:
Journal-level metadata (such as the journal title and publisher), followed by
Article-level metadata (such as the article title and author), followed by
Full text and graphics of the body of the article, including:
structural items (sections, paragraphs, lists)
figures and tables
content items (such as genus-species, gene)
typographical highlighting (bold, small caps)
sidebars and text boxes
pointers to related material such as databases
bibliographic references
appendices and responses to the original article.
A Tag Library is the user documentation for a DTD — a reference guide to tagging for the people creating the XML documents. Each DTD has its own Tag Library available online as linked HTML files. The Tag Libraries provide, in natural language, the same information as the DTD, as well as tagged examples, structural diagrams, element and attribute definitions, and usage explanations that could not be expressed easily in a DTD (or even in schema) syntax. Each Tag Library includes:
Element description pages (one per element);
Attributes description pages (one per attribute);
Context table (where can an element be used);
Hierarchy diagrams (tree structures diagrams for significant structures);
Tagged Article samples (with PDF of each full sample also available);
Index to elements by tag name (element type name);
Index to elements by longer descriptive element name; and
The full text of all DTD modules.
Each Element Page contain:
Element name (tag name and long name) and definition;
Usage notes;
Element’s attributes (linked to attribute descriptive pages);
Explanation of related elements;
Content of the element (what can be inside it);
Tagged example(s) for each distinct context;
Which module the element is defined in; and
Here is a sample of an Element Page for the element Abbreviated Journal Title.
We considered writing the “master” version of these tag sets in W3C XML Schema before we wrote the DTD, but we chose to express the “master" version of these tags sets in DTD form for several reasons, technical and social.
Modularity — The DTDs are modularized in ways that allow, for example, multiple Table Models to be used. The table models are in self-contained modules and may be swapped in and out, you may use the CALS table model, the XHTML table models, or any third table model by adding the appropriate module. Although the CALS table, for example, exists as a W3C XML Schema, it cannot be modularized for swap-out in a similar way in W3C Schema syntax.
Preserving Intellectual Content — Some journals have tagged, executable math. MathML is currently only available as DTD modules, not as schema modules. If we had written a schema, we would have needed to exclude Math (or converted it to a schema ourselves - a very complex endeavor that would require extensive mathematical knowledge). (Note: By the time you read this, the MathML working group will have released the W3C Schema for MathML, but it was not available during our development.)
Little need for W3C Schema's Strengths — Although all of the models in these DTDs can be expressed in most (perhaps all) of the extant schema languages, the real strengths of schema languages are not relevant to this content:
Data Types - For b2b or transaction data, both the data types and the data typing of schemas are useful. There is very little in a journal textual DTD that can benefit from either. There are almost no small, type-identifiable components, even in the metadata.
Data Typing - Most journal structures are textual, with highly diverse content models. W3C Schema can only type when the content models are the same or can be derived.
Enforcement - Our goal is preservation, not enforcement; types and typing enforce and help to exclude through error-checking, which we do not wish to do.
The social and political reasons are, perhaps, even more important than the technical rationale. Current journal publishing is done with XML and SGML DTDs, not Schemas. Current production and tools are set up to use DTDs and not schemas. The targeted user community of this DTD is:
Journal Publishers (STM, society, all journal fields);
Aggregators and Repositories; and
Conversion vendors.
That user community is using:
XML DTDs (almost never schemas);
SGML DTDs (still a large minority);
XML tools that do not support all the schema formats; and
Word processors.
The intent was to provide journal publishers and archives with XML material that they could understand, interpret, and modify as they needed. The idea was to provide consensus and consolidation, not to drive or lead the community into new territory.
Because of the need to accommodate schema-based XML authoring tools, NCBI is creating a W3C XML Schema (and possibly other schema versions) of the DTDs. The DTDs will remain the definitive versions, and the schemas won’t increase DTD functionality with data types or derived typing.
These DTDs and the Tagset are in the public domain; they are not “Open Source”. This means that NLM will retain control over changes and additions to the Tagset and DTDs, but anyone may create a new DTD from the Suite or use the DTDs without permission from NLM.
To maintain consistency of the DTDs, NLM asks:
If you create a DTD from the Archiving and Interchange DTD Suite and intend it to stay compatible with the Suite, then please include the following statement as a comment in all of your DTD modules: "Created from, and fully compatible with, the Archiving and Interchange DTD Suite."
If you alter one or more modules of the suite, then please rename your version and all its modules to avoid any confusion with the original Suite. Also, please include the following statement as a comment in all your DTD modules: "Based in part on, but not fully compatible with, the Archiving and Interchange DTD Suite." [2]
Any DTD that is not being changed is not being used. Journal publication requirements change over time, and new structures may be discovered anytime a new DTD is converted to the Archiving DTD. To keep the DTD relevant to the publishing and archiving communities, NLM has created the Interchange Structure Working Group, made a public comment form available, and contracted with Mulberry Technologies, Inc. of Rockville, MD to act as Archiving and Interchange Tagset Secretariat. Anyone may comment on the public list. Anyone may file a formal comment. The Working Group will meet in face-to-face sessions, on conference calls, and by email to discuss the public requests and to recommend changes in and/or additions to the tagset. The Secretariat will collect the feedback to be discussed by the Working Group and will maintain the files and documentation.
Based on the first Working Group meeting in August 2003 and suggestions from users, version 1.1 will be released on November 1, 2003.
One of the major initial recommendations of the Working Group was to create tools for use with the DTDs/schemas. Such tools will be available for download or will be pointed to from the NCBI site. As of October 2003, the following tools were available:
HTML preview/rendering tool (an XSLT stylesheet that makes XHTML from an article tagged according to these DTDs); and
XSLT transforms from other selected journal DTDs to the tag set.
The following tools are being developed, and will be announced as they become available:
XSL-FO stylesheet to produce a PDF version of an article;
XML authoring tool customizations specific to the DTDs.
By December, as you read this, we hope to be able to mention many more organizations and publishers, but for the printed proceedings, there are at least the following developments.
As stated already, these DTDs will be the basis for the NLM PubMed Central archive.
The Archiving DTD will also being used in the initial pilot implementations of the JSTOR Electronic Archive. JSTOR has made agreements with a number of publishers to include their back content. For details, see the JSTOR website.
The United States Department of Health and Human Services, Centers for Medicare and Medicaid Services, Office of Strategic Planning has chosen the Publishing DTD for use with their Health Care Financing Review, their 2004 CMS Statistics guide, and other publications. Data Conversion Laboratories is converting back issues of the review to the new tagset.
Commonwealth Scientific and Industrial Research Organization (CSIRO) in Australia, one of the world's largest and most diverse scientific global research organizations, has chosen the DTD for use in their publications. Inera, Inc. has implemented their product eXtyles to both clean up and convert CSIRO’s Microsoft Word sources into the tagset.
The home page for the Tagset and DTDs:http://dtd.nlm.nih.gov
Discussion list for Archiving DTD: http://www.ncbi.nlm.nih.gov/mailman/listinfo/archive-dtd
Discussion list for Publishing DTD: http://www.ncbi.nlm.nih.gov/mailman/listinfo/publishing-dtd
This appendix contains selected structures from the National Library of Medicine's Journal Publishing DTD expressed as structural diagrams created by Near & Far® Designer. The conventions of the Near & Far® format are described at the end of this appendix.
The National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM) created two DTDs (Archiving and Interchange DTD and Publishing DTD), with the intent of providing a common format in which publishers and archives can exchange journal content. These DTDs were created from the Journal Archiving and Interchange DTD Suite, which provides a set of XML modules that define elements and attributes for describing the textual and graphical content of journal articles as well as some non-article material such as letters, editorials, and book and product reviews.
The intent of the full Suite is to preserve the intellectual content of journals independent of the form in which that content was originally delivered. The Suite is a set of modules, each a separate file. These modules can be combined into a number of different DTDs, for authoring and archiving journal articles, transferring journal articles from publishers to archives and between archives, and print production.
The document element is an <article>, which may be a traditional research article or many of the common non-article structures such as a book review, produce report, editorial, commentary, erratum, etc. The top-level structure of an article may include subarticles and replies or responses to the article. The metadata concerning the journal, issue, and article is inside the<front> element. The text and graphics of the article are inside the<body> element. The <back> element contains bibliographic references, appendices, and other structures.
While some older DTDs divide journal metadata into three components: journal-specific metadata, issue-specific metadata, and article-specific metadata, current practice in online updates and electronic journals has made the definition of an “issue” a debatable point. Therefore the Publishing DTD clusters issue-like metadata with the article metadata.
Includes both article-specific and issue-specific metadata.
The body of an article is a large OR group of all the paragraph-level elements followed by optional sections.
Sections are the hierarchical divisions of articles, and they are recursive.
![]() |
Required |
![]() |
Optional. May occur zero or one time. |
![]() |
Required, repeatable. May occur one or more times. |
![]() |
Optional, repeatable. May occur zero or more times. |
Table 1. Element Occurrences
![]() |
Element content is expanded elsewhere in the diagram. |
![]() |
The element has attributes. |
![]() |
One or more elements collapsed for clarity |
![]() |
Element is the root element of the model. |
![]() |
Text, numbers, and special characters |
![]() |
Excluded elements that are prohibited from being within the current element |
Table 2. Additional Symbols Associated with Elements
![]() |
Required sequence: Element1 followed by Element2 followed by Element3 |
![]() |
Choice of sequence: Element1 or Element2 or Element3 |
Table 3. Grouping Elements
![]() |
Zero or more repetitions of Element1 or Element2 or Element3 followed by zero or one occurrences of Element4 |
![]() |
Zero or one occurrence of either Element1 followed by Element2 followed by Element3 or one or more occurrences of Element4 |
![]() |
Either one or more repetitions of Element1 followed by Element2 or one or more occurrences of Element4 |
Table 4. Compound Examples
[1] Harvard University Library Office for Information Systems E-Journal Archiving Project. E-Journal Archive DTD Feasibility Study (2001) http://www.diglib.org/preserve/hadtdfs.pdf.
[2] Archiving and Interchange DTD home page. http://dtd.nlm.nih.gov.
![]() ![]() |
Design & Development by deepX Ltd. |