XML Europe 2002 logo

Generation of Simplified DTDs From a Set of XML Sample Files

Abstract

We will describe a method and the related software for the automatic generation of simplified DTDs from a source DTD and a set of sample marked up files. The purpose is to create the minimum DTD that the sample set of files comply to. In this way, new files can be created and parsed using this simplified DTD but still be compliant to the original, more general one. The simplified DTD can be used to make the task of markup easier, specially for non-experienced XML writers.

Our approach is to automatically select only those DTD features that are used by a set of valid documents (validated against the more general DTD) and eliminate the rest of them, obtaining a narrow scope DTD which defines a subset of the original markup scheme. This "pruned" DTD can be used to build new documents of the same markup subclass, which in turn would still comply to the original general DTD.

Using this automated method, the simplified DTD can be updated immediately in the event that new features are added to (or eliminated from) the sample set of XML files (modifications to files of the sample-set must be done using the general DTD for validation). This process can be repeated to incrementally produce a final narrow-scope DTD.

In this way, we use a complex DTD as a general markup-design frame to build a simpler working-DTD that suits a specific project's markup needs.

Another use of this technique is to build a one-document DTD, i.e. the minimum DTD derived from the general DTD that a given XML document would comply.

Another benefit of this tool is that it produces statistical data that may help markup designers improve their markup schemes like the frequency of use of certain elements within others which is helpful to detect unusual structures that could reflect mark-up mistakes, misuse of the DTD, or DTD features that may allow unwanted generalization.

This tool was used at the Miguel de Cervantes digital library of the University of Alicante to obtain simplified versions of the TEI.DTD (Sperberg-McQueen and Burnard, 1994). This work is part of a larger project in the field of text markup and derived applications.

Keywords


The full paper was not available at the time the proceedings were created. Please check the conference web site, http://www.xmleurope.com, to find an updated version of this paper.

Biography

Alejandro G. Bia has a BS and a MS degree in Computer Sciences from ORT University, a Diploma in Computing and Information Systems from Oxford University and is finishing his PhD thesis on Computing Methods to Automate the Production of Digital Resources in Digital Libraries at the University of Alicante. Currently he is working as Subdirector of Research and Development at the Miguel de Cervantes Digital Library of the University of Alicante, where the results of his ongoing research are being put to practice. He has also worked as Special-Projects Manager at NetGate (1996), Documentation Editor of the GeneXus project at Advanced Research and Technology (ARTech) (1991-1994), and worked at the Telephonic Traffic Processing Unit of ANTEL (1994-1989). He has been a lecturer on Operating Systems, Computer Organization, Computer Networks and English for Computer Sciences at ORT University (1990-1996). His current interests are digitisation automation by computer methods, digital preservation, digitisation metrics and cost estimates, texts structuring and markup languages. He is an active member of the TEI Consortium.