Abstract
Choosing a XML vocabulary today is a very challenging task quite similar to finding a web page before the development of the big web search engines and directories on which we rely today, the main difference being that it is much more dangerous to choose a "wrong" vocabulary than a "wrong" page!
If the bad news is that none of the tools which would help are available, the good news is that most if not all the information I need is available somewhere on the web and the purpose of this project is to propose a solution to retrieve and present this information.
Keywords
Table of Contents
I have been recently involved in many discussions and projects around a simple and generic question: "how do I create a XML vocabulary?". The formulation was often different, including "how do I create a namespace?" or "how do I publish a XML schema?" but the central issue was always about the organization to put in place and the ways to advertise the newly created vocabulary.
Analyzing the various organizational, technical and marketing facets of this question, I have developed a conviction that the development and publication of "XML vocabularies" (or namespaces or schemas if you prefer) is just a variation over the generic (and better known) issue of web publishing and that web publishing tools and techniques should be used and adapted to the publication of XML vocabularies.
Among these technologies, web crawlers and search engines are probably those which are missing the most to the XML community and the purpose of this presentation is to show a proof of concept of what such tools might be.
Choosing a XML vocabulary today is a very challenging task quite similar to finding a web page before the development of the big web search engines and directories on which we rely today, the main difference being that it is much more dangerous to choose a "wrong" vocabulary than a "wrong" page!
When I need to choose a XML vocabulary, I want first to have a comprehensive list of vocabularies which could meet my needs. Ideally, I would be using a search engine such as Google or AltaVista but unfortunately, there is no specialized search engine for XML vocabularies.
To choose between those candidates, I need as much information as possible and a directory such as DMOZ or Yahoo would be of great value. Unfortunately, there are lots of "schema repositories" covering vocabularies developed by a number of disjointed communities and this really doesn't help to compare those vocabularies. Furthermore, these repositories often published the descriptions provided by the authors which is usually lacking the critical touch brought by the DMOZ or Yahoo editors.
Finally, I find it very difficult to judge the dynamic behind the vocabularies and to distinguish between a two years old specification abandoned by its authors which usage is slowly declining and a brand new one with a sharply rising market adoption and, ideally again, hardly miss some statistics such as those provided by the Netcraft surveys.
If the bad news is that none of the tools I have mentioned are available, the good news is that most if not all the information I need is available somewhere on the web and the purpose of this project is to propose a solution to retrieve and present this information.
I have opted for a very simple data model with only two classes (Document and Namespace) linked by a variety of relations:
quotes: indicates that a namespace is "quoted" in a document but not declared or used in a way which is conform to the XML 1.0 and Namespaces in XML recommendations.
declares: indicates that a namespace is declared as specified in Namespaces in XML without being used to qualify any element or attribute.
uses: indicates a namespace declared and used to qualify elements or attributes but not used to qualify the root element.
usesAsRoot: indicates that the namespace is declared and used to qualify the root element (and eventually other elements and attributes).
Other relations such as isSchemaFor or isTransformationFor will be defined in future releases.
The number of properties have been kept minimal too, the Document class having only two properties (wellFormedXml and lastVisit) and ''Namespace" has no properties).
The RDF schema reflects this search of simplicity and the relations are translated as sets of RDF properties included directly within their domain classes without using any containers which would have added triples and made the RDF query more complicated than they should be.
As an example, the RDF describing the latest RDDL specification is as simple as:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<document xmlns="http://xmlns.info/descriptions/"
rdf:about="http://www.openhealth.org/RDDL/20020218/rddl-20020218.html">
<lastVisit>2002-03-26T10:28:35Z</lastVisit>
<wellFormedXml>yes</wellFormedXml>
<usesAsRoot rdf:resource="http://www.w3.org/1999/xhtml"/>
<uses rdf:resource="http://www.w3.org/XML/1998/namespace"/>
<uses rdf:resource="http://www.rddl.org/"/>
<uses rdf:resource="http://www.w3.org/1999/xlink"/>
</document>
</rdf:RDF>
The engine used to retrieve the information on the web is the multi-threaded C++ open source crawler Larbin which makes sure we behave as a good web citizen and takes care of the crawling itself including link detection and duplicates management and calls back a user routine when a page has been retrieved.
The central piece of this process is the actual namespace discovery which is done in two steps: the documents are first processed through a regular expression which detects constructs which look like namespaces declarations in documents even when they are not well formed. When such occurrences are found, an attempt is made to parse the document using libxml and if this attempts succeeds, a XSLT transformation is run using libxslt to do a finer analysis on the document.
In both cases when namespaces are found, a RDF document similar to the one above is generated and stored with the RDF description of the document and the relations to the namespaces discovered in the document.
These documents can then be loaded in a RDF database or repository such as 4Suite for later use.
We have already mentioned the RDF schema used for this proof of concept. The information gathered by the crawler is almost complete for use, however I have preferred to give no description for the namespaces discovered in the documents to avoid creating redundant descriptions in multiple documents mentioning the same namespace. As a consequence, the namespaces discovered are not yet typed as namespaces after we've loaded the documents into a RDF database and a batch needs to be written to add a type to untyped namespaces.
For this proof of concept, I have been using 4Suite and its query language Versa. Some features of this language such as its ability to define aggregates were missing to the RDF query languages which I have been using in the past and they are most needed for our project.
Versa supports RDF Schema and can be used to provide a list of untyped namespaces:
filter(traverse(type(info:document), info:mentions, vtrav:any, vtrav:forward, vtrav:transitive), "not(. - rdf:type -> *)")
In this Versa query, the first argument of the filter function (traverse(type(info:document), info:mentions, vtrav:any, vtrav:forward, vtrav:transitive)) relies on our RDF schema to give a list of objects linked to subjects of having a type info:document through any predicate which is a sub property of info:mentions. The second argument just keeps the resources which have no type.
This is already quite useful, however, the main application of a RDF Query language in this project is of course to retrieve the data and present them to the users.
Statistics are the first result which can be drawn from this proof of concept.
The first trial run has retrieved 7693 documents, a number too low to be significant, especially since the starting point given to the crawler has been http://xmlfr.org, a specialized site which should lead to an overestimated proportion of "XML namespaces aware" pages. 241 of these documents contained a mention of a XML namespace and 85 different namespaces have been found.
These statistics should thus be considered as an example of the kind of conclusions which could be drawn rather than representative.
Documents Proportion Proportion
Total documents 7693 100.0%
Namespace aware 241 3.1% 100.0%
Well formed 74 1.0% 30.7%
Not well formed 167 2.2% 69.3%
XHTML 1.0 161 2.1% 66.8%
MS Office 23 0.3% 9.5%
HTML 4.0 20 0.3% 8.3%
VML 13 0.2% 5.4%
RDF 12 0.2% 5.0%
Xlink 12 0.2% 5.0%
MS Word 11 0.1% 4.6%
XSLT 11 0.1% 4.6%
Saxon 10 0.1% 4.1%
Uuid (*) 9 0.1% 3.7%
A "namespace aware" document is any document where the text xmlns[:xxx]]='anything' or one of its variations has been found. Those documents are passed through a XML parser and can be well formed or not.
(*) Uuid is the namespace "uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" mainly found in association with the MS Office namespace in the http://www.omg.org website.
XHTML 1.0 161 100.0% Uses 72 44.7% Quotes 89 55.3%
As expected, XHTML 1.0 (http://www.w3.org/1999/xhtml) is the top namespace found during our crawl. The documents which are just quoting the namespace are not well formed and their huge proportion (55%) clearly shows that publishing well formed XML is far from being obvious with the tools available today. Interestingly enough, even specialized and professional sites such as the Dutch web site from the W3C, IBM services, Dublin Core, Infoteria, O'Reilly, xmlhack and my own XMLfr (to name few) have been seen serving XHTML pages which are not well formed.
MS Office 23 100.0% Uses 0 0.0% Quotes 23 100.0%
MS Office (urn:schemas-microsoft-com:office:office) is well known to be "an ugly mix of XML, ill-formed HTML, scripts, and if statements inside of square braces" (Robert DuCharme) and it's no surprise that none of those documents are well formed. Sites exposing this namespace include institutional sites such as the District Court of Maine, the European Commission, the United States Department of Agriculture or the French Ministere de l'Education.
HTML 4.0 20 100.0% Uses 0 0.0% Quotes 20 100.0%
Using the location of the HTML 4.0 recommendation (http://www.w3.org/TR/REC-html40) as a namespace to identify "well formed HTML" used to be a common practice before the publication of XHTML 1.0 and some sites such as Zvon, different US District Courts or crossref.org expose this namespace. This is probably a "leak" during XSLT transformations generating HTML documents which are, by definition, not well formed XML. Generally speaking, these namespace leaks may prove quite useful to guess the namespaces used internally to construct web pages.
VML 13 100.0% Uses 0 0.0% Quotes 13 100.0%
VML (urn:schemas-microsoft-com:vml) is another namespace used by Microsoft, frequently in association with others such as MS Office or MS Word. Sites exposing this namespace include the National Defense Industrial Association and the United Nations Environment Program.
RDF 12 100.0% Uses 0 0.0% Quotes 12 100.0%
The documents exposing RDF (http://www.w3.org/1999/02/22-rdf-syntax-ns#) found during this crawl are either HTML documents exposing it as a leak such as pages from the well known Zvon XSLT Tutorial or CNN Arabic or RDF islands in HTML documents such as pages from the U.S. Equal Employment Opportunity Commission. These documents are either HTML (and thus not well formed) or just quoting the RDF namespace.
Xlink 12 100.0% Uses 6 50.0% Quotes 6 50.0%
All the documents simply quoting the XLink namespace (http://www.w3.org/1999/xlink ) found during this crawl happen to be documentations mentioning XLink, such as Tim Bray's XNRL, Sean Palmer's XNGloss, or the University of Bath's guidelines to implement Dublin Core. The six well formed documents using XLink are RDDL documents including rddl.org itself and my own Examplotron and XSLTUnit.
MS Word 11 100.0% Uses 0 0.0% Quotes 11 100.0%
The MS Word namespace (urn:schemas-microsoft-com:office:word) is found associated with the MS office namespace mentioned above and the sites exposing it are pretty much the same.
XSLT 11 100.0% Uses 2 18.2% Quotes 9 81.8%
The two well formed document exposing the XSLT namespace (http://www.w3.org/1999/XSL/Transform) are two XSLT transformations while the other documents are mostly documentations mentioning XSLT, such as the netcrucible FAQ or James Clark's XT page.
Saxon 10 100.0% Uses 0 0.0% Quotes 10 100.0%
The Saxon namespace (http://icl.com/saxon) is most of the time a namespace leak in HTML documents such as on the Systinet web site or David Carlisle's TEX manual, the remaining references being documentations mentioning the namespace such as my Examplotron.
Uuid 9 100.0% Uses 0 0.0% Quotes 9 100.0%
Last of our top 10, uuid:C2F41010-65B3-11d1-A29F-00AA00C14882 is used in FrontPage web pages such as those of the OMG web site.
Initializing the directory with the information gathered on the web and the statistics mentioned above would already be valuable. A simple XSLT transformation can present the information known about a namespace as a RDDL document, readable both by humans and computer agents.
The first stage, using the data collected by the first version of the crawler is pretty minimal and, without any manual addition, our system would be able to present the namespace URI, its statistics and a list of resources using the namespace.
In its most simple version, the table of content for a document could be:
With the statistics and usage showing the information retrieved on the web:
If we had run this crawl several times, trends could be given as well which would be interesting to evaluate the dynamic behind the namespace. For the moment our first step to improve the document can just be to include the comments done on the statistics:
From there, there are two main directions in which the description of the namespace can be improved: adding more related resources (such as schemas, existing RDDL documents or stylesheets) and adding more textual information.
Part of these additions could be found by an improved version of the crawler doing a specific analysis of well known document types such as the different schema flavors, XSLT transformations or RDDL documents to minimize the amount of manual work remaining to be done.
Another encouraging factor is that the number of namespaces is several orders of magnitude smaller than the number of pages on the web and that the amount of work is in no way comparable to what has been done by, let's say, the ODP and its 3,274,639 sites and 47,324 editors...
And, with a minimal amount of researches and edition, our table of content becomes:
With the addition of a simple "description" section:
And a short list of resources:
Our document contains now most of what a XLink newbie needs to start working on the subject. What can we add? Who said news? That's trivial assuming we can find a syndication channel such as the one available on xmlhack (http://xmlhack.com/rss10.php?cat=14)!
This gives us a new section for our document:
With the latest news:
We have now a single point of entry giving a huge amount of information on XLink leveraging on what we've found on the web, existing resources such as xmlhack and a minimal amount of human intervention to glue all this together.
The search is a part on which I haven't any concrete material to show, however my own experience on XMLfr is that a standard search engine on a specialized technical site gives pretty good results (and this should be the case here again), especially when it is completed by a Topic Map to classify the resources available and help to navigate among the different topics.
In our domain (XML), such a site could leverage on the work done by the OASIS XMLvoc (Vocabulary for XML Standards and Technologies) Technical Committee which goal is to "define a vocabulary for the domain of XML standards and technologies, which will provide a reference set of topics, topic types, and association types that will enable common access layers and thus improved findability for all types of information relating to XML, related standards, and the XML community. The vocabulary items will be defined as Published Subjects, following the recommendations of the OASIS Topic Maps Published Subjects Technical Committee", ie in short to create such the Topic Map we need.
Both the technology and the information are available to fix the current shortage of relevant and coherent information about XML vocabularies used on the Web.
I firmly believe that this prototype and the ideas behind it may be a foundation for a very useful site and observation platform for the use of XML on the web and welcome the sponsors who may help me to make this happen.
![]() ![]() |
Design & Development by deepX Ltd. 2002 |