XML 2001 logo

RDF: The XML Pattern That Works for me

Nigel W O Hutchison <Nigel.Hutchison@softwareag.com>

1. Introduction

The W3C Semantic Web Activity has the ambitious goal of designing the infrastructure for a machine understandable networked description of the contents of the World Wide Web. The Semantic Web is envisaged as distributed and queriable repository of descriptive information about World Wide Web resources.

The descriptive information in the Semantic Web is, in general, not a part of the resources themselves, but references the resource. Just as on the World Wide Web an HTML document may reference another document by a hyperlink but the target document is "unaware" that it has been linked. to, so on the Semantic Web description information references a resource, but the resource is "unaware" that it has been described. Many HTML documents may reference the same document independently and correspondingly; many sets of descriptive data about a specific resource may exist independently of each other.

2. The Resource Description Framework

The content of the Semantic Web descriptive information is described by the World Wide Web Consortium Resource Description Framework (RDF) specifications. Two of these are particularly relevant in this context .

RDF provides primarily a logical definition of the structure of Semantic Web descriptive information and secondarily, an XML based exchange (or serialization) format for Semantic Web information. RDF was invented before XML, but it appeared on the scene in time to rescue us from the original RDF serialization syntax which was apparently derived from Lots of Irritating and Superfluous Parentheses (LISP).

Essentially, RDF is used to assign properties to resources. The property names are derived from a domain specific vocabulary. The values of the properties are either scalar values, resource references, or structured information. The domain specific vocabulary is described by an RDF Schema. The assignment is a statement in the sense that there is a subject, the resource, a predicate, the property name and an object, the property value which might be another resource reference. Properties names are prevented from clashing by the use of namespaces. Thus "title", in a document library vocabulary shouldn't class with "title", in an address book vocabulary.

A resource reference is typically represented as aUnique Resource Identifiers ( URI).

RDF is pitched at a high enough level so that a wide variety of implementation architectures are possible. So far there is no standard API , though some very useful Open Source APIs like Jena are available and widely used by the RDF user community. It is reasonable to assume that most RDF implementations would be able to accept RDF serialization format, and to emit RDF serialization format when required. Deep down in the implementation somewhere the assignment of properties to resources is realized and logically, it can be perceived as a set of statements.

In practice RDF statements are bundled in Models. You can think of these a RDF databases, when they are big, or extracts, or even query results when they are small.

I think of the Semantic Web as a network of a network of implementation independent RDF servers. sharing the Internet with a network of World Wide Web Servers.

In the years preceding the World Wide Web, queries about resources on the internet were addressed to servers based on the Gopher technology. These servers were interoperable to the degree that it was possible to build master Gophers, which could reroute their queries to other more specialized or localized Gophers on the Net. Presumably Semantic Web Repositories based on RDF implementations will be able to delegate queries to other RDF Repositories. Models may subsume other models

3. An Example

Here is an example of a serialized RDF description of a document from the Resource Description Framework (RDF) Model and Syntax Specification.

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:dc="http://purl.org/metadata/dublin_core#">
  <rdf:Description about="http://www.dlib.org">
    <dc:Title>D-Lib Program - Research in Digital Libraries</dc:Title>
    <dc:Description>The D-Lib program supports the community of people
     with research interests in digital libraries and electronic
     publishing.</dc:Description>
    <dc:Publisher>Corporation For National Research Initiatives</dc:Publisher>
    <dc:Date>1995-01-07</dc:Date>
    <dc:Subject>
      <rdf:Bag>
	        <rdf:li>Research; statistical methods</rdf:li>
	        <rdf:li>Education, research, related topics</rdf:li>
	        <rdf:li>Library use Studies</rdf:li>
      </rdf:Bag>
    </dc:Subject>
    <dc:Type>World Wide Web Home Page</dc:Type>
    <dc:Format>text/html</dc:Format>
    <dc:Language>en</dc:Language>
  </rdf:Description>
</rdf:RDF>

The RDF serialization describes a resource with a URI http://www.dlib.org using the about attribute to contain the reference, The properties of the resource belong to the Dublin Core namespace, denoted by http://purl.org/metadata/dublin_core#. Note that all of the properties, Title, Date etc, are single valued except one, Subject, which has multiple values, in this case with no ordering. In RDF serializations order of the properties is not significant. If a property has multiple values the order of the values may be chosen to be significant.

The attachment of the Title property with the value D-Lib Program - Research in Digital Libraries to the resource http://www.dlib.org corresponds notionally to the statement below.

4. What Use is this to the Developer?

Let's look at a plausible scenario and see if we can recognize the RDF paradigm.

Suppose a company receives orders from customers, delivers goods to satisfy these orders, and sends invoices to receive payment. The orders are in the form of XML messages . The messages conform to a standardized and widely used business oriented XML Schema. In the past, orders used to come in paper format - now they come in XML format. Once they bore a hand signature - now they are electronically signed with respect to origin and content.

In fact not much has changed - the order is a document as before and the rights and duties of customer and supplier with respect to the civil law, the tax authorities, and the shareholders etc. are the same. Operating in the global economy the risks may be greater because the customer - supplier relationship may not be as stable and familiar as it was before electronic trading was introduce. As Clauswitz might have said Electronic Commerce is the continuation of business by other means.

I would like to refer to these documents as enterprise documents. The concept order document exists as a concrete concept throughout the minds of people who work in the enterprise. Quite a few people might claim to have seen one. On the other hand some rows in some tables in some database which represents the order's content might mean something to some programmers in some office. To the customer and his accountants his order is this order.

As before, orders documents are filed, but this time in a database. The original orders may have to be retrieved at any time up to the time the order is satisfied. Subsequently the document may be required again, perhaps for use by a Customer Relation Management (CRM) system, by the auditors, by the companies' lawyers to help them sue for payment, by to the customers' lawyers to show that there is no case to answer. The order dependent information created during the order /delivery/ payment life cycle will have to be recorded for each order.

Let's first look at the order document. It may specify what time the order was sent, and the urgency of the order, among many other things. The document very soon accumulates properties that do not correspond directly to the document content.

In the original business the order was a piece of paper and might have been circulated around the plant slowly accumulating rubber stamp marks, signatures, stapled attachments. In the electronic commerce scenario it is essentially the same story

There are simple new properties that are independent of the content - for example the date and time the document was received, the internal order number, the URI of the document in the database

There are emergent properties, derived from the content such as the department to which the order is to routed and the urgency of the order. But didn't we say the the urgency was represented in the content?. Yes, it was, and presumably the committee that hammered out the XML Schema discussed what urgency values really mean. What urgency for this order means to the supplier may be derived from many other inputs including how important the customer is, what his payment record has been, and how angry the customer's purchasing officer was when he complained about the last delivery. The CRM system might give you some input on that.

The document is being classified by the system that is processing it. In real life, in contrast to Object Orientation, membership of classes is not an intrinsic property of a thing itself but is superimposed by any system that observes it. This is essentially the RDF approach also.

There are properties that the order document acquires, depending on what happened to the order. In one of the simplest scenarios, the order could be rejected because the product code does not relate to a product currently on offer. In the optimal course of the events the order would be fulfilled, and the document would accumulate historical processing properties.

At any time in the process of the order, or even at any indefinite time in the future. someone in the organization may want to ask the question "Tell me everything about that order" - and that would include the (provably) original order and its properties.

In addition the original order should be locatable by content but also by querying the properties the order document acquired

5. What to do with the Properties?

The paper based system the document might be stamped with a date and a signature to record receipt. In principle one could add the properties to the document schema in the form of new attributes or elements from another namespace. But this introduces a set of problems.

If the document instances are signed then the signatures may no longer be verifiable. If the order documents correspond to more than XML Schema than I may have to repeat this work for each schema. Do I want to give write access to the enterprise documents to various applications in my organization.? What if other parts of the organization want to assign their own properties, to an order, unbeknown to me?

The cleanest concept is storing the properties separately. I could design a private XML schema to describe the document properties. I then use this schema to define a database container for the metadata instances. Each metadata instance contains a reference to an order document. I require, at most, read only access to the original data. I can write a style sheet to repurpose the metadata XML and the document it describes into a report.. I can code up XML Queries to retrieve metadata instances, which lead me to the order documents I am seeking.

But why reinvent the wheel? - This is what RDF was designed for. The order document corresponds to the RDF subject resource, and the document properties to RDF properties.

6. What an RDF Implementation would give You

Would I create a RDF Implementation to solve this a particular technical problem? Probably not. Would I use an RDF implementation if it was available? - Yes, if I thought that I could use it again

7. The Properties of the Order

Here is an possible serialization of the RDF properties of an order

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:mc="http://www.mycompany.com/mydepartment/namespaces#">
  <rdf:Description
		 about="http://mysystem.mycompany.com/myXMLDB/entdocs/order/1234-5678-8AQBY">
    <mc:received>2001-01-07</mc:received>
    <mc:orderNo>1234-5678-8AQBY</mc:orderNo>
    <mc:authenticated> 2001-01-07</<mc:authenticated>
				<mc:customerNo>CuDE36737<mc:customerNo>
				<mc:priority>special</mc:priority>
    <mc:invoice resource="http://mysystem.mycompany.com/myXMLDB/entdocs/order/1234-5678-8AQBY"/>
 			......
  </rdf:Description>
</rdf:RDF>

It is worth mentioning at this point that an RDF resource reference, being a URI, could reference, using an XPointer extension, an individual element in an order document; for instance an order item. Not only documents but elements can acquire properties.

8. Re-using the Technique

If I can attach properties to order documents, I can attach properties to other document types, for example invoices and customer complaints. I might use a different RDF schema but re-use the same instance of the RDF container. This is because in RDF, property names shouldn't clash

Because RDF property values can be resource URIs, RDF properties can be used to relate one document to another. The order document can be related to the invoice for payment or to the customer profile. Here RDF is being used to express relationships or dependencies

Should others in the organization re-use the technique and implementation, they can operate independently from me even if they are operating on the same documents. Why indeed should Customer Relations Management tell me that they have assigned properties to orders for their own purposes?

9. What is the Pattern?

Instances of a document type are persistent, and to most intents and purposes, read only. During its lifetime, each document acquires properties which are not expressible by changes in its content. More than one set of properties may exist independently. Document retrieval by property and values is important. The document collection is large.

Examples of document types that would fit this pattern include orders, invoices, financial reports, minutes of meetings, emails, graphics or documents where retrieving by content is hard and where classification properties are required.

Another situations which is particularly appropriate, is where the document collection is heterogeneous but the emergent properties are similar. A cache of mixed Graphic, PDF, HTML, and Microsoft Office documents might need to share the properties drawn from the same vocabulary, such as Title, Subject or Creator, derived from their proprietary internal metadata or imposed and augmented by human classification.

The pattern is extendable to documents which may be frequently updated - like a patent record. But care must be taken that the emergent properties and values don't get out of date. In the RDF scenario the resource doesn't "know" it is being described and, as a consequence can't notify an RDF implementation that a change has been made.

10. A Suitable RDF Implementation

To be useful in an industrial strength setting, like the one sketched above, an RDF implementation would have to meet some basic criteria.

11. Further Uses for RDF

In this paper I have described a fairly simple pattern for the use of RDF in building XML Applications. I am certain other use patterns will emerge.

One interesting approach would be the definition of RDF schematic information applicable to a document type with its XML schema annotation or Schema Adjunct.

In all the preceding discussion RDF properties are only applied to non-RDF resources. Usage of RDF properties has been essentially "flat". In RDF a statement can be the subject or object of another statement. This has enormous implications - a set of properties and values in a RDF model can have its own identity and not just be "about" a resource . This allows the application to define new structures within an RDF model. These structures can have RDF properties assigned to them which are not part of their information content.

Readers interested in pursuing this might be interested in the activities of the Web Ontology Working Group ( http://www.w3.org/2001/sw/WebOnt/)

12. Conclusion

Implementations based on the Resource Description Framework can find a useful place in the XML Application builders tool box and very probably his or her departmental budget. RDF expresses a significant use pattern for the building of XML applications. Use of RDF in various metadata scenarios is increasing and building RDF implementations does not seem to be very difficult.

World Wide Web technology development has been aided in its success by the fact that it scales down from the Internet to the intranet, departmental and application levels. The World Wide Web was based on well known concepts and technologies re-engineered for the Internet.

It is the author's view that RDF shares many of these characteristics with what has become now traditional World Wide Web Technology.

13. Acknowledgments

I would like to thank Andy Bove and Jonathan Robie, both from Software AG , and Josef Dietl from Mozquito Technologies, for valuable insights which I have incorporated into the text. I would like also to thank all the subscribers to the various RDF mailing lists who fill up my mailbox every night.

Glossary

URI

Unique Resource Identifiers

ACID

Atomic, Consistent, Isolation, and Durable

CRM

Customer Relation Management

LISP

Lots of Irritating and Superfluous Parentheses

RDF

Resource Description Framework

Biography

Nigel W O Hutchison
Chief Architect
Software AG
Darmstadt
Germany
Email: Nigel.Hutchison@softwareag.com Web: http://www.softwareag.com

I was born and brought up in Edinburgh, Scotland graduating with a B Sc in Chemistry from Edinburgh University. Since then I have worked in computing with G.E.C, Standard Telephones and Cables, Digital Applications International, and from 1989, Software AG Germany . My entry into the SGML/XML world was via Software AG's Text Retrieval System which is widely used by large commercial and governmental institutions. I was one of the group that started the project which has appeared on the market as Tamino, the XML Server