On mapping from colloquial XML to RDF using XSLT

C. M. Sperberg-McQueen
Eric Miller

Abstract

XML vocabularies can be characterized as those designed for the convenience of authors or software developers, called colloquial, and those designed to have a trivial mapping to a non-XML data structure, which we call non-colloquial. Mapping colloquial vocabularies into other formats (e.g., symbolic logic or RDF) is a powerful tool for making colloquial XML tractable. Specifying this mapping is a way of documenting what the elements and attributes are supposed to mean and how they are to be used. If this is done only in English prose, humans can make use of it, but not machines. If machine-readable syntax is used to specify a mapping from the XML vocabulary into some well-known target syntax, the mapping can benefit both humans and machines. Simple examples illustrate how mappings can be defined using XSLT and how they can be attached to the schema defining the XML vocabulary.

Keywords: RDF; XSLT; Mapping

C. M. Sperberg-McQueen

C.M. Sperberg-McQueen is a member of the technical staff at the World Wide Web Consortium; he chairs the W3C XML Schema Working Group and XML Coordination Group.

Eric Miller

Eric Miller is the Activity Lead for the W3C World Wide Web Consortium's Semantic Web Initiative.

His responsibilities include the architectural and technical leadership in the design and evolution of Semantic Web infrastructure. Responsibililities additionally include working with W3C Working Group members so that both working groups in the Semantic Web activity, as well as other W3C activities, produce Web standards that support Semantic Web requirements. Additionally, to build support among user and vendor communities for the Semantic Web by illustrating the benefits to those communities and means of participating in the creation of a metadata-ready Web. And finally to establish liaisons with other technical standards bodies involved in Web-related technology to ensure compliance with existing Semantic Web standards and collect requirements for future W3C work.

Before joining the W3C, Eric was a Senior Research Scientist at OCLC Online Computer Library Center, Inc. and the co-founder and Associate Director of the The Dublin Core Metdata Initiative, an open forum engaged in the development of interoperable online metadata standards that support a broad range of purposes and business models.

Eric holds a Research Scientist appointment at MIT's Laboratory for Computer Science.

On mapping from colloquial XML to RDF using XSLT

C. M. Sperberg-McQueen [World Wide Web Consortium, MIT Computer Science and AI Laboratory]
Eric Miller [World Wide Web Consortium, MIT Computer Science and AI Laboratory]

Extreme Markup Languages 2004® (Montréal, Québec)

Copyright © 2004 C. M. Sperberg-McQueen and Eric Miller. Reproduced with permission.

Introduction

Let us begin by trying to make explicit some assumptions we are making which may or may not be fully shared by others.

The mapping problem

An application of XML or SGML defines what some people call a markup language, and other people would prefer to refer to as a markup vocabulary or namespace. Since some people prefer to reserve the term markup language for meta-languages like XML and SGML, the following discussion will use the term vocabulary — without, however, intending to obscure the fact that the XML-based applications in question do have rules that go beyond the provision of names and may be captured in whole or in part by syntactic formalisms.

Many people (vocabulary designers, schema and DTD authors, application developers, people trying to make it easier to work with documents in markup languages designed by others, and no doubt others, too) wish to say, for specific constructs in a vocabulary, what they mean. By the constructs of a vocabulary we mean primarily the element( type)s, attributes, notations, processing-instruction targets, and entities defined in that vocabulary; in some cases, it is convenient also to include simple or complex datatypes and substitution groups (as in XML Schema 1.0), non-terminals (as in Relax and Relax NG), classes (as in the ODD system used to generate the Text Encoding Initiative DTDs), or other abstractions under this term.

Some of those who wish to say what markup constructs mean wish to do so using some machine-processable notation; others would be happy with better tools for human-understandable documentation. We are here concerned mostly with the former, though good rules for machine-processable specification of meaning may also help make meaning clear to humans.

Two difficulties attend any effort say what markup constructs mean. First of all, different people have very different ideas of what would be involved. And second, if such an attempt is not to remain a purely individual mental exercise, the results must be written down or spoken in some language with its own syntactic rules. What may have started as an attempt to focus on semantics to the exclusion of syntax thus concludes by looking like just another translation from one syntax to another. Let us examine these two difficulties in more detail.

First, different people have very different ideas of what it would mean, for the constructs of a vocabulary, to say what they mean. For purposes of discussion, we identify five. The first four are all mapping problems in one way or another:

  • Some people mean by this that they wish to be able to specify how data structures internal to some application software are serialized as XML, or how XML is de-serialized into data structures; questions like "When does an element become an object of class Foo, and when does it become an object of class Foobar?", asked with reference to some set of object classes defined in some programming language, are central to their concerns. Call this the concrete data-structure mapping problem. An example of this approach is [Krupnikov/Thompson 2001].
  • Others wish to specify how to map XML document instances into columns, rows, and tables in some SQL database management system; sometimes they wish to specify a mapping into new rows of existing tables, and sometimes what is needed is a mapping which would specify which new tables to create. Call this the abstract data structure mapping problem. It differs from the concrete data structure mapping problem as the abstraction of a SQL table differs from the various programming-language constructs which might be used to implement the abstraction. See [Vorthmann/Buck 2000a], [Vorthmann/Buck 2000b].
  • Still others wish to specify a mapping into first-order predicate calculus as a way of defining the correct interpretation of markup. Call this the FOPC mapping problem. Cf. [Sperberg-McQueen/Huitfeldt/Renear 2001a].
  • Some wish to map arbitrary XML into RDF.1 Call this the RDF mapping problem. See for example [Hazaël-Massieux/Connolly 2004].
These four mapping problems seem to cover the most frequently discussed ground among those interested primarily in machine-processable descriptions of meaning, but we have no proof that the classification is necessarily exhaustive, and nothing in the further argument requires that it be exhaustive.

Some people believe that the four mapping problems described above do not necessarily have much in common. Others believe that all of them are at root ‘the same thing’. Mostly, they seem to mean by this that if a formalism is provided for what they wish to do, they believe that everyone else's requirements will be met. They do not, in general, seem to mean that if anyone else's requirements are met, they will be able to do what they wish to do.

The four mapping problems identified above do have in common that they involve defining a meaning-preserving mapping from XML notation into some other model. Let us call this other model the target model. If the target model has a syntax in which it can be serialized, let us call that syntax the target formalism. We will be concerned only with target models which can be serialized in this way; it may be possible to extend our proposals to some models with non-serial notations, but not to ineffable models (those to which no notation at all is adequate).

If the target model has a corresponding target formalism, then all four of the mapping problems can be conceived of as involving the translation of information from one syntax (XML) into some other syntax. A mapping problem may thus be conceived of as a syntax-to-syntax translation even if, in practice, the result desired is not a string of characters denoting some abstraction, but some other representation of the abstraction (such as an in-memory data structure).

A fifth idea of what it means to describe the meaning of a vocabulary should also be mentioned:

  • Some wish to communicate enough information about each construct in a vocabulary to other human beings to enable them to recognize and use the elements and attributes correctly. Call this the documentation problem.
The documentation problem is known to be soluble, but the solution is not easy: intelligent humans must write clear natural-language descriptions of the vocabulary, and attentive humans must read them and interpret them correctly. This is straightforward but not automatable. Numerous vocabularies for describing markup vocabularies have been developed and used over the years; their use may make the construction of useful, fairly complete documentation easier, but they cannot make it mechanical. Nothing in this paper reduces the importance of the documentation problem or makes it any easier to solve.2

Any solution to the documentation problem produces, by definition, correct understanding in the part of a hearer or reader. Since such understanding is a pre-requisite to the creation of any meaning-preserving transformation or mapping, it may be noted that the solution of the documentation problem appears to be a prerequisite to any solution of any of the mapping problems, except where the mapping problems are solved by the original specifier of a vocabulary without the need for communication with any other humans. The converse is not true: solutions to the mapping problems are neither prerequisites nor necessary consequences of solutions to the documentation problem. Solving the mapping problems, on the other hand, would make it possible to perform more useful work with marked up data without involving the need for quite so many attentive human programmers. This might be advantageous because attentive human programmers are commonly in short supply.

The fifth idea of meaning brings us to the second difficulty identified above. Reducing the mapping problem to a translation problem operating at the level of syntax may trouble some, particularly those interested in the documentation problem. If we regard the realm of meanings as distinct in some way from that of utterances or syntax, we are bound to be disappointed in a mapping-oriented solution, because the solution seems to shift ground from the ethereal to the mundane. But strictly speaking, any useful formulation of semantics is reducible in this way to a problem operating at the level of syntax. Any attempt to say what markup means necessarily involves constructing some utterance in some perceivable form. That utterance can only be described and interpreted in terms of some syntax: without syntax, all meanings are ineffable.

The involvement of the syntactic layer does not (pace some reviewers of this paper) render the mapping problems mentioned above meaningless, nor does it divorce them from meaning. The mapping problems are not solved by arbitrary mappings from XML into the target models, but only by mappings which retain the meaning of the original.3 (In specialized cases, it may suffice for practical purposes to capture only part of the meaning.) It is easy to dismiss these as merely pushing a bump in the rug from one location to another: having translated from XML into some other notation, are we not still faced with the task of specifying the meaning of that other notation? In cases where the target notation is as opaque to us as the original notation, the criticism has some justice. When the target notation is well understood, however, the translation does precisely what is needed. And we stress again: every successful explanation takes the form of translation from one syntax into another. The documentation problem is also a mapping problem and differs from the others only in substituting the syntax of English or French or some other natural language for the machine-processable target syntaxes of the other views. On the positive side, the fact that all specification of semantics is thus reducible to a problem in specifying syntactic transformations means that we can directly exploit without embarrassment the long history of work on mechanisms for syntax-driven transformations of marked up data.

Since in the long run every notation must be explained to be useful, it is an inescapable prerequisite for any useful work with marked up data that the documentation problem be solved for some notation or other. Since in the long run one of the main reasons for using markup is to reduce the need for human intervention in routine information procesing, however, solving the documentation problem alone will not suffice to allow us to exploit markup to full advantage. Hence this paper's emphasis on machine-processable target notations.

Colloquial XML and non-colloquial XML

Some applications of XML in use today obey strict rules for mapping XML constructs into constructs in some underlying data model non-isomorphic to XML. RDF is an example: every XML construct in an RDF data stream maps into an RDF triple, a part of a triple, or a set of triples, using relatively straightforward rules. Similarly, every XML construct in TEI feature system markup maps into a feature structure, a feature, a value of a feature, or a set of feature structures, following simple and unvarying rules (see [ACH/ACL/ALLC 1994] or [Langendoen/Simons 1995]). XML in Layman normal form, or in any of the normal forms distinguished by Henry Thompson [Thompson 2001] has a simple mapping into labeled graph structures.

Many applications of XML in use today emphasize convenience for authors or software developers over simplicity of the mapping to any underlying data model. Some applications do not specify any underlying model different from the basic XML data model of nodes in a tree, with arbitrary links expressed by ID/IDREF links or by application-level information. TEI, HTML, DocBook are examples of such applications. Following Noah Mendelsohn, we refer to the XML used by applications of this sort as colloquial XML. This allows us to suggest the term non-colloquial XML for XML whose structure is dictated by the desire to have a trivial mapping to a non-XML data structure.

Even in the case of non-colloquial XML, the mapping problems outlined above may be worth contemplation. Obviously, one will seldom need to solve the RDF mapping problem for information already in RDF (although it may be interesting to consider the problem of mapping RDF into different RDF with similar or identical semantics, e.g. as part of normalization). But mapping RDF into concrete or abstract data structures or into predicate calculus may easily become complex enough that it will be convenient to have tools to make the mapping easier to understand and specify. Similar considerations apply to all the output formats.

In the remainder of this paper, however, we focus on mapping colloquial XML into other formats, specifically logical form and RDF. Defining such mappings for colloquial XML helps clarify the intended semantics of the markup (at least for readers who can understand the target notations) and encourages vocabulary designers to be explicit about distinctions which matter for such semantically based mappings. And by mapping XML documents from specialized vocabularies into a common underlying data model, we can make it easier to merge information from multiple sources and to reuse the information represented in XML documents. At the moment, such merger and reuse requires intervention by humans who understand the semantics of the source markup, which may or may not be well documented and may or may not have been followed correctly by the data provider. Every step we can take toward making it easier to capture the meaning of markup vocabularies in machine-tractable form is a step toward better tools for performing such mergers, for managing evolution of vocabularies and data, and for building more robust systems. Mapping from arbitrary XML into semantically equivalent logical form or RDF is one such step.

The mapping problem as a schema annotation problem

Some people may take the view that the mapping problem is an illusion which only arises because XML vocabulary designers have fallen into bad habits owing to XML's lack of any binding semantic model. If designers would refrain from using colloquial idioms in XML, this line of reasoning goes, and if they would instead simply use a particular non-colloquial vocabulary or design their vocabularies within a particular non-colloquial meta-vocabulary, then there would be no mapping problem. Some (early) discussions of RDF seem to take this view fairly explicitly.

We do not share this view, because we believe that colloquial and non-colloquial XML have different strengths and are suitable for different uses. In particular, we note that many users of colloquial vocabularies are empirically unwilling to abandon them in favor of any of the non-colloquial vocabularies currently on offer.

Others may believe that the mapping problem (or, strictly speaking, its solution) is fundamentally a problem of language design and use. What is needed, on this account of things, is a language in which to describe the mapping from XML to the target model; the problem, in this view, is to design and use a language for describing mappings. Traces of this view can be found in the so-called “Cambridge Communiqué” [Swick/Thompson 1999] and in some discussions of it. From this point of view, the problem is simply: to design a language for use inside an XML Schema xsd:annotation element to specify the mapping of an XML vocabulary into a preferred semantic notation.

We are sympathetic to this view, but without being wholly committed to it. We note:

  • Since the mapping problems can be reduced to problems of translating information from one syntax to another, it is strictly speaking unnecessary to design a new language: existing languages for transforming XML documents into new XML vocabularies or into non-XML syntaxes can be used to specify mappings. [Ogbuji 2001] and [Hazaël-Massieux/Connolly 2004] illustrate this point.
  • Even if existing languages are more verbose, more general, and more complex than necessary or desirable for a pure mapping language, any design of a mapping language should be based on the analysis of actual mappings.

Getting a better grip on our information, knowing what it means and what it is and is not plausible to do with it, is an important part of building better information systems. The ability to make explicit more of the meaning of a vocabulary is useful in allowing members of distributed communities to work independently of each other, adding information to common resources and changing the form in which information is represented without requiring that all existing information stores be retrofitted with the new form of representation. More explicit semantics is also important in maintaining the integrity of information. The ability of a vocabulary designer or schema author to ‘annotate’ a schema document with information about how to map from the vocabulary into one or more chosen target syntaxes is the focus of this paper.

In this paper we walk through one simple example of mapping from a colloquial XML vocabulary into two non-colloquial notations: first-order predicate logic and RDF. We explore the use of XSLT as a language for specifying mappings from colloquial XML to logic and RDF, describe some methods for associating such mappings with XML Schema documents, and provide input into the design requirements of any future mapping language.

A simple example

In order to keep things simple, we choose first a very simple example of colloquial XML.

The vocabulary

The DTD

The source material we are interested in is an XML representation of a time log. The vocabulary is very simple; in its entirety, the DTD reads:

Figure 1: DTD for time log data [File timelog.dtd]
<!--* Timelog.dtd: record time periods spent, by project and category.
    * 
    * Revisions:
    * 2002-01-04 : made DTD to move existing data into XML
    *-->
<!--* To do:
    * allow paragraphs of prose annotation in TEI Lite or HTML.
    *   (this would require DTD modules for paragraphs and phrase-level
    *   elements. Might be a good test of DTD modularization)
    *-->

<!ENTITY % kw.DATE "NMTOKEN">
<!ENTITY % kw.TIME "NMTOKEN">
<!ENTITY % kw.N    "NMTOKEN">

<!ENTITY % a.daycounts "
          workdays %kw.N;    #IMPLIED
          holidays %kw.N;    #IMPLIED
          satsuns  %kw.N;    #IMPLIED
">

<!ELEMENT timelog (p*, period+) >
<!ATTLIST timelog  %a.daycounts;
          label    CDATA   #IMPLIED
          xmlns    CDATA   "http://example.org/mcxrx/timelog#">
<!ELEMENT period  (p*, (logentry* | period*)) >
<!ATTLIST period   %a.daycounts;
          label    CDATA   #IMPLIED>
<!ELEMENT logentry (#PCDATA) >
<!ATTLIST logentry
          date     %kw.DATE; #REQUIRED
          start    %kw.TIME; #REQUIRED
          end      %kw.TIME; #REQUIRED
          dur      %kw.N;    #REQUIRED
          project  NMTOKEN   #REQUIRED
          category NMTOKEN   #REQUIRED
>
<!ELEMENT p  (#PCDATA) >

For purposes of the discussion, we imagine (counterfactually) that this DTD is made available on the Web at the URI http://example.org/mcxrx/timelog.dtd.4

The schema is equally simple:

Figure 2: XML Schema document for time log data [File timelog.xsd]
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  targetNamespace="http://example.org/mcxrx/timelog#"
  xmlns="http://example.org/mcxrx/timelog#"
  elementFormDefault="qualified" 
>
 <xsd:annotation>
  <xsd:documentation>
   A simple schema for timelog data: record time periods spent, by
   project and category.

   Revisions:
   2004-03-24: specify schema
   2002-01-04: made DTD to move existing data into XML
  </xsd:documentation>
 </xsd:annotation>

 <xsd:annotation>
  <xsd:documentation>
    To do:
    * allow paragraphs of prose annotation in TEI Lite or HTML.
      (this would require schema modules for paragraphs and phrase-level
      elements. Might be a good demo of schema modularization)
  </xsd:documentation>
 </xsd:annotation>

 <xsd:complexType name="container">
  <xsd:sequence>
   <xsd:element ref="p" minOccurs="0" maxOccurs="unbounded"/>
   <xsd:choice>
    <xsd:element ref="period" minOccurs="0" maxOccurs="unbounded"/>
    <xsd:element ref="logentry" minOccurs="0" maxOccurs="unbounded"/>
   </xsd:choice>
  </xsd:sequence>

  <xsd:attribute name="label" type="xsd:string"/>
  <xsd:attribute name="workdays" type="xsd:nonNegativeInteger"/>
  <xsd:attribute name="holidays" type="xsd:nonNegativeInteger"/>
  <xsd:attribute name="satsuns"  type="xsd:nonNegativeInteger"/>
 </xsd:complexType>

 <xsd:complexType name="toplevelContainer">
  <xsd:complexContent>
   <xsd:restriction base="container">
    <xsd:sequence>
     <xsd:element ref="p" minOccurs="0" maxOccurs="unbounded"/>
     <xsd:choice>
      <xsd:element ref="period" minOccurs="0" maxOccurs="unbounded"/>
      <xsd:element ref="logentry" minOccurs="0" maxOccurs="0"/>
     </xsd:choice>
    </xsd:sequence>
   </xsd:restriction>
  </xsd:complexContent>
 </xsd:complexType>

 <xsd:complexType name="logentry" mixed="true">
  <xsd:sequence/>
  <xsd:attribute name="date" type="xsd:date"/>
  <xsd:attribute name="start" type="xsd:token"/>
  <xsd:attribute name="end" type="xsd:token"/>
  <xsd:attribute name="dur" type="xsd:nonNegativeInteger"/>
  <xsd:attribute name="project" type="xsd:token"/>
  <xsd:attribute name="category" type="xsd:token"/>
 </xsd:complexType>

 <xsd:element name="timelog" type="toplevelContainer"/>
 <xsd:element name="period"  type="container"/>
 <xsd:element name="p">
  <xsd:complexType mixed="true"/>
 </xsd:element>
 <xsd:element name="logentry"  type="logentry"/>
</xsd:schema>

The meanings of these constructs can be paraphrased informally thus:
timelog

a quantity of timelog data; in practice, in this vocabulary this is usually a log for a single calendar month, and invariably a log for a single individual. Attributes include:
label

a human-readable label identifying the period in question; not guaranteed machine-parseable

workdays

the number of working days in the period; this is used for calculating the average hours per work week spent on given tasks. In some sense this is redundant: a sufficiently intelligent system could examine the date labels occurring and figure out how many of them are workdays. On the other hand, if a work day passes without any work being done, and thus without any entries in the time log, that work day nevertheless needs to be counted in the arithmetic. In practice, this attribute is often not specified, and inaccurate when specified.

holidays

the number of holidays in the period

satsuns

the number of weekend days in the period

period

an arbitrary subunit of a time log; in practice, this often groups together the log entries for a given week or day; periods can nest. Attributes are as for timelog; there is no conceptual difference between the two, the outermost period being identified as a timelog rather than a period solely for convenience in processing.

p

a paragraph of human-readable prose describing or commenting on the work done in a given period

logentry

a single entry in the time log, describing a period of time (hereinafter a chunk of time) dedicated to a single task (or, in degenerate cases, a period of time treated as a chunk despite being devoted to multiple tasks, or despite lacking any useful information about what the time actually went to). Legal attributes are:
date

the date on which the chunk occurred, in ISO 8601 (yyyy-mm-dd) form; each chunk occurs on exactly one date. Date boundaries are in local time.

start

the time at which the chunk started, in local time. In practice, this is normally the same as the end time of the preceding entry in the log; the default display styles for this vocabulary check for this condition and flag discontinuities visually, since they may indicate errors in the data. The value is not in ISO 8601 format because it lacks a seconds field; also, in practice the existing application uses a full stop rather than a colon as an hour-minutes separator.

end

the time at which the chunk ended, in local time. In practice, usually the same as the start time of the following entry in the log.

dur

the number of minutes in the chunk; normally this can be calculated by subtracting start from end, but in some cases this is not so. A time zone shift may be recorded thus:

<logentry 
      date="2004-01-10" 
      start="15.56" end="16.56" 
      dur="0" 
      project="other:" 
      category="other:">time 
zone shift</logentry>
(No machine-processable information is included about which time zone, exactly, is local time, and time zone shifts are not required by the semantics of the markup. They are included in order to reduce the number of places at which the sanity checking in the default style sheets suspects a possible error.)

project

the project (or notional account) to which this time chunk is allocated for purposes of time accounting (the vocabulary allows any value here; data entry software is used to ensure a reasonably consistent set of values); typical values are
xmlschema

work on the XML Schema Working Group

xsl

work on the XSL Working Group

xmlcg

work related to the XML Coordination Group

arch

work related to the W3C Architecture Domain

w3c

W3C work other than XML Schema, XML CG, or Architecture Domain

mep

work on the Model Editions Partnership

extreme

work on the conference Extreme Markup Languages

prof

professional work on projects other than those currently being identified more specifically

personal

the chunk went for personal, not work-related, activities

ovhd

work classed as ‘overhead’: not directly related to any particular project or account, but work-related nonetheless (e.g. cleaning spam filters, maintaining time logs, upgrading operating system)

other

unclassifiable

category

the kind of activity to which the chunk was devoted: telcon, making an agenda, attending face to face meeting, drafting or revising minutes, phone call with an individual, reading email, writing email, mix of reading and writing email, giving a talk, chatting on IRC, reading a paper, learning a piece of software, drafting or revising a paper, working on software or spec requirements, various other development activies, down time.

Sample data

Some sample data may help illustrate the usage of the vocabulary:

    <logentry date="2004-03-23" start="17.01" end="17.32" dur="31"
      project="xmlschema" category="agenda">revising ten-week
      plan</logentry>
    <logentry date="2004-03-23" start="17.32" end="17.53" dur="21"
      project="xmlschema" category="phone">NI, discuss
      ten-week plan and doc dates</logentry>
    <logentry date="2004-03-23" start="17.53" end="18.42" dur="49"
      project="xmlschema" category="agenda">revising ten-week plan,
      send to WG</logentry>
    <logentry date="2004-03-23" start="18.42" end="19.06" dur="24"
      project="personal" category="other:">ironing</logentry>
    <logentry date="2004-03-23" start="19.06" end="19.10" dur="4"
      project="overhead" category="email">Email sorting</logentry>
    <logentry date="2004-03-23" start="19.10" end="19.47" dur="37"
      project="personal" category="dogfeed">Feeding dogs</logentry>
    <logentry date="2004-03-23" start="19.47" end="19.59" dur="12"
      project="w3c" category="think">trying to find server to
      look at simile data</logentry>
    <logentry date="2004-03-23" start="19.59" end="20.10" dur="11"
      project="overhead" category="implementation">revising
      timelog.rexx to provide XSL category (and some other
      changes)</logentry>
    <logentry date="2004-03-23" start="20.10" end="20.59" dur="49"
      project="prof" category="other:">print out Lisbon papers for
      review</logentry>
    <logentry date="2004-03-23" start="20.59" end="21.33" dur="34"
      project="overhead" category="readmail"></logentry>
    <logentry date="2004-03-24" start=" 6.30" end=" 6.33" dur="3"
      project="overhead" category="other">startup time</logentry>
    <logentry date="2004-03-24" start=" 6.33" end="07.21" dur="48"
      project="w3c" category="docdraft">working on time log example
      for EM</logentry>
    <logentry date="2004-03-24" start="07.21" end=" 8.08" dur="47"
      project="personal" category="meal">breakfast</logentry>

A prose transcription captures the meaning of the marked up data fairly concisely; later, we'll try to formalize this better.

  • On 2004-03-23, from 5:01 to 5:32 p.m., NN5 spent 31 minutes revising ‘the ten-week plan’. This should be accounted for as time devoted to the project “xmlschema”. The category of activity is “agenda”.
  • On 2004-03-23, from 5:32 to 5:53 p.m., NN spent 21 minutes on the phone with NI, to discuss the ten-week plan and document publication dates. This should be accounted for as time devoted to the project “xmlschema”. The category of activity is “phone”.
  • On 2004-03-23, from 5:53 to 6:42 p.m., NN spent 49 minutes revising the ten-week plan and sending it to the WG. This should be accounted for as time devoted to the project “xmlschema”. The category of activity is “agenda”.
  • On 2004-03-23, from 18:42 to 19:06, NN spent 24 minutes ironing shirts. This should be accounted for as time devoted to the project “personal”. The category of activity is “other:”.
  • On 2004-03-23, from 19:06 to 19:10, NN spent 4 minutes sorting email. This should be accounted for as time devoted to the project “overhead”. The category of activity is “email”.
  • On 2004-03-23, from 19:10 to 19:47, NN spent 37 minutes Feeding dogs. This should be accounted for as time devoted to the project “personal”. The category of activity is “dogfeed”.
  • On 2004-03-23, from 19:47 to 19:59, NN spent 12 minutes trying to find the server to look at the sample Simile data. This should be accounted for as time devoted to the project “w3c”. The category of activity is “think”.
  • On 2004-03-23, from 19:59 to 20:10, NN spent 11 minutes revising timelog.rexx to provide an XSL category (and some other changes). This should be accounted for as time devoted to the project “overhead”. The category of activity is “implementation”.
  • On 2004-03-23, from 20:10 to 20:59, NN spent 49 minutes printing out papers to review for a conference. This should be accounted for as time devoted to the project “prof”. The category of activity is “other:”.
  • On 2004-03-23, from 20:59 to 21:33, NN spent 34 minutes reading email. This should be accounted for as time devoted to the project “overhead”. The category of activity is “readmail”.
  • On 2004-03-24, from 6:30 to 6:33, NN spent 3 minutes starting up his machine. This should be accounted for as time devoted to the project “overhead”. The category of activity is “other”.
  • On 2004-03-24, from 6:33 to 07:21, NN spent 48 minutes working on the time log example for EM. This should be accounted for as time devoted to the project “w3c”. The category of activity is “docdraft”.
  • On 2004-03-24, from 07:21 to 8:08, NN spent 47 minutes eating breakfast. This should be accounted for as time devoted to the project “personal”. The category of activity is “meal”.

It should be noted at the outset that some information in the paraphrases above is not explicit in the XML markup or content: two obvious examples are the fact that the individual whose time is being logged is NN and the fact that when NN spends time ironing, what he irons is shirts. To understand these facts from the marked up data, it is necessary and sufficient to understand some basic facts about the context in which the marked up data is designed to be created and used. A good inference engine could make the appropriate inferences, given the relevant additional facts.

The successful paraphrase of the activity descriptions from their sometimes telegraphic style into full English sentences also requires more knowledge than is explicit in the marked up data: it requires a command of English. This is not something any current software can be expected to have or acquire soon.

Notes on the vocabulary

The vocabulary presented is in private use by a single individual; the semi-controlled vocabulary used for projects and categories changes only slowly, and the vocabularies are controlled only by the data entry form, not by the DTD. Knowledge of the vocabularies is built into the processing software: for example, the fact that “work” minutes are the sum of schema minutes, w3c minutes, professional minutes, and overhead minutes, or that w3c minutes are the sum of the minutes spent on the “xmlcg”, “arch”, and “w3c” projects, and so on.

The vocabulary could easily be adapted for use by a group of people, whose log entries might be merged; in that case, (a) different methods of vocabulary control would be needed, and (b) the individual (or other entity) whose time is being logged would have to be made explicit in the XML. Also (c) it would make sense to have explicit records of the relations among various projects, and constraints (such as that the “meal” category is only allowed for the “personal” project).

The target syntax

A simple representation in logical form

Two kinds of object need to be distinguished in the logical form: chunks and time periods. For convenience in establishing inter-object links, we'll supply an arbitrary identifier (unique within a particular data stream) for each chunk or period.

A chunk is represented by a member of the chunk relation. One straightforward representation is

chunk(Date, Start_time, End_time, Duration, 
      Project, Category, Description)
in which each attribute of the logentry element is represented as an argument to the predicate, as is the #PCDATA content. Another representation adds a unique identifier for the chunk, for use in relating chunks to the time periods which contain them.
chunk(ID, Date, Start_time, End_time, Duration, 
      Project, Category, Description)
It seems a more plausible representation of the actual meaning of the log data, however, to treat the date, start, end, and dur attributes as identifying a specific time interval, with timestamps for the starting and ending points in time or with a starting timestamp and a duration:
chunk(ID, interval(Start, Duration), 
      Project, Category, Description)

A time period is represented by a member of the period relation:

period(ID,Label, Workdays, Holidays, Weekenddays)

The links between periods are represented by members of the contains relation:

contains(Outer_period_id,Inner_period_id)

The fact that a chunk occurs within a given period is represented by a member of the includes relation:

includes(Period_id,Chunk_id)

Note that other logical representations are possible. Some of the more obvious variations include:

  • A different ontology might be used, which postulates the existence of different kinds of objects.
  • The project and category to which a chunk of time is assigned might be represented not as strings of characters but as entities in their own right; in particular, they might be associated with well known (or other) URIs, as is usual in RDF.

Ways of modeling an n-tuple

Note that there are several ways of translating from the n-tuples above into the set of binary relations required by RDF.

If each log entry is viewed as a tuple of the form

chunk(Date,Start,End,Dur,Proj,Cat,Desc) 
and no identifier is assigned to the chunk (e.g. because our ontological scruples do not allow us to postulate the existence of chunks as individuals in the sense sometimes used in formal logic), then the problem of reducing the tuple to RDF resembles the problem of reducing an n-ary function to a unary function which takes one of the n arguments and returns a function of arity n-1 which accepts the other arguments and returns the desired result (or, more frequently, which is itself reduced to a unary function which takes one argument and returns a function of lesser arity). This is well understood under the name currying.6

Applying one currying-inspired approach to the tuple above, we would have, for some particular date, start-time, end-time, etc.:

       (dsedpcd_tuple(A)
        & sedpcd_tuple(B)
        & edpcd_tuple(C)
        & dpcd_tuple(D)
        & pcd_tuple(E)
        & cd_tuple(F)
        & d_tuple(G)
        & date(A,"2004-03-24") 
        & start(B,"07.21") 
        & end(C,"8.08") 
        & dur(D,"47") 
        & project(E,"personal") 
        & category(F,"meal")
        & desc(G,"breakfast")
        & r1(A,B)
        & r2(B,C)
        & r3(C,D)
        & r4(D,E)
        & r5(E,F)
        & r6(F,G))
Here, the first seven lines assert the types of various tuples and sub-tuples A, B, C, ... G, the next seven associate the various literal values given for the date, etc., with them, while the last six lines link the tuples up in an appropriate chain.

The approach just shown requires us to postulate several types of tuple, which may itself be ontologically troubling to us, so we may prefer a slightly different method, which makes use of a single tuple predicate:

          tuple(A,dsedpcd)
        & tuple(B,sedpcd)
        & tuple(C,edpcd)
        & tuple(D,dpcd)
        & tuple(E,pcd)
        & tuple(F,cd)
        & tuple(G,d)
        ...
the remainder of the translation is as before.

Since either of these translations could in theory take the arguments in any order, there are 5040 (7!) variations of each. In fact we could also group arguments into subtuples, so there are actually more ways to reduce the chunk predicate to a set of binary relations.

The translation into RDF is much simpler if we overcome whatever philosophical hesitation we might have about assuming the existence of time chunks; that allows us to represent the logentry elements in a more straightforward way in RDF (or using binary predicates):

(exists L)(logentry(L)
    & date(L,"2004-03-24") 
    & start(L,"07.21") 
    & end(L,"8.08") 
    & dur(L,"47") 
    & project(L,"personal") 
    & category(L,"meal")
    & desc(L,"breakfast"))
This is easier for humans to understand, and it is unlikely that any human designer would curry the predicate instead of postulating the existence of time chunks. But the choice does exist and must be made when specifying a mapping from colloquial XML into logical form or into RDF.

RDF

The simplest RDF translation of the sample closely resembles the form given above:

<Chunk rdf:ID="NN_2004-03-23T17.01/17.32">
      <who>NN</who>
      <date>2004-03-23</date>
      <start>17.01</start>
      <end>17.32</end>
      <dur>31</dur>
      <project>xmlschema</project>
      <category>agenda</category>
      <description>revising ten-week
      plan</description>
</Chunk>

A slightly more interesting translation assumes we may wish to associate further information with the projects and categories, and so assigns a URI to each of them.

Figure 3: [File sample.rdf]
<?xml version="1.0">
<!DOCTYPE rdf:RDF [
    <!ENTITY rdfns 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'>
    <!ENTITY timelogns 'http://example.org/mcxrx/timelog#'>
    <!ENTITY people 'http://example.org/mcxrx/people#'>
    <!ENTITY projects 'http://example.org/mcxrx/projects#'>
    <!ENTITY categories 'http://example.org/mcxrx/categories#'>
]>

<rdf:RDF
    xmlns:rdf="&rdfns;" 
    xmlns="&timelogns">

<Period rdf:ID="id2589187">
      <label>Sample</label>
</Period>

<Period rdf:ID="id2588575">
      <label>A short sample</label>
</Period>

<Period rdf:ID="id2591870">
      <label>Tue 23 Mar</label>
</Period>

<Chunk rdf:ID="NN_2004-03-23T17.01/17.32">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>17.01</start>
      <end>17.32</end>
      <dur>31</dur>
      <project rdf:resource="&projects;xmlschema" />
      <category rdf:resouurce="&categories;agenda" />
      <description>revising ten-week
      plan</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T17.32/17.53">
     <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>17.32</start>
      <end>17.53</end>
      <dur>21</dur>
      <project rdf:resource="&projects;xmlschema" />
      <category rdf:resouurce="&categories;phone" />
      <description>M Holstege, discuss
      ten-week plan and SCD dates</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T17.53/18.42">
     <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>17.53</start>
      <end>18.42</end>
      <dur>49</dur>
      <project rdf:resource="&projects;xmlschema" />
      <category rdf:resouurce="&categories;agenda" />
      <description>revising ten-week plan,
      send to WG</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T18.42/19.06">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>18.42</start>
      <end>19.06</end>
      <dur>24</dur>
      <project rdf:resource="&projects;personal" />
      <category rdf:resouurce="&categories;other:" />
      <description>ironing</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T19.06/19.10">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>19.06</start>
      <end>19.10</end>
      <dur>4</dur>
      <project rdf:resource="&projects;overhead" />
      <category rdf:resouurce="&categories;email" />
      <description>Email sorting</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T19.10/19.47">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>19.10</start>
      <end>19.47</end>
      <dur>37</dur>
      <project rdf:resource="&projects;personal" />
      <category rdf:resouurce="&categories;dogfeed" />
      <description>Feeding dogs</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T19.47/19.59">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>19.47</start>
      <end>19.59</end>
      <dur>12</dur>
      <project rdf:resource="&projects;w3c" />
      <category rdf:resouurce="&categories;think" />
      <description>trying to find server to
      look at simile data</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T19.59/20.10">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>19.59</start>
      <end>20.10</end>
      <dur>11</dur>
      <project rdf:resource="&projects;overhead" />
      <category rdf:resouurce="&categories;implementation" />
      <description>revising
      timelog.rexx to provide XSL category (and some other
      changes)</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T20.10/20.59">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>20.10</start>
      <end>20.59</end>
      <dur>49</dur>
      <project rdf:resource="&projects;prof" />
      <category rdf:resouurce="&categories;other:" />
      <description>print out witt papers for
      review</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-23T20.59/21.33">
      <who rdf:resource="&people;NN" />
      <date>2004-03-23</date>
      <start>20.59</start>
      <end>21.33</end>
      <dur>34</dur>
      <project rdf:resource="&projects;overhead" />
      <category rdf:resouurce="&categories;readmail" />
</Chunk>

<Period rdf:ID='id259208'>
  <label>Wed 24 Mar"</label>
</Period>

<Chunk rdf:ID="NN_2004-03-24T6.30/6.33", 
     <who rdf:resource="&people;NN" />
      <date>2004-03-24</date>
      <start>6.30</start>
      <end>6.33</end>
      <dur>3</dur>
      <project rdf:resource="&projects;overhead" />
      <category rdf:resouurce="&categories;other" />
      <description>startup time</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-24T6.33/07.21">
      <who rdf:resource="&people;NN" />
      <date>2004-03-24</date>
      <start>6.33</start>
      <end>07.21</end>
      <dur>48</dur>
      <project rdf:resource="&projects;w3c" />
      <category rdf:resouurce="&categories;docdraft" />
      <description>working on time log example
      for EM</description>
</Chunk>

<Chunk rdf:ID="NN_2004-03-24T07.21/8.08">
      <who rdf:resource="&people;NN" />
      <date>2004-03-24</date>
      <start>07.21</start>
      <end>8.08</end>
      <dur>47</dur>
      <project rdf:resource="&projects;personal" />
      <category rdf:resouurce="&categories;meal" />
      <description>breakfast</description>
</Chunk>

</rdf:RDF>

The XSLT

We'll build the stylesheet up in phases, first doing just the simple things and gradually elaborating until we have a transformation which produces the full logical representation shown above; then we'll make a companions stylesheet to generate RDF.

Version 0.1: handling chunks

The stylesheet

We'll start by just extracting the instances of the chunk/7 relation chunk(Date, Start, End, Duration, Project, Category, Description).

The stylesheet framework is the usual one:

The beginning of the stylesheet will have pretty much exactly the same form in all the versions we describe in this paper:

Figure 5: Stylesheet DTD and start-tag
<?xml version='1.0'?>
<!DOCTYPE xsl:stylesheet PUBLIC 'http://www.w3.org/1999/XSL/Transform'
      '../../../People/cmsmcq/lib/xslt10.dtd' [
<!ENTITY nl "&#xA;">
]>
<xsl:stylesheet 
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>

<!--* A stylesheet which translates timelog data into logical form.
    * Revisions:
    * 2004-04-16 : v 0.6, finish for Extreme submission
    * 2004-03-24 : v 0.1, just do chunk/7 predicate.
    * 
    * To do:
    * v 0.1 chunk/7
    * v 0.2 chunk/8 + period
    * v 0.3 add contains_p_p/2, contains_p_c/2
    * v 0.4 add some implicit information, standard facts
    * v 0.5 add existential quantification
    * v 0.6 add some static inference rules
    *-->

This code is used in < “Stylesheet 0.1: chunks [File log.logic.01.xsl] ” 4 >< “Stylesheet 0.2: chunks with IDs [File log.logic.02.xsl] ” 12 >< “Stylesheet 0.6: logical form [File log.logic.06.xsl] ” 25 >

So will the end of the stylesheet:

Figure 6: Stylesheet end-tag
</xsl:stylesheet>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:"/SGML/Public/Emacs/xslt.ced"
sgml-omittag:t
sgml-shorttag:t
sgml-indent-data:t
sgml-indent-step:1
End:
-->

This code is used in < “Stylesheet 0.1: chunks [File log.logic.01.xsl] ” 4 >< “Stylesheet 0.2: chunks with IDs [File log.logic.02.xsl] ” 12 >< “Stylesheet 0.6: logical form [File log.logic.06.xsl] ” 25 >< “Stylesheet R: RDF export [File log.rdf.xsl] ” 27 >

The document root doesn't actually require any special handling, but in the interests of labeling the output, we emit a comment saying what the output actually is and where it came from:

Figure 7: Handling the document root
 <xsl:template match="/">
  <xsl:text>/* Logical representation of time log data &nl;</xsl:text>
  <xsl:text> * generated by timelog.to.logic.xsl &nl;</xsl:text>
  <xsl:text> */&nl;&nl;</xsl:text>
  <xsl:apply-templates/>
  <xsl:text>&nl;</xsl:text>
 </xsl:template>

This code is used in < “Stylesheet 0.1: chunks [File log.logic.01.xsl] ” 4 >< “Stylesheet 0.2: chunks with IDs [File log.logic.02.xsl] ” 12 >

Because we want to produce output in plain text format (suitable for loading into, say, a Prolog system), we need to tell the XSLT processor to produce text output rather than XML:

Figure 8: Output declaration
 <xsl:output method="text" media-type="text/plain"/>

This code is used in < “Stylesheet 0.1: chunks [File log.logic.01.xsl] ” 4 >< “Stylesheet 0.2: chunks with IDs [File log.logic.02.xsl] ” 12 >< “Stylesheet 0.6: logical form [File log.logic.06.xsl] ” 25 >

The heart of the translation is the handling of the log entry. For each logentry element, we want to generate one member of the chunk relation, with the appropriate arguments. Using the notation of XSLT's attribute value templates, we might say schematically that the string we want to generate is

chunk({@date}, {@start}, {@end}, {@dur}, 
   {@project}, {@category}, "{string(.)}")
When a given logentry element is the current element, the XPath expression @date denotes the (value of the) date attribute; in attribute value templates, the braces around XPath expressions indicate that the expressions are to be evaluated, rather than becoming a literal part of the string. The last argument (string(.)) denotes the string value of the element's content; note the quotation marks around it, which ensure that it is syntactically recognizable as a string.

In practice, if we want to load the output into a Prolog system, more quotation marks are useful: single quotes around the date, time, project, and category values will ensure that the values are read as single atoms even if they contain decimal points, colons, hyphens, etc.

chunk('{@date}', '{@start}', '{@end}', {@dur}, 
   '{@project}', '{@category}', '{string(.)}')
We do not quote the duration, since we want it read as an integer.

If we write this string into an XSLT template, however, we won't get quite the results we wish: while the attribute value template notation is convenient, it is used, as the name implies, only for certain attribute values. To generate the appropriate strings, we will need to use a slightly more cumbersome notation:

Figure 9: Handling a single log entry
<xsl:template match="logentry">
chunk('<xsl:value-of select="@date"/>', 
      '<xsl:value-of select="@start"/>', 
      '<xsl:value-of select="@end"/>', 
      <xsl:value-of select="@dur"/>, 
      '<xsl:value-of select="@project"/>', 
      '<xsl:value-of select="@category"/>', 
      "<xsl:value-of select="string(.)"/>").
</xsl:template>

This code is used in < “Stylesheet 0.1: chunks [File log.logic.01.xsl] ” 4 >

Variations on this pattern are possible, in order to vary the layout of the output and of the XSLT stylesheet.

Once we have this template for logentry elements, we have most of what we need. The default rules for XSLT will process other elements by recurring on their children, which means we will effectively ignore all timelog and period elements and find all logentry elements at whatever depth.

On the other hand, the default rule for text nodes is to write them out to the output. We don't want the content of the p element in the output. So we suppress the processing of the p element:

The white space in the source document, however, still gets copied to the output; this is not a serious problem, but it is unsightly. So we suppress all text nodes in the input document:

Figure 11: [continues 10 Suppressing character data ]
 <xsl:template match="text()"/>

The output

The stylesheet given above produces the following output (reformatted for compactness):

chunk('2004-03-23', '17.01', '17.32', 31, 
      'xmlschema', 'agenda', "revising ten-week plan").
chunk('2004-03-23', '17.32', '17.53', 21, 
      'xmlschema', 'phone', "NI, discuss ten-week plan and doc dates").
chunk('2004-03-23', '17.53', '18.42', 49, 
      'xmlschema', 'agenda', "revising ten-week plan, send to WG").
chunk('2004-03-23', '18.42', '19.06', 24, 
      'personal', 'other:', "ironing").
chunk('2004-03-23', '19.06', '19.10', 4, 
      'overhead', 'email', "Email sorting").
chunk('2004-03-23', '19.10', '19.47', 37, 
      'personal', 'dogfeed', "Feeding dogs").
chunk('2004-03-23', '19.47', '19.59', 12, 
      'w3c', 'think', "trying to find server to look at simile data").
chunk('2004-03-23', '19.59', '20.10', 11, 
      'overhead', 'implementation', 
      "revising timelog.rexx to provide XSL category (and some other changes)").
chunk('2004-03-23', '20.10', '20.59', 49, 
      'prof', 'other:', "print out Lisbon papers for review").
chunk('2004-03-23', '20.59', '21.33', 34, 
      'overhead', 'readmail', "").
chunk('2004-03-24', '6.30', '6.33', 3, 
      'overhead', 'other', "startup time").
chunk('2004-03-24', '6.33', '07.21', 48, 
      'w3c', 'docdraft', "working on time log example for EM").
chunk('2004-03-24', '07.21', '8.08', 47, 
      'personal', 'meal', "breakfast").

Version 0.2: giving chunks unique identifiers

The second version of the XSLT transformation will provide a unique identifier for each chunk. It doesn't much matter what method we use of generating the unique identifier, as long as it's unique for the chunk.

We'll also generate instances of the period relation, again with unique identifiers.

The stylesheet

Adding unique identifiers

The stylesheet framework is almost the same as before:

The only difference is in handling the log entries. One simple way to generate a unique identifier for each chunk is to use the XSLT generate-id function. This is guaranteed to generate a distinct identifier for each element in the input document. And since each chunk corresponds to a distinct element in the input, it will give us an identifier for each chunk which is unique in the data stream:

Figure 13: Handling a single log entry (version 2a)
<xsl:template match="logentry">
chunk('<xsl:value-of select="generate-id()"/>', 
      '<xsl:value-of select="@date"/>', 
      '<xsl:value-of select="@start"/>', 
      '<xsl:value-of select="@end"/>', 
      <xsl:value-of select="@dur"/>, 
      '<xsl:value-of select="@project"/>', 
      '<xsl:value-of select="@category"/>', 
      "<xsl:value-of select='string(.)'/>").
</xsl:template>

This code is not used elsewhere.

The identifiers generated by generate-id for chunks contained in different XML files, however, are not guaranteed distinct. They also won't be guaranteed unique if we ever wish to merge log data for multiple individuals. For any individual, however, we know that there should be only one chunk with the same date and the same start and end times; the individual can be in only one place at a time, and billing the same time chunk to two different accounts violates the principles of time accounting.7 So we can construct an identifier for the chunk that way. And while we're worrying about possible merger with other data later, we'll add an argument identifying the individual.

Figure 14: Handling a single log entry (version 2b)
<xsl:template match="logentry">
chunk('<xsl:value-of select="concat('NN_',@date,'T',
           normalize-space(@start),'/',
           normalize-space(@end))"/>', 
      'NN',
      '<xsl:value-of select="@date"/>', 
      '<xsl:value-of select="@start"/>', 
      '<xsl:value-of select="@end"/>', 
      <xsl:value-of select="@dur"/>, 
      '<xsl:value-of select="@project"/>', 
      '<xsl:value-of select="@category"/>', 
      "<xsl:value-of select='string(.)'/>").
</xsl:template>

This code is used in < “Stylesheet 0.2: chunks with IDs [File log.logic.02.xsl] ” 12 >< “Stylesheet 0.6: logical form [File log.logic.06.xsl] ” 25 >

Period data

Generating a clause for each period in the input is straightforward. We just add a template to handle the timelog and period elements. The only complication is that since in practice the workdays and other attributes are often supplied manually, they are not always present, and when present not always accurate. Since our task is to interpret the markup, not to clean up the meaning, we do nothing here about inaccurate values. But if the value is not supplied, we either need to invent yet another relation with a different arity, or we need to allow a special value to which we assign the meaning ‘unknown’.

Figure 15: Handling a period
<xsl:template match="period|timelog">
period(<xsl:value-of select="generate-id()"/>,
      '<xsl:value-of select="@label"/>', 
      <xsl:choose>
        <xsl:when test="@workdays">
          <xsl:value-of select="@workdays"/>
        </xsl:when>
        <xsl:otherwise>'unknown'</xsl:otherwise>
      </xsl:choose>, 
      <xsl:choose>
        <xsl:when test="@holidays">
          <xsl:value-of select="@holidays"/>
        </xsl:when>
        <xsl:otherwise>'unknown'</xsl:otherwise>
      </xsl:choose>, 
      <xsl:choose>
        <xsl:when test="@satsuns">
          <xsl:value-of select="@satsuns"/>
        </xsl:when>
        <xsl:otherwise>'unknown'</xsl:otherwise>
      </xsl:choose>).
<xsl:apply-templates/>
</xsl:template>

This code is used in < “Stylesheet 0.2: chunks with IDs [File log.logic.02.xsl] ” 12 >< “Stylesheet 0.6: logical form [File log.logic.06.xsl] ” 25 >

Sample output data

Version 0.2 of stylesheet produces the following output for the short sample (reformatted for compactness):

/* Logical representation of time log data 
 * generated by timelog.to.logic.xsl 
 */


period(id2589187, 'Sample', 
       'unknown', 'unknown', 'unknown').
period(id2588575, 'A short sample', 
       'unknown', 'unknown', 'unknown').
period(id2591870, 'Tue 23 Mar', 
       'unknown', 'unknown', 'unknown').

chunk('NN_2004-03-23T17.01/17.32', 
      'NN', '2004-03-23', '17.01', '17.32', 31, 
      'xmlschema', 'agenda', "revising ten-week plan").
chunk('NN_2004-03-23T17.32/17.53', 
      'NN', '2004-03-23', '17.32', '17.53', 21, 
      'xmlschema', 'phone', 
      "NI, discuss ten-week plan and doc dates").
chunk('NN_2004-03-23T17.53/18.42', 
      'NN', '2004-03-23', '17.53', '18.42', 49, 
      'xmlschema', 'agenda', 
      "revising ten-week plan, send to WG").
chunk('NN_2004-03-23T18.42/19.06', 
      'NN', '2004-03-23', '18.42', '19.06', 24, 
      'personal', 'other:', "ironing").
chunk('NN_2004-03-23T19.06/19.10', 
      'NN', '2004-03-23', '19.06', '19.10', 4, 
      'overhead', 'email', "Email sorting").
... etc. 

Version 0.6: containment, existential quantification, inference rules

The final version of the log-entry-to-logic transformation extends the foregoing in a few simple ways.

Capturing containment relations

First, we add rules to capture the containment relations between time periods and between time periods and the log entries in them. Since every period element is the child either of a timelog element or of another period element, and since we are using the standard function generate-id to make unique identifiers for the periods, all we need to do is to write out a contains clause with the IDs generated for the parent and for the current element.

Figure 16: Capturing containment relations
 <xsl:template match="period" mode="containment">
contains(<xsl:value-of select="generate-id(..)"/>,
         <xsl:value-of select="generate-id()"/>).
  <xsl:apply-templates mode="containment"/>
 </xsl:template>

Continued in 17, 18

This code is used in < “Stylesheet 0.6: logical form [File log.logic.06.xsl] ” 25 >

Log entries are very similar. Each logentry element is contained in a period, and we must write a clause for the includes relation with the ID of the enclosing period and the ID we generate for the log entry itself. The only complication comes from the more complex form of identifier we have chosen for log entries.

Figure 17: [continues 16 Capturing containment relations ]
 <xsl:template match="logentry" mode="containment">
includes(<xsl:value-of select="generate-id(..)"/>,
         '<xsl:value-of select="concat('NN_',@date,'T',
           normalize-space(@start),'/',
           normalize-space(@end))"/>').
 </xsl:template>

And because we are generating containment clauses in a separate mode, we need to suppress the p element and text nodes again.

Figure 18: [continues 16 Capturing containment relations ]
 <xsl:template match="p" mode="containment"/>
 <xsl:template match="text()" mode="containment"/>

The templates just added produce the following clauses for our standard sample:

contains(id2588993,
         id2590277).
  
contains(id2590277,
         id2590286).
  
includes(id2590286,
         'NN_2004-03-23T17:01/17:32').
 
includes(id2590286,
         'NN_2004-03-23T17:32/17:53').
 
includes(id2590286,
         'NN_2004-03-23T17:53/18:42').
 
includes(id2590286,
         'NN_2004-03-23T18:42/19:06').
 
includes(id2590286,
         'NN_2004-03-23T19:06/19:10').
 
includes(id2590286,
         'NN_2004-03-23T19:10/19:47').
 
includes(id2590286,
         'NN_2004-03-23T19:47/19:59').
 
includes(id2590286,
         'NN_2004-03-23T19:59/20:10').
 
includes(id2590286,
         'NN_2004-03-23T20:10/20:59').
 
includes(id2590286,
         'NN_2004-03-23T20:59/21:33').
 
contains(id2590277,
         id2592222).
  
includes(id2592222,
         'NN_2004-03-24T6:30/6:33').
 
includes(id2592222,
         'NN_2004-03-24T6:33/07:21').
 
includes(id2592222,
         'NN_2004-03-24T07:21/8:08').

Existential quantification

Next, we add some rules to generate explicit statements that about the existence of certain things. If a log entry says a certain chunk of time was spent on a particular project P, doing work of a particular category C, then unless the markup is faulty we can infer that there is a project P and a category C.8 This seems to be worth capturing; we can do this with a simple rule that fires for each log entry:

Figure 19: Existential quantification of time chunks, projects, and categories
 <xsl:template match="logentry" mode="existence">
time_chunk('<xsl:value-of select="concat('NN_',@date,'T',
           normalize-space(@start),'/',
           normalize-space(@end))"/>').
project(<xsl:value-of select="@project"/>).
category(<xsl:value-of select="@category"/>).
 </xsl:template>

This code is not used elsewhere.

This generates statements of the form:

time_chunk('NN_2004-03-23T20:59/21:33').
project(overhead).
category(readmail).
which assign categories or types (in the informal sense, not in that of any specific schema language) to the individuals named.

Some might prefer to generate explicit statements that the things named