Henry S. Thompson
15 November 2005
- My thanks to the members of the W3C Technical Architecture
Group and to Richard Tobin, who have contributed to my understanding of the
issues addressed here in many patient conversations. The misunderstandings
which remain are, of course, solely my own responsibility.
- Versioning XML languages is a hot topic
- Before we can understand how to manage versioning
- We need to understand what a version is
- This in
turn means exploring what is needed to document an XML language
- And this turns in to a matter of clarifying just what names in
an XML language are and what they name
- Since XML languages are constituents of the
Web
- We approach out task by trying to construct URIs for the
named constituents of XML languages
- What is a namespace?
- a syntactic mechanism
- for distinguishing one class of uses of a particular simple name from all others
- A namespace is just an infinite set of expanded names
- By 'expanded name', I mean a pair of a namespace name (or nothing) and
a local name, as defined by XML Namespaces 1.1
- In other words, a namespace is not a finite set of names.
- Nor a more complex
structured object as suggested by the (in)famous now-deleted non-normative
Appendix A: The
Internal Structure of XML Namespaces.
- This minimalist reading is the only one consistent with actual usage
- People mint new namespaces by simply using them in an expanded name
or namespace declaration.
- without thereby incurring any obligation to define
the boundaries of some set
- Does this mean that a namespace springs into life
the first time anyone uses a URI as a namespace name?
- That objectifies namespaces as such in an unhelpful way.
- Better to not reify namespaces as such at all.
- So "[some name] in the [some URI] namespace" doesn't literally make sense.
- It's just a convenient way of referring to "the expanded name
< some_URI, some_name >".
- It makes sense to ask questions about namespace names:
- E.g. What
namespace name will XSLT 2.0 use?
- Or about expanded names:
- E.g. Does the definition of the element type named
< http://www.w3.org/Style/1998/Transform, output > change between XSLT
1.0 and XSLT 2.0?
- But
questions about namespaces as such are rarely if ever useful
- Unless of course
they're understood as questions about namespace names
- Or about
some otherwise-defined set of expanded names with a namespace name in common.
- Consider statements such as
"Such-and-such an element type is defined in the XSLT namespace"
- It follows from the minimalist position that this does
not mean that the
XSLT namespace contains element types (or element declarations).
- Rather it means that the XSLT
language uses the expanded name
< http://www.w3.org/Style/1998/Transform, such-and-such > for some element type for some purpose.
- This
fits well with W3C XML
Schema:
- A schema document
for a particular target namespace corresponds to a schema
- That schema assigns element declarations, type definitions, etc. to expanded names.
- Those expanded names all share that target namespace as their namespace name.
- But any schema language which
supports namespaces at all must in the end function to associate
definitions with expanded names.
- So it's languages which provide definitions
for expanded names.
- or as we used to say,
applications, in the SGML sense.
- It's important to distinguish three uses of expanded names:
- As names for (classes of) infoitems in XML languages in general,
without explicit formal definitions
- So every well-formed XML document has named element types, even
without a DTD or other form of schema
- And many have attribute types, and some have named anchors.
- As names for things in application data models in general
- Even from the XML-is-for-human-documents perspective, there's an
application data model distinct from the 'raw' infoitem model
- As names for things in the application data models of what we might call XML
definition languages
- such as W3C XML Schema, Relax NG, OWL, SVG or
WSDL
- XML definition languages have as an important part of their overall semantics the
assignment of names to constructs in their domain or data model, for example
- simple type definitions
for W3C XML Schema
- patterns for Relax NG
- classes for OWL
- views for SVG
- interfaces for WSDL
- It's the latter case which raises the hard questions
- So languages such as W3C XML Schema, Relax NG and WSDL are the
primary focus hereafter.
- A language might provide one and only one definition for a particular expanded name
- But
evidently in many cases a particular expanded name may have more than one
definition
- because it gets used in the language in more than one sort of way
- E.g. to name an element type and an attribute type.
- Note I'm using 'definition' here to cover everything a language has to
say about a particular use of an expanded name.
- Syntax, semantics, whatever.
- I'll come back to this later.
- When a word has more than one meaning in a natural language, we say
it's ambiguous.
- The same thing can happen with respect to an XML language
- Ambiguous expanded names
are a problem for Web Architecture
-
AWWW says "A specification in which QNames serve as resource identifiers MUST provide a mapping to URIs".
- One concrete goal of the analysis given here is, then, to try to address this problem.
- Any given XML definition language only gives names to certain sorts of things.
- Some
languages only give names to one sort of thing
- Others give names
to more than one sort of thing.
- For example, in the implicit data model of XML DTDs we find
- Definitions/declarations of element types, attribute types, notations and two
types of entity, all given names.
- As far as my imagination stretches, there are no XML
definition languages which name only one kind of thing
- Indeed very few XML languages of any kind, which name anything, name only one
kind of thing
- I haven't been able to find one!
- But we could imagine one without difficulty: perhaps AddressBookML would
only provide for naming address book entries.
- Some languages, although naming
more than one sort of thing, constrain their use of names to be unambiguous
- Typically because they use IDs for names in their XML representations
as e.g. SVG and XML Encryption.
- But sometimes simply by fiat, as e.g. RDFS and OWL
- In the latter cases, just an expanded name is sufficient to identify something
- and constructing a URI for it is therefore straightforward
- provided there's a functional mapping from namespace name to language
- But in the former, where there are multiple sort of things being
named, and no uniqueness constraint, expanded names alone don't do the job
- For example,
type is five-ways ambiguous as an
attribute in the XHTML
language as defined by the transitional DTD.
- And
style is ambiguous between element and attribute in
the same language.
- Looking more closely at XML as defined by a DTD, there are in principle an
unbounded number of things which might share a name, only distinguishable by
context
- we have element declarations (max. one per expanded
name)
- and attribute declarations (max. as many as there are
element declarations)
- W3C XML Schema also has a substantial set of what it
calls "symbol spaces"
- In general I'll call these different "sorts".
- There are seven things whose definitions can be named:
- Types, attribute and element groups, notations and identity-constraints along side elements and attributes
- Elements as well as attributes may differ in different contexts
- In general this means any general approach to providing
unambiguous names will have to accommodate some means for
distinguishing between contexts.
- So far we've considered ambiguity regarding expanded names with respect to a single language.
- That is, a single language as defined by an XML definition language.
- But languages vary over time
- as new versions of a
language are released
- And some languages encompass alternative variants which
are all current at the same time
- For example the HTML
P element has a long and complex history
- Even the XHTML
p element has three distinct variants in version 1.0 (strict,
transitional and basic)
- none of which is exactly the same as the one in version 1.1.
- Sometimes we may want names which abstract over such differences
- Other times we may need to be very precise
- None of this should come as a surprise.
- Ordinary language uses names in
ways which are both ambiguous and context-determined
- whose use changes
over time
- But its consequence for the Web are serious
- Particularly when we consider the use of names for things on the Web intended for automatic
processing
- where appeal to context for disambiguation may not be
straightforward
- It is not at all
obvious how to specify an approach to
constructing URIs for things which will cover all the cases just discussed.
- We've identified five dimensions
which have to be fixed to identify a named thing:
- Language
- Variant
- Sort
- Template rule vs. Attribute Set
- Context gets covered here too
- Namespace name
- Distinct from language? We'll come back to this later
- Local name
- Since it's languages which are the locus of definitions
- We can try starting with a simple URI for the
language
- And add a fragment identifier for the rest
- We've clearly got several
too many dimensions to make this possible straight away.
- Broadly speaking there are three ways one could respond to the
dimensionality problem:
- You only win in the simple case
- That is, you only get URIs for things named in a language when
- Local names are unambiguous (as e.g. for SVG)
- You don't care about variants
- And when the language itself is one-to-one with a namespace
- So you can just the namespace name plus the local name as the
fragment identifier
- You always win, but you pay a high price
- Use a complex fragment syntax to encode the sort, variant and,
possibly, namespace name dimensions as well as the local name
- Either via a new XPointer scheme
- Or a new media type
- XML Schema Component Designators and IRI References for WSDL 2.0
Components are examples of this approach
- Look for a middle ground
- Adopt the first, simple, approach wherever possible
- Otherwise an approximation to it which ignores
variants and as much application-specific detail as possible
- Fall back to the second, complex, approach when necessary.
- At least the simple approach does not allow for the possibility that two distinct languages or
language variants might
use the same expanded name for two evidently distinct things.
- This is intimately bound up with another assumption with respect to variation
- It's possibly to tell reliably when a change in something counts as
a variation
- as opposed to a fundamental change of identity
- If I change the
named definition of a type by nudging its min or max a bit
- That pretty
clearly just produces a variant of the same type.
- But if I change the
definition assigned to a name from being an integer to being a date
- That's no longer the same type at all.
- I expect that
both of these assumptions will want to be recast as Good Practice notes going
forward:
- Don't use the same expanded name for two different things of the
same sort in different languages under your control.
- As a language evolves,
use new expanded names for new things, don't recycle old ones.
- There are at least four things that "ignore variants" might mean:
-
Pick a variant and stay with it.
- The constructed URI names something in a
distinguished variant
- For example the first variant
- Collect all variants.
- The constructed URI names the set of
things named across
all variants in which the name is used.
- Abstract over all variants.
- The constructed URI names
whatever is common across over all the members of the above set.
- Accept that the name is contingent.
- The
constructed URI will name different things at different times.
- Requires
imposing an order, typically temporal, across all possible variants.
- Interpreting the URI to mean the largest member of the above set with respect to that order.
- The last of the above alternatives is, of course, similar to the way
most URIs already function.
- The resource identified by
http://www.guardian.co.uk/ is time-varying.
- If you want a
particular edition of the Guardian newspaper, you have to use a much more
complex URI.
- What is the starting point for URI construction?
- When there is a
one-to-one relation between language and namespace name (ignoring variants)
- That's the starting point.
- What other cases are there?
-
There is no namespace name.
-
Docbook
and Specprod are widely used languages for document markup
- Their definitions consist of elements and attributes in no namespace.
- The obvious choice of starting point in such cases is the URI of the official language definition.
-
There are multiple namespace names, all specific to the language.
- Many languages defined using W3C XML Schema are in this category,
e.g. UBL, JDF, EuroferXML, . . .
- Fortunately, in all but one pathological case there is
a functional mapping from namespace name to language.
- All this adds up to saying there is a single starting-point URI we can use
for all names.
- Whether the above story leads us to a namespace name or a language
definition URI
-
Sometimes this URI will also encode some
variant information
- It would still be a good idea to have a single unchanging URI which names
a language independent of variation. . .
- We've already established that there are five dimensions along which
the constituents of a language need to be identified: language, variant, sort,
namespace name, local name.
- The previous section effectively covers language
and namespace name, leaving variant, sort and local name.
- I will assume
without argument that
http: URIs are the goal.
- This gives us
three syntactic mechanisms to exploit to produce a name from our starting
point, which we'll schematise as http://starting/point/:
-
Additional path components, i.e.
http://starting/point/more/goes/here/
-
Parameters, i.e.
http://starting/point/?more=this&other=that
-
Fragment identifiers. This case sub-divides based on whether we use
a new media type for the representations retrievable via our constructed URIs
or not:
-
Existing XML media type(s)
- Either a shorthand pointer, i.e.
http://starting/point/#ncname or an
XPointer using a new scheme, i.e. http://starting/point/#more(goes,here)
-
New media type(s)
- Wide-open, only subject to
http: syntax rules, i.e. http://starting/point/#more;goes~here
- To cover the full complexity discussed above, we must pick from the above
resources to encode all three of variant, sort and local name.
- The compromise approach sometimes needs all three, sometimes variant or sort or both are not needed, but it always needs the local name.
- So choosing a syntax which allows the encoding or variant or sort to be easily elided would be a good thing.
- In cases such as XML attributes or elements whose identity is
determined by context, the space of sorts is open-ended
- So for the full complexity case some form of path-based component seems inescapable.
- Insofar as it makes sense to describe a generic solution, independent
of the details of particular languages, then I like something along the
following lines:
-
variant
- Encode as a numeric+optional alphabetic path component, e.g.
http://www.w3.org/1999/xhtml/1.1/, http://www.w3.org/1999/XSL/Transform/2.0/
-
simple sorts
- Encode as an alphabetic path component, e.g.
http://www.w3.org/2001/XMLSchema/simpleType/, http://www.w3.org/1999/XSL/Transform/attribute/
-
local name
- Encode as a shorthand fragment identifier, e.g.
http://www.w3.org/1999/xlink/#href, http://www.w3.org/2005/xpath-functions/#tokenize
-
context-specific sorts
- Encode as a fragment identifier using an XPointer scheme, existing
if possible, otherwise new, e.g.
http://www.w3.org/1999/xhtml/#xpath(//hr/@align) (the align attribute of the hr element in XHTML), http://www.w3.org/2001/04/xmlenc/#xscd(/~EncryptedType/EncryptedMethod) (the EncryptedMethod element as defined by the W3C XML Schema schema for the XML Encryption language.
- The intention is that where necessary and/or appropriate, these can
all be combined.
- When anything necessary to uniquely identify something is
left out, the alternative abstractions come in to play.
- Consider the case of the W3C XML Schema language itself.
- The expanded name
< http://www.w3.org/2001/XMLSchema, attribute > has definitions therein as four different sorts of things:
- a key
- a complex type
- a top-level element type
- a context-specific element type
- Not all of these definitions stayed the same between the original Recommendation and the second edition.
- Accordingly we could establish a naming convention which yielded all the following URIs for things with that expanded name:
-
http://www.w3.org/2001/XMLSchema/key/#attribute
- The key, no variant specified
-
http://www.w3.org/2001/XMLSchema/1.0.2/complexType/#attribute
- The complex type, as defined in the second edition
-
http://www.w3.org/2001/XMLSchema/1.0/element/#attribute
- The top-level element type, as defined in the first edition
-
http://www.w3.org/2001/XMLSchema/csElement/#xscd(/group::attrDecls/attribute)
- The context-specific element type, as it appears in the
attrDecls model group definition, no variant specified.
-
http://www.w3.org/2001/XMLSchema/#attribute
- If we chose to, we could say this was the most
recent top-level element. That is, for each dimension, it's open to us to say
how to disambiguate in ambiguous cases.
- The proposal outlined above results in a large number of URIs for any
given language -- perhaps as many as
(S x V) + S + V + 1
- That's 1
for the language, abstracting over variants and sorts, one for each sort,
abstracting over variants, one for each variant, abstracting over sorts, and one
for each sort with respect to each variant.
- What kind of resource should be
identified by each such URI?
- Clearly, at least, one with anchors for all the
barenames appropriate to the URI.
- And what should each such anchor correspond
to?
- A definition of the thing named, of course.
- Presumably
that should include connections to any published definitions, either formal or
in natural language.
- But the details of how this should be done, and what
further information should be provided, that is, the design of a generic
language definition information document, although the original goal of this
work, will have to be left for another day.
- One thing we'd like to find in a definition is a set of variants with
respect to which it's valid.
- This in turn would support a minimal coherence
condition
- given that the discussion above implies the existence of a partial (not all sorts exist in all variants, and not
all names name things of all sorts) function from name X sort X variant to
definitions.
- Informally, we would then hope that
defn(name,sort,variant)=defn implies variant in defn.variants.
- This has been an exploration of the structure of names in XML languages
- As defined by other XML languages
- And identified by URIs
- It's been driven both bottom-up
- What's actually out there
- And top-down
- What requirements do we think we have for names?
- There's still lots more work to do!