XML 2003 logo

Python Paradigms for XML

Abstract

Python is a very popular language for processing XML because of its flexibility, and the work of many open-source developers in the XML-SIG and elsewhere. The most popular XML processing models, DOM and SAX are well-represented, and many other important technologies from XSLT and XPath through RDF.

But the real strength of Python for XML processing requires specialized models to emerge. The dynamicism of Python allows for next-generation data bindings that use declarative forms for mirroring XML vocabularies. The combination of Python and XML core strengths could provide even more power, registering XPatterns for dispatch during parsing and XPaths for triggering of processing. Python's strengths in introspection allow for simplified serialization to XML. Its variety of polymorphic hooks allow for pluggable datatype libraries, among other advantages. These strengths add up to especially rich forms of data-binding that do not depend on an object or relational view of XML data.

This paper presents an overview of the many XML processing tools and techniques available for Python, and focuses especially on tools that make the most of the Python's strengths in XML processing.

Keywords

»Data Binding, »DOM, »Parser, »Python, »RDF, »RELAX NG, »SAX, »Schema, »WXS, »XML, »XPath, »XSLT.

Table of Contents

1. Choosing Python for XML processing
2. Simple API for XML
3. Document Object Model
3.1. cDomlette
4. Pull processing models
4.1. pulldom
4.2. libxml's XmlTextReader
5. Specialized Python APIs
5.1. elementtree
6. Python/XML data bindings
6.1. Gnosis Utilities
6.2. Anobind
7. Closing
Bibliography
Glossary
Biography

1. Choosing Python for XML processing

Python combines excellent readability with excellent flexibility. It has enjoyed steady growth since the early nineties and has recently won over as converts some of the best minds in the software development profession. It acheves a remarkable balance by remaining very accessible to beginners while treating everyone as a "consenting adult", and allowing experts innovative expression the without clumsy encumbrances. Python has several technical strengths that make it especially strong for XML processing, including:

  • Well-designed Unicode support, extensive and eficient built-in text-processing libraries

  • A variety of Internet protocol libraries that support the many aspects of XML that are designed for the Web

  • The highly-tuned dictionary type, which allows for efficient associative arrays of text structures, which facilitates many XML processing tasks

  • Generators and iterators, which provide for expressive and efficient code for manipulating the sort of trees structures that lie at the heart of XML

  • A core set of built-in XML and HTML processing libraries

One of the best things about Python/XML is the active community of practitioners and contributors. From introductory texts to references to mailing lists, these resources will provide answers to most questions worth asking about Python and XML. The [XML-SIG] is the primary focus of Python work for XML, and the [XML-SIG-ML] is a good place for discussion. The XML SIG has also produced some important general XML work such as the XML Bookmark Exchange Language (XBEL), which is now used in several Web browsers. As I've explored in my [PYTHON-XML], there are over 60 third-party packages for Python/XML processing, and constant growth in the development of new packages and capabilities.

There are innumerable tools and techniques for processing XML in Python. There is certainly no room in this paper to cover all of them, but I shall present a selection of those with which I'm familiar.

2. Simple API for XML

Simple API for XML (SAX) ([SAX]) is the dominant push procesing API for XML. A SAX library comes built into Python 2.0 or more recent (the xml.sax module). One can usually get recent updates and bug-fixes by loading the latest PyXML, which provides a drop-in replacement for Python's built-in SAX. The following is a small SAX program that draws a crude graph of the tree structure of an XML document.


# This is a special form of string literal that can span multiple lines
xml_source = """<?xml version='1.0'?>
<memo>
  <title>With Usura Hath no Man a House of Good Stone</title>
  <date form="ISO-8601">1936-04-03</date>
  <to>The Art World</to>
  <body>
It has come to our attention that the basis for art production
Has shifted from keen patronage to vulgar commercial measure.
Management is concerned this will erode the lasting value of the age's works.
  </body>
</memo>
"""


import string
from xml import sax

#We subclass from ContentHandler in order to gain default behaviors
class TreePrintHandler(sax.ContentHandler):
    #A class that handles XML events and draws a tree from the document structure

    def __init__(self):
        #In Python, self is conventionally used as a name of the value
        #That represents the instance on which the method is being invoked
        #This means that you handle instance variables as attributes of self
        self.depth = 0
        self.increment = "  "
        return

    def startDocument(self):
        print "--- [Document]"
        self.depth = self.depth + 1
        return

    def startElement(self, name, attributes):
        #In Python multiplying a string by an Integer repeats the string that
        #many times
        print self.increment*self.depth + "+-- [Element <" + name + ">]"
        self.depth = self.depth + 1
        return

    def endElement(self, name):
        self.depth = self.depth - 1
        return

    def characters(self, text):
        #Only print out any information on this string
        #if it has non-space characters
        if not text.isspace():
            print self.increment*self.depth + "+-- [Text <" + text + ">]"
        return

#A little Python magic that allows this program to run as a stand-alone script
if __name__ == '__main__':
    #Create a new parser instance, using the default XML engine SAX provides
    parser = sax.make_parser()
    #Create an instance of our handler class, which will be registered
    #to receive SAX events
    handler = TreePrintHandler()
    #Pass a string to be parsed, and pass the handler to be registered
    #to receive SAX events.
    sax.parseString(xml_source, handler)
    #At this point, the parser has completed processing, and all events
    #have been dispatched.  We're done.


      

SAX was designed with Java in mind, and it has never really fit comfortably into Python shoes. Namespaces are an especially ugly influence on the API. And regardless of the language, SAX has a reputation for difficulty. Managing SAX is all in the state machine technique, but the reality is that XML processing is usually an adjunct to broader data processing, and the complexities of state management in call-back APIs can become impenentrable for even the most practiced hands.

3. Document Object Model

Document Object Model (DOM) ([DOM]) also comes built into Python in the form of minidom. Again the Python API is an approximation of the official W3C binding, which is provided in Common Object Request Broker Architecture (CORBA)Interface Definition Language (IDL). Again it doesn't make the most comfortable fit with Python, but at least the Python DOMs are easier to use for many purposes than SAX.


#The import
from xml.dom import minidom

DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""

#Create a minidom document node parsed from XML in a string
#Yous would use just parse() to parse a file
doc_node = minidom.parseString(DOC)

#You can execute regular DOM operations on the document node
verse_element = doc_node.documentElement
print verse_element
#And you can even use "Pythonic" shortcuts for things like
#Node lists and named node maps
#The first child of the verse element is a white space text node
#The second is the attribution element
attribution_element = verse_element.childNodes[1]
print attribution_element
#attribution_string becomes "Christopher Okibgo"
attribution_string = attribution_element.firstChild.data
print attribution_string

print

#Print the third line
#Use list comprehensions to isolate line elements
lines = [ node for node in verse_element.childNodes if node.nodeName == u'line' ]
third_line = lines[2]
#Normalization required to be sure adjacent text nodes are merged
third_line.normalize()
print third_line.firstChild.data

#Write the same XML back out to stdout
print doc_node.toxml(encoding="utf-8")

      

You can see how much clumsy maneuvering is required to pluck the third line element's content out using the DOM interface. Many Python users dislike this interface. Fredrik Lundh, who has been a prolific contributor to Python and Python/XML tools, recently commented that "the DOM API is designed for consultants, not for humans or computers", and Guido van Rossum, the brains behind Python, famously yelled out at a Python conference session "DOM sucks". Much of people's probems with DOM is its very rigid and unfriendly interface, and this actually causes disenchantment among users in other languages as well. But Python is especially subject to DOM bloat, the problem that DOM keeps the whole document tree in memory. Python objects are well-known for taking up a lot of memory (the cost of dynamicism) and the many small objects that compose the document add up.

3.1. cDomlette

Several tools seek to minimize DOM bloat and improve parsing and mutation speed by moving the expensive stuff into C data structures. Some maintain DOM's tree-like basics, but a more specialized API, such as PyRXP. One example that tries to stick to the DOM API where practical is cDomlette, available as part of [4SUITE]). cDomlette is optimized for XPath operations, speed, and relatively low memory overhead, at least when compared to 4DOM and minidom. It is not fully DOM compliant, but it does provide an interface very close to DOM Level 2. In Domlette, where DOM and XPath disagree, XPath wins. A translation of the above listing to cDomlette:


#NonvalidatingReader is a reader singleton object for parsing XML
#Print is a function for serializing nodes back to XML strings
from Ft.Xml.Domlette import NonvalidatingReader, Print

DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""

#4Suite is pretty insistent about wanting base URIs for source documents
doc_node = NonvalidatingReader.parseString(DOC, 'urn:bogus:dummy')

verse_element = doc_node.documentElement
print verse_element

#1 index rather than 0 to account for the white space text node
attribution_element = verse_element.childNodes[1]
print attribution_element

attribution_string = attribution_element.firstChild.data
print attribution_string

print

#Print the third line
#Use list comprehensions to isolate line elements
lines = [ node for node in verse_element.childNodes if node.nodeName == u'line' ]
third_line = lines[2]
#Normalization not needed because cDomlette normalizes all text nodes on parse
print third_line.firstChild.data

#Write the same XML back out to stdout
Print(doc_node, encoding="utf-8")

      

4. Pull processing models

Pull processing is a model for processing that seeks to combine the efficiency of SAX's approach -- stream through and process a particular window of markup -- and the ease of DOM's -- walk through the hierarchy and manipulate nodes in place. In pull processing, you track the stream of markup events, and when you get to a particular window of interest, you convert it to a hierarchy of nodes within that window that is available for random access.

4.1. pulldom

Python 2.0 and more recent comes with a pulldom variety built in. The following listing demonstrates this API:


from xml.dom import pulldom
 
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""
 
#Print the text of the first line element
events = pulldom.parseString(DOC)
line_counter = 0
for (event, node) in events:
    if event == pulldom.START_ELEMENT:
        if node.tagName == "line":
            line_counter += 1
            #Switch from steaming mode to random-access mode
            if line_counter == 3:
                events.expandNode(node)
                #Traditional DOM processing starts here
                #Print the text data of the text node
                #of the current (line) element
                print node.firstChild.data

      

4.2. libxml's XmlTextReader

libxml is the popular C library for XML processing that came from the GNOME project. [LIBXML] is its Python bindings. and it includes a pull API similar to one pioneered in Microsoft's msxml. The following listing is the eqivalent of the last one:


import cStringIO
import libxml2

DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""

XMLREADER_START_ELEMENT_NODE_TYPE = 1

#Print the text of the first line element
stream = cStringIO.StringIO(DOC)
input_source = libxml2.inputBuffer(stream)
#Use a bogus base URI for the input source since nothing in the
#XML requires relative resolution
reader = input_source.newTextReader("urn:bogus")

line_counter = 0
while reader.Read():
    if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE:
        if reader.Name() == "line":
            line_counter += 1
            if line_counter == 3:
                #Switch from steaming mode to random-access mode
                node = reader.Expand()
                print node.children.content
                #Skip the subtree, since we just expanded it
                if reader.Next() != 1:
                    break

      

These Pull APIs certainly provide the efficiency of SAX, but looking at the above examples, it's hard to argue that they meet the additional goal of simplicity.

5. Specialized Python APIs

With concerns that DOM and SAX are too alien to the Python way of life, Python developers have come up with a variety of more specialized APIs for XML processing. Using flexible object attributes, iterators, the sequence and mapping protocols, and other Python idioms, these packages allow uses to get things done with fewer lines of code and fewer arcane metods and funtions to memorize.

5.1. elementtree

One of the most mature examples of a pure Pythonic API is elementtree, which turns XML into a simple Python data structure that focuses on elements.


import sys
import cStringIO

DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""

stream = cStringIO.StringIO(DOC)

from elementtree.ElementTree import ElementTree
root = ElementTree(file=stream)

third_line = root.findall('line')[2]

print third_line.text

      

The simplicity gained by adapting Python's sequence protocol is immediately clear in this example.

6. Python/XML data bindings

A data binding is a system for viewing XML documents as databases or programming language or data structures, and vice versa. There are many aspects of data bindings, including rules for converting XML into specialized Python data strucures, and the reverse (marshalling and unmarshalling), using schemata to provide hints and intended data constructs to marshalling and unmarshalling systems, mapping XML data paterns to Python functions, and controlling Python data strctures with native XML technologies such as XPath. A data binding essentially serves as a very pythonic API, but in this paper, the main distinction made in calling a system a data binding lies in the basics of marshalling and unmarshalling. In data bindings, the very shape of the resulting Python data structure is set by the XML vocabulary. elementtree, for example, uses generic Python attrbutes and methods that treat the XML information item names as parameters. Data bindings take the more direct approach of reflecting information item names in Python attribute and method names.

6.1. Gnosis Utilities

Gnosis XML Utilities is a Python package with a variety of utility classes for data management, especially utility classes for XML processing. The module import gnosis.xml.objectify allows you to convert arbitrary XML documents to Python objects. At its most basic, it implements simple marshaling and unmarshaling, but it's also a sophisticated data binding tool. The following listing demonstrates:


import gnosis.xml.objectify
import cStringIO
 
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""
 
stream = cStringIO.StringIO(DOC)
#Set up a preparatory object with a DOM tree
#from which the Python structure is created
dom_obj = gnosis.xml.objectify.XML_Objectify(stream)
#make_instance method does the actual work of generating the Python structure
verse = dom_obj.make_instance()
 
print verse.line[2].PCDATA

      

Again you can see how directly one accesses the parts of the XML document using the Python. The nuance in this case is that the XML document is accessed using Python constructs that use the same vocabulary as the XMl document itself. This is the hallmark of a good data binding tool.

6.2. Anobind

elementtree and gnosis.xml.objectify bring the power and expressiveness of ython to XML processing in a very direct fashion, but in a manner of speaking they are more partial to Python in their design than to XML. This is not necessarily a bad thing, but it does leave a small opening for another very Pythonic API to emerg, one that is built as much from the view of the characteristics of XML as of Python. Anobind is such a data binding. The primary characteristics for which it is designed are:

  • Natural default binding (i.e. when given an XML file with no hints or customization)

  • Well-defined mapping from XML information item names to Python identifiers

  • Declarative, rules-based system for fine-tuning the binding

  • Support for RELAX NG schemata in providing hints for the binding

  • XPattern support for rules definition

  • Clean but compliant XML namespaces support

  • Strong support for document-style XML (especially with regard to mixed content)

  • Reasonable support for unbinding back to XML

  • Some flexibility in trading off between efficiency and features in the resulting binding

The following listing shows the equivalent code to the gnosis.xml.objectify example.


import anobind
from Ft.Xml import InputSource
 
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
  <attribution>Christopher Okibgo</attribution>
  <line>For he was a shrub among the poplars,</line>
  <line>Needing more roots</line>
  <line>More sap to grow to sunlight,</line>
  <line>Thirsting for sunlight</line>
</verse>
"""
 
#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
isrc = isrc_factory.fromString(DOC, 'urn:bogus')
 
#Now bind from the XML given in the input source
binder = anobind.binder()
binding = binder.read_xml(isrc)
 
print binding.verse.line[2].text_content()

      

One difference is that Anobind explicitly models the root node of the XML document (binding in the listing). In elementtree and gnosis.xml.objectify, the top-level bound object is the top-level element.

7. Closing

The most important thing to note is the variety and flexibility of XML processing tools for Python. There are tools and techniques for every taste and need. This paper only covers a few of the tools available. Within the scope of those tools it covers only a few of the techniques. Also available to developers are SAX filters, XPath bindings for minidom, cDomlette, elementtree and Anobind, generator idioms for DOM and more. There will likely be even more varieties of Python/XML tools that emerge because there are s many nuances one can mld from the Python and the XML worlds. XML processing tools for Python have been available for over four years now and the body of experience regarding what works where ensures that we can expect to see even more remarkable things in this space as Python and XML develop.

Bibliography

[XML-SIG] Python XML SIG

[XML-SIG-ML] Python XML SIG mailing list

[PYTHON-XML] "Python & XML" xml.com column http://www.xml.com/pub/q/pyxml

[SAX] Simple API for XML Project http://www.saxproject.org/

[DOM] Document Object Model http://www.w3.org/DOM/

[AKARA-SAX] U. Ogbuji, Basic SAX processing http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/sax

[4SUITE] 4Suite http://4Suite.org

[XBEL] The XML Bookmark Exchange Language (XBEL), http://xml.coverpages.org/xbel.html

[AKARA-DOMLETTE] U. Ogbuji, M. Brown, 4Suite's Domlettes http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes

[LIBXML] Python/libxml http://xmlsoft.org/python.html

[UOGBUJI-PULLDOM] U. Ogbuji, Using pull-based DOMs http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html

[ELEMENTTREE] elementtree http://effbot.org/downloads/

[GNOSIS] Gnosis Utilities http://gnosis.cx/download/

Glossary

CORBA

Common Object Request Broker Architecture

DOM

Document Object Model

IDL

Interface Definition Language

SAX

Simple API for XML

Biography

Uche Ogbuji is a Computer Engineer, co-founder and CEO of Fourthought, Inc., a software vendor and consultancy specializing in open, standards-based XML solutions, especially as applicable to problems of knowledge management. He has worked with XML for several years, co-developing 4Suite, a open-source platform for XML and RDF applications, written in Python and C. He writes articles on XML for XML.om, ITWorld, IBM developerWorks, Intel Developer Services, Application Development Trends, and elsewhere. He also speaks extensively at various conferences.

Mr. Ogbuji is a Nigerian immigrant with a B.S. in Computer Engineering from Milwaukee School of Engineering. He currently resides in Boulder, Colorado where he enjoys playing soccer in the summer and snowboarding in the winter. His main interest is literature, and poetry in particular.