Abstract
Python is a very popular language for processing XML because of its flexibility, and the work of many open-source developers in the XML-SIG and elsewhere. The most popular XML processing models, DOM and SAX are well-represented, and many other important technologies from XSLT and XPath through RDF.
But the real strength of Python for XML processing requires specialized models to emerge. The dynamicism of Python allows for next-generation data bindings that use declarative forms for mirroring XML vocabularies. The combination of Python and XML core strengths could provide even more power, registering XPatterns for dispatch during parsing and XPaths for triggering of processing. Python's strengths in introspection allow for simplified serialization to XML. Its variety of polymorphic hooks allow for pluggable datatype libraries, among other advantages. These strengths add up to especially rich forms of data-binding that do not depend on an object or relational view of XML data.
This paper presents an overview of the many XML processing tools and techniques available for Python, and focuses especially on tools that make the most of the Python's strengths in XML processing.
Keywords
Table of Contents
Python combines excellent readability with excellent flexibility. It has enjoyed steady growth since the early nineties and has recently won over as converts some of the best minds in the software development profession. It acheves a remarkable balance by remaining very accessible to beginners while treating everyone as a "consenting adult", and allowing experts innovative expression the without clumsy encumbrances. Python has several technical strengths that make it especially strong for XML processing, including:
Well-designed Unicode support, extensive and eficient built-in text-processing libraries
A variety of Internet protocol libraries that support the many aspects of XML that are designed for the Web
The highly-tuned dictionary type, which allows for efficient associative arrays of text structures, which facilitates many XML processing tasks
Generators and iterators, which provide for expressive and efficient code for manipulating the sort of trees structures that lie at the heart of XML
A core set of built-in XML and HTML processing libraries
One of the best things about Python/XML is the active community of practitioners and contributors. From introductory texts to references to mailing lists, these resources will provide answers to most questions worth asking about Python and XML. The [XML-SIG] is the primary focus of Python work for XML, and the [XML-SIG-ML] is a good place for discussion. The XML SIG has also produced some important general XML work such as the XML Bookmark Exchange Language (XBEL), which is now used in several Web browsers. As I've explored in my [PYTHON-XML], there are over 60 third-party packages for Python/XML processing, and constant growth in the development of new packages and capabilities.
There are innumerable tools and techniques for processing XML in Python. There is certainly no room in this paper to cover all of them, but I shall present a selection of those with which I'm familiar.
Simple API for XML (SAX) ([SAX]) is the dominant push procesing API for XML. A SAX library comes built into Python 2.0 or more recent (the xml.sax module). One can usually get recent updates and bug-fixes by loading the latest PyXML, which provides a drop-in replacement for Python's built-in SAX. The following is a small SAX program that draws a crude graph of the tree structure of an XML document.
# This is a special form of string literal that can span multiple lines
xml_source = """<?xml version='1.0'?>
<memo>
<title>With Usura Hath no Man a House of Good Stone</title>
<date form="ISO-8601">1936-04-03</date>
<to>The Art World</to>
<body>
It has come to our attention that the basis for art production
Has shifted from keen patronage to vulgar commercial measure.
Management is concerned this will erode the lasting value of the age's works.
</body>
</memo>
"""
import string
from xml import sax
#We subclass from ContentHandler in order to gain default behaviors
class TreePrintHandler(sax.ContentHandler):
#A class that handles XML events and draws a tree from the document structure
def __init__(self):
#In Python, self is conventionally used as a name of the value
#That represents the instance on which the method is being invoked
#This means that you handle instance variables as attributes of self
self.depth = 0
self.increment = " "
return
def startDocument(self):
print "--- [Document]"
self.depth = self.depth + 1
return
def startElement(self, name, attributes):
#In Python multiplying a string by an Integer repeats the string that
#many times
print self.increment*self.depth + "+-- [Element <" + name + ">]"
self.depth = self.depth + 1
return
def endElement(self, name):
self.depth = self.depth - 1
return
def characters(self, text):
#Only print out any information on this string
#if it has non-space characters
if not text.isspace():
print self.increment*self.depth + "+-- [Text <" + text + ">]"
return
#A little Python magic that allows this program to run as a stand-alone script
if __name__ == '__main__':
#Create a new parser instance, using the default XML engine SAX provides
parser = sax.make_parser()
#Create an instance of our handler class, which will be registered
#to receive SAX events
handler = TreePrintHandler()
#Pass a string to be parsed, and pass the handler to be registered
#to receive SAX events.
sax.parseString(xml_source, handler)
#At this point, the parser has completed processing, and all events
#have been dispatched. We're done.
SAX was designed with Java in mind, and it has never really fit comfortably into Python shoes. Namespaces are an especially ugly influence on the API. And regardless of the language, SAX has a reputation for difficulty. Managing SAX is all in the state machine technique, but the reality is that XML processing is usually an adjunct to broader data processing, and the complexities of state management in call-back APIs can become impenentrable for even the most practiced hands.
Document Object Model (DOM) ([DOM]) also comes built into Python in the form of minidom. Again the Python API is an approximation of the official W3C binding, which is provided in Common Object Request Broker Architecture (CORBA)Interface Definition Language (IDL). Again it doesn't make the most comfortable fit with Python, but at least the Python DOMs are easier to use for many purposes than SAX.
#The import
from xml.dom import minidom
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
#Create a minidom document node parsed from XML in a string
#Yous would use just parse() to parse a file
doc_node = minidom.parseString(DOC)
#You can execute regular DOM operations on the document node
verse_element = doc_node.documentElement
print verse_element
#And you can even use "Pythonic" shortcuts for things like
#Node lists and named node maps
#The first child of the verse element is a white space text node
#The second is the attribution element
attribution_element = verse_element.childNodes[1]
print attribution_element
#attribution_string becomes "Christopher Okibgo"
attribution_string = attribution_element.firstChild.data
print attribution_string
print
#Print the third line
#Use list comprehensions to isolate line elements
lines = [ node for node in verse_element.childNodes if node.nodeName == u'line' ]
third_line = lines[2]
#Normalization required to be sure adjacent text nodes are merged
third_line.normalize()
print third_line.firstChild.data
#Write the same XML back out to stdout
print doc_node.toxml(encoding="utf-8")
You can see how much clumsy maneuvering is required to pluck the third line element's content out using the DOM interface. Many Python users dislike this interface. Fredrik Lundh, who has been a prolific contributor to Python and Python/XML tools, recently commented that "the DOM API is designed for consultants, not for humans or computers", and Guido van Rossum, the brains behind Python, famously yelled out at a Python conference session "DOM sucks". Much of people's probems with DOM is its very rigid and unfriendly interface, and this actually causes disenchantment among users in other languages as well. But Python is especially subject to DOM bloat, the problem that DOM keeps the whole document tree in memory. Python objects are well-known for taking up a lot of memory (the cost of dynamicism) and the many small objects that compose the document add up.
Several tools seek to minimize DOM bloat and improve parsing and mutation speed by moving the expensive stuff into C data structures. Some maintain DOM's tree-like basics, but a more specialized API, such as PyRXP. One example that tries to stick to the DOM API where practical is cDomlette, available as part of [4SUITE]). cDomlette is optimized for XPath operations, speed, and relatively low memory overhead, at least when compared to 4DOM and minidom. It is not fully DOM compliant, but it does provide an interface very close to DOM Level 2. In Domlette, where DOM and XPath disagree, XPath wins. A translation of the above listing to cDomlette:
#NonvalidatingReader is a reader singleton object for parsing XML
#Print is a function for serializing nodes back to XML strings
from Ft.Xml.Domlette import NonvalidatingReader, Print
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
#4Suite is pretty insistent about wanting base URIs for source documents
doc_node = NonvalidatingReader.parseString(DOC, 'urn:bogus:dummy')
verse_element = doc_node.documentElement
print verse_element
#1 index rather than 0 to account for the white space text node
attribution_element = verse_element.childNodes[1]
print attribution_element
attribution_string = attribution_element.firstChild.data
print attribution_string
print
#Print the third line
#Use list comprehensions to isolate line elements
lines = [ node for node in verse_element.childNodes if node.nodeName == u'line' ]
third_line = lines[2]
#Normalization not needed because cDomlette normalizes all text nodes on parse
print third_line.firstChild.data
#Write the same XML back out to stdout
Print(doc_node, encoding="utf-8")
Pull processing is a model for processing that seeks to combine the efficiency of SAX's approach -- stream through and process a particular window of markup -- and the ease of DOM's -- walk through the hierarchy and manipulate nodes in place. In pull processing, you track the stream of markup events, and when you get to a particular window of interest, you convert it to a hierarchy of nodes within that window that is available for random access.
Python 2.0 and more recent comes with a pulldom variety built in. The following listing demonstrates this API:
from xml.dom import pulldom
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
#Print the text of the first line element
events = pulldom.parseString(DOC)
line_counter = 0
for (event, node) in events:
if event == pulldom.START_ELEMENT:
if node.tagName == "line":
line_counter += 1
#Switch from steaming mode to random-access mode
if line_counter == 3:
events.expandNode(node)
#Traditional DOM processing starts here
#Print the text data of the text node
#of the current (line) element
print node.firstChild.data
libxml is the popular C library for XML processing that came from the GNOME project. [LIBXML] is its Python bindings. and it includes a pull API similar to one pioneered in Microsoft's msxml. The following listing is the eqivalent of the last one:
import cStringIO
import libxml2
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
XMLREADER_START_ELEMENT_NODE_TYPE = 1
#Print the text of the first line element
stream = cStringIO.StringIO(DOC)
input_source = libxml2.inputBuffer(stream)
#Use a bogus base URI for the input source since nothing in the
#XML requires relative resolution
reader = input_source.newTextReader("urn:bogus")
line_counter = 0
while reader.Read():
if reader.NodeType() == XMLREADER_START_ELEMENT_NODE_TYPE:
if reader.Name() == "line":
line_counter += 1
if line_counter == 3:
#Switch from steaming mode to random-access mode
node = reader.Expand()
print node.children.content
#Skip the subtree, since we just expanded it
if reader.Next() != 1:
break
These Pull APIs certainly provide the efficiency of SAX, but looking at the above examples, it's hard to argue that they meet the additional goal of simplicity.
With concerns that DOM and SAX are too alien to the Python way of life, Python developers have come up with a variety of more specialized APIs for XML processing. Using flexible object attributes, iterators, the sequence and mapping protocols, and other Python idioms, these packages allow uses to get things done with fewer lines of code and fewer arcane metods and funtions to memorize.
One of the most mature examples of a pure Pythonic API is elementtree, which turns XML into a simple Python data structure that focuses on elements.
import sys
import cStringIO
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
stream = cStringIO.StringIO(DOC)
from elementtree.ElementTree import ElementTree
root = ElementTree(file=stream)
third_line = root.findall('line')[2]
print third_line.text
The simplicity gained by adapting Python's sequence protocol is immediately clear in this example.
A data binding is a system for viewing XML documents as databases or programming language or data structures, and vice versa. There are many aspects of data bindings, including rules for converting XML into specialized Python data strucures, and the reverse (marshalling and unmarshalling), using schemata to provide hints and intended data constructs to marshalling and unmarshalling systems, mapping XML data paterns to Python functions, and controlling Python data strctures with native XML technologies such as XPath. A data binding essentially serves as a very pythonic API, but in this paper, the main distinction made in calling a system a data binding lies in the basics of marshalling and unmarshalling. In data bindings, the very shape of the resulting Python data structure is set by the XML vocabulary. elementtree, for example, uses generic Python attrbutes and methods that treat the XML information item names as parameters. Data bindings take the more direct approach of reflecting information item names in Python attribute and method names.
Gnosis XML Utilities is a Python package with a variety of utility classes for data management, especially utility classes for XML processing. The module import gnosis.xml.objectify allows you to convert arbitrary XML documents to Python objects. At its most basic, it implements simple marshaling and unmarshaling, but it's also a sophisticated data binding tool. The following listing demonstrates:
import gnosis.xml.objectify
import cStringIO
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
stream = cStringIO.StringIO(DOC)
#Set up a preparatory object with a DOM tree
#from which the Python structure is created
dom_obj = gnosis.xml.objectify.XML_Objectify(stream)
#make_instance method does the actual work of generating the Python structure
verse = dom_obj.make_instance()
print verse.line[2].PCDATA
Again you can see how directly one accesses the parts of the XML document using the Python. The nuance in this case is that the XML document is accessed using Python constructs that use the same vocabulary as the XMl document itself. This is the hallmark of a good data binding tool.
elementtree and gnosis.xml.objectify bring the power and expressiveness of ython to XML processing in a very direct fashion, but in a manner of speaking they are more partial to Python in their design than to XML. This is not necessarily a bad thing, but it does leave a small opening for another very Pythonic API to emerg, one that is built as much from the view of the characteristics of XML as of Python. Anobind is such a data binding. The primary characteristics for which it is designed are:
Natural default binding (i.e. when given an XML file with no hints or customization)
Well-defined mapping from XML information item names to Python identifiers
Declarative, rules-based system for fine-tuning the binding
Support for RELAX NG schemata in providing hints for the binding
XPattern support for rules definition
Clean but compliant XML namespaces support
Strong support for document-style XML (especially with regard to mixed content)
Reasonable support for unbinding back to XML
Some flexibility in trading off between efficiency and features in the resulting binding
The following listing shows the equivalent code to the gnosis.xml.objectify example.
import anobind
from Ft.Xml import InputSource
DOC = """<?xml version="1.0" encoding="UTF-8"?>
<verse>
<attribution>Christopher Okibgo</attribution>
<line>For he was a shrub among the poplars,</line>
<line>Needing more roots</line>
<line>More sap to grow to sunlight,</line>
<line>Thirsting for sunlight</line>
</verse>
"""
#Create an input source for the XML
isrc_factory = InputSource.DefaultFactory
isrc = isrc_factory.fromString(DOC, 'urn:bogus')
#Now bind from the XML given in the input source
binder = anobind.binder()
binding = binder.read_xml(isrc)
print binding.verse.line[2].text_content()
One difference is that Anobind explicitly models the root node of the XML document (binding in the listing). In elementtree and gnosis.xml.objectify, the top-level bound object is the top-level element.
The most important thing to note is the variety and flexibility of XML processing tools for Python. There are tools and techniques for every taste and need. This paper only covers a few of the tools available. Within the scope of those tools it covers only a few of the techniques. Also available to developers are SAX filters, XPath bindings for minidom, cDomlette, elementtree and Anobind, generator idioms for DOM and more. There will likely be even more varieties of Python/XML tools that emerge because there are s many nuances one can mld from the Python and the XML worlds. XML processing tools for Python have been available for over four years now and the body of experience regarding what works where ensures that we can expect to see even more remarkable things in this space as Python and XML develop.
[PYTHON-XML] "Python & XML" xml.com column http://www.xml.com/pub/q/pyxml
[SAX] Simple API for XML Project http://www.saxproject.org/
[DOM] Document Object Model http://www.w3.org/DOM/
[AKARA-SAX] U. Ogbuji, Basic SAX processing http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/sax
[4SUITE] 4Suite http://4Suite.org
[XBEL] The XML Bookmark Exchange Language (XBEL), http://xml.coverpages.org/xbel.html
[AKARA-DOMLETTE] U. Ogbuji, M. Brown, 4Suite's Domlettes http://uche.ogbuji.net/tech/akara/nodes/2003-01-01/domlettes
[LIBXML] Python/libxml http://xmlsoft.org/python.html
[UOGBUJI-PULLDOM] U. Ogbuji, Using pull-based DOMs http://www-106.ibm.com/developerworks/xml/library/x-tipulldom.html
[ELEMENTTREE] elementtree http://effbot.org/downloads/
[GNOSIS] Gnosis Utilities http://gnosis.cx/download/
[ANOBIND] Anobind http://uche.ogbuji.net/tech/4Suite/anobind/
![]() ![]() |
Design & Development by deepX Ltd. |