Keywords: non-extractive parsing, processing mode, incremental update, ASIC implementation
Biography
Jimmy Zhang is founder of Ximpleware, a provider of high performance XML processing solutions. Prior to founding XimpleWare, he worked for a few technology companies in the Silicon Valley ranging from EDA (electronic design automation) to VOIP (voice over IP). He holds a BS EECS and a MSEE from UC Berkeley.
As the first step of most existing XML parsing algorithms, one usually creates many string objects by extracting tokens out of the input XML document. We describe a "non-extractive" way of tokenizing XML without taking apart the document. Using a binary encoding specification called Virtual Token Descriptor (VTD) we represent tokens exclusively using starting offset and length. A VTD record is a 64-bit integer that encodes the starting offset, length, type and nesting depth of a token in an XML document. A processing model based on VTD also requires that the original XML document be kept intact in memory. Because VTD records can be stored in chunk-based buffers, one can potentially achieve both high performance and efficient memory usage processing XML. Also because VTD is entirely based on offset and length, it is inherently persistent. Our internal benchmark indicates that VTDAs XML becomes more ubiquitous, the new processing model can potentially offer a worthy alternative to DOM and SAX.
Since this talk was waitlisted, no paper was prepared for the proceedings.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.