Abstract
The abstract was not available at the time the proceedings were created. Please check an updated version of the paper abstracts at the conference proceedings web site.
Table of Contents
Despite the many advantages in using XML as an interchange format for integrating business applications, for some time to come we can expect that several legacy formats will continue to be used in addition to XML. In such an environment it will often be necessary to convert data in these legacy formats to and from XML. There are many commercial products that offer such capabilities. However, most of them use proprietary approaches that aren't portable to other solutions, and the price entry point is still a barrier to many organizations. We would like to have a simple, low or no cost general purpose file format conversion facility, capable of converting data in legacy formats to and from XML. The primary formats we want to support are:
Comma separated values (CSV) files - These are the familiar delimited files produced by many desktop applications. This format commonly uses only one logical record format per file, with each column having the same semantics.
Flat files - These are typically formats in which a number of different logical record types may be defined, with a record identifier at a fixed location in each record. Fields are of fixed length, although the actual physical records may be either fixed length or variable length with a record terminator character.
Electronic Data Interchange (EDI) - The traditional common standard for passing business data between organizations. This is a delimited format, with generalized grammars and specific message formats defined by national or international standards bodies.
The facility should ideally produce or consume an XML instance document that is an approximate isomorphic representation of the legacy file. Conversion to and from other XML-based languages can be handled by XSLT.
A necessary property of such a general-purpose utility is the ability to accept as a parameter the definition of a specific legacy grammar format. For example, it isn't enough to say that I want to convert any flat file format to and from an XML representation, I want to convert my purchase order extract file to an XML representation.
In addition to our constraint of simplicity, for reasons and rationale that are beyond the scope of this paper, it was a given to use the World Wide Web Consortium's (W3C) Document Object Model (DOM)[DOM] for processing XML data. As will be shown, this constraint brought an unexpected advantage when designing algorithms for processing legacy formats.
A general-purpose conversion utility requires generalized, abstract grammars that characterize broad classes of legacy formats. So, we need generalized grammars for CSV files, flat files, and EDI.
But, before we talk even of grammars, we must establish the fundamental notion of what constitutes a "document". In the XML world, we'll use document as it is defined in the XML 1.0 Recommendation[XML], but several of the legacy formats have a somewhat different notion. A definition common to most is that of a logical business document that contains the data concerning a specific business transaction such as a purchase order or an invoice. But, there are a few exceptions, and we'll be more specific as we discuss each format. In general terms, though, we are concerned about converting an XML instance document to or from a document as the legacy format defines a document. This is generally the level at which we will be most concerned when dealing with grammars. A secondary interest is the grammar of records and fields within the legacy formats. While certainly important for detail design and implementation, such details are mostly beyond the scope of this paper. But, a few important attributes of records and fields will be noted
Let us consider a CSV file format that may contain one or more business documents, with a break on a control field such as invoice number indicating a new document. Expressed in the W3C's Extended Backus-Naur Form (EBNF), the overall structure of a CSV file is:
CSVFile ::= CSVDocument+
The CSVDocument contains one or more records, or rows as in a spreadsheet.
CSVDocument ::= CSVRow+
Fields or columns in a CSV row are delimited by a selected delimiter character, and text within columns may be further delimited by a text delimiter so that the column delimiter may be escaped. The essential column properties are is the column number and data type. As with a simple spreadsheet, the semantics of the data in each column is the same across all rows, so there is only one type of row.
Flat files also generally contain one or more logical business documents. However, they commonly have many different record types. For example, a purchase order might have a header record and one or more detail records for each line item. In addition, types of records may appear in repeating groups. For example, if the items in a purchase order are to be shipped to a number of different locations, the line item detail may be a group consisting of a line item record and a ship to record. Groups may contain other groups. For example, the ship to data may itself be defined in a group of record types that define a physical location. Some flat file formats impose a different record type or identifier for otherwise identical record formats when used in different positions or groups within the file. However, a generalized grammar need only impose the looser restriction that each group begins with a specific mandatory record. Our generalized grammar for describing flat files is therefore:
FlatFile ::= FlatDocument+ FlatDocument ::= FlatGroup+ FlatGroup ::= FlatRecord ( FlatRecord | FlatGroup )+
The essential properties of a field in a flat file record include the data type, the field length and an offset from the beginning of the record.
An EDI message (or in the ANSI X12 standard, a transaction set) contains a number of different types of records, or segments. Each is identified by a two or three character segment identifier at the beginning of the segment. As with flat files, segments may appear in groups. The specific grammar of EDI is very well defined in ISO 9735[ISO 9735]for the international UN/EDIFACT standard and in X12.5[X12.5] and X12.6 [X12.6] for the U.S. national ANSI X12 standard. In addition, the ISO 9735 grammar in regard to message structure is sufficiently general for the X12 transaction set grammar to be considered a more restrictive subset. An EDI message usually contains a single business document, but some specific messages don't follow this convention. So, since the grammar of messages is well defined let us consider a message to be a document. EDI messages, like flat files, may define repeating groups of segments that may themselves contain other groups. However, if we consider only the segments that contain business data and not the so called "control" segments that, among other things, begin and end a message, we note that an EDI message is not a group since the first segment in an EDI message is not necessarily mandatory. So, the generalized grammar for an EDI message is:
EDIMessage ::= ( EDISegment | EDIGroup )+ EDIGroup ::= EDISegment ( EDISegment | EDIGroup )+
Properties for the fields, or data elements, within an EDI segment include the data type, the position of the data element within the segment, whether or not it is allowed to repeat, and if it is a so-called composite data element or structure that is used to group data elements within a segment.
The recursive nature of the group productions for flat files and EDI messages is an important property with implications for the machine-readable representation of the grammar as well as the processing of the grammar.
These observations about legacy grammars lead to two key decisions regarding the representation of legacy semantics in XML instance documents. The first decision is that each document (as defined by the legacy format) would be represented by a unique XML instance document. For example, when converting a flat purchase order extract file into XML a unique XML instance document is created for each purchase order in the file. The second decision is to not just perform a direct, isomorphic conversion of legacy data to and from XML. Instead, parent container elements in the XML instance are used to represent grammar and semantic characteristics that may only be implied in legacy file instances. We depict a group in a flat file or EDI message as a parent element with a number of record elements or other group elements as children. For EDI messages, we depict a composite data element as a parent element, with each of the component simple data elements as child elements. Overall, the approach is very similar to that defined in the ANSI ASC X12 Technical Report "An Experimental Methodology for the Representation of X12 Semantics in XML Syntax".[X12-XML]
The two example fragments below illustrate these decisions. The fragment of the X12 EDI purchase order shows a beginning, a date/time, and ship to name segment in the heading area, followed by two line item detail loops consisting of the PO1 baseline item data segment and the PID product description segment.
BEG*00*SA*4445-0323**20030123~ DTM*001*20030206~ N1*ST*BIG BOX - STORE #45*92*001234567S045~ PO1*1*20*CA*30.36**UP*35790000122~ PID*F****Instant Hot Cocoa Mix - Mint flavor~ PO1*2*40*CA*31.08**UP*35790000641~ PID*F****Instant Hot Cocoa Mix - Dutch Chocolate flavor~ CTT*2~
The XML representation shows the explicit grouping of the line item data.
<?xml version="1.0" encoding="UTF-8"?>
<X12PurchaseOrder>
<BEG>
<BEG01>00</BEG01>
<BEG02>SA</BEG02>
<BEG03>4445-0323</BEG03>
<BEG05>2003-01-23</BEG05>
</BEG>
<DTM>
<DTM01>001</DTM01>
<DTM02>2003-02-06</DTM02>
</DTM>
<N1Header>
<N1>
<N101>ST</N101>
<N102>BIG BOX - STORE #45</N102>
<N103>92</N103>
<N104>001234567S045</N104>
</N1>
</N1Header>
<PO1Group>
<PO1>
<PO101>1</PO101>
<PO102>20</PO102>
<PO103>CA</PO103>
<PO104>30.36</PO104>
<PO106>UP</PO106>
<PO107>35790000122</PO107>
</PO1>
<PID>
<PID01>F</PID01>
<PID05>Instant Hot Cocoa Mix - Mint flavor</PID05>
</PID>
</PO1Group>
<PO1Group>
<PO1>
<PO101>2</PO101>
<PO102>40</PO102>
<PO103>CA</PO103>
<PO104>31.08</PO104>
<PO106>UP</PO106>
<PO107>35790000641</PO107>
</PO1>
<PID>
<PID01>F</PID01>
<PID05>Instant Hot Cocoa Mix - Dutch Chocolate flavor</PID05>
</PID>
</PO1Group>
<CTT>
<CTT01>2</CTT01>
</CTT>
</X12PurchaseOrder>Due to the differences between the legacy file formats, a slightly different XML-based language is required to depict each of the legacy grammars. The major elements in the languages correspond to the respective EBNF productions, and a schema or DTD can be designed to define each of these three languages. The metadata of a specific legacy grammar is encoded as an XML instance document. For example, the grammar of our X12 EDI purchase order, with segment and element properties included, could look something like the following fragments.
<Grammar ElementName="X12PurchaseOrder" TagValue="BEG">
<SegmentDescription ElementName="BEG" TagValue="BEG">
<SimpleElementDescription ElementName="BEG01" FieldNumber="1"
SubFieldNumber="0" DataType="X12-ID" MinLength="2"
MaxLength="2"/>
...
</SegmentDescription>
<GroupDescription ElementName="N1Header" TagValue="N1">
<SegmentDescription ElementName="N1" TagValue="N1">
<SimpleElementDescription ElementName="N101" FieldNumber="1"
SubFieldNumber="0" DataType="X12-ID" MinLength="2"
MaxLength="3"/>
...
</SegmentDescription>
</GroupDescription>
<GroupDescription ElementName="PO1Group" TagValue="PO1">
<SegmentDescription ElementName="PO1" TagValue="PO1">
<SimpleElementDescription ElementName="PO101" FieldNumber="1"
SubFieldNumber="0" DataType="X12-AN" MinLength="1"
MaxLength="20"/>
...
</SegmentDescription>
<SegmentDescription ElementName="PID" TagValue="PID">
<SimpleElementDescription ElementName="PID01" FieldNumber="1"
SubFieldNumber="0" DataType="X12-ID" MinLength="1"
MaxLength="1"/>
...
</SegmentDescription>
</GroupDescription>
...
</Grammar>
Representing the legacy grammars as XML instance documents, and using these documents as parameters for the conversion facility are significant design decisions, but not entirely sufficient. Processing the legacy data requires that the metadata of the grammar be stored in program data structures. This is where an unexpected advantage of the DOM was discovered. DOM level 3 load semantics were already going to be used to load the XML grammar document from disk. Various types of data structures were considered for storing the grammar metadata, and various stack-based algorithms, such as push down automata, were considered for processing. However, it was finally decided that the simplest approach would be to walk the input data in a preorder traversal of the tree (the DOM tree with XML as the input, and the logical tree with legacy data as the input), and walk the grammar tree in parallel. Since the grammar metadata was already loaded in a DOM tree, there was no need to design another set of data structures to contain it.
The power of this approach is most apparent when considering how groups are processed in flat file and EDI formats. The recursive nature of the group production suggests a recursive algorithm. The pseudocode below shows the general outlines of an algorithm for processing a group of records in a legacy or EDI format, and producing a DOM subtree as output.
Preconditions: An input record or segment is loaded in the
record buffer, and a pointer has been set to the element
containing the grammar for the group
Get record grammar element as firstChild of group grammar element
Using field characteristics retrieved from the field grammar
children of the record grammar element, parse input record
from buffer, and load into DOM output tree
Do until end of file
Read next legacy format input record into record buffer
Identify record from tag
Get nextSibling of the record grammar element until match
is found
If match is found
Set pointer to the found grammar element
If grammar is for a group
Make a recursive call to this group processing algorithm
Else
Using field characteristics retrieved from the field grammar
children of the record grammar element, parse input record
from buffer, and load into DOM output tree
End if
Else
Return
End if
End do
Return
The recursive case of the algorithm is fairly easy to follow. The termination case of end of file is also evident, but the other termination case may not be. If a record is encountered that is not defined as part of the current group, the routine exits back to a previous iteration of the group processing algorithm and continues. If succeeding exits take the program back to the base case, then the input record is not part of the grammar and the program terminates abnormally.
The algorithm for processing a DOM subtree as input and producing a group of legacy format records or segments as output is very similar, being a mirror image of this algorithm.
The approach and algorithms described here were implemented and proven in the open source Babel Blaster project, hosted at SourceForge.net. This facility currently consists of six stand-alone utilities, with a "to XML" and "from XML" program for each of the three legacy formats. The project also has the following functional and design features:
The utilities are implemented as console mode programs suitable for use in shell scripts.
Implementations are available in both Java with Xerces and C++ on Win32 with MSXML.
"File description documents" enable users to specify legacy grammars, physical characteristics of legacy formats such as delimiters and record lengths, and the element names used in the XML representation.
Legacy file data types are converted to and from native schema language data types to simplify processing by other XML-aware programs, such as XSLT transformation engines.
The project uses a modular, object oriented design. This enables a great deal of reuse of essential processing algorithms by the converters for each of the legacy formats and allows for relatively easy addition of new legacy data types and file formats.
[DOM] Document Object Model (DOM) Level 2 Core Specification, World Wide Web Consortium, November 2000, http://www.w3.org/DOM/DOMTR#dom2
[ISO 9735] ISO 9735, Electronic data interchange for administration, commerce and transport (EDIFACT) -- Application level syntax rules, International Organization for Standardization
[Using XML] Using XML with Legacy Business Applications, Michael C. Rawlins, 2003, Addison-Wesley Professional. X12 and XML fragment examples in sections 3 and 4 are from this work, and used by permission.
[X12.5] Interchange Control Structures (X12.5), Release 004010, Accredited Standards Committee X12 of the American National Standards Institute, December 1997, published by the Data Interchange Standards Association.
[X12.6] Application Control Structure (X12.6), Release 004010, Accredited Standards Committee X12 of the American National Standards Institute, December 1997, published by the Data Interchange Standards Association.
[X12-XML] X12-XML: An Experimental Methodology for the Representation of X12 Semantics in XML Syntax.(ASC X12C/99-184), Accredited Standards Committee X12 of the American National Standards Institute, October 1999, published by the Data Interchange Standards Association.
[XML] Extensible Markup Language (XML) 1.0 (Second Edition), World Wide Web Consortium, October 2000, http://www.w3.org/TR/2000/REC-xml-20001006
![]() ![]() |
Design & Development by deepX Ltd. |