XML 2001 logo

Merging XML Technologies

Karen Fox <kfox@mitre.org>
Matt Keen <mkeen@mitre.org>

ABSTRACT

This effort is a case study into how to integrate products from competing vendors, such as Oracle and Microsoft©. To achieve the goal of modernizing the Air Force Technical Training archive, it was determined that a collection of information technology products from various vendors was needed. We discovered that while they can be made to work, there is a lot of trial and error involved. Currently, the Air Education and Training Command (AETC) Headquarters receives data monthly from each of its training bases and uses this data to compile a variety of reports. However, this data is not stored long-term, which necessitated the creation of an information management solution designed to provide a training data repository for the next 20-25 years. This new archival system is required to interface to the legacy systems and processes, as well as deliver new capabilities, such as data validation and enhanced report generation. With the use of eXtensible Markup Language (XML) technologies, the legacy data was parsed into XML using Java, validated using a SAX2 validator in a Java class file, and inserted into a Microsoft© Structured Query Language (SQL) Server database using Oracle’s XSQL/XML-SQL Utiltity (XSU) technology. Using Oracle products with a Microsoft© database required the use of a Java Database Connectivity (JDBC)-Open Database Connectivity (ODBC) translator, and we selected JSQLConnect. XSQL was also used with Java to extract the data from the database as XML and transform it back into the legacy file format for subsequent delivery to the end user. By performing data validation prior to the insertion of data, we improved the integrity of the stored data and saved the end user time by reducing errors. Additionally, this archive allows for the generation of reports using XSL stylesheets and XSQL over the saved data. It has provided access to the erred data and has been expanded to include even more types of training data.

Table of Contents

1. Overview

The United States AETC has the responsibility of aggregating and reporting on training data collected by training centers across the country. The individual training centers have systems that maintain these local data stores in a variety of Data Base Management Systems (DBMS). Although similar, the data structures and legal values are also different. These differences are the result of the procurement process in which each center’s system was built to meet the operational needs of the production systems. In spite of the need for an enterprise data scheme, the maintenance is performed at various locations by multiple owners with multiple production requirements.

The goal of an AETC Enterprise Data Management System (EDMS) is to provide central, long-term storage to be used for reporting, as well as for providing data for forecasting and trend analysis. It will also be used to feed both a data warehouse and production database systems. In addition, the EDMS will store metadata that describes the characteristics of the legacy data.

This document will describe the approach used to develop the AETC EDMS, which involved using a combination of technologies to take data from the wide variety of sources and get it into the EDMS. Section 2 will explain the overall design of the EDMS and go more in depth on each of the technologies used. Section 3 will go into detail on problems encountered in merging these technologies and lessons learned, while Section 4 will list plans for future development.

2. Design Process

The Phase I plans for the EDMS included receiving a training data text file in a comma-separated values (csv) format from the outlying AETC training bases. These files had to then be transformed into XML, validated and inserted into a database, and retrieved and passed onto the AETC warehouse in the csv format they are used to receiving. For the purposes of Phase I, the data had to be selected from the database and transformed back into the original format to be transparent to the change. This process is displayed graphically in Figure 1.

Figure 1: Phase I Architecture

Each AETC training center generated a csv file containing the previous month’s data on student training and transferred it via File Transfer Protcol to a folder on a computer at the AETC Headquarters. This file was then transformed via Java servlets into XML and validated against an XML schema using Xerces. The valid record equivalents were inserted into the SQL Server database using XSQL. Records that were not valid were transferred to another folder and an error report was generated. XSQL was also used to extract report data from the Air Education and Training Management Ssytem (AETMS) database and Xalan and eXtensible Stylesheet Language (XSL) were used to display the extracted data in the same csv format as originally received. This data was forwarded to AETC Decision Support System (ADSS) as one report containing the validated AETC training centers information, providing an improved csv file, while remaining an invisible addition to the current process.

2.1. Technologies

The original plans for this proposal called for using the XML technologies to process and display the archive data in order to maintain the integrity of the thousands of entries. Additionally, as developers we wanted to use Java to make this application platform independent. In addition, direction from the AETC required that we use Microsoft© products, such as Internet Information Services (IIS) 5.0 and SQL Server.

2.1.1. Java

Java is a platform-independent programming language and provides many free or low-cost add-ons. We used Sun's© Java Development Kit (JDK) 1.3 and created several servlets to provide the classes necessary for our prototype. Each servlet and its application are listed below.

XMLFileElementValue Used in conjunction with ConfigFileElement to determine the path to where data is accessed and stored. This servlet grabs the path from a designated document.
XMLFileElementValue Used in conjunction with ConfigFileElement to determine the path to where data is accessed and stored. This servlet grabs the path from a designated document.
ConfigFileElement Wrapper to XMLFileElementValue. Passes in XML document containing path and receives path element.
SAX2Count (from Apache) Used with Xerces to validate XML document against a designated schema. We modified this class to work with our application.
InsertXMLtoDB Used to insert validated XML files in a SQL Server database via Oracle XSU and JSQLConnect driver.
SimpleTransform2 Xalan sample used to transform XML with eXtensible Stylesheet Language XSL.
ParseTTMS Main servlet to parse csv data into XML document.
ParseTPSCourse Used to parse Training Planning System (TPS) course data into XML document.
ParseTPSClass Used to parse TPS class data into XML document.
RunTTMS Wrapper servlet for processing Tech Training Management System (TTMS) data by parsing and validating data.
Run TPS Wrapper servlet for processing TPS Class data by parsing and validating data.
Run TPS2 Wrapper servlet for processing TPS Course data by parsing and validating data.
getParam Servlet to connect to web graphic user interface (GUI), gather information, and pass it to RunTTMS. Calls other needed servlets to process data.
getTPS Servlet to connect to web GUI, gather information, and pass it to the appropriate RunTPS servlet. Calls other needed servlets to process data.
GetTTMS Servlet to connect to web GUI to pull TTMS data out of database and deliver as the original csv file.

2.1.2. JRun

Because we were using Java servlets, we needed a Java servlet engine and decided to use Allaire©’s JRun v2.2.3, since it interacted smoothly with IIS. However, we ran into several problems using JRun, which will be detailed further in Section 3. We even tried upgrading to the latest JRun Version 3.1, but that ended up creating more problems than it solved.

The order of jar files in the classpath for JRun became very important during development. In order to use the Xerces processor, the Xerces jar file had to be listed first in the classpath; otherwise, the JAXP and parser jar files overrode it.

2.1.3. Xerces

This free java library from Apache offers a built-in validator for use with Document Type Definitions (DTDs) or XML schemas. We began with Version 1.3 and ended with Version 1.4.2. We modified code in the SAX2Count class file from Apache by adding the following method so it would report back on the errors found in order to provide an error report and separate our validated data from nonvalidated.

			 public boolean begin(String arg){
	       System.out.println("In validation");
	       // vars
          String  parserName = DEFAULT_PARSER_NAME;
          print(parserName, arg,  true);
			 //the primary SAX2Count method
          //System.out.println(flag);
	       return flag;
          } 

This process worked smoothly when we tested it from the DOS command prompt, but later when we developed the web interface and connected the servlets to it, we had problems in validation. This will be discussed more in Section 3.

2.1.4. Xalan

Another free Apache java library, Xalan Version 1.2.2 contains XSL processing code, such as the sample SimpleTransform that we copied and modified to fit our needs. It allowed us to take our XML documents and transform them into the format necessary for insertion into the SQL Server database.

2.1.5. JSQL Connect

We decided on a low-cost JDBC-ODBC connector in order to talk to SQL Server with our Java code. JSQLConnect Version 2.21 by NetDirect was easy to install and easy to use. While Oracle does provide a JDBC-ODBC tool, JSQLConnect proved easier to set up and use.

2.1.6. SQL Server

The Air Force required that we use SQL Server 2000 for our database because they already had a copy at the AETC. This new version of SQL Server contained XML tools, but we did not use them because we could not bulk insert data into columns by matching column names to XML elements via a Java servlet

2.1.7. Oracle XDK

Oracle’s XSU and XSQL products are contained in the XML Developer’s Kit (XDK) Version 9.0.2.0.0B. This Beta version is the only one currently that will work with SQL Server. The XSU enabled us to push XML documents into SQL Server, automatically matching element tags with column names, and pull data from SQL Server as an XML document. The code to insert the data follows. Note that the file name and table name were passed into the servlet.

			   Connection conn = null;
      Document doc;
      DOMParser d = new DOMParser();
      Statement stmt = null;
      ResultSet rset = null;
      FileReader freader = new FileReader(FileName);
      BufferedReader breader = new BufferedReader(freader);
      try{
       DriverManager.registerDriver(new com.jnetdirect.jsql.JSQLDriver());
       // Insert XML data from file (filename) into database table (tabName)  
        try{
          d.parse(breader);
        }
				  catch (Exception e) {
	       System.out.println(e.toString());
				  }
      doc = d.getDocument();
      try{ 
       OracleXMLSave savedoc = new OracleXMLSave(conn, tabName);
       savedoc.insertXML(doc);
       savedoc.close();
      } 
			   catch(oracle.xml.sql.OracleXMLSQLException ex) 
      {
	      //Get the original exception
	      Exception parent = ex.getParentException();
	      if (parent instanceof java.sql.SQLException)
	      {
	        //perform other stuff. Here we simply print it out
	        System.out.println("Caught SQL Exception: "+parent.getMessage());
	      }
	      else
	      {	
	        System.out.println("Exception caught...!"+ex.getMessage());
	        System.out.println("SQL Error Code: " +ex.getErrorCode());
  	      System.out.println(ex.toString());
	      }
	     }
      System.out.println(" successfully inserted rows into "+ tabName);

2.2. Design

The csv file is uploaded to a folder on a server each month, so we retrieved the monthly Technical Training Management System (TTMS) data from there. The Training Planning System (TPS) data is sent out on an annual basis and had to be loaded into the database first, since the TTMS data is related to it. The application is accessed via a web interface and log in screen (Figure 2).

Figure 2: Log in Screen

After verifying the user’s identity, the user is sent to the initial screen and given options of importing or exporting data to the archive (Figure 3).

Figure 3: Initial Screen

The user can select the type of data to import. The TPS data is received annually, and it only needs to be imported once. However, this data is broken into two separate pieces–the course data and class data (Figure 4), and each must be imported separately. Because of the many to one relationship between the class and course data, the course data must be imported first so links can be created to it while importing the class data.

Figure 4: Import TPS Data

Importing TTMS data follows a similar process (Figure 5) in which a csv file is selected from a folder on the computer, pulled by Java servlets to be parsed into XML, validated against a schema, transformed for insertion into the database, and inserted into the database.

Figure 5: Import TTMS Data

Once the servlets finish, the information on records parsed and validated is returned to the browser (Figure 6). This screen also provides links to the bad data, which not only contains theXML document, but the reason the record was designated as bad (Appendix A).

Figure 6: TTMS Insert Results

For Phase I, we also had to pull the data out of the archive database and pass it onto the ADSS in the format in which we originally received it (the comma-separated value file), since that is what they are used to receiving at this point in time. ADSS then imports this file into their database and uses it for data calculations. This also proved that our system was interoperable and seamless to the present operations and that the data was not changed despite the transformation, validation, and aggregation processes we used. Again, we used a web interface screen to determine which data to process (Figure 7).

Figure 7: Export TTMS Data

The code behind this screen passes the form information into a servlet, which uses the Oracle XSU to pull the requested data from the database in XML form. The servlet then transforms it into a csv file via XSL and provides a link on the processed web screen to the data (Figure 8) so it can be viewed or saved as a file.

Figure 8: Selected TTMS Data

3. Lessons Learned

The lessons learned fall into three general categories: 1) the maturity and 2) interoperability of the technologies and 3) the experience of the developer. Although it would be more comfortable to address only those dealing with technology, we felt that lessons learned should be mentioned in case other programmers encounter them.

3.1. Java

One problem we encountered in writing the servlet to parse the data into XML was a developer error was easily fixed. When using the StringTokenizer to read in and parse the data, it would not find the end-of-line characters no matter what code or symbol we used.

The BufferedReader we used to pull the csv file into a FileReader only read in a line at a time, not including the end-of-line character. To fix this, we added the end-of-line character (/n) to each line while reading in. The revised code snippet from ParseTTMS that covers this is below. Whole and str were previously declared as strings, and InputFile was passed into this servlet.

		    //Read in the file and parse with string tokenizer
	     c=1;
	     FileReader freader = new FileReader(InputFile);
	     BufferedReader breader = new BufferedReader(freader);
	     System.out.println(“Reading in data”);
	     While ((str = breader.readLine()) != null)
	     {
		     whole = whole + str + “/n”; //adding /n here made the difference
		     c++;
	     }
	     freader.close();
	     StringTokenizer ST = new StringTokenizer(whole, “,”, true);

We also encountered a “null pointer” error several times because of source data inconsistency. During parsing, the servlet would attempt to grab data from a field, which was sometimes empty and sometimes contained data. We had to modify the parsing code to check for null and enter an empty space in that case, as shown below.

		    INSTL_BLK_ID=ST.nextToken(",");
      if (INSTL_BLK_ID.equals(","))
      {
	      INSTL_BLK_ID = "";
      }
      else
      {
	      ST.nextToken(",");
      }

3.2. JRun

The JRun Java Servlet Engine produced several problems during development with the worst being that it times out when processing an incoming file any larger than 100 kilobytes. To date, we have been unable to resolve this problem, even though we tried using the latest version of JRun. To make the application run, we broke the incoming large file into smaller files. This is a temporary fix at best and must be corrected.

We also discovered that the classpath entered into the JRun administrator must be set with the Xerces jar file first in order to use that parser and not the one built into JRun. When trying to use the latest version, we encountered even more difficulties in setting the classpath; both Xerces and JSQLConnect had to be first, and JRun still would not work correctly.

We have also encountered a recurring problem of JRun stopping for no apparent reason and not starting again. Each time meant uninstalling, then reinstalling JRun in order to proceed. We believe using Enterprise Java Beans (EJB) and an application server would greatly improve performance and solve these difficulties.

3.3. Xerces

Most of the problems encountered in using Xerces were easily fixed, though one problem stymied us for several weeks. Fortunately, we could post to discussion groups available with folks very willing to share their vast knowledge, such as the Xerces-J User list.

One notable problem involved the SAX2Count code from Apache used in validating an XML document against a schema. We consistently upgraded the Xerces application during development to achieve more capability. In the beginning, the validation process worked perfectly when invoked via the DOS prompt. However, after adding the web interface and connecting the Java servlets to it, the validation process quit working. It appeared that the code could not find the schema; therefore, it declared every element as invalid.

We tried referencing the schema in a variety of ways from the XML document, to include hardcoded path, full URI, and relative Uniform Resource Identifier (URI) . We posted the problem on several loops and received some responses, but none of them solved this error. Finally, an unrelated post on an email loop suggested a fix, which turned out to be embarrassingly simple.

We began development using Xerces 1.3.1 and progressed through Version 1.4.2. The original schema namespace was listed as: xmlns:xsi=http://www.w3.org/2000/10/XMLSchema. Despite the release of the XML Schema Recommendation from the W3C in May 2001, it never occurred to us to change this reference to reflect the latest schema as xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance. Once we read a loop email mentioning the change in schema reference namespace, the light came on. We made this change and the validation process worked perfectly.

We also used XML Spy Version 3.5 to write the schemas. Although it is a good tool for writing and testing schemas, IBM’s XML Schema Validator or the XSV tool are more accurate in relation to the latest W3C Schema Recommendation.

3.4. Xalan

The only difficulty encountered using Xalan came when we tried to use JRun 3.1. No matter where we put the Xalan jar files in the JRun classpath, the JRun parsing files would override the Xalan and break when trying to do a transformation.

3.5. JSQL Connect

The only problem with JSQLConnect came with trying to get the jar files to work with JRun 3.1. By putting JSQLConnect as the first jar file in the classpath, we got the application to access the connection driver.

3.6. SQL Server

At this point in time, SQL Server just does not have robust XML capabilities. Aside from the learning curve in using the SQL Server application, we discovered that the built-in XML tools were not conducive to integrating with our Java servlets, which made us examine the Oracle XSQL tool for use in importing and exporting XML data to and from the database. While this tool does work, we discovered that SQL Server will not implement an autonumber field during a bulk insert, so the ID numbers and relationship fields have to be set up in the XML document prior to transformation and insert. Sample Java code used to create one of these index fields follows:

public int getEventID() throws java.sql.SQLException
  Connection conn = null;
	 Statement stmt = null;
	 ResultSet rsetEvent = null;
	 String query2;
	 try{
    DriverManager.registerDriver(new com.jnetdirect.jsql.JSQLDriver());
	   //Initialize a JDBC connection
	   conn = DriverManager.getConnection("jdbc:JSQLConnect://mm97220:1433/database=archive/user=demo/password=demo");
	   query2="SELECT MAX(STUDENT_CLASS_ASSIGNMENT_EVENT_ID) AS ID from STUDENT_CLASS_ASSIGNMENT_EVENT";
	   //System.out.println (query2);
    stmt = conn.createStatement();
	   rsetEvent = stmt.executeQuery(query2);
	   rsetEvent.next();
	   if (rsetEvent != null)
	   {
     EventID = rsetEvent.getInt("ID");
	   }
	   else
	   {
		   EventID = 0;
	   }
	   //System.out.println ("Event is " + EventID);
	 }
	 catch(java.sql.SQLException e){ 
   System.out.println(e.toString());
	 }
	 rsetEvent.close();
	 stmt.close();
	 conn.close();
	 return(EventID);
  }//end method getEventID

Because of this, we reduced the number of tables in the database, thereby, changing from our initial database normalized schema. We believe there is a workaround for this problem, but have not encountered a fix to it.

Finally, there is no update/add feature, which means extensive coding will need to be done to address those records coming in that may be updates rather than new data.

3.7. Oracle XDK

The Oracle XML products are designed to work with Java servlets, which made them excellent candidates for this project. Information we encountered said that XSU would now allow connection to ODBC databases, such as SQL Server, but when we tried it, we could only pull data, not insert. We finally discovered that only the current Beta version of the XDK 9.0.2.0.0B would implement the insertion process.

4. Future Plans

We have several things we want to try with this application to include 1) using an Apache web server and Tomcat Java engine to run it. Perhaps this will resolve some of the problems encountered in using JRun. We also want to 2)streamline the code and make it as platform independent as possible.

We also want to 3) explore new tools. We have seen demonstrations of XML tools that may provide useful as well. Neocore produces a database designed solely to hold XML, which is supposed to be faster and easier to use in relation to XML. X-Aware has a virtual XML database, which acts as a wrapper to a regular relational database, but handles XML data in and out. We would like to investigate these products further, especially in relation to retrieving the TTMS data from the training centers. Currently, this data is sent monthly by the training centers. Using the X-Aware tool would allow us to pull data every month in an XML format. Our recommendation has been to implement more XML awareness on the training center end of this process.

5. Conclusion

While intermixing technologies from different vendors presents a unique set of problems, overall, none of these problems are insurmountable, due mostly to the sharing nature of theInternet Technologies ( IT) community and the proliferation of new XML tools. Worthy of note is Oracle for not only creating excellent XML database tools, but for extending their capability beyond the reach of their own database.

6. Appendix A

6.1. Validated Bad Data Report

<?xml version='1.0' encoding='UTF-8'?>

<TTMS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:noNamespaceSchemaLocation="http://space-xml-mitre.org/aetc/schemas/ttms.xsd">

<STUDENT>

<PRSN_DSGNTR_SSN_ID>3333333333</PRSN_DSGNTR_SSN_ID>

<PERSON_ID>16</PERSON_ID>

<STUDENT_CLASS_ASSIGNMENT_EVENT_ID>20</STUDENT_CLASS_ASSIGNMENT_EVENT_ID>

<COURSEID>

<ORGZN_ID>L</ORGZN_ID>

<STDNT_TRNG_RQMT_TRNG_TYP_CD>6</STDNT_TRNG_RQMT_TRNG_TYP_CD>

<STDNT_TRNG_RQMT_STDNT_TYP_CD>A</STDNT_TRNG_RQMT_STDNT_TYP_CD>

<INSTL_UNIT_TYP_CD>G</INSTL_UNIT_TYP_CD>

<ORGZN_INSTR_UNIT_RL_CD>J</ORGZN_INSTR_UNIT_RL_CD>

<OCPTN_DSGNTR_ASGND_ID>3P031</OCPTN_DSGNTR_ASGND_ID>

<REVISION>001</REVISION>

</COURSEID>

<INSTL_CLS_ID>0</INSTL_CLS_ID>

<INSTL_CLS_ID_FK>1</INSTL_CLS_ID_FK>

<INSTL_BLK_ID>L6AGU3P031 001</INSTL_BLK_ID>

<STDNT_CLS_ASGN_EVNT_STTS_CD>ETE</STDNT_CLS_ASGN_EVNT_STTS_CD>

<STDNT_CLS_ASGN_EVNT_CLNDR_DTTM>20001006</STDNT_CLS_ASGN_EVNT_CLDNR_DTTM>

<UNIF_SVC_RNK_CD>B</UNIF_SVC_RNK_CD>

<PRSN_DTY_STTS_CD xsi:nil="true"> </PRSN_DTY_STTS_CD>

<PRSN_ACSSN_TYP_CD>true</PRSN_ACSSN_TYP_CD>

<TRNG_RQMT_RQST_ID>Aj10</TRNG_RQMT_RQST_ID>

<USRNK_CD> 31</USRNK_CD>

<SEX_CTGRY_CD>M</SEX_CTGRY_CD>

<STDNT_TRNG_REQ_STTS_CD>ZBJ2021230</STDNT_TRNG_REQ_STTS_CD>

<SUBORD_ORGZN_ID>342</SUBORD_ORGZN_ID>

<STDNT_SRC_ORGZN_TYP>TTMS</STDNT_SRC_ORGZN_TYP>

<DATA_DATE>09172001</DATADATE>

</STUDENT>

<Error_Description>test.xml:10:56 Datatype error: In element 'ORGZN_INSTR_UNIT_RL_CD": Value 'J' must be one of [A, B, C, D, E, F, G, H, K, L, M, N, O, P, Q, R, S, T, U, W, X, Y, Z]. </Error_Description>

</TTMS>

Glossary

IT

Internet Technologies

ADSS

AETC Decision Support System

AETC

Air Education and Training Command

AETMS

Air Education and Training Management Ssytem

csv

comma-separated values

DBMS

Data Base Management Systems

DTDs

Document Type Definitions

EDMS

Enterprise Data Management System

EJB

Enterprise Java Beans

IIS

Internet Information Services

JDBC

Java Database Connectivity

JDK

Java Development Kit

ODBC

Open Database Connectivity

SQL

Structured Query Language

TPS

Training Planning System

TTMS

Technical Training Management System

URI

Uniform Resource Identifier

XDK

XML Developer’s Kit

XML

eXtensible Markup Language

XSL

eXtensible Stylesheet Language

XSU

XSQL/XML-SQL Utiltity

Biography

Karen Fox
Technical Project Support
The MITRE Corporation
Colorado Springs
Colorado
U.S.A.
Email: kfox@mitre.org

Karen Fox is a Technical Project Assistant with The MITRE Corporation in Colorado Springs, where she has worked for ten years. However, her foray into web programming is relatively new, beginning three years ago when she started teaching herself VBScript and Active Server Pages. At the beginning of the AETC project, she knew nothing of Java and very little about databases and XML. Her presentation is an attempt to help others avoid all the stupid mistakes she made in combining a variety of XML technologies.

Matt Keen
Technical Project Support
The MITRE Corporation
Colorado Springs
Colorado
U.S.A.
Email: mkeen@mitre.org

Matthew Keen works for Mitre Corporation, where he holds the post of programmer for web services and distributed web applications. He has worked with XML for 2 years. Recent projects include work in SOAP based rule engines. Other experience includes Java, ASP, WML, SVG and PHP.