Merging XML Technologies
ABSTRACT
This effort is a case study into how to integrate products from competing vendors, such as Oracle and Microsoft©. To achieve the goal of modernizing the Air Force Technical Training archive, it was determined that a collection of information technology products from various vendors was needed. We discovered that while they can be made to work, there is a lot of trial and error involved. Currently, the Air Education and Training Command (AETC) Headquarters receives data monthly from each of its training bases and uses this data to compile a variety of reports. However, this data is not stored long-term, which necessitated the creation of an information management solution designed to provide a training data repository for the next 20-25 years. This new archival system is required to interface to the legacy systems and processes, as well as deliver new capabilities, such as data validation and enhanced report generation. With the use of eXtensible Markup Language (XML) technologies, the legacy data was parsed into XML using Java, validated using a SAX2 validator in a Java class file, and inserted into a Microsoft© Structured Query Language (SQL) Server database using Oracle’s XSQL/XML-SQL Utiltity (XSU) technology. Using Oracle products with a Microsoft© database required the use of a Java Database Connectivity (JDBC)-Open Database Connectivity (ODBC) translator, and we selected JSQLConnect. XSQL was also used with Java to extract the data from the database as XML and transform it back into the legacy file format for subsequent delivery to the end user. By performing data validation prior to the insertion of data, we improved the integrity of the stored data and saved the end user time by reducing errors. Additionally, this archive allows for the generation of reports using XSL stylesheets and XSQL over the saved data. It has provided access to the erred data and has been expanded to include even more types of training data.
Table of Contents
1. Overview
The United States AETC has the responsibility of aggregating and reporting on training data collected by training centers across the country. The individual training centers have systems that maintain these local data stores in a variety of Data Base Management Systems (DBMS). Although similar, the data structures and legal values are also different. These differences are the result of the procurement process in which each center’s system was built to meet the operational needs of the production systems. In spite of the need for an enterprise data scheme, the maintenance is performed at various locations by multiple owners with multiple production requirements.
The goal of an AETC Enterprise Data Management System (EDMS) is to provide central, long-term storage to be used for reporting, as well as for providing data for forecasting and trend analysis. It will also be used to feed both a data warehouse and production database systems. In addition, the EDMS will store metadata that describes the characteristics of the legacy data.
This document will describe the approach used to develop the AETC EDMS, which involved using a combination of technologies to take data from the wide variety of sources and get it into the EDMS. Section 2 will explain the overall design of the EDMS and go more in depth on each of the technologies used. Section 3 will go into detail on problems encountered in merging these technologies and lessons learned, while Section 4 will list plans for future development.
2. Design Process
The Phase I plans for the EDMS included receiving a training data text file in a comma-separated values (csv) format from the outlying AETC training bases. These files had to then be transformed into XML, validated and inserted into a database, and retrieved and passed onto the AETC warehouse in the csv format they are used to receiving. For the purposes of Phase I, the data had to be selected from the database and transformed back into the original format to be transparent to the change. This process is displayed graphically in Figure 1.
Each AETC training center generated a csv file containing the previous month’s data on student training and transferred it via File Transfer Protcol to a folder on a computer at the AETC Headquarters. This file was then transformed via Java servlets into XML and validated against an XML schema using Xerces. The valid record equivalents were inserted into the SQL Server database using XSQL. Records that were not valid were transferred to another folder and an error report was generated. XSQL was also used to extract report data from the Air Education and Training Management Ssytem (AETMS) database and Xalan and eXtensible Stylesheet Language (XSL) were used to display the extracted data in the same csv format as originally received. This data was forwarded to AETC Decision Support System (ADSS) as one report containing the validated AETC training centers information, providing an improved csv file, while remaining an invisible addition to the current process.
2.1. Technologies
The original plans for this proposal called for using the XML technologies to process and display the archive data in order to maintain the integrity of the thousands of entries. Additionally, as developers we wanted to use Java to make this application platform independent. In addition, direction from the AETC required that we use Microsoft© products, such as Internet Information Services (IIS) 5.0 and SQL Server.
2.1.1. Java
Java is a platform-independent programming language and provides many free or low-cost add-ons. We used Sun's© Java Development Kit (JDK) 1.3 and created several servlets to provide the classes necessary for our prototype. Each servlet and its application are listed below.
| XMLFileElementValue | Used in conjunction with ConfigFileElement to determine the path to where data is accessed and stored. This servlet grabs the path from a designated document. |
|---|---|
| XMLFileElementValue | Used in conjunction with ConfigFileElement to determine the path to where data is accessed and stored. This servlet grabs the path from a designated document. |
| ConfigFileElement | Wrapper to XMLFileElementValue. Passes in XML document containing path and receives path element. |
| SAX2Count (from Apache) | Used with Xerces to validate XML document against a designated schema. We modified this class to work with our application. |
| InsertXMLtoDB | Used to insert validated XML files in a SQL Server database via Oracle XSU and JSQLConnect driver. |
| SimpleTransform2 | Xalan sample used to transform XML with eXtensible Stylesheet Language XSL. |
| ParseTTMS | Main servlet to parse csv data into XML document. |
| ParseTPSCourse | Used to parse Training Planning System (TPS) course data into XML document. |
| ParseTPSClass | Used to parse TPS class data into XML document. |
| RunTTMS | Wrapper servlet for processing Tech Training Management System (TTMS) data by parsing and validating data. |
| Run TPS | Wrapper servlet for processing TPS Class data by parsing and validating data. |
| Run TPS2 | Wrapper servlet for processing TPS Course data by parsing and validating data. |
| getParam | Servlet to connect to web graphic user interface (GUI), gather information, and pass it to RunTTMS. Calls other needed servlets to process data. |
| getTPS | Servlet to connect to web GUI, gather information, and pass it to the appropriate RunTPS servlet. Calls other needed servlets to process data. |
| GetTTMS | Servlet to connect to web GUI to pull TTMS data out of database and deliver as the original csv file. |
2.1.2. JRun
Because we were using Java servlets, we needed a Java servlet engine and decided to use Allaire©’s JRun v2.2.3, since it interacted smoothly with IIS. However, we ran into several problems using JRun, which will be detailed further in Section 3. We even tried upgrading to the latest JRun Version 3.1, but that ended up creating more problems than it solved.
The order of jar files in the classpath for JRun became very important during development. In order to use the Xerces processor, the Xerces jar file had to be listed first in the classpath; otherwise, the JAXP and parser jar files overrode it.
2.1.3. Xerces
This free java library from Apache offers a built-in validator for use with Document Type Definitions (DTDs) or XML schemas. We began with Version 1.3 and ended with Version 1.4.2. We modified code in the SAX2Count class file from Apache by adding the following method so it would report back on the errors found in order to provide an error report and separate our validated data from nonvalidated.
public boolean begin(String arg){
System.out.println("In validation");
// vars
String parserName = DEFAULT_PARSER_NAME;
print(parserName, arg, true);
//the primary SAX2Count method
//System.out.println(flag);
return flag;
}
This process worked smoothly when we tested it from the DOS command prompt, but later when we developed the web interface and connected the servlets to it, we had problems in validation. This will be discussed more in Section 3.
2.1.4. Xalan
Another free Apache java library, Xalan Version 1.2.2 contains XSL processing code, such as the sample SimpleTransform that we copied and modified to fit our needs. It allowed us to take our XML documents and transform them into the format necessary for insertion into the SQL Server database.
2.1.5. JSQL Connect
We decided on a low-cost JDBC-ODBC connector in order to talk to SQL Server with our Java code. JSQLConnect Version 2.21 by NetDirect was easy to install and easy to use. While Oracle does provide a JDBC-ODBC tool, JSQLConnect proved easier to set up and use.
2.1.6. SQL Server
The Air Force required that we use SQL Server 2000 for our database because they already had a copy at the AETC. This new version of SQL Server contained XML tools, but we did not use them because we could not bulk insert data into columns by matching column names to XML elements via a Java servlet
2.1.7. Oracle XDK
Oracle’s XSU and XSQL products are contained in the XML Developer’s Kit (XDK) Version 9.0.2.0.0B. This Beta version is the only one currently that will work with SQL Server. The XSU enabled us to push XML documents into SQL Server, automatically matching element tags with column names, and pull data from SQL Server as an XML document. The code to insert the data follows. Note that the file name and table name were passed into the servlet.
Connection conn = null;
Document doc;
DOMParser d = new DOMParser();
Statement stmt = null;
ResultSet rset = null;
FileReader freader = new FileReader(FileName);
BufferedReader breader = new BufferedReader(freader);
try{
DriverManager.registerDriver(new com.jnetdirect.jsql.JSQLDriver());
// Insert XML data from file (filename) into database table (tabName)
try{
d.parse(breader);
}
catch (Exception e) {
System.out.println(e.toString());
}
doc = d.getDocument();
try{
OracleXMLSave savedoc = new OracleXMLSave(conn, tabName);
savedoc.insertXML(doc);
savedoc.close();
}
catch(oracle.xml.sql.OracleXMLSQLException ex)
{
//Get the original exception
Exception parent = ex.getParentException();
if (parent instanceof java.sql.SQLException)
{
//perform other stuff. Here we simply print it out
System.out.println("Caught SQL Exception: "+parent.getMessage());
}
else
{
System.out.println("Exception caught...!"+ex.getMessage());
System.out.println("SQL Error Code: " +ex.getErrorCode());
System.out.println(ex.toString());
}
}
System.out.println(" successfully inserted rows into "+ tabName);
2.2. Design
The csv file is uploaded to a folder on a server each month, so we retrieved the monthly Technical Training Management System (TTMS) data from there. The Training Planning System (TPS) data is sent out on an annual basis and had to be loaded into the database first, since the TTMS data is related to it. The application is accessed via a web interface and log in screen (Figure 2).
After verifying the user’s identity, the user is sent to the initial screen and given options of importing or exporting data to the archive (Figure 3).
The user can select the type of data to import. The TPS data is received annually, and it only needs to be imported once. However, this data is broken into two separate pieces–the course data and class data (Figure 4), and each must be imported separately. Because of the many to one relationship between the class and course data, the course data must be imported first so links can be created to it while importing the class data.
Importing TTMS data follows a similar process (Figure 5) in which a csv file is selected from a folder on the computer, pulled by Java servlets to be parsed into XML, validated against a schema, transformed for insertion into the database, and inserted into the database.
Once the servlets finish, the information on records parsed and validated is returned to the browser (Figure 6). This screen also provides links to the bad data, which not only contains theXML document, but the reason the record was designated as bad (Appendix A).
For Phase I, we also had to pull the data out of the archive database and pass it onto the ADSS in the format in which we originally received it (the comma-separated value file), since that is what they are used to receiving at this point in time. ADSS then imports this file into their database and uses it for data calculations. This also proved that our system was interoperable and seamless to the present operations and that the data was not changed despite the transformation, validation, and aggregation processes we used. Again, we used a web interface screen to determine which data to process (Figure 7).
The code behind this screen passes the form information into a servlet, which uses the Oracle XSU to pull the requested data from the database in XML form. The servlet then transforms it into a csv file via XSL and provides a link on the processed web screen to the data (Figure 8) so it can be viewed or saved as a file.
3. Lessons Learned
The lessons learned fall into three general categories: 1) the maturity and 2) interoperability of the technologies and 3) the experience of the developer. Although it would be more comfortable to address only those dealing with technology, we felt that lessons learned should be mentioned in case other programmers encounter them.
3.1. Java
One problem we encountered in writing the servlet to parse the data into XML was a developer error was easily fixed. When using the StringTokenizer to read in and parse the data, it would not find the end-of-line characters no matter what code or symbol we used.
The BufferedReader we used to pull the csv file into a FileReader only read in a line at a time, not including the end-of-line character. To fix this, we added the end-of-line character (/n) to each line while reading in. The revised code snippet from ParseTTMS that covers this is below. Whole and str were previously declared as strings, and InputFile was passed into this servlet.
//Read in the file and parse with string tokenizer
c=1;
FileReader freader = new FileReader(InputFile);
BufferedReader breader = new BufferedReader(freader);
System.out.println(“Reading in data”);
While ((str = breader.readLine()) != null)
{
whole = whole + str + “/n”; //adding /n here made the difference
c++;
}
freader.close();
StringTokenizer ST = new StringTokenizer(whole, “,”, true);
We also encountered a “null pointer” error several times because of source data inconsistency. During parsing, the servlet would attempt to grab data from a field, which was sometimes empty and sometimes contained data. We had to modify the parsing code to check for null and enter an empty space in that case, as shown below.
INSTL_BLK_ID=ST.nextToken(",");
if (INSTL_BLK_ID.equals(","))
{
INSTL_BLK_ID = "";
}
else
{
ST.nextToken(",");
}
3.2. JRun
The JRun Java Servlet Engine produced several problems during development with the worst being that it times out when processing an incoming file any larger than 100 kilobytes. To date, we have been unable to resolve this problem, even though we tried using the latest version of JRun. To make the application run, we broke the incoming large file into smaller files. This is a temporary fix at best and must be corrected.
We also discovered that the classpath entered into the JRun administrator must be set with the Xerces jar file first in order to use that parser and not the one built into JRun. When trying to use the latest version, we encountered even more difficulties in setting the classpath; both Xerces and JSQLConnect had to be first, and JRun still would not work correctly.
We have also encountered a recurring problem of JRun stopping for no apparent reason and not starting again. Each time meant uninstalling, then reinstalling JRun in order to proceed. We believe using Enterprise Java Beans (EJB) and an application server would greatly improve performance and solve these difficulties.
3.3. Xerces
Most of the problems encountered in using Xerces were easily fixed, though one problem stymied us for several weeks. Fortunately, we could post to discussion groups available with folks very willing to share their vast knowledge, such as the Xerces-J User list.
One notable problem involved the SAX2Count code from Apache used in validating an XML document against a schema. We consistently upgraded the Xerces application during development to achieve more capability. In the beginning, the validation process worked perfectly when invoked via the DOS prompt. However, after adding the web interface and connecting the Java servlets to it, the validation process quit working. It appeared that the code could not find the schema; therefore, it declared every element as invalid.
We tried referencing the schema in a variety of ways from the XML document, to include hardcoded path, full URI, and relative Uniform Resource Identifier (URI) . We posted the problem on several loops and received some responses, but none of them solved this error. Finally, an unrelated post on an email loop suggested a fix, which turned out to be embarrassingly simple.
We began development using Xerces 1.3.1 and progressed through Version 1.4.2. The original schema namespace was listed as: xmlns:xsi=http://www.w3.org/2000/10/XMLSchema. Despite the release of the XML Schema Recommendation from the W3C in May 2001, it never occurred to us to change this reference to reflect the latest schema as xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance. Once we read a loop email mentioning the change in schema reference namespace, the light came on. We made this change and the validation process worked perfectly.
We also used XML Spy Version 3.5 to write the schemas. Although it is a good tool for writing and testing schemas, IBM’s XML Schema Validator or the XSV tool are more accurate in relation to the latest W3C Schema Recommendation.
3.4. Xalan
The only difficulty encountered using Xalan came when we tried to use JRun 3.1. No matter where we put the Xalan jar files in the JRun classpath, the JRun parsing files would override the Xalan and break when trying to do a transformation.
3.5. JSQL Connect
The only problem with JSQLConnect came with trying to get the jar files to work with JRun 3.1. By putting JSQLConnect as the first jar file in the classpath, we got the application to access the connection driver.
3.6. SQL Server
At this point in time, SQL Server just does not have robust XML capabilities. Aside from the learning curve in using the SQL Server application, we discovered that the built-in XML tools were not conducive to integrating with our Java servlets, which made us examine the Oracle XSQL tool for use in importing and exporting XML data to and from the database. While this tool does work, we discovered that SQL Server will not implement an autonumber field during a bulk insert, so the ID numbers and relationship fields have to be set up in the XML document prior to transformation and insert. Sample Java code used to create one of these index fields follows:
public int getEventID() throws java.sql.SQLException
Connection conn = null;
Statement stmt = null;
ResultSet rsetEvent = null;
String query2;
try{
DriverManager.registerDriver(new com.jnetdirect.jsql.JSQLDriver());
//Initialize a JDBC connection
conn = DriverManager.getConnection("jdbc:JSQLConnect://mm97220:1433/database=archive/user=demo/password=demo");
query2="SELECT MAX(STUDENT_CLASS_ASSIGNMENT_EVENT_ID) AS ID from STUDENT_CLASS_ASSIGNMENT_EVENT";
//System.out.println (query2);
stmt = conn.createStatement();
rsetEvent = stmt.executeQuery(query2);
rsetEvent.next();
if (rsetEvent != null)
{
EventID = rsetEvent.getInt("ID");
}
else
{
EventID = 0;
}
//System.out.println ("Event is " + EventID);
}
catch(java.sql.SQLException e){
System.out.println(e.toString());
}
rsetEvent.close();
stmt.close();
conn.close();
return(EventID);
}//end method getEventID
Because of this, we reduced the number of tables in the database, thereby, changing from our initial database normalized schema. We believe there is a workaround for this problem, but have not encountered a fix to it.
Finally, there is no update/add feature, which means extensive coding will need to be done to address those records coming in that may be updates rather than new data.
3.7. Oracle XDK
The Oracle XML products are designed to work with Java servlets, which made them excellent candidates for this project. Information we encountered said that XSU would now allow connection to ODBC databases, such as SQL Server, but when we tried it, we could only pull data, not insert. We finally discovered that only the current Beta version of the XDK 9.0.2.0.0B would implement the insertion process.
4. Future Plans
We have several things we want to try with this application to include 1) using an Apache web server and Tomcat Java engine to run it. Perhaps this will resolve some of the problems encountered in using JRun. We also want to 2)streamline the code and make it as platform independent as possible.
We also want to 3) explore new tools. We have seen demonstrations of XML tools that may provide useful as well. Neocore produces a database designed solely to hold XML, which is supposed to be faster and easier to use in relation to XML. X-Aware has a virtual XML database, which acts as a wrapper to a regular relational database, but handles XML data in and out. We would like to investigate these products further, especially in relation to retrieving the TTMS data from the training centers. Currently, this data is sent monthly by the training centers. Using the X-Aware tool would allow us to pull data every month in an XML format. Our recommendation has been to implement more XML awareness on the training center end of this process.
5. Conclusion
While intermixing technologies from different vendors presents a unique set of problems, overall, none of these problems are insurmountable, due mostly to the sharing nature of theInternet Technologies ( IT) community and the proliferation of new XML tools. Worthy of note is Oracle for not only creating excellent XML database tools, but for extending their capability beyond the reach of their own database.
6. Appendix A
6.1. Validated Bad Data Report
<?xml version='1.0' encoding='UTF-8'?>
<TTMS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="http://space-xml-mitre.org/aetc/schemas/ttms.xsd">
<STUDENT>
<PRSN_DSGNTR_SSN_ID>3333333333</PRSN_DSGNTR_SSN_ID>
<PERSON_ID>16</PERSON_ID>
<STUDENT_CLASS_ASSIGNMENT_EVENT_ID>20</STUDENT_CLASS_ASSIGNMENT_EVENT_ID>
<COURSEID>
<ORGZN_ID>L</ORGZN_ID>
<STDNT_TRNG_RQMT_TRNG_TYP_CD>6</STDNT_TRNG_RQMT_TRNG_TYP_CD>
<STDNT_TRNG_RQMT_STDNT_TYP_CD>A</STDNT_TRNG_RQMT_STDNT_TYP_CD>
<INSTL_UNIT_TYP_CD>G</INSTL_UNIT_TYP_CD>
<ORGZN_INSTR_UNIT_RL_CD>J</ORGZN_INSTR_UNIT_RL_CD>
<OCPTN_DSGNTR_ASGND_ID>3P031</OCPTN_DSGNTR_ASGND_ID>
<REVISION>001</REVISION>
</COURSEID>
<INSTL_CLS_ID>0</INSTL_CLS_ID>
<INSTL_CLS_ID_FK>1</INSTL_CLS_ID_FK>
<INSTL_BLK_ID>L6AGU3P031 001</INSTL_BLK_ID>
<STDNT_CLS_ASGN_EVNT_STTS_CD>ETE</STDNT_CLS_ASGN_EVNT_STTS_CD>
<STDNT_CLS_ASGN_EVNT_CLNDR_DTTM>20001006</STDNT_CLS_ASGN_EVNT_CLDNR_DTTM>
<UNIF_SVC_RNK_CD>B</UNIF_SVC_RNK_CD>
<PRSN_DTY_STTS_CD xsi:nil="true"> </PRSN_DTY_STTS_CD>
<PRSN_ACSSN_TYP_CD>true</PRSN_ACSSN_TYP_CD>
<TRNG_RQMT_RQST_ID>Aj10</TRNG_RQMT_RQST_ID>
<USRNK_CD> 31</USRNK_CD>
<SEX_CTGRY_CD>M</SEX_CTGRY_CD>
<STDNT_TRNG_REQ_STTS_CD>ZBJ2021230</STDNT_TRNG_REQ_STTS_CD>
<SUBORD_ORGZN_ID>342</SUBORD_ORGZN_ID>
<STDNT_SRC_ORGZN_TYP>TTMS</STDNT_SRC_ORGZN_TYP>
<DATA_DATE>09172001</DATADATE>
</STUDENT>
<Error_Description>test.xml:10:56 Datatype error: In element 'ORGZN_INSTR_UNIT_RL_CD": Value 'J' must be one of [A, B, C, D, E, F, G, H, K, L, M, N, O, P, Q, R, S, T, U, W, X, Y, Z]. </Error_Description>
</TTMS>
Glossary
- IT
-
Internet Technologies
- ADSS
-
AETC Decision Support System
- AETC
-
Air Education and Training Command
- AETMS
-
Air Education and Training Management Ssytem
- csv
-
comma-separated values
- DBMS
-
Data Base Management Systems
- DTDs
-
Document Type Definitions
- EDMS
-
Enterprise Data Management System
- EJB
-
Enterprise Java Beans
- IIS
-
Internet Information Services
- JDBC
-
Java Database Connectivity
- JDK
-
Java Development Kit
- ODBC
-
Open Database Connectivity
- SQL
-
Structured Query Language
- TPS
-
Training Planning System
- TTMS
-
Technical Training Management System
- URI
-
Uniform Resource Identifier
- XDK
-
XML Developer’s Kit
- XML
-
eXtensible Markup Language
- XSL
-
eXtensible Stylesheet Language
- XSU
-
XSQL/XML-SQL Utiltity

