Abstract
Sigma-Aldrich, the leading supplier of biochemical and organic chemical products for Life Science and High Technology, uses its print and electronic catalogs as its primary sales and marketing vehicle. A recent market research study by BioInformatics, LLC (a leading provider of market research to the biotechnology industry) ranked Sigma-Aldrich Number 1 among life science companies for its catalogs and website.
Sigma-Aldrich publishes nearly 1.5 million catalogs annually including content in 5 languages and 17 currencies. In order to achieve this volume, a centralized product information database called Corpbase was designed to be the company's repository for product information stored in an output independent format. As such, it can be repurposed to provide information for package labels, printed catalogs, brochures, and the online catalog.
Corpbase uses a blend of RDBMS (Oracle), Java and XML technologies to provide:
storage and maintenance of fielded information for over 100,000 individual chemical and equipment products, managed across 5 corporate divisions;
multiple language and currency versions for each publication;
management of the publication outline including entities like chapters, sections, product listings and graphic objects;
a method for excluding content that is not suitable for the current publication both globally and at the individual listing level;
support for rendering listings as either text prose or CALS tables;
a flexible phrase-builder engine for presenting atomized data in a variety of formats;
a technique to save space and reduce redundancy by moving duplicate information from products into higher-level groupings (deduping); and
DTDs that provide consistent input to both high-end desktop publishing software and the corporate website.
This case study will describe the major commercial products and technologies used in the project, the design decisions and tradeoffs that were made, and the overall results including lessons learned for the future. Specifically we will highlight the key role XML played in supporting filtered content, CALS table construction, deduping, and transforming the data stream into the various publishing DTDs. The technical information in this case study is directed towards an audience familiar with the basic use of XML, XSL, XPath, ERDs, and Java.
Keywords
Table of Contents
Sigma-Aldrich Corporation’s electronic publishing system (Corpbase) uses a client server architecture. The server is an Oracle 8i database running on an IBM 6H1 under AIX with an EMC disk array and 6GBs of RAM. The database is currently 12GB. The client applications are hosted on an NT based Citrix server. A Citrix client is then installed on the end user workstations, greatly reducing the client’s bandwidth demands on the corporate WAN. This allows the server to be housed in the US while supporting users located in 6 countries. Data is entered and maintained using a VB6 client application called CBClient. Publishing project outlines are defined and maintained using an MSAccess client called CBManager. CBContentWorkbench, an IIS ASP client application, is used to filter product and substance content by publication. The extraction of a publication uses JAVA, PL/SQL, and XML technologies to produce content appropriate for the target media, be it print, web, or some other destination.
This section describes the overall architecture of the system, and the major commercial and custom products and components used.
The system uses the following major commercial (off-the-shelf) components:
IBM 6H1 Server with with AIX 4.3.3 and 6GB RAM
Win2000 Server(client Apps) with Citrix XP
Oracle 8i
Oracle PL/SQL
Oracle Java
Oracle XDK
3B2 Composition Engine
The Corpbase publishing system integrates with our corporate ERP system SAP to reduce duplication in data entry and management. SAP is the master source for both EH&S(Environmental Health & Safety) information and pricing/packaging information. Before a project is extracted, this information is batch updated into Corpbase from a snapshot table on SAP using Oracle’s data replication feature.
A custom data model was designed and built using the standard Oracle design tools to store product information. A major feature of the data model is that all published texts are stored in tables with language_id as part of their compound primary key allowing a publication to be extracted in more than one language.
The outline of a publication is created by assembling the building blocks in a hierarchical tree starting from a section that represents the publication as a whole, working down through subsections until finally arriving at products, images and cross references etc. The three levels of organization in the publication are sections (entire collections of products of a similar type), followed by groups (loosely related products/substances) and substances (tightly related products). The tree leaves are products, images, text paragraphs (which allow for, for example, introductory text at the start of a section), and cross-references.
One of the unique issues faced in the development of Corpbase was how to deal with all the product attributes that were atomized in nature but were assembled differently for different publications. The solution was to employ a transform engine. The transform engine uses a set of Boolean rules to output text constants and variable data from the database as a meaningful text string. These rules allow for things like putting parentheses around a comment text only if the comment is present or putting punctuation between multiple items in a list. The transforms can also handle issues with different sentence structures in different languages. Typically a catalog will have 100,000-400,000 transform strings generated during the extraction process. This was implemented as a PL/SQL package rather than using XML due to the complex data structures needed to complete this task.
Customers don't all speak English as their primary languages. So one of the features of the publishing system is that all database texts are maintained as translation sets so that a given project outline can be extracted using any available language. Primarily that is English, French, German and Spanish. However, UTF8 encoded Unicode as the database character set we are also able to support our Japanese customers and others as future needs dictate.
Another issue we face is that many of our products have sales restrictions in one country or another. So a catalog that will be distributed into several countries needs to be flexible to accommodate any restrictions. We handle those in two ways. The first is by what we call "cloning a project". Once a publishing outline has been completed and the data is proofed the entire project can be replicated. We call the original the "master version" and any replicas/clones a "base version". Products are removed from each clone based on sales restrictions in its target countries. Currently we have US, Europe and ROW (rest of world). Each base version is replicated swapping out pricing, covers and intro text to create custom editions for each country. To allow for a finer grain in the restrictions we can suppress pricing in a particular currency and put in a disclaimer statement saying a product is not available in a particular country. For example some products will say "Not available for sale in Japan" in the Yen edition of the ROW base version.
This section discusses the processing steps used in the system that are common to all target media. A brief outline of these steps is as follows:
Pre Extract Preparation
Database/Publication Synchronization (Cleanup)
CLOB Generation
Exclusions and Filtering
Each of these steps is discussed in more detail below.
Before a publishing project can be extracted, some setup is necessary. First, the data must be entered into Corpbase using CBClient for all products, substances and groups that are needed in the publication. Once this is done, CBManager is used to build the publication’s outline by creating any required sections and adding in the products, substances and groups to the correct locations. Associated text and images can also be added along with generating any useful x-refs. After this step is complete, any unnecessary listing information can be suppressed using CBContentWorkbench. This web-based application displays product and substance level information with on/off toggles next to each data element. Any data that is marked off causes an XPath statement to be generated and stored in the database for later use in filtering out that piece of data from the listing’s XML representation. Since XPath is being used on hierarchical XML, it is easy for CBContentWorkbench to allow for the suppression of all instances of a type of information like descriptions or only an instance of the type. Transforms are assigned to the project to render attributes into meaningful text strings.
The first step in the extraction process is to ensure the database and the publication project(outline) are in sync. An example of this is adding a synonym on a product in Corpbase can cause an x-ref to automatically be generated for it from the project.
Once the database and the publishing project are in sync, the system generates a set of small XML documents for each node in the publishing project’s tree. These documents are generated by SQL statements stored in a table. Each statement is run through Oracle’s XMLGEN package to produce a small nested XML document for that type of information (for example, descriptions or citations). While these are being generated the system runs a parallel process that generates all the transform strings needed for each listing node and then uses the same method to turn the transform results into a small XML document. The XML generated at this point is in the database DTD form and must be transformed by an XSLT into a target DTD appropriate for the target output media.
After all of the individual XML sections of a listing node have been generated, they are combined into one XML document for that node. The XPath statements created by CBContentWorkbench are used to delete any information for the XML document that needs to be excluded from the publication currently being extracted. The XML sections can be cached for use in other publications being worked on concurrently.
This section discusses the additional processing steps used for print media. A brief outline of these steps is as follows:
De-duping
Tag replacement
XML File Generation
Pricing
Page Composition
Index Generation
Final Production Steps
The following diagram describes this processing.
Whenever common information from amongst a listing's children can be moved up to the parent de-duping occurs. For example if all of the products within a substance have a common storage temperature, then that temperature need only be shown for the parent substance, and not repeated each time. This can save a considerable amount of space in the catalog.
De-duping one generation of listings can affect its parents. For example, de-duping product listings might promote some duplicate data to their substance level. That data becomes a candidate for de-duping to the group parent of the substances. This would not work correctly if de-duping were done from the top down (from grandparent to parent to child).
The combination of using individual Oracle CLOBs combined with an XML format was found to be an excellent architecture for our needs. Since fielded information such as "storage temperature" is identified explicitly by the XML tags, identifying this information and moving it between listings is straightforward (given a DOM to operate on). Had we performed this operation solely at the database level, in SQL, we might have had to change and maintain that data in the original Oracle tables, instead of just changing the listing information being published at that time in the XML. Had we performed this operation within the typesetting system, we would have quickly encountered memory issues trying to build an XPath-searchable structure (such as a DOM) for the entire document. The partitioning of listings into XML CLOBs provided a very manageable level to operate on.
The de-duping logic was implemented in Java, and calls the database using JDBC prepared statements.
During implementation we discovered an added complication regarding related data. Sometimes, if one type of information was to be deduped, a related subtree of the XML should be moved that isn't involved in the duplicate comparison (such as all related citations when a description is moved). This proved difficult to implement generically; we implemented it using special Java logic.
A major design choice was to publish the printed catalogs using a DTD that more closely modeled the structure of the catalog. Given this, it was necessary to translate the simple Oracle row-and-column based tagging in the CLOBs into a richer tagset suitable for processing into pages. The algorithm we implemented (using Java and JDBC) was to visit each node in the project tree, obtain its Oracle XML CLOB, and pass this CLOB thru an XSLT template that contains the translation rules from the original XML into the publishing XML. The XSLT template changes tag names into a more readable format than Oracle uses (for example, a field named S_EFORM in the database will be translated into the publishing tag <EmpiricalFormula>), and also changes the form of the tagging, in that some information that appears as multiple tags in the CLOBs can appear as XML attributes, or concatenated together into one publishing element. An example of the tagging we designed is shown below for a simple ProductListing:
During this step, the children for each parent listing are sorted into the correct display order for the catalog. Sorting is by one of two different sort criteria, either alphabetical based on a pregenerated sortkey or manually based on a user defined sequence. Once sorted, a "marker" is inserted into the proper place in the parent’s publishing XML, using a processing instruction with the instruction type of "insertnode" (for example, <?insertnode N?> where N is the node ID). This tells the later assembly process exactly where to merge the XML for each child into its parent’s listing.
A special part of the tag replacement step is the assembly of information from a group or substance into a CALS table. A table can be a much more concise and direct way of presenting listings, especially when there are a few distinguishing fields of information between otherwise similar products. For example, such a table might look like the following:
| Range | platform, diam. (mm) | No. | AC | Cat. No. |
|---|---|---|---|---|
| weighing capacity 210 g, 1 mg intervals | 121 | EP213C | 230 V | <ProductNumber>Z66,028-0</ProductNumber> |
| weighing capacity 410 g, 1 mg intervals | 121 | EP413C | 230 V | <ProductNumber>Z66,029-9</ProductNumber> |
| weighing capacity 610 g, 1 mg intervals | 121 | AP613C | 230 V | <ProductNumber>Z66,030-2</ProductNumber> |
| weighing capacity 100/410 g, 0.01/0.01 g intervals | 121 | EP413DC | 230 V | <ProductNumber>Z66,031-0</ProductNumber> |
| weighing capacity 610 g, 10 mg intervals | 172 x 172 | EP612C | 230 V | <ProductNumber>Z66,032-9</ProductNumber> |
| weighing capacity 2100 g, 10 mg intervals | 172x172 | EP2102C | 230 V | <ProductNumber>Z66,033-7</ProductNumber> |
| weighing capacity 4100 g, 10 mg intervals | 172 x 172 | EP4102C | 230 V | <ProductNumber>Z66,034-5</ProductNumber> |
Table 1.
In general, the system stores predefined "table types" containing a set of table column definitions (each column can contain the results of one of the transforms described earlier in this document). Parent listings are then assigned a "table type" if their children are to be shown as a CALS table instead of as normal listings. During the tag replace procedure, if the table type is non-null, two separate SQL queries are dynamically assembled from the table type column definitions.
The first SQL query returns one row of data consisting of the column titles for the CALS table. The second returns a result set of table cell data. This row/column information can contain special XML tags whose content can "bubble up" into the column title in a deduping-like scenario. These tags are <HDR1>, <HDR2>, <HDR3>, and <HDRP>. The content of each cell tag is examined independently within a column. For every cell in a column (except the header row), if the content of the<HDR1> tags are all the same or null, then the non-null value bubbles up.
After package sizes and a link to the pricing file are merged in, a flexible set of post-processing functions known as "widgets" are optionally applied to the table at the product analyst's discretion. The widgets that we implemented include:
Groupbreak: converts a column into a series of group break rows above groups having the same value in that column.
Subfooter: similar to the groupbreak widget, but instead generates a row underneath a listing or group of listings
Spanrows: merges successive cells in a column when values in the cells are the same.
Crosstab: similar to the MS Access crosstab function, where some data values become column headers.
This step is quite straightforward and uses a simple "walk the tree" algorithm to obtain the tag-replaced CLOBs from the database, and output them to an XML file.
Initially we implemented this step using a DOM parser, but switched to SAX for a significant speed improvement. This also simplified the processing, in that processing a DOM node required printing only part of the parent, then printing all its children, then printing the remaining content of the parent followed by its end tag. The event-driven nature of the SAX approach was a much more natural fit to this method of assembly.
We also found that we needed to be very careful in how we wrote the output file in order to preserve special characters. We used an OutputStreamWriter, tied to a FileOutputStream with an explicitly declared encoding (utf-8):
new OutputStreamWriter(new FileOutputStream("catalog.xml"),"utf-8");
As was mentioned earlier, pricing is batch loaded into Corpbase via a feed from the Corporate Business Warehouse (an SAP module called BW). This data is extracted using Oracle’s XMLGEN PL/SQL package which converts a query into XML. The XML representation of the pricing information for a publication in a particular currency is streamed out to a file on the server’s file system. One price file is generated for each currency version of the publication. In addition, a TAB setter file is produced for each base version. This file contains the longest price for each SKU and is used by the typesetting engine to provide a frame(tab) sized correctly to hold any of the desired currency values for that SKU.
Once the generic XML file has been extracted from Corpbase for a particular publication, it is post processed with a re-ordering XSLT. We have found that the closer the content stream is to the desired order of information on the page, the easier it is to write the 3B2 typesetting template. 3B2 uses a proprietary macro language, supplemented with PERL script functions, to typeset the page shell contents. Overall 3B2 is very flexible and can handle various conditional situations, differences in right and left hand pages, break rules, keeps, widows, orphans, line breaks etc. Currently it takes 3B2 about 5 hours to generate a 3,000 page catalog as a 3D file (viewable pages in 3B2’s internal format) from the 70MB XML file. It takes another 5:00 hours to produce PDFs for proofing. A page number lookup file is generated during processing of the body of the publication for cross-reference and index generation.
To ensure that the indices of a publication match the contents exactly, the same source XML file is used to generate the indices as was used to generate the body of the publication. The standard XML file is run though an XSLT to trim out any big blocks of unused content. This greatly reduces the size of the XML file being processed and thus greatly speeds up the processing times. A second XSLT is used to convert the input XML into the index DTD format which is then run through the appropriate 3B2 template to produce the 3D file and then PDFs.
Generally, it takes several rounds of producing pages and proofing them before the publication content is approved. Issues related to data entry, transformations, exclusions, sequencing, and page template presentation must all be addressed. Once the proofing process is complete and final PDFs for the publication body and indexes are produced, these are sent to the printer. The publication body and indexes are assembled with front and back pages and bound with the correct cover for that base version and currency combination. The finished catalogs are cartoned for distribution.
The web catalog takes the excluded product XML documents and passes it through an XSLT to convert it into an XML representation of the file format used to load our web catalog product. Then ORACLE’s XMLGEN package is used to automatically insert the XML into a table as rows. This is then output and sent to our web catalog product which, once loaded, renders the data via a JSP template into a product listing on the website.
In a similar fashion to the generation of the feed for our web catalog product, the data is also converted into another output table in a more generic format. The major difference is that the data in this table retains unicode entities rather then converting them into web markup. This table is used as a standard feeder file for our other downstream systems and users. Currently we are using this method to supply SAP with data for product labels and to provide data to external agencies where required by legal or contractual agreements.
By making use of Oracle’s JAVA, PL/SQL and XML technologies we were successful in developing a flexible publishing system that supports various target media. Going forward we plan to do extensive studies on Oracle 9i’s new XML datatype and the potential for storing some information directly as XML documents rather than fully normalized relational data. Another issue we will look at is performance. Currently the same data is converted into a DOM and processed multiple times during the extraction process. It may be possible to rework the data flow to consolidate this processing and reduce the overhead of repeated DOM creation and XML serialization. However, having the intermediate XML stored in tables during the extraction process has proved very valuable in trouble shooting both data and application errors. Also, it has allowed us to correct downstream processing without the need to restart an extract from the beginning.
![]() ![]() |
Design & Development by deepX Ltd. |