XML Europe 2004 logo

Cost Effective XML Processing in the Datacenter

Abstract

Keywords


1. XML in the Datacenter

The enterprise IT world is moving inexorably towards Service-Oriented Architectures that will allow rapid development of applications that provide real differentiated value to their businesses. The goal is a virtualized, real-time, extensible enterprise architecture that can quickly offer new functionality, yet integrates easily with legacy assets. This architecture must be reliable, extensible and manageable. It must offer the highest performance for peak loads, yet not be over-sized leaving assets under-utilized for typical workloads. It should offer the highest availability without duplication of expensive components.

Is it possible to build such an architecture in today's budget-constrained environments? Are standards maturing quickly enough to settle on, at least, a plan forward? Can an enterprise embark on such a quest in an evolutionary manner?

The industry answer has been - Yes, probably. Although there are many emerging standards and systems vendor strategies there is agreement in at least two areas: one, a service oriented architecture is a "Good Thing" and, two, the foundation for loosely- coupled applications using self-describing data is XML.

The momentum behind service orientation/web services and XML in recent years has been stunning. Today, the majority of Fortune 1000 enterprises are utilizing XML formatted data both for business to business communication and for datacenter applications. Zapthink, a leading Industry Analyst company with a focus on Service Orientation, estimates that by 2005, 25% of all LAN traffic will be XML-based, indicating a massive growth in datacenter XML and service orientation. It is this very growth of XML in the Datacenter that presents an emerging concern to datacenter architects and those responsible for providing a high-performance infrastructure to support Service orientation - how do we parse, validate and transform all this XML data?

It is starting to become clear that XML processing has a significant overhead associated with it. Large documents require manipulation at several points in the data lifecycle. If done inefficiently, up to 80% of application server processing can be consumed manipulating and reformatting XML.

The traditional approach, add more application servers

The traditional solution to application server performance issues is to increase the amount of processing power on tap. If the server farm is already part of a tiered architecture it can be relatively easy to scale the farm by - just adding another. Many of the costs in doing this, however, are hidden beyond primary hardware acquisition expenses and include:

- server options such as memory, PCI controllers and drives

- network infrastructure (cables, switch ports etc.)

- software licenses for the application software

- database software licenses

- management software licenses

- deployment and configuration time

- test time in a simulated environment

- implementation and benchmarking

- ongoing management costs

These costs can quickly become prohibitive as the deployment of web services and associated applications continues to grow. If industry estimates around the growth of datacenter XML LAN traffic are true, the average application server farm size will almost double by 2006. At some point, a more architecturally appropriate approach to solving this problem is required.

Another approach - upgrade existing servers

Moore's Law states that processing power effectively doubles every 18 months. Surely this provides an answer to the XML processing problem? Unfortunately, it does not. While servers are indeed adopting new architectures and processors that increase the raw processing power available, it is the type of processing in addition to the amount of datacenter XML traffic that makes the XML problem at the same time unique and ubiquitous.

Processing of XML documents is most efficient using powerful document processing languages like XSLT. XSLT allows the programmer to parse the XML document, transform data into other formats and perform complex business logic dependent on the nature and content of the document. The tree structure of XML means that this is best done utilizing parallel processing methodologies that can traverse many variables at the same time. At any point in the tree, other processes may be spawned to perform logic on the data or on other tree elements. Complex style sheets representing real-world business logic and data transformations can consume significant processing cycles from an application server pool.

These complex, highly parallel-processing tasks cannot be solved efficiently using a general-purpose architecture.

Architecturally Advantaged - the appliance approach

At some point in the technology lifecycle customers look to solving problems with a dedicated, custom-built device. This is the reason every household (with owners that enjoy consuming hot bread products) has a toaster instead of making toast in the Oven (a general-purpose device that also roasts beef and bakes cakes).

In the same way, when general-purpose servers become inefficient at solving a particular problem, server appliances become the favored approach. Routers, switches, firewalls and load balancers are all particular instances of dedicated solutions focused on performing one task, or a range of related tasks, in the most efficient manner.

Server appliances offer a range of advantages over the equivalent general-purpose server approach:

Simple and rapid deployment:

Since these appliances are dedicated to a particular task the setup, configuration and deployment of the solution becomes more turnkey in nature. The solution vendor knew exactly what this device was to be used for and, as such, can develop installation and management tools that are dedicated for that purpose.

Integration into existing environments:

Regardless of the type of hardware, flavor of operating system, or nature of software and tools on the system, server appliances should seamlessly fit into any given enterprise environment. Appliances should be viewed as black box environments that are accessible via standard interfaces (TCP/IP, SOAP etc.) and manageable with standard tools.

The highest availability

Any dedicated server appliance should offer higher availability than its general-purpose equivalent. In a dedicated device, the usage conditions and scenarios can be pre-defined, pre-configured and tested. Most server appliances offer redundant components within a single box and provide load-balancing and failover for even higher availability.

Orders of magnitude performance gains

However, the major reason for implementing an appliance approach to a given problem is that it just does a better job of it. It's architecturally the right thing to do. In many cases, server purchases can be deferred, or existing general-purpose servers redeployed, because the appliance solution offers significant performance advantages.

XML in Hardware - More than 50X performance gains?

One example of a dedicated appliance for web services and XML processing is the Conformative Systems CSXi XML Appliance. This solution offers the ability for enterprises to deploy and run datacenter web services and XML applications on a highly scalable, high performance, enterprise-ready platform that is built on custom, parallel- processing hardware. This solution offloads the headache of parsing, validating and transforming data, particularly XML documents, from application servers. It can also be used to host complex business processing logic or web services.

Deploying an XML appliance can significantly reduce the bloat of additional application servers in an enterprise, increasing application performance and reducing the cost of XML data processing by a factor of 10-15X. In addition, cost savings continue throughout the life of the implementation as management, sparing, infrastructure and availability advantages begin to accumulate. The very fact that a single two-node cluster of appliances can replace dozens of general purpose servers makes these savings apparent.

So, how are these performance gains realized? These incredible efficiencies are delivered through the appropriate mix of hardware and software technologies specifically designed and built for the unique processing of declarative data.

A ) The engine

These hardware solutions may be built upon custom ASICs that perform parallel processing of XML and compiled XSLT in hardware. As can be seen from the diagram below the solution interfaces to the outside world via standard networking interfaces and protocols such as HTTP, TCP/IP, SOAP etc.

The data flows through the system as follows:

1. XML documents are passed to the solution via one of these networking methods

2. The document is then parsed and validated in parallel using dedicated processing hardware

3. A custom transformation engine manages the processing of a pre-compiled style sheet using dedicated parallel application engines that process the document.

4. The document is then reconstituted into the output format and passed out of the solution.

Each one of these functions is accelerated by the use of dedicated hardware where appropriate, providing dramatic performance advantages over traditional sequential processing methodologies.

B) The compiler

Any parallel processing engine has to be coupled with highly efficient compiler technologies to ensure that the efficiency of the processing engine is realized. These compiler technologies expose the available parallel processing opportunities of a declarative programming model, allowing architecturally specific hardware engines to process data for web services with more throughput than a general purpose processor.

In the appliance case, the compiler software is fully aware of the hardware architecture beneath it, ensuring that the maximum processing efficiencies can be gained.

Enterprise-class solution components

Any solution designed for business-critical applications within the enterprise datacenter also demands reliability, ease of use, availability and scalability features. These features should include:

- software tools for rapid deployment

- configuration tools for web services

- interfaces to standard SNMP-based management utilities

- fully redundant single server platform

- support for failover and load balancing

-standard software APIs for integration with web services

- adherence to datacenter industry standards

Ease of Integration and Deployment

Integration with standard web services, data from databases, and data from other applications that run on other platforms are all critical in today's data center environment. This integration is best done using standard APIs that are being supported in the marketplace. Hardware-based platforms will be required to interface to these APIs and provide key features and functions. This will be necessary to take advantage of the price performance advantages offered by hardware that includes configurable parsing, validation with schemas or DTDs, and transformation using XSLTs. All functions should be available using standard APIs.

For example, using the Java platform to aid the developer when building XML-based applications includes:

§ The Java API for XML Processing (JAXP) that allows applications to parse and transform XML documents using an API that is independent of any particular XML processor implementation.

§ The Java API for XML/Java Binding (JAXB)

§ Long Term JavaBeans Persistence (EJB)

Control and Management

As a high-availability, high-performance platform, the XML hardware device should provide the highest levels of robustness possible. The data path, web services and documents being processed, should therefore be kept separate from the control path, configuration information, monitoring processes, and device logging.

The separate data path and management ports should include:

§ Separate network data connections for data and control paths

§ Separate processor subsystems for data and control paths

§ Encryption support for both paths

The hardware configuration and management capabilities should include:

§ Non-intrusive secure management port to retrieve log files, error reports, statistical analysis etc.

The use of Management Information Base (MIB) used in TCP/IP with (SNMP) environments allows network administrators to monitor and manage computers and check for statistics such as bytes sent received, fragment packets, dropped packets and other statistical information on devices connected to the network. Important considerations here include:

§ GUI to establish communication between client, database and Applications server

§ Configuration as an Applications- like server when an application is running on the hardware platform

Robustness

In addition to being able to observe system behavior, error logging, and even predictive error heuristics, the system should be able to operate with transitory faults or even with failing hardware with degraded, but reasonable performance. Important hardware system attributes here include:

§ ECC protected memory for detecting and correcting errors

§ Fail-over mechanism to detect when a memory module(s) fails with an ability to re-direct data to another memory module and run with limited number memory banks

§ Redundant architecture that allows de-rated performance and/or fail-over mechanism to re-direct the data to another network port to keep processing data

§ Software monitor to determine health of server and report or alert of hard failures and soft failure trends that suggest impending hard failures

Cost of Ownership Benefits

Enterprises deploying these server appliance-based systems and fully offloading the parsing, validation and transformation of XML-based data can realize significant TCO benefits.

For example, a rapidly-growing regional bank is undergoing client-system migration to a .NET architecture in which each of the bank branches will be totally redesigned and new IT infrastructure implemented.

Amongst other functions, the new branch systems transmit transaction logs both real-time and batch to the corporate office via XML with each message potentially processed by up to 15 different applications at the bank headquarters all based on different server and application architectures.

With message volumes currently running at approximately 20 million per month and growing at 40% per year the bank anticipates significant additional servers required to manage the growth in XML messaging traffic.

The solution deployment to process this increased workload involves significant expense in the following areas:

- Additional servers per application

- Additional storage costs

- Additional software licenses per application

- Additional network infrastructure costs

- Additional disaster recovery and backup costs

- Additional power

- Additional datacenter space

- Additional installation and testing

Additional management costs

- Additional sparing

In addition, the bank expects that the transformation of XML to a number of proprietary formats will involve significant programming efforts with a number of different applications requiring modification to utilize the new data formats.

Over the lifetime of the additional systems, the bank expects the additional cost of server infrastructure, solutions development and management to approach $2,000,000.

The alternative is to deploy an architectural model that offloads this processing from costly application servers and will allow the enterprise to seamlessly manage this data transformation.

The deployment of such a solution can immediately lower acquisition costs, reduce deployment time, simplify management and add availability. For this customer, it is anticipated that savings over the lifetime of the system will exceed 90%, with the TCO of the optimized server appliance below $200,000.

This massive savings in acquisition costs and lifetime costs are being repeatedly demonstrated as customers deploy these solutions in the XML and web services-enabled datatcenter.

Summary - Server Appliances offer a better way

As enterprise customers continue to search for architectures that can quickly adapt to the rapidly morphing business environment, XML (as the data standard) and web services (as the processing standard) will become the foundation on which the majority of business logic is built.

Most enterprise customers cannot flip a switch to full Service Orientation. Legacy systems and applications abound in today's datacenter environment and will continue to remain critical to the health and success of the business. However, by basing the architecture on Web Services and XML legacy systems and data can be viewed as an asset and the organization can evolve to fully-enabled Service Orientated architecture in a controlled, step-wise fashion.

Web Service appliances built specifically for the processing and manipulation of XML data can utilize architecturally advantaged methods to address both performance and cost challenges. This places extensible web services and the organizations that use them on a very solid platform for growth and expansion for years to come.

Biography

As CEO of Conformative Systems, John is dedicated to driving the adoption of Web Services using XML and building industry-leading solutions that allow enterprises to drive down the cost of implementing high-performance XML-based applications.John’s most recent venture was as founder and CEO of Chicory Systems, a semiconductor IP startup. Chicory received $5.3M in venture funding and exited for approximately $50M in cash and stock within 18 months of being started. Previously, John held a number of senior engineering positions at IBM where he developed servers, server chipsets and processors. John has been issued 19 patents in various fields including data processing, processor architecture, and systems optimization. He has a similar number that have been filed and are under review. However, his real skills are in identifying enterprise needs in emerging technologies, and building and managing teams and solutions to fulfill those needs. John holds a BSEE from Iowa State University and has performed focused graduate studies at ISU, FAU, NTU (UMass-Amherst, UC-Berkeley) and other sources. Areas of study included hardware technology, software technology, project management, and analysis .