Keywords: Enterprise Information Integration, Virtual Data Federation, data integration, metadata repository, business intelligence, customer relationship management, portal, data aggregation, mapping and transformation, data warehouse, operational data store, enterprise application integration, extract, transform, load
Biography
Stephen Gantz is an enterprise architect with Blueprint Technologies, a professional services firm specializing in architecture solutions for the federal and commercial markets. Blueprint provides teams of expert enterprise and solution architects who integrate best practices, leading technologies, and disciplined approaches to consistently deliver outstanding, best-value results. Steve has 13 years of experience in technology-related professional services and software development, primarily as an IT architect designing e-commerce, enterprise application integration, customer relationship management, and security systems and infrastructures. He is a Certified Information Security System Professional (CISSP) and is also certified in vulnerability assessment and ethical hacking. Steve’s industry expertise includes federal civilian and state government, defense and intelligence, financial services, insurance, retail, telecommunications, and higher education. His areas of technical expertise include customer relationship management (CRM) and enterprise resource planning (ERP) applications, middleware technologies, security and e-commerce systems architecture, and data transport and exchange using EDI and XML. He is a regular speaker at industry events on enterprise application integration, seurity, and XML. He holds a Masters Degree in technology policy from the John F. Kennedy School of Government at Harvard University, as well as a Bachelors degree in applied mathematics and statistics from Harvard.
Among the popular emerging integration needs in the market today is information aggregation, normalization, and presentation from multiple back-end data sources to front-end applications. Termed Enterprise Information Integration by some vendors in the market, this type of solution relies on a centralized common object model to provide a data access interface to client applications. Applications can used this common interface to request data from one or more data sources in a single query, with the intricate details of resolving the query left to the integration tool. This session will explain the architecture of an enterprise information integration solution in general, highlight some of the vendors and their approaches in this market space, and explain the use of such as solution through a real-world example with a large financial services organization.
This sort of solution has four primary components: the data source systems that own the original data; a central normalization layer; a transformation and formatting capability; and the presentation or front-end application. Formats of data sources are mapped to the central normalization layer’s common object model during the design phase, with the resulting mappings often stored and managed in a metadata repository. Depending on requirements – such as data volume, query frequency, and the type of data involved – the central normalization layer can be implemented as a metadata repository alone (with the data remaining in its origin system and format), or as a data warehouse (with the source data copied and normalized into the central repository).
A recent integration initiative at a large financial services company in New York highlights the leading role XML-based technologies can play in this type of solution. The project requirements included a blend of traditional ETL, metadata management, mapping and transformation, and transactional application access using web services-based interfaces. This company put together an end-to-end solution using a set of best of breed technologies for each of its core requirements, and examined in detail the relative benefits of adopting a metadata-repository-only approach versus a more conventional intermediate-data-warehouse architecture.
The mapping and transformation component of this type of solution has most commonly been delivered using a commercial mapping and transformation engine, either as a standalone component or as part of an enterprise application integration (EAI) product. Many of the leading products on the market offer script-based or programmatic mapping and transformation logic, which customers can use to build the necessary functionality into their solution. With the rise of XML as a preferred intermediate data storage format in the integration market, in more recent releases many tools have begun to support or, in a few cases, rely on XSLT to provide mapping and transformation logic.
1. Introduction
2. Data Integration Solution Space
3. Case Study
This paper focuses on the role of XML tools and technologies to enable a class of data integration solutions becoming increasingly popular in the marketplace. In addition, it will address the high-level characteristics of business problems to which the solutions can be applied, and present examples of the specific types of problems that are particularly well-suited to this type of solution. The description of the solution architecture will include all its major elements, but for this audience, the emphasis of the paper is on the components and features of the solution architecture that either depend on or are facilitated by XML technology. Finally, in an effort to illustrate the effective use of this type of solution, the paper will present a case study, based on the experience of a large financial services company, to design an overall solution architecture and select the technical components necessary to implement it.
While many technical aspects of this class of solution are not new, its emergence is accompanied by new labels: vendors seem to prefer the term “enterprise information integration” or “EII” to describe this type of solution, while the analyst community and other observers favor “virtual data federation,” a phrase which describes the typical architecture of such a solution. The point of this type of solution is to provide normalized access to data from multiple disparate data sources in an aggregated or integrated view of the data – something that is hardly a new data integration requirement. Where EII is differentiated from other integration approaches is by leaving the data in their systems of record and executing distributed queries against those data sources. In contrast, more conventional data integration solutions – such as data warehouses and operational data stores – usually involve creating a centralized physical data store and copying source data into the central database. By leaving the data in place, EII solutions aim to provide real-time data aggregation, and as such vendors marketing EII solutions tend to emphasize their use for time-sensitive applications such as business intelligence and customer relationship management.
This paper is structured in two sections: the first provides an overview of the data integration solution space, to place the EII approach in proper context, and a brief description of the four primary components of an EII solution architecture. The second section presents a case study from the financial services sector, including a selection of appropriate technologies and overall solution design.
Enterprise Information Integration (EII) is the latest popular categorization of data integration across multiple disparate source. It is important to note at the outset that despite vendor claims and industry marketing hype, EII is not a technology, but rather a functional objective. Moreover, it can be misleading to attempt to make technical distinctions between EII and actual integration tools such as enterprise application integration (EAI) or extract, transform, load (ETL) products. EII has many potential uses, but is often positioned as particularly appropriate for data aggregation needs to support front-end applications such as business intelligence tools; presentation-layer aggregation interfaces such as enterprise portals; and consolidated views across an enterprise of key entities, such as customers, products, and orders.
As described later in this section, EII solutions often comprise a variety of technical components from different toolsets, including mapping and transformation tools such as those commonly sold by EAI and ETL vendors. The key point highlighted in this section is that, regardless of the solution architecture involved, data integration solutions consist of the same basic set of functional tiers:
Complementary components for virtual data federation solutions, including those marketed as EII packages, deliver data access capabilities somewhat similar to those seen in conventional data integration or EAI architectures. These include adapters and related interface mechanisms to provide access to data in different kinds of systems, including databases, applications, legacy systems, and file formats. Virtual data federation solutions also depend on distributed query handling capabilities, to break incoming queries into sub-queries according to the source systems containing the data requested by the query.
Data integration for purposes of aggregation and analysis is a long-standing need in many organizations, and the solution space for data integration technology is quite mature. While there are a variety of different approaches, architectures, and technologies applicable for data integration requirements, the fundamental requirement of a data integration solution is the same: query, retrieval, and aggregation of data stored in multiple sources for the purpose of unified access, analysis, or consumption by a user or system. Because the basic technical needs for data integration are common to many approaches, it is often easier to distinguish various solution alternatives based on differences in their architecture, rather than their function. The most conventional approach to data integration involves creating a new, physically distinct repository and copying appropriate data into it, to provide a single data source instead of many. In contrast, the virtual data federation approach espoused with EII executes distributed queries across multiple databases, aggregating the results into a virtual database or unified view. The obvious difference is the presence (with data warehousing) or absence (with EII) of a physical database to house the aggregated data. Leaving the data in its source databases, rather than copying it to a central repository, offers advantages in certain situations, which is precisely the point: neither approach is necessarily better than the other, but each may be optimal for addressing specific business problems or data integration use cases.
Among the most common approaches to data integration – particularly for large-scale integration needs – is data warehousing, which involves standing up a central data storage capability to provide the unified access, and populating it with data from as many source systems as required. Data warehouses rely on capabilities to inspect source databases, and map and convert the source data into a normalized data structure stored in the central repository. An entire class of software exists to provide these extract, transform, and load capabilities. ETL tools, to generalize somewhat, are intended and optimized to work with large amounts of data in batches, and to rebuild the data warehouse entirely each time it is refreshed. Due to the volume and the full rebuild characteristics, the data residing in data warehouses is almost always historical, and rarely more recent that 12 hours old.
The need by many organizations to aggregate and centralize large amounts of data, but to include more current data in the central repository, gave rise to the operational data store (ODS) type of solution, which has many characteristics in common with a data warehouse but which is updated in a different way. In contrast to the use of batch ETL tools to populate data in the repository, an ODS typically uses alternate data mapping and transformation tools, such as message brokers sold by EAI vendors, to refresh the database much more frequently with new data. So to simplify the comparison somewhat, data warehouse and operational data store solution architectures are often comparable in terms of their data, aggregation, and application tiers, but tend to differ in their mapping and transformation tiers.
This brings the discussion again to virtual data federation. The “virtual” in the label refers to the fact that the solution architecture for this approach has no central physical repository, but relies instead on building purpose-specific views as the result of real-time distributed queries. Being able to execute simultaneous distributed queries in real time across multiple data sources is hardly a trivial feat, however. To make this functionality available to data consumers in the application tier, solution developers may need to invest significant effort to data mapping and normalization, to allow the aggregation tier to present disparate data in a common format. While there is more than one way to accomplish this task, the most popular approach is the use of a metadata repository, which contains all relevant information required about the data contained in source systems, as well as the configuration necessary to access those systems and retrieve the appropriate data to fulfill a query request. Metadata repositories are common adjuncts to many technical architectures and data warehousing solutions as well – these also benefit from having a single record of all to-be integrated data and its characteristics – but the emergence of EII solutions has brought to market XML-based metadata repositories optimized both for data source- and field-level definition, and for facilitating single queries executed against multiple databases. Similarly, the suitability of XML as the common format available to front-end applications within a virtual data federation solution makes the use of web services engines a natural complement for providing XML transformation and publication capabilities.
Virtual data federation’s lack of a persistent central physical data store, along with EII’s emphasis on real-time query execution, limits the scope of business problems to which these solutions can be applied. In particular, the fact alone that EII executes queries against live production systems may preclude its use in some situations. In addition, EII solutions are only designed to provide read access to the data they retrieve; if the ability to change data within the aggregate view is needed, compensating integration transactions must be developed to write data back to the source systems. While the ultimate suitability to task of any given type of solution depends on the functional and technical requirements of the specific business problem to be addressed, there are several key characteristics that can help determine if a virtual data federation solution is appropriate, as summarized in the table below.
| Attribute | Virtual Data Federation | Data Warehouse/ODS |
| Data Volume per Transaction | Small | Large |
| Transaction Volume | Small to Medium | Large |
| Data Currency | Real-time | Historical/Near real-time |
| Impact to Production Systems | At run-time | Only during refresh |
| Aggregation Persistence | No | Yes |
| Data Access | Read-only | Read/Write |
| Level of Access Control | Source Systems | Central Repository |
As should be clear at this point, virtual data federation is not a new concept, nor are the enabling technologies in question. The ease of use and applicability of EII solutions, however, has been greatly enhanced by the emergence and maturation of various XML technologies and standards, including XML Metadata Interchange (XMI) and XQuery. An increasing number of vendors are not just supporting XML as a normalized data format, but requiring it or building into their solutions. For example, BEA’s Liquid Data EII offering using XML in the aggregation tier, and also relies on XML queries written into the application tier. As evidenced in the case study at the end of this paper, the emphasis on XML by EII vendors is well-suited to financial organizations, which are frequent users of XML for internally focused application integration and data aggregation and distribution purposes, and are also leading the dissemination of XML standards for external data sharing and exchange.
Beginning early in 2004, a leading financial services company in New York began developing an architecture, known as the XML Portal, to provide customers electronic access to their account information. As many of the company’s customers had more than one account with the firm, but were interested in seeing a single view of their holdings, the XML Portal was designed to extract information contained in multiple, disparate systems and present that information to externally and internally accessible systems and sites in a consolidated view.
Although the firm recognized that customers had an interest in direct self-service access to their account and holding information, the primary intended users of the XML Portal were the company’s employees. These employees include both investment advisors and securities brokers, whose access to customer data is not the same in every source system, so appropriate filters needed to be put in place to control the information returned to each type of user.
Information delivered to front-end applications and users was specified to be delivered in three ways:
Although the XML Portal was a new application, the company wanted to avoid the time and resource expenditure of building a custom solution from scratch. While they believed that no single commercially available offering met their needs, they thought they could combine the capabilities of several technologies to build a complete solution. The company evaluated several technical components of EII solutions, although ultimately they chose not to implement a virtual data federation architecture, due in part to the large volume of information involved in the XML Portal. The solution architecture design was further influenced by existing technical capabilities and systems in place at the company, including an intermediate central database used with custom-developed batch routines to perform ETL functions. A high-level view of the problem domain for the XML Portal appears in the accompanying diagram.

Major project activities undertaken for the XML Portal initiative included the following:
Technology Evaluation: The company performed a market research study and vendor analysis to identify, evaluate, and define the technology components envisioned for the XML Portal architecture stack, including XML transformation tools, metadata repository and tagging tools, and ETL tools. Key vendors evaluated included BEA, MetaMatrix, and Composite Software in the metadata repository area, and Informatica and Ascential for ETL.
Metadata Layer Development: To help facilitate the dynamic transformations and extractions, the company needed to define a metadata schema that would better organize and categorize the information within the intermediate data store and other systems, leveraging that information to create XML transformations and documents that can be presented through a portlet or web services-based interfaces.
Data Mapping and Transformation: Although a version of the ETL layer exists in the current environment, the company needed to review the existing data mappings between the existing data sources and the intermediate data store to insure that they are optimized for the new architecture and metadata strategy. Additionally, a second set of transformation logic needs to be developed within the mapping and transformation tier that react to dynamic requests generated by the web services interface or a portlet exposed on the intranet and extranet sites. Web Services
Interface Development: In order to provide a dynamic, automated interface to key users and clients, the company sought to develop an XML-RPC web services-based interface that will expose a finite set of queries to a pre-defined customer set, with appropriate access control provisions and within approved performance guidelines.
The result of the evaluation phase, perhaps unsurprisingly, was a determination that the company could use the querying, XML transformation and publishing, and metadata repository components of a virtual data federation solution to great advantage. The majority of the data aggregation activity, however, remained in a central intermediate data store, due primarily to the concern of negative performance impact to source systems by exposing them to unpredictable volumes of real-time query access. The company has emphasized the metadata repository in its detailed design and early development phases, to provide front-end application developers the unified data view they sought while leaving open the option of adding addition data sources and types to the integrated view, either through direct access or augmentation of the central repository.
XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.