XML 2003 logo

XML Unifies Content Migration

Abstract

Content and document management systems are among the earliest applications of Extensible Markup Language (XML), which was a natural evolution from the SGML roots of many electronic publishing systems. Content management systems drove the adoption and use of XML metadata to describe content stored in these systems, as well as object repositories and object-relational databases to store content described in XML formats. As content management systems have been extended and optimized for content delivery to the web, web content management tools (CMS) tools have become familiar adjuncts to web sites, portals, and other web applications. Efforts to select and implement content management systems often emphasize XML handling capabilities for authoring, categorization through metadata tagging, and multi-purpose content delivery.

Less often addressed are capabilities to support content migration from existing sources into a CMS and corresponding content conversion or transformation between formats. Many organizations quickly realize two things: first, converting unstructured content such as Word documents or even static hypertext markup language (HTML) pages into a more structured format for the purposes of storage in a CMS can be very challenging, since translating from little structure to lots of structure requires explicit guidelines and categorization rules. Second, and related to the first, effective content conversion presumes, and in fact demands, the existence of metadata standards or specified tagging structures. Too many organizations intend to adopt centralized content management architectures without having explicit information architectures or metadata standards in place.

For organizations in this type of situation — implementing a CMS with the intention of moving content from multiple sites, divisions, business units, and other sources into a single repository structure — available XML technologies and supporting processes can greatly facilitate the implementation. Content migration and conversion can cover such a diverse scope of sources to threaten effective enterprise-wide rollout. One approach to mitigating this challenge is the establishment and use of XML metadata standards in the context of a content "migration factory" process that covers content inventory and analysis, content categorization, content conversion and transformation, metadata tagging, and formal migration into the CMS.

This concept can best be explained through the use of real-world examples. This presentation will highlight two case studies from the public sector — one at the state level, one federal — involving large-scale content migration efforts during the implementation of content management systems. Attendees should expect to learn some of the major challenges involved in standing up enterprise-wide CMS solutions, and several ways that standards can facilitate the process. While not all standards need to be based on XML, relevant successful experiences leveraging XML metadata standards for information architecture, taxonomy, and tagging will demonstrate the benefits of XML in addressing this need.

Keywords


Table of Contents

1. Introduction
2. Centralized Content Management Approaches
3. Real-world Content Migration Examples
4. Conclusion
Glossary
Biography

1. Introduction

As content management systems have been extended and optimized for content delivery to the web, web CMS tools have become familiar adjuncts to web sites, portals, and other web applications. Efforts to select and implement content management systems often emphasize XML handling capabilities for authoring, categorization through metadata tagging, and multi-purpose content delivery. Less often addressed are capabilities to support content migration from existing sources into a CMS and corresponding content conversion or transformation between formats.

Many organizations quickly realize two things: first, converting unstructured content such as Word documents or even static HTML pages into a more structured format for the purposes of storage in a CMS can be very challenging, since translating from little structure to lots of structure requires explicit guidelines and categorization rules. Second, and related to the first, effective content conversion presumes, and in fact demands, the existence of metadata standards or specified tagging structures. Too many organizations intend to adopt centralized content management architectures without having explicit information architectures or metadata standards in place.

Both of these challenges can be difficult to overcome, either from a time or level-of-effort standpoint, but characteristics inherent to XML can help, particularly as implemented by content management tool vendors. As usual where XML solutions are concerned, much of the difficulty in successfully overcoming content migration challenges stems from process or governance, rather than technology. Initiatives to implement centralized content management face many non-technical decision points, including decisions about data formats, metadata, and whether to make use of existing standards. The examples cited in this paper will highlight two different approaches to content migration related to a centralized content management initiative. The approaches used in each case were completely different, but in both cases the decision to adopt XML-based content standards was an important factor in their success.

2. Centralized Content Management Approaches

Many organizations fail to realize the need for, or benefits of, metadata standards until after they are already in the midst of a content management initiative. Still others may understand the utility of metadata standards, but are either unwilling or unable to develop or select them, or to reach any level of consensus as to what metadata should be used. Several technology tools have emerged to help this type of organization, including offerings from content management vendors such as Interwoven and Documentum, and search vendors such as Autonomy and Verity. While the details vary from product to product, the core proposition is the same: direct the tool at a variety of content, and the tool suggests appropriate metadata based on automated analysis of the content. This is essentially a bottom-up approach in contrast to the top-down metadata-first approach, and is attractive to many organizations because it applies appropriate metadata to the content they already have. Of course, “appropriate” is in the eye of the tool, which may or may not meet with actual organizational requirements. Of the two typical approaches to centralized content management, this approach is the most common.

The optimal approach to centralized content management begins not with content management systems, or even XML, but with metadata. Metadata standards can be specific to the organization, which is especially helpful in industry-specific content environments, or can be broader and more general. Similarly, metadata standards can range from just a few standard elements and attributes to hundreds or thousands. Whether metadata standards are developed in house or adapted from externally available sources, XML has emerged as the most suitable means for expressing metadata.

There is of course a third content management approach, which is quite common among organizations today: decentralized content management. In this model each business unit or user community handles their own content, with whatever formats they prefer, using their own system or systems. To provide content access (if not management per se) across an enterprise using such a decentralized approach, content is aggregated, assembled, or otherwise integrated in another application, such as a portal. The broad support enjoyed by XML among current CMS vendors raises the possibility of using enterprise-wide content standards, even if the content management systems are different.

3. Real-world Content Migration Examples

The State of Minnesota embarked on a state-wide portal (the North Star Portal) effort over two years ago, intended to provide a single point of access to all state government information and services. This initiative first addressed access to existing content, online applications, and state agency websites, with an eventual plan to provide centralized content management as well as centralized application components to drive state agency transactions. As is typical in such projects, the original portal framework put in place could only become valuable when it integrated or provided access to a wide variety of content and services. The state chose to implement an XML-based content management system to support the North Star Portal and to provide a common destination for content migrated from other locations to the portal.

For a variety of reasons, the State of Minnesota selected BroadVision's One-to-One Content Management System, which is technology once known as Blade Runner, acquired with Interleaf. The BroadVision CMS, although built on a specialized version of Oracle, is a repository that not only supports XML, but requires it, storing content and related metadata only as XML. As the state determined the content types it wanted to use for its portal, the project team developed XML Document Type Definition (DTD)'s to correspond to each content type. This set of DTD's provided a content format library for state agencies and their content managers to use when migrating content to the portal. The state implemented a process and organizational capability known as a “content migration factory” which orchestrated the conversion of existing content to the new environment, including placing content within the site's information architecture, assigning appropriate metadata, and checking converted content into the CMS repository to enable versioning and other library services. Content that was not migrated to the centralized repository — either because the content didn't fit into the current information architecture or because the content migration teams simply hadn't gotten to it yet — was still presented and accessed through the portal along with CMS content through indexing and retrieval using a search engine. This allows the state to come closer to delivering on its vision of a single point of access to information without limiting available information to the content that has already been migrated.

In most content management system implementations, the time and effort required to migrate content into the new system far exceeds the time and effort required to deploy the new system and make it available to content presenting applications. In situations where a content management system is implemented to replace existing content storage, it is important for user acceptance that the content presented is not limited to the content that is stored in the new system. For organizations who combine a new content management system initiative with a reorganization or restructuring of their enterprise content, content migration becomes much more a prerequisite to launching the new system.

At the Centers for Disease Control and Prevention (CDC), the decision to implement a new content management system infrastructure came with a requirement to migrate thousands of pages of static HTML. Complicating matters, the HTML content included no meta-tags, while the new content management system was intended to incorporate a complex medical and public health thesaurus. The CDC initiative designed the information architecture, taxonomy, and enterprise thesaurus before beginning implementation of the content management system. One of the criteria used to select the content management system was its ability to import an externally developed taxonomy. Another differentiating factor in the selection was the capability to automatically tag incoming content with metadata according to the enterprise taxonomy.

An organization such as the CDC looks to enterprise content management for handling document-centric content as well as web content. Much of this unstructured content exists in current form in desktop application file formats such as Microsoft Word and Excel. A major challenge with converting this type of unstructured content into a structured format such as XML is that the source content may not provide enough meaningful information to guide conversion tools. A common example is Word documents created with the Normal template, and using only presentation-related style definitions (bold, bullet, etc.) instead of structure or context definitions (heading levels, footers, etc.). For content migration of unstructured content, auto-tagging tools are often the only alternative to manually intensive conversion efforts, which in practice become large-scale cut-and-paste exercises.

Content management at the CDC is widely distributed, with hundreds of content contributors and publishers among the organization's units. The newly centralized content management infrastructure needs to accommodate dozens of disparate content handling processes, although most of those shared a complete absence of metadata in their content. In contrast to the content migration factory approach used in Minnesota, with content owners taking primary responsibility for making conversion happen, the CDC centralized the core content migration capability through the implementation of automated tagging tools as part of the CMS infrastructure. With their Documentum CMS implementation, this capability corresponded to the Content Intelligence Services (CIS) module, which parses incoming content and determines the appropriate metadata tags to apply according to the taxonomy loaded into the CMS. For similar content files, this capability can be extended with scripting or mapping tools to reduce the manual effort required to make sure that content gets converted into the right fields with the right attributes. This is particularly well suited for converting static HTML to XML formats stored in the CMS.

4. Conclusion

What should be clear from the above examples is that metadata standards provide support in many ways to content management initiatives in general, and content conversion efforts in particular. The term “standard” is this context connotes little more than consistency in use within the enterprise, although there are certainly cross-industry metadata standards available, such as the Dublin Core. The point is that implementing content management in the absence of metadata standards provides content migrators no guidance, no commonality, and little opportunity to automate the process. To turn these negatives into positives, organizations should begin with the development or acquisition of metadata standards, which simplify many content management and migration activities

Glossary

CISSP

Certified Information Systems Security Professional

CMS

content management tools

DTD

Document Type Definition

HTML

hypertext markup language

XML

Extensible Markup Language

Biography

Stephen Gantz is the senior architect for Roundarch, a systems integrator focused on enterprise portals, content management, and integration. He also leads their security practice. Steve has 12 years of experience in technology-related professional services and software development, primarily as an IT architect designing e-commerce, enterprise application integration, customer relationship management, and security systems and infrastructures (he also is Certified Information Systems Security Professional (CISSP) certified).

Steve’s industry expertise includes federal civilian and state government, financial services, insurance, retail, telecommunications, and higher education. His areas of technical expertise include customer relationship management (CRM) and enterprise resource planning (ERP) applications, middleware technologies, security and e-commerce systems architecture, and data transport and exchange using EDI and XML. He is a regular speaker at industry events on enterprise application integration, security, and XML. He holds a Masters Degree in technology policy from the John F. Kennedy School of Government at Harvard University, as well as a Bachelors degree in applied mathematics and statistics from Harvard.