XML Europe 2002 logo

A New Face for Each Show: Make Up Your Content by Effective Variants Engineering

Abstract

This paper presents an approach for variant engineering to address use cases found in the production of information types like reference works and legal publishing - where the composition of variant fragments to products demands concrete practicable information technology concepts. It will then describe two use cases of reference works and legal publishing the real solutions of which will be in the focus of the slides presentation at the conference.

In the solutions presented we will show proven strategies for the multiple re-use of a piece of information for various products and under given aspects.

Keywords


Table of Contents

1. Intro
2. Terms and Definitions
3. General Assumptions on Variants Engineering
3.1. Motivation and Characterization of Variants
3.2. What Happens Logically when Variants are Created
3.3. *Content-Managing* Variants on Object Level
3.4. *Content-Managing* Collections of Managed Objects
4. Serving Multiple Input and Output Channels
4.1. A Reference Works Publisher's Real World...
4.2. ... Several Steps to Keep it Going
4.2.1. Project Goals
4.2.2. Use Cases
4.3. ... and the Real Show...
5. Time Variants in Legal Publishing
5.1. The Proof-of-concept Prototype at ZGR, Wolters Kluwer, Germany
5.1.1. Import
5.1.2. Variant Engine
5.1.3. Extended Linking
5.1.4. Structure Search
5.1.5. Computer Aided Update (CAU)
5.1.6. Collection and Publication
5.1.7. Export
5.2. The Show...
6. Conclusion
Acknowledgements
Bibliography
Glossary
Biography

1. Intro

Existing business cases and the ever broadening vision of publishing markets put increasingly complex demands on the production of non-fictional information. To achieve the business targets, it is necessary to identify and satisfy these demands in a seamless editorial and content management environment. How can this be achieved considering the multiple expectations on our products we want to meet?

One crucial feature of an effective production is certainly the minimization of redundant information. Information re-use is the buzzword. To achieve this, we structure our content using concepts like XML thus enabling it for fragmentation and modularisation. Applied consistently, this has a direct positive effect on production and maintenance cost.

But there’s more to the production of non-fiction: some crucial characteristics of non-fictional information are that it is purpose-driven, market-oriented, and user-dependent. Here, (I re-use a term I found in [TE01]John F. Terris) re-purpose is the buzzword. The occurrence of a piece of information - be it text, images, animations, references… -varies with the multi-facetted demands of user expectations and capabilities, publishing media, time, money. We call each new occurrence of a (existing) piece of information - for a certain purpose and under certain conditions - a variant.

The outcomes of an XML-based and variants aware content production will get the company another step nearer to what has been promised to be the advantages of SGML/XML — based on the creation of smart content.

on the production level:

  • single-source and multi-channel publishing

  • improved information quality

  • enhanced functionality of electronic output

on the business level:

  • new products quickly to market

    products that fit the individual expectations

    possibility to anticipate what makes the customer happy and produce that in reasonable amount of time; which finally leads to ⇒

    increased customer loyalty.

We define smart content as a

  • managed collection of product neutral, reusable, linked and well structured information objects

  • plus their associated metadata,

  • stored in standards-based media neutral format,

  • which can be navigated, searched and combined,

  • to provide timely, targeted, user adapted and traceable information to the end user in the required medium.

2. Terms and Definitions

Since the interpretation of the keywords used in this paper often is as dazzling as the matter itself, let us first introduce our understanding of them.

information object

An information object is the smallest content unit managed by a content management system.

link object

A link connecting two or more information objects; link objects carry information (e.g. metadata) which can be modified without modifying any information objects being associated to them.

metadata

Information about a managed object (for example validity, status, administrative issues, etc.)

managed object

In this context a managed object is an object handled with versioning by the content management system. There are two managed object types: information objects and link objects. Metadata is not considered to be managed objects in this context, because metadata are seen as an integral part of a managed object and therefore is implicitly versioned.

collection of managed objects

A collection of managed objects is an arbitrary set of managed objects. A collection may have a defined sequence and graph representation (e.g. container structure) of its members.

occurrence

An information object or link object which contains content or other information on an abstract subject.

production

A production is an editorial process instance to produce a specific version and variant of a publication.

publication

An arbitrary collection of managed objects for the purpose of delivery as a product to a target group. A publication runs through a workflow and is released for delivery. Publications may be regularly published in new versions or in different variants.

re-use

“using the EXACT same information object (image or text) in more than one “document“ “[TE01]

re-purpose

“extracting and/or formatting the same piece of information in many different ways, usually producing a different document type targeted for a different user and/or purpose” [TE01]

subject

everything that has been assigned a *name*

variant

A variant is a managed object or collection of such with alternative content to another managed object or a collection of such. Two associated variants may have different content and related data (such as link arcs and metadata). Variants are independently modifiable by editing. Variants are created based on decisions (human triggered). Variants know about their relationship to the other variants. Variants have some attached validity metadata, which define the context in which the variant is valid/effective (e.g. language, target medium).

variant set

All the variant occurrences to a given subject/topic. A set of managed objects making up all defined variants to a managed object.

version

A version is a self-contained persistent snapshot of the data and related data (such as link arcs and metadata) of a managed object or a collection of managed objects at a given time. All versions of the same origin (managed object or collection) build up a clearly defined (temporal) sequence. Versions except the current version are read only and can’t be modified. The creation of versions is normally rule/event based (e.g. check-in or publication).

3. General Assumptions on Variants Engineering

Imagining the constant creation of new variants combined with their , re-purposing, and issues, we see the complexity of an information production process grow quite rapidly: it is no longer two-dimensional — as for each one of the production lines — but can rather be seen as a non-linear production network with variants and their relations to each other as the objects to be managed therein. While it is easier to run parallel, specialized databases of the same information base for special functions, the consistency problems become exponentially difficult to handle as the databases and their complexity grow.

3.1. Motivation and Characterization of Variants

As described in the Section 2 Terms and Definitions, a variant in our sense of the word can be an (ideally XML-structured) information object or a link object or a collection of them. Variants of the same variant set describe one and the same subject in a different way.The motivation for creating a new variant is triggered from criteria like new production channel, user profile, new publication (of law, spin-off product), new variant in the product described (technical documentation), additional language; it is not triggered by production workflow issues (these result in versions).

Typically, a variant is created by a copy⇒edit⇒characterize action on a given object performed by an editor who intellectually decides which information object (fragment) qualifies for a new variant, at what granularity a variant should be created, which content changes are necessary for the new purpose, and which are its characteristics/validity constraints. This new variant now has its own lifecycle, and can itself be the basis for a new variant.

The variants in a variant set are equally valid and useful in their specific context, according to their validity constraints. These constraints characterize each variant according to its occurrence — i.e. in which product, for which user group, at which point in time, it should be published. As these criteria are information about information, we call them metadata. In our model, both information objects and link objects can have metadata. These are an inherent part of the managed objects and are stored at version level.

Figure 1.

click image for full size view

(XML-)Managed Objects, Links, Metadata and Link Metadata: the Building Blocks of Variant Handling

click image for full size view

(XML-)Managed Objects, Links, Metadata and Link Metadata: the Building Blocks of Variant Handling

  • A variant set is a set of managed objects making up all defined variants to a managed object. In the relationship of variants to the variant set, validity data is needed to identify which variant is valid in which context. It is illegal to have variants within one variant set with overlapping validity. A given managed object can only take part in one variant set. All managed objects of a variant set belong to the same type of managed objects: either information object or link object.

  • A collection of managed objects consists of managed objects (information and link objects). The collection may have an internal structure (sequence or graph representation, e.g. container structure).

  • A managed object consists of an ordered list of versions. There is always exactly one current version available.

  • A managed object has an optional association to another managed object of the same type indicating where it was created from by a copy operation.

  • Metadata of managed objects are stored at version level.

  • Link objects have relationships to versions of information objects. A link has a source and a target version. A link carries its metadata like information objects on version level.

With the given data model the following operations are possible:

  • Finding the variant set in which a given information object is a member.

  • Retrieving the matching information object for a given validity constraint with a given start information object (independently from the validity of the start information object)

  • Traversing the origin chain of information objects.

3.2. What Happens Logically when Variants are Created

The following initial situation is given:

  • Two information objects: IO1.1 and IO2.1

  • Both information objects are linked by a link object LO3.1

click image for full size view

A new variant of IO1.1 should be created now.

The following is the situation after generation of the variant:

click image for full size view

Variant creation resulted in two new objects: a new information object which builds a new variant set together with IO1, and a new link LO5. The two links make up a variant set on their own.

In the next step a variant of IO4 will be created. The situation after creation is now:

click image for full size view

Since an information object can only be part in one variant set, the new information object will be part in the same variant set.

In the next step a variant will be added to IO2. The situation is now:

click image for full size view

Assuming we would like to create a publication with validity constraints which would fit for information objects IO1,1 in V1 and IO8,1 in V2 (remember: validity overlapping is not allowed: so only one information object of each variant set will valid).

The following shows the resulting collection of information objects of the publication:

click image for full size view

3.3. *Content-Managing* Variants on Object Level

A system supporting variant handling should provide efficient means for handling variants of managed objects (information objects and link objects). It must provide means to:

  • Create a variant of an information object by copying it and adding it to the variant set (this may include creating additional links). The newly created object must store information about the information object it was copied from.

  • Edit the validity data of a variant.

  • Add an existing information object to a variant set.

  • Remove an information object from a variant set

  • Create a report on all variants within a variant set including their validity data

  • Check for overlapping validities within a variant set with a given validity constraint

  • Checking that at least one object within a variant set “is valid” according to a given validity constraint.

3.4. *Content-Managing* Collections of Managed Objects

Collections have editable metadata for classification and retrieval. Collections may have an internally defined sequence of its managed objects. Collections can be assembled by different methods:

  • Searching

  • Individual selection

  • Link traversing

  • Navigating views of managed objects (e.g. container structures)

  • Combinations of the above.

The system should support bulk operations on the members of a collection:

  • Editing metadata

  • Setting workflow states

  • Check-in and Check-out

  • Moving and Copying

  • Re-using

  • Export

  • Deleting

4. Serving Multiple Input and Output Channels

RW publishing is a typical information business which absolutely depends on user happiness on the information published. Their overall business case is the serving multiple existing and anticipating future expectations. Information is the product, either explicit or implicit (explicit = the texts; implicit = hyper/between/beneath-text).

4.1. A Reference Works Publisher's Real World...

Our reference works publisher (RWP) wants to market his content by generating various products out of his database on different media for different target groups and by syndicating this database. The RWP partially generates the database himself and partially buys content from other publishers. He works with internal and external editors.

Figure 2. RW production environment

click image for full size view

The RWP is embedded between suppliers and consumers of content. External resources for content are external authors and syndication suppliers. External authors write or review articles for given lists of lemmas. External syndication suppliers deliver whole databases of articles or assets (pictures) either on demand or based on their own content databases. The RWP also has internal authors. They basically have the same tasks as the external authors.

The RWP has several content management teams for different content building and maintenance tasks. Each team uses additional internal and external authors as well as syndication suppliers to perform these tasks. Each team has an editor in chief and about 3 to 6 team members.

For the actual building of products, there are several product management teams. They decide on the creation of new products, select content from the different databases, modify and adapt it as needed, produce the product via the production department and maintain the product probably over several years. They use several sales channels to sell the product.

Figure 3. RW publication channels

click image for full size view

RWP has basically two reference works content databases. The "medium reference works (RW)" and the "big RW". They evolved historically separated but the lists of lemmas has been synchronized in between, and there is an RWP wide list of lemmas. Different to that, the chronicles content database is a database of historical articles for different topics (cities, politics, countries). RWP has a database of electronic pictures and an archive for multimedia assets. The archive is organized electronically but the different assets are partially only available offline due to expensive conversion costs and limited electronic storage capacity (films).

The products themselves are first of all a variety of reference works. There are small and big versions and special variants (e.g. physics) for special topics or target groups. The big version is also published on CD-ROM.

A separately edited multimedia version is published on CD-ROM. The web site publishes the "big RW" plus a lot of pictures and even multi media data. The web content additionally contains internal links for different lemmas. Users are even able to add their own content to different lemmas. This additional information is also stored in the web content database. By semi-automatic procedures, a topic map is maintained on the articles in the website and is interactively accessable for the user. The web editorial team also writes "actual articles" on actual topics. These actual articles are also stored in the web contents database.

4.2. ... Several Steps to Keep it Going

4.2.1. Project Goals

The main goal of dealing with variants, versions and configuration management in this project is to arrive at a content centric (as opposed to a product centric) content management. In order to achieve this it is necessary to decouple the creation of content from the formation of products and their production. This holds for the content structure as well as for the workflow.

4.2.2. Use Cases

Acquisition of information

  1. New or modified pieces of information pass through a product independent editorial process. Creation or modification has no direct effect on existing products.

  2. Additional or new information is to be acquired. An author is identified to (re)write an article.

  3. Articles of a certain class (say all articles about German cities) need to be updated (e.g. a paragraph about their breweries is to be added). All relevant articles are identified, either by selecting them manually or by retrieving them through a metadata search. The articles are passed through an edit / review process to ensure that all articles contain the required information.

  4. New articles can be created and passed through the same edit / review process, if necessary.

  5. If the group of articles intersects with another group which is already in a rework process, the collision has to be evaluated by an editor and the project may have to be postponed.

  6. All articles of the class have successfully passed the editorial process. I.e. all articles contain the required information and have been reviewed. They are now ready to be used in any product.

  7. Creation, modification (and even deletion) of information objects must not have any direct effect on existing products. This means the products must not adopt the modifications unless the responsible editor decides so.

  8. Remarks: These “up-to-date” information objects are constantly changing at a rather quick pace.

Release of information

  1. When a class of articles has been modified successfully, the objects are made available for products (released). This released information is self contained and unchangeable except through overwriting by the next release.

  2. A class of articles has successfully passed a modification process (see use case 1: Acquisition of information).

  3. The responsible editor acknowledges the successful acquisition of information and releases the corresponding class of articles. The relevant articles are identified manually or through a metadata search. The current content of all articles along with all links and designated metadata is set to a defined status to be the released version(s) of the information (read-only). A previous existing released version will be overwritten by the “new” release (the old released version(s) are discarded). All links between two information objects in the editorial process are translated into links in between the released versions of these information objects.

  4. In some cases it was decided that objects are to be released automatically as soon as they had passed the editorial process of information acquisition. This can be regarded merely as a convenience improvement for the editors.

  5. If a link cannot be established in the pool of released versions (i.e. one of the objects is not available yet) it will be ignored. However the link will automatically be established as soon as the missing released objects is available (i.e. when the corresponding article is released).

  6. The information is now available for the creation / modification of products. All relations (links) that are valid within the released information have been established. All designated metadata have been copied.

  7. The pool of released objects must be self-contained. All elements (objects, links, metadata) must be updateable through releasing the corresponding objects again.

  8. Remarks: The released objects basically form a (named) version. The one exception is, that the next release (i.e. next version) will overwrite the existing one.

Generation of Products

  1. An editor selects a set of information objects from the pool of released versions to form a product. Variants of these objects are made available exclusively for the processes of this product.

    Precondition: Released Versions of information objects available. The structure (storage, workflow etc.) for the product has been defined.

    The editor identifies all relevant articles for his product among the released versions. This is mostly done through metadata searches, as the collection is usually very big. The selection is stored in a so-called shopping cart (basically a storage of object identifiers in combination with the identification of the selection). A process is started to generate variants of the selected articles (and the links and metadata). The set of variants (the product) is self-contained. The objects can be modified without any interference to other products.

    Post Condition: All necessary objects and links are available for the product and can be modified.

    The product must be self contained and modifiable. All elements must be updateable through adopting the corresponding released version again.

    Remark: The product is derived as variants from released versions.

Maintaining Products

  1. Maintaining products covers local modification of information as well as updating / recovering information from the pool of released versions.

    Precondition: The product has been created and needs to be modified.

    In the course of processing the product along its special workflow (e.g. structural adjustments, proof reading, printing …) there is the need to modify the content. This can be done locally to adjust the content with respect to the publication media (i.e. to manipulate page breaks) or with respect to the designated audience. Another means of maintaining a product is to update the content from the pool of released versions. This can be split into two major possibilities:

    1. Product driven update: A set of articles is selected in the product (with the means of shopping carts). The objects in the set are updated from the pool of released versions. Optionally objects can be deleted if their correspondences are no longer available among the released objects.

    2. Released versions driven update: A new set of articles is selected in the pool of released objects (with the means of shopping cart). The objects inserted into the product or the corresponding product objects are updated.

    Note that a product can contain objects that have no correspondences in the pool of released versions. This can be the case if the objects were originally created in the product or if the corresponding released version has been deleted.

    Result: The product remains a self-contained set of variants.

    All links that can be derived from links in the pool of released versions are established in the product.

    Remark: The updating process can be seen as a creation of new variants in the product while the old variants are discarded.

4.3. ... and the Real Show...

...will be given and discussed in Barcelona.

5. Time Variants in Legal Publishing

The semantics of time plays a crucial role in the publishing of products based on legal documents: it is the criteria for the validity of the legal texts and references to them. If we talk about the laws themselves, we would rather think it is a versions issue we have to address. But in managing laws — and their fragments down to paragraphs — as information units, we think they must be treated as variants for validity reasons: the state of each production effective to a certain point in time (or time range) must be kept retrievable.

5.1. The Proof-of-concept Prototype at ZGR, Wolters Kluwer, Germany

The ZGR (central legislation editorial office) is a pilot sigmalink installation at Wolters Kluwer Germany. ZGR can be considered the proof of concept project with the following targets:

  • enable editorial staff for using a CMS

  • experience in CMS

  • proof of general CM concepts

  • proof of DTD concept

  • get user feedback/usability issues

  • explore basic customizing facilities

  • find out about cost and efforts

  • comparison to former way of working

  • export system neutral data for further processing

Fundamental reasons for choosing the ZGR for the pilot were:

  • Central law editorial is a complex task that gives a good example for solving challenging situations (esp. variant handling) with a CMS.

    Central law editorial is the “heart” of services a legal publisher in Germany may give to his customers; other data types (comments, decisions) may build up on this central base.

    There was a need for a high quality output format to be given to production.

The basic idea has been to build a data structure context “from heart to periphery”, beginning with central law editorial, going further with court decisions (which are referring to laws), and finishing with works of comment character, that may refer to everything else.

The implemented processes will be shown at the conference.

5.1.1. Import

Existing data resources (a collection of law objects) are imported into the system through a defined import format. The format carries all time-dependent information, as well as metadata and links. This information is kept as elements and attributes in the SGML instances before importing them in the system. During the import,

  • the documents are split into fragments according to structures defined in the DTD,

  • status and other information about the information objects are extracted as metadata to be managed by the system and used as basis for searches and retrieval actions in the editing process,

  • link information like valid-from is extracted as link metadata, cf. Section 5.1.2Variant Engine functionality.

5.1.2. Variant Engine

The Variant Engine stores and manages all time-dependent variants of a legal document marked up in SGML/XML. Every time-dependent fragment carries a valid-from and a valid-until date. The date could either be a “real” date or a logical date like “to be defined by the government” or “valid until further notice”. Because the differences between the time versions of a legal document might be small – compared to the size of the document – the variant engine reduces the size of the copied text fragments to a minimum. Each fragment is under the control of the variant engine. The access to the variants for editing, reading, linking, and publishing is done through the engine. It also ensures that the proper documents are assembled from all the time-dependent text fragments stored in the database.

5.1.3. Extended Linking

Legal publishers sell added value to public laws. Maintaining legal texts is one task, but assigning comments and connecting related texts with links is another more valuable task. The link networks make use of a common information pool with all the legal texts. The publishers can create different link layers or networks over the same information object, like customer groups or specific purposes as the state of a legal situation ten years ago. Thus, it is an important precondition that the links can be established without changing the data and that they can be collected and managed as different networks. The system’s extended linking feature provides this functionality and is also integrated into the variant engine offering time dependent links.

5.1.4. Structure Search

Verbal cross references to legal documents make use of the strong hierarchy of the texts, e.g. directive XYZ, §2, item 1, sentence 4. Those laws which change other laws make heavy use of such references to identify the text portions to be changed, and it is one of the most important tasks for the publisher to keep track of all the changes in his law database. Thus, easy following of these verbal references is an essential requirement to the system.

5.1.5. Computer Aided Update (CAU)

The maintenance of laws means applying all the changes from the changed laws into the publisher’s database. This important but laborious task should be supported by the CAU. The changed laws have to be transferred into an electronic form which controls the automatic update program. Every in-between step can be checked manually before the automatic process continues. Thus, the CAU simplifies and speeds up the maintenance of law texts.

5.1.6. Collection and Publication

A published legal text might consist of a number of separate documents, decisions, comments, graphics, etc. connected to each other by links. This makes electronic publications on CD-ROMs or in the Web more useful than the printed versions, and, moreover, if the links have metadata, they themselves can be treated as variant objects. The collection function of the system will pack all valid information objects belonging to a specific publication with all their metadata and all their links into one publication. The publication acts like an archive which can be extracted completely from the database, later, at any point in time.

5.1.7. Export

The export process exports all the publication data from the database to the file system for use in other tools like typesetting engines or CD-ROM applications.

5.2. The Show...

...will be given and discussed in Barcelona.

6. Conclusion

With information modelling concepts like XML plus complementary concepts for modelling meta-information like metadata and inherent knowledge structures expressed in the link objects, we have the methods we need for handling the constantly growing number of variants to meet the known and still unknown challenges in the information producing industries.

In our approach to variant handling, the heart is smart content in an abstract format which is independent of the tools it is produced with, media it will be published in, free from product specific information, thus being smart enough to serve them all in its varying occurrences that can be created on demand.

Acknowledgements

I would like to thank my colleagues, especially Dr. Martin Kreutzer, Johannes Müller, and Dr. Franz Weber, for their most valuable contributions concerning the logics of variant handling. And Mr. Németh for his creative cooperation in the ZGR project and his “Zustimmung” for showing it as a use case in XML Europe 2002.

Bibliography

[TE01] Terris, John F., IT Program Manager, Allen Park, Valley Forge Technical Information Services, “Re-use, Re-purpose, Re-package: A General Engine Products, Inc. Case Study”, http://www.idealliance.org/papers/xml2001papers/tm/WEB/04-01-04/04-01-04.htm, XML 2001

Glossary

RW

reference works

RWP

reference works publisher

ZGR

“Zentrale Gesetzesredaktion”

Biography

Martina Hemrich works as a project manager and IT consultant at empolis GmbH (formerly STEP Electronic Publishing Solutions GmbH), her main focus lying on consulting in XML and related standards and managing the implementation of XML-based solutions realized with the empolis product family. She started at STEP in 1995 specializing in DTD design and the rendering of structured data (both layout and linking concepts).

The focus in her conceptual work lies on the analysis and design of information architectures targeting the management of structured content and the workflows fit for achieving and creating new business goals of empolis' customers. The main aspects of her consulting work are the analysis of publication and workflow processes, Information Process Reengineering, creating new concepts for the production of information that meet the requirements of both contents, stakeholders, audience, and medium. She coaches the introduction of new publication processes, trains on XML concepts, information design and workflow strategies. The projects she has been involved in for the last years mostly deal with legal content and reference works.