XML 2003 logo

Using Open Source Software for Content Management in Publishing

Abstract

As XML workflows become more established within the publishing industry, the need has arisen within publishing organisations to develop electronic systems for storing and processing their digital content holdings, so called content management systems (CMSs).

Yet though the underlying technologies of XML and other digital file formats are themselves well-established, the selection, installation and use of a CMS for managing them are still activities that can be fraught with technical and commercial uncertainty.

A recent trend within publishing is the 'self-build' CMS, in which a publisher decides to assemble a bespoke system out of available (often Open Source) components. Based on real-world experience of a number of publishers' plans and experiences, this paper will examine the reasons for this trend and discuss the commercial and technical arguments for and against such initiatives.

One of the most immediate barriers for publishers is a mismatch between what is expected of a CMS in publishing, and in many other environments. In the wider world, a CMS can sometimes be seen as little more than back-end to a web site; yet for a publisher a CMS is a system marshalling large amounts of content through complicated workflows. Though the CMS marketplace is a crowded one, there are good commercial reasons why CMS vendors can eschew the publishing market, as simpler CMSs can sell better into less demanding (and richer) markets.

Despite the rise of XML, published information is still much more than just structured data. A CMS for publishers must offer good control of unstructured data as well as XML, particularly in the difficult area of digital right management (DRM). Again, like CMSs themselves, DRM often means something different to the publishing world, than to the non-publishing world - in particular any publishing CMS should be aware of the issues surrounding permissions management.

Though technically intriguing, the ability of many CMSs to explode XML data into fragments for storage is little needed in publishing, where data transactions are typically applied to larger units of content relatively infrequently. The expense of acquiring and deploying a 'true' fragmenting XML repository is one which publishers can often avoid, particularly when they often already have effective storage systems within their businesses.

Though avoidance of 'vendor lock-in' has long been one of the promises of structured markup technologies, this means little if a publishing workflow becomes practically dependent on a particular commercial CMS. However, the costs of maintaining a bespoke system are not negligible - how do they stack up against the costs and benefits of acquiring a CMS from a commercial vendor?

The paper will conclude by suggesting that for the first time the crystallisation of a number of information standards, together with a number of high-quality commercial and Open Source components, allows publishers the ability to allow a 'self-build' Open Source based CMS room for consideration when they are acquiring such a system.

Keywords


Table of Contents

1. What is a CMS?
2. The Publishing Domain
3. The Current Position
4. CMS Choices
5. Horror Stories
6. A CMS Software Cost Model (1)
7. An Operating System Cost Model
8. A CMS Software Cost Model (2)
9. The Technical Model
10. Pro and Contra Open Source
11. Platform/Development Language
12. Component-based Software
13. XML ...
14. ... And Not XML
15. Storage Models
16. Deployment
17. Conclusion
Bibliography
Biography

1. What is a CMS?

The acronym 'CMS' (Content Management System) means different things to different people. The rise of the Web and of XML over the past few years has seen a corresponding increase in the availability of self-styled content management systems: the Intranet Journal Web site has a (no doubt incomplete) list of available CMSs, and mentions 147 systems[IJ]. This list encompasses a broad range of type of CMS, from established 'big ticket' systems such as Astoria CMS[Astoria], Documentum[Documentum] or SigmaLink[SigmaLink] (to pick three from this category at random), to more recent open source offerings like Mason[Mason].

The range of functionality offered by these systems is so great that we can say the term 'CMS' is too general to be meaningful. While the bigger systems tend to offer storage mechanisms, version control, multi-user access - and thus almost mirror and extend and the functionality of entire enterprise operating systems - many recent offerings are much lighter-weight. Mason, for example, describes itself as 'a powerful Perl-based web site development and delivery engine'. These products all have their strengths and weaknesses, but to say that something is a 'CMS' is to say almost nothing about it. If we take CMS to mean something as generic as a system that offers storage, hierarchical placement of content, and fast-indexed based searching of content, then the out-of-the-box Windows operating system could be described as a 'CMS'!

To add to the problem, some vendors and developers are differentiating their offerings by adopting different acronyms. For example a publisher might be offered a DAMS (Digital Asset Management System) by one vendor; another might insist that in fact a Workflow System is required; yet another might offer workflow management as an integral part of their CMS offering.

2. The Publishing Domain

In an attempt to cut through these problems of what is meant by a 'CMS', I am going to base this paper on a consideration of requirements. Since this is the publishing track they will be the likely requirements of an organization carrying out large volume publishing activities; since this is an XML conference I am going to assume that this imaginary publisher is making extensive use of XML.

There are, I suggest, three main characteristics of such organizations that affect what they might mean when talking of a 'CMS'.

First, they have complex workflows - workflows that have many processes and which involve many participants. The act of turning authored content into distributable products requires many iterations of work on content, for example at the editorial and proofing stages. In a modern publishing business it is likely that suppliers of production services will be many, and often these suppliers are remote from the core publishing operation. This model has lead some publishers to adduce an 'air traffic control' metaphor for publishing production, in which work items are directed to different locations under control of a central coordinator - the publisher.

Secondly, publishers - apart from some database publishers - typically have complex heterogeneous content. The sort of XML found in large scale document modelling has a structure far more complex, and more various, than that found in the simpler database-like formats that characterize the bulk of data in other industries.

Thirdly, for publishers the validity and quality of content are key concerns. It is not enough for the content being managed simply to work well enough, as it might be for content being managed as the back end for a web site - where working well on current browsers might be the chief criterion for success. Publishers typically have carefully prepared DTDs and Schemas and care that their content properly validates against these to ensure that it can be repurposed effectively should the need arise.

3. The Current Position

The current position for publishers who have not adopted a CMS product, is what is sometimes referred to as the 'Drive D' approach: they store their content on the file system - often on a file server with shared access. Many publishers - especially small and medium sized ones - but some of the big players too - simply do not have a formal CMS.

Part of the reason for this is that for the developers and vendors of CMS systems, publishing can be seen as a Cinderella industry. Many CMS systems are developed by companies that are venture-backed or which have themselves (quite properly) commercial objectives. When making the business case for developing a CMS the question of target markets inevitably arises - and publishing is not as attractive a target market as (to take three examples) aerospace, financial sectors or government. Thus the peculiar needs of publishing are seldom conspicuously at the forefront of many CMS vendor's plans.

Related to this is price. Publishing is not a rich industry and publishing IT-based solutions often need to be low cost. Many of the most capable CMS systems, that are potentially well-suited to publishing applications, are simply too expensive for publishers (unlike, perhaps, for aerospace, financial sectors and government).

4. CMS Choices

For those publishers looking to put a CMS in place, there are three main options.

One is the adoption of a low-tech solutions. For publishers already using the 'Drive D' approach (that is, the file system), a thin layer of intelligence over that can add a lot of utility for very little cost. One approach I have seen adopted is a simple intranet application to give user access and automation to a number of productions tasks. The user was presented with a reasonably nice looking web front end; behind the scenes a number of Perl scripts did the work - unzipping, parsing, ftp'ing, email alerting, etc. If things went wrong, then the content was all there on the file system, so manual intervention could always correct matters when the low-tech automated solution failed, and the system could be supported and maintained in-house without specialist skills.

Another option is the 'self-build' CMS, in which a publisher develops their own system. While this approach has the advantage of producing a system that, if properly specified, can be the best fit for a particular publisher's activities, the disadvantage is that it will commit the publisher to a major software development project. Such major projects are notoriously difficult to bring off successfully, even for companies who do nothing but software development - for a publisher the risks are higher still.

In part the riskiness of the self-build approach may lead some publishers to buy their own commercial CMS - and indeed a number of the big publishers have between them a number of installations of commercial CMS systems. Because of the risk and expense of software development required for customizing such systems, a idea that has gained some currency within publishing it that it is easier to fit a company around a CMS, than it is to customize a CMS to a company - publishers are, after all, well used to being flexible after the seismic upheavals in their industry over the last 20 years or so.

5. Horror Stories

Every industry and every technology has its share of horror stories. But the phenomenon of CMS systems in publishing seem to generate a particularly rich crop.

Brandon Jockman's paper at XMLEurope 2003, Hunting XML CMS AntiPatterns - Found in the Wild [Jockman] is an entire paper devoted to cataloguing some of the technical, architectural and commercial shortcomings of some existing XML CMSs. As further evidence of such shortcoming I offer these stories...

The first is perhaps apocryphal, but offers a powerful parable for all intersections of publishing and technology. It concerns the demonstration, in the 1970s, of one of the first computer typesetting systems. The demonstration was given by a US vendor in Paris. Some sample content was provided, some sample pages set, and proudly offered for inspection. The French audience objected - 'where are the accents?'. This query sent the programming team into a hurried conference before they emerged with the question, 'can't you do without the accents?'

The message of this tale is far from being purely historical - even today publishers grapple with the shortcomings of systems for storing and rendering non-ASCII characters - but behind the particular problem lies, I think, a general principle, that the type of problems that characterize many publishing activities are not the kind of problems that interest technologists. There are some notable exceptions to this principle (Don Knuth, for example), but in general the technologist more interested in, say, character spacing than in 3D graphics rendering, is rare indeed.

The other stories are true ...

The 38 day export. This concerns the use of a well-known commercial CMS for storing a multi-volume reference work. The system dictated all the files be stored as RTF, as XML was an export format. As the publication deadline approached, the time came to export the whole encyclopedia as XML. 38 days later, it was finished.

Self-nesting elements? Another reference publisher had just gone to the time and expense of commissioning a shiny new DTD to model its content. Everyone was pleased with the lean and clean design which featured a nestable <section> element, making content extraction and re-combination so much easier. As the CMS vendors customized the system a distraught query came back - the system didn't support elements that nested inside themselves - couldn't the publisher do without it?

Too many zeroes. This is the tale of the publisher director who had been given a vendor presentation of a new CMS system. Although he needed to purchase something he felt too embarrassed to call the CMS sales team back, as the only way the conversation could progress, he said, was if their opening statement was 'sorry - we accidently added an extra zero to the price'.

Put another way, the overall message of these stories might be that publishers want systems that perform ten times better, but at a tenth of the price, than what is currently available.

6. A CMS Software Cost Model (1)

The cost problem faced by publishers seeking to acquire a CMS can be seen as being composed of two complementary problems. One is the fixed cost of the product licenses, and any associated hardware and software. The other is the variable cost of customization, maintenance and support.

The conundrum publishers face is that the two cost components balance. Highly-capable systems that service their demanding requirements 'out of the box' have high fixed costs. The 'self-build' option might be initially cheap, but the system development costs are expensive. It is difficult to know which option will be less expensive and so in general many buyers either stomach the high fixed costs of high end systems, or do not buy at all rather than risk the self-build approach.

7. An Operating System Cost Model

To look away from publishing for a moment, an analogue for this cost model can be found in the operating system market.

At one extreme it is possible to buy more expensive commercial operating systems that have high user convenience and proven capabilities. At the other it is possible to ftp a entire enterprise operating system free (Linux), and get the same functionality if you are prepared for more effort getting the system established, and the lack of commercial backing. By and large though (despite what Linux enthusiasts will tell you), there is no compelling overall cost case for adopting Linux over, say, Windows or Solaris as an enterprise operating system.

The breakthrough change in this model has come from the growth of Linux distributions. It is not quite free, but it is low cost - and the assembler of the Linux distribution has selected its components, tested interoperability, and wrapped the components with user friendly installation applications. Manuals are provided and support can be purchased . The result is that now a purchaser can break free from the cost conundrum when buying an operating system.

8. A CMS Software Cost Model (2)

My prediction is that this model will emerge in the CMS market too. Packages will emerge that combine open source components in such a way that they offer big-hitting CMS functionality cheaper than all-out commercial systems, but with less risk and expense than a self-assembled system.

I am so convinced of this (and by way of a declaration of interest), that this is where I am positioning my own company in expectation of this market shift.

9. The Technical Model

The technical model posits a number of components. Like Linux operating system distributions, in this model a number of open source components are selected, tested for interoperability, and combined, perhaps with an integrating software layer to give the components a coherent look and feel. In addition, support and documentation may be provided.

To meet the particular requirements we have identified for publishers, the components will necessarily be wide ranging in their functionality. Some sort of storage is needed, obviously, to store content; some form of validation framework is required for validating content; and multiple users and their actions must be tracked to have any hope of modelling workflows adequately.

The phenomenon that allows software systems to interoperate is standards conformance. XML defines a number of standard APIs and so the expectation is that by adopting these generic APIs, XML software can be truly 'plug and play' - in practical terms this should mean that it will not matter whether you prefer, say, Saxon or Xalan as your XSLT engine - it is the fact that both systems comply with the TRaX API[TrAX] that makes this irrelevant.

10. Pro and Contra Open Source

Not all open source software is good. Looking at an open source hub like sourceforge.net it is apparent that for every project that is well-conceived and successful, a large number are stalled. This is all a natural (and perhaps desirable) phenomenon of the open source development model, in which a sort of principle of natural selection applies.

Fortunately, for the application we have in mind some excellent open source software is available. Web serving and XML processing software, in particular, have open source implementations that are the most conformant and best performing of all. And good open source storage technologies exist.

When selected open source components, however, licensing can play an important role. Some of the more viral licensing agreements like the full GPL, tend - by design - to prohibit vendors from commercializing such software. In practice the more liberal Apache license is likely to be of more interest to those exploiting open source software commercially.

11. Platform/Development Language

When developing or selecting an open source CMS solution, the question of which platform and development language to use is important. The chosen solution must offer reasonable performance, adequately support the processes a CMS is required to support, and must be proven and supportable.

There are some unavoidable difficulties inherent in the task of developing any useful CMS. In particular the need for multi user access to the limited resources at the heart of any CMS (for example storage) mean in practice that the underlying design should be multi-threaded.

This fact alone might promote Java as the number 1 choice of programming language, but a number of other factors reinforce the case for Java: it is OS-agnostic, freely available, stable, reasonably-well performing and has a natural affinity with XML. Many open source components are themselves developed in Java, making a Java-based solution the natural choice.

12. Component-based Software

The idea of component-based software has been buzzing for several years now. Technologies like COM and CORBA, and more recently Web Services, have made possible the notion of a kind of software Lego, in which small modular components are assembled into larger applications.

To a large degree the component software revolution has been a failure in this respect, much like the object revolution that preceded it. In practice it has proved as difficult to assemble a large application from small components, as it was to assemble it by combining the classes from libraries using object-oriented coding techniques.

Interestingly it tends to have been standards, like SQL and the various XML APIs, which have made components interoperable - not the underlying technologies like COM and CORBA.

In one respect, however, component based software has been a success, and this is where the components are big. Butler Lampson makes the case that 'big components work ... they are so huge that you only use 3 of them'[Lampson], and, by a happy coincidence an open source CMS solution for publishers can, I believe, be assembled from such 'big' components: a database, an XML suite, and a web server - all else is glue to choreograph these components into offering useful CMS functionality.

13. XML ...

The storage of XML presents some special difficulties for any CMS implementation.

As identified earlier, publishers care about XML data quality in the broadest sense, and so any publishing CMS must offer strong support for ensuring the validity of XML content. Often checking content against a DTD or Schema will not be enough - the CMS must expose some way to express business rules that can be used to check that XML content is correct in every required way.

There is a certain minimal level of functionality necessary for a CMS to offer in its XML handling - and for publishing applications the bar is set high. Full Unicode support, support for deeply nested content, and support for large and complex data models will all be required for any moderately involved publishing activity.

14. ... And Not XML

By volume, however, most publisher content is not XML. It is graphical. Any publishing CMS must be as comfortable storing TIFF, PDF, etc. files as it is storing XML.

Data quality issues affect these non-XML files as much as they affect XML. A corrupt graphic can be as damaging to a production process as invalid XML. Again, it is the role of the CMS to allow rules to be expressed that can check such things, and altert users if necessary.

15. Storage Models

The term 'native XML storage' is perhaps as meaningless as the term 'CMS', though it enjoys much currency in the promotional material of some CMS vendors. What can this mean? is XML in such systems byte-for-byte identical to the XML that was stored? And if so, is this a good thing?

The storage model used for storing XML has excited much debate. Several broadly opposing schools of thought use different database technologies as the underlying store for XML content.

Relational database adherents prefer systems that decompose the XML into database tables.

Object database adherents prefer systems that decompose XML into hierarchical structures that mirror object maps such as are typically stored in such databases.

Still another approach is to keep the XML intact in BLOBs and use some form of stand-off indexing to make it amenable to high-speed access.

In practice this implementation details should not be apparent to users of CMS. If (to refer back to one of our horror stories), a certain relational database approach means content with nested elements of the same name cannot be stored, then something is wrong.

16. Deployment

Publishers prefer to avoid desktop installations if at all possible. Bespoke client applications inevitably come with support costs, and the universality of the web browser as a desktop client has - by and large - obviated the need for such bespoke clients.

Another happy side effect of the use of web clients is that remote deployment becomes much easier. Home workers and new suppliers - no matter where they are - only need Internet connectivity to have access to a web-hosted CMS.

17. Conclusion

In conclusion, in the varied world of the CMS, publishers stand out as having particularly demanding requirements. Yet many existing systems are unable to fulfil their requirements at a price that is right.

Historically, self-building a CMS may not be an attractive option, as it may prove as costly as purchasing an equivalent system from a vendor. Yet the likely shift in the market whereby distributions of open source software are available, may break the existing model and allow publishers to acquire CMSs for less.

Technically, such distributions will be centred around open source storage (databases), XML software (parsers and XSLT engines), and a web platform (typically, Java-based).

Commercially, they should see a shake up in the CMS market place and the happy situation in which publishers benefit from technology that had a real application in their industry.

Bibliography

[Documentum] Documentum.http://www.documentum.com/.

[IJ] Intranet Journal Web site 'Tools: Content Management' page. http://www.intranetjournal.com/tools/cm/

[Jockman] Hunting XML CMS AntiPatterns - Found in the Wild. http://www.idealliance.org/papers/dx_xmle03/papers/03-01-08/03-01-08.html

[Lampson] Lampson, Butler, How Software Components Grew Up and Conquered the World. http://research.microsoft.com/lampson/ReusableComponentsAbstract.htm

[Mason] Mason. http://www.masonhq.com/

Biography

Alex first became interested in structured markup when analyzing literary texts for his Ph.D. (on Shakespeare editions) in the late 1980s. Following the award of his Ph.D. his interest grew to such an extent that he aborted a career as a English Literature lecturer and moved into the software industry, where he spent four interesting years working on heavily object-oriented C++ application framework for cross-platform multimedia products, at the height of the CD-ROM boom. In 1997 Alex was one of the founding directors of Griffin Brown Digital Publishing Ltd, a company which provides XML-based components and tools - chiefly to the publishing industry. He is responsible for leading the company's XML consulting and implementation, and his work includes advising clients on XML/IT strategy, mentoring clients' staff, writing DTDs and Schemas, and designing and developing XML software systems in Java, C++ and other languages. In 2002, Alex was invited to join the British Standards Institute (BSI) Technical Committee IST/41, where he contributes to ISO/IEC JTC1/SC34 in its formation of the DSDL ISO standard, among other things. Alex writes and speaks regularly on structured markup technologies and their application to publishing and document processing.