|
Table of contents | Author | City | Company | Country | State/Province | Term | Interchange | ![]() |
Daniel,, Ron
Jr , Standards Architect , Interwoven, Inc.,
California
U.S.A.
Email: rdaniel@interwoven.com
Ron Daniel Jr. is a Standards Architect at Interwoven, and an active member of the XML and metadata standardization communities. He chairs the PRISM metadata working group and is a member of the RDF Core working group. In the past he has chaired the XML Linking Interest Group, and co-edited numerous specifications including XPointer, PRISM, three IETF RFCs, and the first two Dublin Core reports.
Before joining Interwoven, Ron was a Senior Information Scientist at Metacode, a startup specializing in metadata and taxonomies. Prior to that, Ron was a technical staff member at Los Alamos National Laboratory, where he worked on a variety of projects focused on the lab's need for a large-scale, long-duration, information infrastructure. He received his Ph.D. in Electrical Engineering from Oklahoma State University, and was a postdoctoral researcher at Cambridge University and Los Alamos.
Burman, Linda A.
, CEO , L. A. Burman Associates Inc.,
Ontario
Canada
Email: linda@laburman.com
Linda Burman is President and CEO of L A Burman Associates Inc., a consulting company providing services including--- formation and facilitation of industry standards groups, industry analysis, business development and due diligence for potential investors, founder of the PRISM (Publishing Requirements for Industry Standard Metadata) Working Group and co-author of Mastering XML.
With the exception of the past year when she took a tour of duty in the dot.com world as Vice President of Standards and Evangelism at Kinecta Inc, a company developing ICE-based software, she has been developing her consulting business since 1995. Directly prior to starting her own company, Ms. Burman was director of worldwide marketing at SoftQuad Inc., and before that, she was the publishing evangelist at Apple Computer. Previously, she held business and technical positions in the data communications and publishing industries.
Ms. Burman is recognized as an industry expert, is an advisor to IDEAlliance and also sits on the advisory councils of Foundry Ventures and of the Baycrest Hospital for Geriatric Care's Day Care program.
Defining metadata as "data about data" hides more than it reveals. The definition, while perfectly true, is too general to be of much use. Many different groups can, and do, lay claim to the term. To a statistician, metadata means information about a set of measurements. How were the samples prepared? How were the recording instruments calibrated? To a database administrator, metadata means data about the information in a database application. What are the various tables? What are the columns and their datatypes? How are entries related to one another? A third group are the managers of very large data storage systems. For them, metadata means information about how a large dataset was split across multiple tapes. Which tapes are online? What part of the dataset is currently on disc and what part is not?
While these are all legitimate meanings of the word, none of them are what we are talking about in this paper. In our view, metadata means descriptive information about the content being created, managed, and distributed throughout an organization's content infrastructure, or in a commercial venture such as magazine or reference publishing.
Now why is that an earth shattering concept?
Let's look at the corporate organization. The bulk of the content in a large enterprise is intended for people. Press releases, PowerPoint presentations, sales brochures, white papers, product repair manuals, and developer documentation are just a few of the different types of content being produced. All of these are intended for a person to see, understand, and act upon. But the flood of information pouring into an organization's websites is too much for a person to manage. We need our machines to filter out the irrelevant and highlight the key pieces of content needed in any situation.
There is only one problem. Filtering out the irrelevant noise requires an understanding of the content, and the current context of its use. Machines have historically been very bad at understanding content intended for people. Inferring the purpose, context, and audience for any particular item in a repository is a difficult, error-prone, and computationally intensive process. Even now, machines are oblivious to metaphors and irony, and have a lot of problems with negatives. Fortunately, most business documents are meant to clearly communicate a point, so the use of those literary devices is minimized. Nevertheless, rather than making every application go through all the work of inferring the subject, context, and audience of every asset, it makes more sense to do it once and store the results in a fashion that is fast and easy to process. That extra data about the human-readable content, which is easy for machines to use, is what we mean by descriptive metadata.
Most organizations are not publishers, but do perform a variety of publishing-like activities. To carry out those activities, they have established, and are extending, an infrastructure for the creation, management, and release of content. Across such a content infrastructure, there are many places where additional automation can help people do a better job, at lower cost, and for greater rewards. To automate those processes will require software that can act as if it understands the content to some degree. So, rather than being a purely academic concern, metadata is key to the next generation of productivity improvements in handling content. We cannot see all the opportunities for increased automation, but there are a number of tasks customers would like to address now. Better search is the most common. Better personalization and browsing are also major areas for use. All of those are primarily seen as uses for customers outside the organization. There are many uses inside as well. Faster and more complete research, tracking of usage rights and permissions, and monitoring editorial trends are only a few of those internal uses.
All of these considerations and benefits also apply to commercial and reference publishing. However, in these cases, metadata not only creates efficiencies that reduce costs but can also provide the support for new business opportunities.
Having decided that a metadata system is needed, how does one go about implementing it? While the details differ from one project to the next, there are some overall steps that apply to many efforts. Those steps require you to answer, in order, the following questions:
How will you get that data? What will it cost to buy or create? What data standards should you follow?
How should you store and operate on the content and metadata?
What are the additional opportunities for exploiting that data?
The first and most important step is to know what problem you are trying to solve. While essentially every business problem can be abstracted to either cutting costs or growing revenues, that abstraction does not help one decide what to do.
Focus on specifics. Software development is the opposite of archery; the smaller the bull's-eye, the easier it is to hit. Groups should consider their individual situations. Is there something in the current work process that is killing them with repetitive manual effort, which they need to automate? What could a publisher offer to advertisers to help them reach their specific target audiences, taking the unique characteristics of the online medium into account? How can a central IT group improve searching on corporate web sites they do not control, and do so at a minimal cost? Knowing the problem is the first step, and its importance cannot be overemphasized. Trying to figure out what a system should do and tweaking it while it is being developed is a very slow and expensive proposition.
Of course, it is very easy to say that you need a laser-sharp focus on the problem to be solved. Unfortunately, achieving that focus is difficult. Most of us have general goals, such as growing revenues and/or cutting costs, and have to figure out what project to do next to achieve those goals.
One way to focus on a problem is through a simple thought experiment. Ask the stakeholders and implementation team to imagine that all the metadata they could possibly want has magically been applied to their content. What would they do first? Is better search the #1 priority? Better personalization? A syndication and redistribution strategy? Niche products? Or is the job to reduce repetitive and error-prone operations?
Seed their imaginations with some examples and suggestions. A publisher, for example, might want to increase ad revenue. Tagging the subject of a story so that only the most relevant ads will be shown is one way metadata might be put to use. Assuming that is of interest, the metadata implementation team will need to know how to tag the ads and the stories. They will also need a good way for advertisers to pick the slices they want. Always keep in mind that there is more than one way to solve most problems. Ad insertion will not be fully automated for a very long time. Machines have a hard time deciding if a story is a good or a bad mention of a subject. For example, a wine ad next to a restaurant review could be good. Running it next to an article on anti-drunk driving efforts would be bad.
Our experience has been that it is better to tag according to multiple 'facets' than to rely on one, all-encompassing taxonomy. In other words, make provisions to tag articles as being about particular people, companies, movies and shows, places, events, industries,and so on, as well as less easily segregated subjects. Advertisers can then mix and match those criteria to get what they want. A knock-on benefit is that websites could also use that information in alerting regular readers about new information, or in personalizing the pages to be shown to them.
A second task that a publisher might want to tackle is how to generate more ad impressions by getting readers to click on more pages within the websites of the magazines. If an article, such as a piece on a TV show, came into the website all tagged up, the site could dynamically generate sidebars of links to archived stories such as:
You might use who/what/where/when/why/how as a means for thinking about what kinds of sidebars to generate. These sidebars could become more and more in tune with an individual reader's needs by examining the facets on the articles they have been looking at in order to see what are common factors. (Caveat: such personalization software is not an off-the-shelf item at this time).
A third task is similar to the second - to keep people looking at articles from the magazine's archives (or archives of partner magazines) by making links from within the story itself. For example, we might link from the first mention of an actor on the show to a profile on the actor. The profile, in turn, has links to lots of related content. The point here is that 'metadata' does not have to be just author/title/subject information about the whole article. Any sort of data about the content, or portions of the content, is fair game. Fragments of the stories themselves can be marked. Companies, people, places, and things are starting points. It would also be possible to sell ads that are really link destinations. For example, a sports web site might link from a mention of a sports team to the team's box office and apparel store. However, editorial groups will probably want tight control over such in-content links so they can preserve the audience's trust. I know I would. :-) So links to other editorial content, or items that clearly look like ads, seem safest.
A fourth task would be to use metadata to reduce the time involved with rights tracking by eliminating some of the manual searches. There are many possibilities.
The point of the thought experiment is to get people thinking about things they would like to do if there were no constraints on the additional data available that machines would need to automate the tasks.
Then, once you've determined where metadata would add value, you need to prioritize according to your company's business imperatives.
Once some desired functions have been identified, what kinds of metadata must be created to enable them?
For the sorts of functions mentioned above, you will need to be able to tag new content according to the selected facets. You will probably also want to tag archival content so that there will be links into the archives. To do that tagging, the most cost-effective facets must be chosen. While it is fun in the brainstorming process to imagine a stork dropping off little bundles of metadata joy, you can't rely on that happening. So think a little about which fields are really needed and which are just nice to have. In addition to determining the facets, you also need to determine where values will come from. For example, if you have decided to mark the company names in some content, where will you get the list of companies to match against? Many organizations already have lists of companies - their customer and supplier databases if nowhere else. Using those databases has the advantage that no extra work has to be done to keep the information up to date.
Having selected the fields to fill and the values to put in them, you then have to figure out how to do it. At this point, you have to get very serious about operational realities. How will you get the data you need? What parts will you buy, what parts will you create through software, and what parts must be created through human involvement? For example, if you want to mark company names, consider licensing that from a mailing list vendor or other supplier, rather than creating your own. Keeping such a list current as companies merge, split, and move, are created and are destroyed, is a big job.
Don't expect a simple answer to this question. Instead, you will have to make a cost-benefits tradeoff between the effort needed to create the metadata and the accuracy needed for the task.
Many of us have the dream of being able to use a program to automatically add completely accurate metadata to our content. That is, of course, a fantasy. While there are programs which will create metadata automatically, they do make errors. No surprise, people make errors as well. Given a collection of about 100 books, trained library catalogers will give them the same metadata about 80% of the time. The other 20% they will be described differently. With considerable effort, the error rates for machine processing can approach those of trained professionals. However, there is a qualitative difference between the results. The errors made by machines are very different than the errors people make. When we look at the results of human catalogers, we can typically understand why they categorized something they way they did, even if we think it should have been done differently. When machines make an error, they are likely to categorize things in ways no sane person would consider. Since 'sanity' is not a concept that applies to machines, this is not surprising. Another way of understanding this is to consider spell checking. On its surface, spell checking seems easy. Given a list of words, go through a story and make sure all its words are on the list. Any that are not on the list should be fixed up so they are. Of course, we all know better. None of us would dream of turning a spelling checker loose on our content without some human supervision.
At the same time, you probably can't afford to have trained professionals manually tag all your content. Depending on the data quality you need to solve your problem, a completely automatic approach might be acceptable. Or you may want to use a semi-automatic tagging approach, where a human reviews the machine's suggestions.
Keep in mind that adding metadata is not a one-shot process. Your applications will have to deal with new content all the time, and the underlying vocabularies will change as well. So consider how this system will be maintained and operated in the future. Five years might be a good planning horizon. When you think about this, also consider whether some of the metadata could be created earlier in the editorial process - to the time when photos are taken and stories are written - or even earlier to when story assignments are made. Early creation of metadata does not have to be the incredibly odious task it might at first seem to the editorial group. Products are evolving which will make it possible for metadata to be added mostly automatically during content creation. This process could also ensure that corporate editorial standards are followed ? which would be an editorial benefit on its own, as well as opening up immense opportunities for putting that information to use later. Images and other media could also be tagged at creation. For example, imagine a digital camera with a plug-in card that holds the appropriate metadata for a shoot. Many products are beginning to support metadata frameworks and standards. The process of adding metadata should become much less burdensome as we move away from wholesale metadata creation and into a mode of successive enhancement.
Now that you know what information you need and have some ideas about what you would do with your metadata you need to decide which metadata standards you want to follow. There are a number of standards you might use. One we recommend you consider is PRISM (Publishing Requirements for Industry Standard Metadata, www.prismstandard.org). It provides elements for basic author/title/subject information. It has elements for marking the names of people, places, and things - either inline or as an indicator of a subject of the article as a whole. Those elements can be used when generating the dynamic sidebars mentioned earlier. There are elements that link a story to related articles. There are elements to indicate what type of article this is - an interview, a press release, a profile, a financial statement, etc. There are also elements to facilitate tracking the rights and permissions on content. We recommend the PRISM specification, not only because we helped develop it, but because we believe it solves a real problem while remaining technically feasible. It recommends the use of many existing standards, and adds only the new elements needed by common publishing scenarios.
No matter which standards you find, it is quite likely that you will want to add a few things and not use some others. That's to be expected. Try to find other standards where those additional elements are defined. There are a multitude of vertical standards such as XBRL for financial publishing that can extend the horizontal metadata supplied by PRISM. Mixing and matching elements from various specifications in order to meet the needs of a specific application was a prime motivation for the XML namespaces specification. Since many of the new standards are being defined in XML, the outlook for such mixing is rosy.
Choosing a storage system is not the first step in an implementation plan. Until you have determined what you need to do, what your priorities are and what kind of content you're going to operate on, you can't possibly make the right decision. But once you have figured those things out, you probably want to actually do something with the content and metadata. The characteristics of the problem you solve will be a major factor in determining the storage system you get. Do you need to combine multiple types of rich media content in a production process for a relatively small number of users? Such a scenario would be best served by a Digital Asset Management system. If the production group is targeting the web as their channel, the web site testing requirements would indicate the need for a dedicated web content management system. If you are going to be serving massive numbers of catalog search requests you need a heavy-duty database horse. Do you need to access tagged content at a very granular level? In that case you might need an object-oriented database solution. Or perhaps you need a combination of all of the above tied together by a unified metadata strategy. And don't forget to think about how to back up and restore all that data.
Now that you've completed your implementation and satisfied your highest priorities, kick back and relax for a bit while you look for "knock-on" opportunities. You've already marked up your content with metadata such as company names to help people easily find what they need. Now you can use that content for other projects that may not be at the top of your list but they can be implemented essentially for free. For instance, you may have content that originated in print and has been abstracted to send to a WAP enabled cell phone. Cell phones are not the only place with limited screen real estate. If your website has dynamically generated sidebars that link into the archives, you may want to take advantage of those condensed titles and descriptions for the limited space available in those sidebars. Or, as in the ad sales example above, you may be able to help a different department solve a real business problem in a very cost-effective way, since the original project has paid for itself. If you can think about these opportunities at the beginning when you're trying to figure out what problems you're trying to solve, you may want to factor that into the decision about what problem to tackle first.
Once your implementation is accomplishing its task, and you have taken advantage of any quick additional projects, it will be time to start the process again. Focus on the next problem you can solve by applying rich metadata, then repeat the process.
|
Table of contents | Author | City | Company | Country | State/Province | Term | Interchange | ![]() |