Technical Writing and XML: Reconciling Editorial License with Structured Markup

Keywords: authoring, editor, markup, XML, Publishing

Douglas Rudder
XML Analyst
Facts and Comparisons
St. Louis
Missouri
United States of America
drudder@drugfacts.com

Biography

Douglas Rudder is the XML Analyst at Facts and Comparisons. In this role, his primary responsibility is content analysis and SGML/XML DTD design, development, and management. In addition, he is involved in development and support of XML/SGML publishing technologies and often functions as the XML technical liaison with other WKH companies and outside customers. He has eleven years of experience in publishing with SGML/XML, both as a Technical Writer/Editor and as an SGML/XML developer for media-neutral information from which both print and electronic products are generated


Abstract


In writing reference material, consistency of organization and presentation is key. If the same information is presented in a consistent order and style throughout the publication or information set, it enhances the readability and usability of the material for the consumer. Ease of use is vital. XML provides a means to assist in the standardization of reference material from both an organizational and a semantic/content-oriented standpoint. Standardization based on structure and content enhances the potential for reuse of the XML-tagged information for both print and electronic delivery.

But while there can be a strong relationship between the authoring and editing of content and structured markup, all too often conflicts arise between technical writers and DTD/schema designers and programmers. The perceived need for editorial license and creative freedom by many authors/editors clashes with the need for rigid structure to facilitate ease of programming for markup technologists and programmers. The disagreements are commonly between format and structure, looseness and rigidity, and are often more philosophical than practical.

So how do we close the gap between the focus on format and the need for structural, content-based tagging? The first step is to understand what technical writing is and the strong relationship between the concepts of technical writing and the purpose of semantic XML. The correlation between technical writing and structured markup needs to be clearly defined and emphasized to those involved in the process of content creation and delivery.

This presentation will address the relationship between structured markup and the authoring/editing of reference material, with discussion of potential conflicts and techniques for resolving those conflicts.


Table of Contents


1. Introduction
2. An Artifact of Mindset
3. Potential Mindset Issues
4. Technical Writers are Valuable Assets
5. Bridging the Conceptual Abyss
6. Technical Writing in a Nutshell
7. Content-based Markup in a Nutshell
8. The Relationship between Technical Writing and XML
9. Final Comments
Bibliography

1. Introduction

In writing reference material, consistency of organization and presentation is key. If the same information is presented in a consistent order and style throughout the publication or information set, it enhances the readability and usability of the material for the consumer. Reference materials may include such documents as encyclopedias, dictionaries, parts supplies lists, maintenance manuals, computer manuals, and drug information texts, among others. Ease of use is vital. XML provides a means to assist in the standardization of reference material from both an organizational and semantic/content-oriented standpoint. Standardization based on structure and content enhances the potential for reuse of the XML-tagged information for both print and electronic delivery.

But while there can be a strong relationship between the authoring and editing of content and structured markup, all too often conflicts arise between technical writers and DTD/schema designers and programmers. The perceived need for “editorial license” and “creative freedom” by many authors/editors clashes with the need for “rigid structure” to facilitate “ease of programming” for markup technologists and programmers. The disagreements are commonly between format and structure, looseness and rigidity, and are often more philosophical than practical.

2. An Artifact of Mindset

Often, it is the mindset of the individuals that lead to conflict. That is, a lack of experience or understanding in a given area – whether it is an editor who has never heard of XML before or a programmer who is not familiar with the content being developed – can be the catalyst for disagreement and even animosity. The following discussion focuses on issues related to the difficulty of non-XML aware technical writers and editors adjusting to a structured authoring/editing environment. Many examples come from my own experience, but also reflect conversations with other technical writers and DTD developers throughout the industry.

3. Potential Mindset Issues

Many writers come from a desktop publishing environment, where formatting of the page is a key part of their responsibilities. For some editors, formatting is intertwined with content because of their experience and training. Structural and content-based markup devoid of format is a foreign concept for them. The transition from thinking of documents as rendered views of information to thinking of them as pieces of data as well as understanding the logical interrelations of these pieces is a difficult task, to say the least.

For example, in 1993, Facts and Comparisons (Facts), a drug information publisher, shifted from our old typesetting system to an SGML-based publishing system. Consultants on site helped to convert legacy data into SGML data. The technical writers received a brief introduction to markup languages by practicing with memos and other simple documents. This did not prepare us for the complexities of detailed drug information data.

Up to this time, Facts had primarily been a print publishing house, with loose-leaf our most common product. Format had been an integral part of the thought processes with regard to document creation. The move to SGML was driven largely by the fact that Facts needed to extend its market into the electronic world and wanted to establish a media-neutral database from which both print and electronic products could be produced and the same pieces of data could be shared among a variety of titles by utilizing structure- and content-based semantic DTDs. The first step was obviously to convert our legacy data to SGML and continue to produce the current print products as the database was being populated. As a result, when pages were composed and formatting difficulties arose, the DTDs began to be conformed to "print-groups" and quite a few format-specific tags were added to drive print production.

This format-oriented mindset, inherited not only from the old system but also from the desktop-publishing culture of the time, led to a number of issues. Previewing a composed page became a method for checking the tagging, the assumption being that if the page looked okay, the tagging must be okay. Unfortunately, this practice masked creative tagging and led to significant difficulties down the road.

The most common example of creative tagging came in the use of a para and an emph tag instead of the section, title, paragraph construction to indicate a section title. In print products, section titles are usually rendered as italic text; it was not until the development of a CD-ROM product in which section titles were formatted as red, bold text that this tagging problem became evident. The errant structure was not caught earlier because the printed pages formatted correctly, so it was assumed that the tagging was correct. Because of a lack of training, this misuse of the emph tag was often used to circumvent the intended tagging structure of the document. It became the catch-all when there was confusion about how an item should be tagged.

Other common issues included the addition of break tags to introduce line breaks into the data, the failure to report the need for DTD modifications, or requests for DTD modifications because an outside author or a document presented the same piece of information in a different way. The line breaks were an issue for Facts because our goal was a media-neutral database to be utilized for a variety of print and electronic formats; the line breaks were geared primarily toward the loose-leaf product and often occurred in the form of a hard hyphen followed by a break tag inserted in the middle of a word. There also was no consistency in the way this was accomplished: Sometimes there would be a hyphen followed by a space, then a break; other times a hyphen was followed by a break, then a space; or a space was on either side of the break, and so forth. This wrought havoc with attempts to reuse the data in different formats. Isn't it amazing how one little empty tag can lead to so much consternation?

The conflicts that arise because of requests or lack of requests for DTD modifications are among the most difficult to address. If a number of requests are turned down because it is determined that they are simply attempts to account for the same type of content in a different way, it can lead to pertinent requests not being made and fudging of the tagging to occur. If writers assume that their requests will automatically be refused or they fear appearing foolish, legitimate requests for DTD modification based on new content may be withheld and the integrity of the XML/SGML can be compromised.

Some requests for DTD modifications come as a result of organizational inconsistency in the original document. For example, at one point an editor requested that two sub-elements of a particular DTD be made reversible. When asked why, she commented that they appeared in reverse order in one drug monograph in that data set and that it "reads better" that way. At length it was discovered that the definition of "reads better" came down to the idea that since the author wrote it that way, that is the way it must be. It was a simple inconsistency in organization by an author who was not writing in an XML/SGML environment. While the standard section headers were used, the mental parsing of the document had failed to catch a mild breach in the ordering of the contents. The DTD was not modified; the document was.

Not all requests for unnecessary changes come from a fear of changing the original document, however. Sometimes a subtle variation on a theme can occur in a document that makes it look like a legitimate need for modification. One of our products includes sections at the end of each monograph for references and suggested readings. One monograph received from an outside author had an additional section sandwiched between the two standard sections called Additional Readings in Safety Issues in Children. On the face of it, that appeared to be a rather important section to include. However, further content analysis indicated that a section on safety issues was standard in the body of all monographs, that works cited dealing with safety issues in children were part of the standard references in that monograph, and that similar monographs dealing with safety issues in children included the additional readings in the suggested readings section already. The creation of this new section in that specific monograph had to do entirely with the mindset of the author at the time of writing, not because this was a new area of content that needed to be accounted for. It was simply a different presentation of the same information. Again, the DTD did not change; the contents of that section were moved to the suggested readings section instead. The key here is that the decision was made based on content, not on DTD restrictions.

A more serious problem, often difficult to identify until attempts are made to reuse the data, is when a need for DTD modification is not reported and existing semantic elements are misused in order to get a document to parse and still produce the desired format on output. For example, in one product we have a sub-element called age-rd that refers to drug dosage information related to age groups. In the main dosage element, the DTD subdivided the content between product, condition, and age, but each sub-element did not allow for generic sections inside them. Therefore, when the age-related information contained further sections about initial doses and maintenance doses, these sections were either tagged using another age element or in more creative fashion. Sometimes the title would simply be text in a paragraph followed by a colon, close paragraph, and another paragraph containing the body text. Other times there would be a single paragraph with the supposed title text followed by a colon and then the body text. In other places the old paragraph-emph construction would format the text for output. All of these instances apparently were acceptable on the formatted page – though sometimes the title would be italicized and other times it would not be – but the functionality related to structure was lost for electronic and database usage.

4. Technical Writers are Valuable Assets

It may sound like I am picking on technical writers and editors thus far; this is not the case. I spent four years at Facts as a technical writer before spending the last six years as a content analyst, DTD developer, and programmer.

Several years ago, we had an individual who worked in electronic product development who tried to promote the idea that the editorial staff was unnecessary. His reasoning was that the healthcare providers we worked with knew the technical information much better than our editors, so they should both write and edit the documents themselves. After all, "anyone can edit text." That attitude is as misinformed as the statement from the non-technical side that "it's just programming." Both of these ideas demonstrate a fundamental lack of understanding of what the other person is doing.

A good, well-trained technical writer or editor is every bit as skilled a professional as a good programmer. While it is true that the healthcare specialist, physicist, software developer, and so forth may have a deeper technical understanding of the information, the ability to present that information in a clear, well-written document geared toward a given audience is a skill requiring training in language and communication that many technical specialists have not attained. That is why the field of technical writing came into being to begin with.

5. Bridging the Conceptual Abyss

So how do we reconcile the conflict between editorial license and structured markup? How do we close the gap between the focus on format and the need for structural, content-based tagging? The first step is to understand what technical writing is and the strong relationship between the concepts of technical writing and the purpose of semantic XML.

6. Technical Writing in a Nutshell

A technical writer is one who interprets and communicates specialized information in a way that is "reader oriented and efficient." [TechWrit] In his book, Technical Writing, John Lannon comments that:

...data rarely materializes or thinking rarely occurs in a neat, predictable sequence. We cannot merely report ideas or data in the same random order they occur. Instead we shape this material into an organized unit of meaning. [TechWrit]

Technical writers often take disparate pieces of documentation on a given topic and organize the content into a logical structure to enhance readability and usability of that information for the reader. Part of the process entails identifying the relationship between the data and the sequence in which the reader is likely to approach the information. [TechWrit]

A major part of technical writing is analysis: Breaking down and categorizing the content into relevant pieces of information, often both with regard to the internal components of the data as well as the category in which the information belongs. [TechWrit] For instance, information about a given drug may be broken down by components of the drug (e.g., active ingredients, doseforms, strengths, uses, side effects, how supplied) and by therapeutic class (e.g., analgesics, muscle relaxants). By organizing this information the same way in each drug monograph, it makes it easier for the reader to find the information being sought. That is, if side effects always comes after indications and before how supplied, the user can simply scan the document to find the information quickly instead of having to hunt and peck to find the information needed.

This concept can be applied to a variety of technical documents, such as computer manuals, encyclopedias, dictionaries, parts lists, reports, proposals, and aircraft manuals, to name a few. A simple example would be a dictionary. The components of a dictionary entry (such as terms, pronunciations, etymologies, and definitions) are always organized in the same way. While it is true that not every entry contains every possible component, the parts that are present are always ordered consistently.

Technical writers and editors should always bear in mind the three Cs of technical writing: Clear, concise, and consistent. Make your meaning and interpretation clear; state things concisely, using concrete, active language to get your point or information across; and be consistent, both in organization and in format. Yes, even format plays a role. If a major section header appears in bold face followed by a hard return, its first-level subsection headers appear in italic face followed by a colon, and its second-level subsection headers appear indented in italic face followed by a dash, this format should be applied to all such major sections in the document. Utilizing the same format across the board enhances readability in much the same manner as consistent organization does.

7. Content-based Markup in a Nutshell

Within a media-neutral publishing environment, XML is used to identify and maintain the structure and content of information, independent of formatting specifications. By maintaining a consistent organization within the data, the information can be reused across formats and publications with a minimum of effort. XML can serve as an aid in the editorial process by providing a standard methodology for describing the "meaning, structure, and other properties" of the data. [XMLHand]

XML defines what the content is, not what it looks like. To reiterate a previous example, a section title that is to be rendered in italics in a given product is not tagged as italics, but rather as a title. The title tag defines what it is; output tools will render its appearance, whether it is print, CD, web, or some other format. In The XML Handbook, Goldfarb and Prescod refer to the "ambiguity of formatting", stating that "formatting information would merely clutter up" an abstract document. [XMLHand] By keeping presentational instructions out of the data, a higher level of portability is maintained, making the XML database far more powerful and reusable.

The organizational structure of XML also establishes relationships between parts of information. The relationships may be structural (e.g., a section element may contain a title and a paragraph or subsection) or semantic (e.g., a drug monograph may have a section called Warnings, with standard subsections of Pregnancy, Elderly, and Children, among others). These content models are part of the overall tree structure of the information set which describes the relationship of all the elements in the data.

8. The Relationship between Technical Writing and XML

There is a clear connection between the concepts of technical writing and structured markup. The process of organizing and categorizing relevant pieces of information into a consistent pattern to produce a well-defined, easy to follow document is very similar to the data-modeling that is part of DTD development in XML. In fact, the technical writer and semantic DTD developer look at many of the same things in performing their respective analyses. Both analyze the organizational structure of the content and try to identify standard headings and section types, the writer for consistent structure to enhance readability, the developer for consistent structure and semantic definition to enhance functionality and granularity. Perhaps it would be helpful for the technical writer to view the DTD developer's role as one of extending the intended value of the content.

Even the format of the original manuscript can play a role in content analysis. According to Trevor Alyn:

The tags in an abstract XML document and the styling in a rendered print document do the same thing - they just do it differently. XML communicates structure literally, using element types and nesting. Print publications communicate structure visually, using formatting and arrangement. [XMLHand]

DTD developers can key on organization, font variants, placement of headers, indenting, and so forth as indicators of structure. [XMLHand] As described in the earlier discussion of technical writing, the presentation of section headers and subsection headers in the formatted instance can indicate to the DTD developer levels of nesting that need to be accounted for.

Sounds easy, doesn't it? The concept is easy; reality is not. Sometimes the restrictions placed on the writer by the DTD diminish the overall quality of the content. Other times the nature of the information diminishes the quality of the tagging. Case in point: Ideally, the content model of each element should define exactly what that element can contain and what it cannot, and there should always be some sort of content required. However, in data as complex as drug information, it is not always possible to require content. Within warnings content of drug monographs, there are a number of common subsections worthy of semantic tagging, so we created standard markup to cover these sections as well as leaving room for generic sections that apply only to specific drugs. The problem is that while the standard subsections occur in the majority of drug monographs, none of them occur in every monograph, and there is not always drug-specific information in each monograph. Therefore, the entire content model must be optional; that is, if the information does appear, it must appear in a certain order, but no specific piece is required in all monographs. This makes the content model seem ambiguous, but it is because of the necessary nature of drug information.

Striking a balance between truly important flexibility and the need for consistent structure in a data model is problematic to say the least. Unfortunately, there is no magical silver bullet to slay this demon. There will usually be some level of compromise mandated, depending on the complexity level of the data set. Steps must be taken to keep the need for compromise minimal, but it will happen.

The keys to peaceful coexistence between technical writing and XML are communication, education, and cooperation. This may sound trite, but it is true nonetheless. Open lines of communication between the writers and developers will help identify problems before they become significantly entrenched. Education, both for the technical writers and editors with regard to the purpose behind XML and for the developers with regard to understanding the content, will help in the development of cleaner, tighter data. Finally, close cooperation in the analysis, design, and implementation of DTDs/schemas will help expand the editors' ability to comprehend and function within the XML environment.

For instance, we need to encourage technical writers and editors to send requests for changes to DTDs/schemas, even if many of those requests are turned down. When requests are turned down, always include the rationale used to reach that decision. The first goal when a situation is encountered where the information does not match the DTD should be to see if that content can be edited to fit the existing standard structures without impacting the value of the information. If a logical reason for modifying the DTD can be demonstrated, then make it so. But we must make sure it is not simply because of inconsistent authoring or organization in a given source document. There must be a definitive reason for such deviation from the established standard.

Inconsistency of organization in authoring can be the result of a number of variables: Multiple authors working on the same project, an author's shifting mindset when writing at different times, no established protocol to guide the author, and so forth. The technical writer's job is to interpret these inconsistencies and reorganize them into an efficient, easy-to-read document. Protocols and procedures must be established to guide the writing and editing of content to assist in adherence to standards. The analysis that leads to the development of these standards applies to both editorial processes and DTD development. This is where a carefully defined DTD can aid the writers in maintaining consistency. The DTD is a tool, not an impediment.

Another valuable practice is to involve the editors in the repurposing of data. For example, when we decided to develop an electronic version of our flagship product, the programming staff ran into serious difficulty trying to repurpose the data. It was not until one of our editors was assigned to the project that the format-specificity of our data was clearly identified. Our DTD was oriented toward print-groups and had no upper level structure to facilitate parsing and manipulating the entire data set as a single document. Much of the aforementioned creative tagging also became apparent as we tried to reconcile the new stylesheet and intended functionality with the markup structure. A custom DTD was put together for the electronic version and the upper level tagging was added by hand, which led to many late nights in an effort to hit the deadline. It is unfortunate that it took a reformatting of the print product a couple of years later to give us an opportunity to redesign the DTD and clean up the entire data set to make the content media-neutral, but at least we were able to reach that point over time. It was a tremendous learning experience for us, and all subsequent DTDs and work on a media-neutral repository were enhanced greatly by this experience. But until the technical writers actually participated in the repurposing process, the impact of format-specific tagging was not truly understood. The idea that if the printed page output correctly that the tagging must be correct got blown out of the water; it made us look more carefully at the structure of our content.

Ideally, it would be a great asset to be able to split the authoring and editing staff between content/markup specialists and format specialists. The content specialists would focus on developing and tagging the content without consideration of format; the format specialists would then take that content, apply stylesheets as needed, and perform any format-related edits in the composition tool, as opposed to the data. However, for many companies, budget constraints make it financially unfeasible to maintain two editing groups; therefore, the content editors and format editors are often the same people.

One way to cover this situation is to institute a quality assurance (QA) process specific to markup. At Facts, in addition to the standard content QA our information receives, we have added a markup QA process as the last phase in our workflow before posting to the database. Each document or document set passes through a markup QA specialist who first parses the data several times at different granular levels, then reviews the markup for anomalies and instances of creative tagging. The markup specialist has compiled a list of tagging issues to look for, documenting new trends as they come up to nip them in the bud. The QA specialist will correct minor tagging issues; more significant issues will be identified and sent back to the technical writer or editor with an explanation of the problem and method for resolving it. This gives the responsibility back to the editors and encourages them to focus on the integrity of the data, rather than on formatting.

The QA specialist is also tasked with training new editors in markup concepts and in how to use the editing tools. As an integral member of the DTD development team and an experienced editor, the QA specialist has a deep understanding of both worlds and can help new technical writers adjust to the intricacies and philosophies of the new publishing environment.

9. Final Comments

Technical writing and XML cannot only exist in the same universe, but they also serve a very similar function. Both try to understand and define the relationships between pieces of information and organize them into a cohesive unit of meaning. The two disciplines complement each other: Technical writing is intended to structure information in a consistent, easily human-readable manner, while XML assists in maintaining consistency and extends the content through semantic and structural tagging to be machine-readable as well. The reusability of XML-tagged content greatly reduces the need for writers and editors to rehash the same material for a variety of products; they write it once and repurpose the existing content as needed. This does not mean we need fewer editors; on the contrary, now the writers and editors can produce more material faster and more efficiently.

The relationship between technical writing and structured markup needs to be clearly defined and emphasized to those involved in the process of content creation and delivery. It is imperative that protocols and workflows be established for content creation and XML development. A quality assurance program for both content and markup is critical to success. Training needs to be instituted to introduce the writing and editing professionals to the new publishing paradigm, and must be supplemented on an ongoing basis to keep skills current and revisited periodically if old format-oriented habits reappear. In order to produce useful, media-neutral XML, the content specialists and DTD developers and programmers cannot each live in their own little worlds, operating in a vacuum. The development of content and DTDs must be a coordinated effort, and lines of communication must be open and utilized.

Bibliography

[TechWrit]
Technical Writing.Lannon, John 7th ed.Addison-Wesley Educational Publisher Inc1997
[XMLHand]
The XML Handbook. Goldfarb, Charles F., Prescod, Paul. 3rd ed.Prentice Hall PTR2001

XHTML rendition made possible by SchemaSoft's Document Interpreter™ technology.