|

Technical Writing and XML
Reconciling Editorial License with Structured Markup
Featured Paper from XML 2001
Conference Proceedings
1. Introduction
In writing reference material, consistency of organization and
presentation is key. If the same information is presented in a consistent order
and style throughout the publication or information set, it enhances the
readability and usability of the material for the consumer. Reference materials
may include such documents as encyclopedias, dictionaries, parts supply lists,
maintenance manuals, computer manuals, and drug information texts, among many
others. Ease of use is vital. XML provides a means to assist in the
standardization of reference material from both an organizational and a
semantic/content-oriented standpoint. Standardization based on structure and
content enhances the potential for reuse of the XML-tagged information for both
print and electronic delivery.
But while there can be a strong relationship between the authoring
and editing of content and structured markup, all too often conflicts arise
between technical writers and DTD/schema designers and programmers. The
perceived need for “editorial license” and “creative
freedom” by many authors/editors clashes with the need for “rigid
structure” to facilitate “ease of programming” for markup
technologists and programmers. The battles are commonly between format and
structure, looseness and rigidity, and are often more philosophical than
practical.
2. An Artifact of Mindset
Often, it is the mindset of the individuals that leads to conflict.
That is, a lack of experience or understanding in a given area — whether
it is an editor who has never heard of XML before or a programmer who is not
familiar with the content being developed — can be the catalyst for
disagreement and even animosity. The following discussion focuses on issues
related to the difficulty of non-XML-aware technical writers and editors
adjusting to a structured authoring/editing environment. Many examples come
from my own experience, but also reflect conversations I have had with other
DTD developers throughout the industry.
2.1. Potential Mindset Issues
An exasperated manager at a conference in Boston this year asked
this question: “How can you get an editor to tell the difference between
format and content?” She illustrated the problem by describing instances
where XML documents had been printed out as simple text files for content
editing, but the editors not only edited the content, they also marked up poor
line and page breaks, even though this was not intended to be a composed or
formatted document.
Many editors come from a print or desktop-publishing environment,
where formatting of the page is a key part of their responsibilities. For some
editors, formatting is intertwined with content because of their experience and
training. Structural and content-based tagging devoid of format is a foreign
concept for them. The transition from thinking of documents as a rendered view
of information to thinking of them as pieces of data as well as understanding
the logical interrelations of these pieces is a difficult task, to say the
least.
For example, in 1993, Facts and Comparisons (Facts), a
drug information publisher, shifted from our old typesetting system to an
SGML-based publishing system. Consultants on site helped to convert legacy data
into SGML data. The technical writers received a brief introduction to markup
languages by practicing with memos and other simple documents. This did not
prepare us for the complexities of detailed drug information data.
Up to this time, Facts had primarily been a print publishing
house, with loose-leaf our most common product. Format had been an integral
part of the thought processes with regard to document creation. The move to
SGML was driven largely by the fact that Facts needed to extend its market into
the electronic world and wanted to establish a media-neutral database from
which both print and electronic products could be produced and the same pieces
of data could be shared among a variety of titles by utilizing structure- and
content-based semantic DTDs. The first step was obviously to convert our legacy
data to SGML and continue to produce the current print products as the database
was being populated. As a result, when pages were composed and formatting
difficulties arose, the DTDs began to be conformed to
“print-groups” and quite a few format-specific tags were added to
drive print production.
This format-oriented mindset, inherited not only from the old
system but also from the desktop-publishing culture of the time, led to a
number of issues. Previewing a composed page became a method for checking the
tagging, the assumption being that if the page looked okay, the tagging must be
okay. Unfortunately, this practice masked creative tagging and led to
significant difficulties down the road.
The most common example of creative tagging came in the use of a
para and an emph tag instead of the section, title,
paragraph construction to indicate a section title. In print products,
section titles are usually rendered as italic text; it was not until the
development of a CD-ROM product in which section titles were formatted as red,
bold text that this tagging problem became evident. The errant structure was
not caught earlier because the printed pages formatted correctly, so it was
assumed that the tagging was correct. Because of a lack of training, this
misuse of the emph tag was often used to circumvent the intended tagging
structure of the document. It became the catch-all when there was confusion
about how an item should be tagged.
Other common issues included the addition of break tags to
introduce line breaks into the data, the failure to report the need for DTD
modifications, or requests for DTD modifications because an outside author or a
document presented the same piece of information in a different way. The line
breaks were an issue for Facts because our goal was a media-neutral database to
be utilized for a variety of print and electronic formats; the line breaks were
geared primarily toward the loose-leaf product and often occurred in the form
of a hard hyphen followed by a break tag inserted in the middle of a word.
There also was no consistency in the way this was accomplished: Sometimes there
would be a hyphen followed by a space, then a break; other times a hyphen was
followed by a break, then a space; or a space was on either, side of the break,
and so forth. This wrought havoc with attempts to reuse the data in different
formats. Isnt it amazing how one little empty tag can lead to so much
consternation?
The conflicts that arise because of requests or lack of requests
for DTD modifications are among the most difficult to address. If a number of
requests are turned down because it is determined that they are simply attempts
to account for the same type of content in a different way, it can lead to
pertinent requests not being made and fudging of the tagging to occur. If
writers assume that their requests will automatically be refused or they fear
appearing foolish, legitimate requests for DTD modification based on new
content may be withheld and the integrity of the XML/SGML can be
compromised.
Some requests for DTD modifications come as a result of
organizational inconsistency in the original document. For example, at one
point an editor requested that two sub-elements of a particular DTD be made
reversible. When asked why, she commented that they appeared in reverse order
in one drug monograph in that data set and that it “reads better”
that way. At length it was discovered that the definition of “reads
better” came down to the idea that since the author wrote it that way,
that is the way it must be. It was a simple inconsistency in organization by an
author who was not writing in an XML/SGML environment. While the standard
section headers were used, the mental parsing of the document had failed to
catch a mild breach in the ordering of the contents. The DTD was not modified;
the document was.
Not all requests for unnecessary changes come from a fear of
changing the original document, however. Sometimes a subtle variation on a
theme can occur in a document that makes it look like a legitimate need for
modification. One of our products includes sections at the end of each
monograph for references and suggested readings. One monograph
received from an outside author had an additional section sandwiched between
the two standard sections called Additional Readings in Safety Issues in
Children. On the face of it, that appeared to be a rather important section
to include. However, further content analysis indicated that a section on
safety issues was standard in the body of all monographs, that works cited
dealing with safety issues in children were part of the standard references in
that monograph, and that similar monographs dealing with safety issues in
children included the additional readings in the suggested readings section
already. The creation of this new section in that specific monograph had to do
entirely with the mindset of the author at the time of writing, not because
this was a new area of content that needed to be accounted for. It was simply a
different presentation of the same information. Again, the DTD did not change;
the contents of that section were moved to the suggested readings section
instead. The key here is that the decision was made based on content, not on
DTD restrictions.
A more serious problem, often difficult to identify until
attempts are made to reuse the data, is when a need for DTD modification is not
reported and existing semantic elements are misused in order to get a document
to parse and still produce the desired format on output. For example, in one
product we have a sub-element called age-rd that refers to drug dosage
information related to age groups. In the main dosage element, the DTD
subdivided the content between product, condition, and age, but each
sub-element did not allow for multiple generic sections inside them. Therefore,
when the age-related information contained further sections about initial doses
and maintenance doses, these sections were either tagged using another age
element or in more creative fashion. Sometimes the title would simply be text
in a paragraph followed by a colon, close paragraph, and another paragraph
containing the body text. Other times there would be a single paragraph with
the supposed title text followed by a colon and then the body text. In other
places the old paragraph-emph construction would format the text for output.
All of these instances apparently were acceptable on the formatted page —
though sometimes the title would be italicized and other times it would not be
— but the functionality related to structure was lost for electronic and
database usage.
2.2. Technical Writers are Valuable Assets
It may sound like I am picking on technical writers and editors
thus far; this is not the case. I spent four years at Facts as a technical
writer before spending the last four and a half years as a content analyst, DTD
developer, and programmer.
Several years ago, we had an individual who worked in electronic
product development who tried to promote the idea that the editorial staff was
unnecessary. His reasoning was that the healthcare providers we worked with
knew the technical information much better than our editors, so they should
both write and edit the documents themselves. After all, “anyone can edit
text.” That attitude is as misinformed as the statement from the
non-technical side that “it's just programming.” Both of these
ideas demonstrate a fundamental lack of understanding of what the other person
is doing.
A good, well-trained technical writer or editor is every bit as
skilled a professional as a good programmer. While it is true that the
healthcare specialist, physicist, software developer, and so forth may have a
deeper technical understanding of the information, the ability to present that
information in a clear, well-written document geared toward a given audience is
a skill requiring training in language and communication that many technical
specialists have not attained. That is why the field of technical writing came
into being to begin with. As an example, my uncle is a physicist who once
commented that his company used technical writers because while their
scientists were brilliant in research and development, their ability to produce
detailed, well-crafted, readable reports was atrocious. Hence the need to have
trained writing professionals working with the scientists to get the job
done.
3. Bridging the Conceptual Abyss
So how do we reconcile the conflict between editorial license and
structured markup? How do we close the gap between the focus on format and the
need for structural, content-based tagging? The first step is to understand
what technical writing is and the strong relationship between the concepts of
technical writing and the purpose of semantic XML.
3.1. Technical Writing in a Nutshell
A technical writer is one who interprets and communicates
specialized information in a way that is “reader oriented and
efficient.”[1] In his book, Technical
Writing, John Lannon comments that:
...data rarely materializes or thinking rarely occurs in
a neat, predictable sequence. We cannot merely report ideas or data in the same
random order they occur. Instead we shape this material into an organized unit
of meaning.[1]
Technical writers often take disparate pieces of documentation on
a given topic and organize the content into a logical structure to enhance
readability and usability of that information for the reader. Part of the
process entails identifying the relationship between the data and the sequence
in which the reader is likely to approach the information.[1]
A major part of technical writing is analysis: Breaking down and
categorizing the content into relevant pieces of information, often both with
regard to the internal components of the data as well as the category in which
the information belongs.[1] For instance, information
about a given drug may be broken down by components of the drug (e.g.,
active ingredients, doseforms, strengths, uses, side effects, how
supplied) and by therapeutic class (e.g., analgesics, muscle
relaxants). By organizing this information the same way in each drug
monograph, it makes it easier for the reader to find the information being
sought. That is, if side effects always comes after indications
and before how supplied, the user can simply scan the document to find
the information quickly instead of having to hunt and peck to find the
information needed.
This concept can be applied to a variety of technical documents,
such as computer manuals, encyclopedias, dictionaries, parts lists, reports,
proposals, and aircraft manuals, to name a few. A simple example would be a
dictionary. The components of a dictionary entry (such as terms,
pronunciations, etymologies, and definitions) are always organized in the
same way. While it is true that not every entry contains every possible
component, the parts that are present are always ordered consistently.
Technical writers and editors should always bear in mind the
three Cs of technical writing: Clear, concise, and consistent. Make your
meaning and interpretation clear; state things concisely, using concrete,
active language to get your point or information across; and be consistent,
both in organization and in format. Yes, even format plays a role. If a major
section header appears in bold face followed by a hard return, its first-level
subsection headers appear in italic face followed by a colon, and its
second-level subsection headers appear indented in italic face followed by a
dash, this format should be applied to all such major sections in the document.
Utilizing the same format across the board enhances readability in much the
same manner as consistent organization does.
3.2. Content-based Markup in a Nutshell
Within a media-neutral publishing environment, XML is used to
identify and maintain the structure and content of information, independent of
formatting specifications. By maintaining a consistent organization within the
data, the information can be reused across formats and publications with a
minimum of effort. XML can serve as an aid in the editorial process by
providing a standard methodology for describing the “meaning, structure,
and other properties” of the data.[2]
XML defines what the content is, not what it looks like. To
reiterate a previous example, a section title that is to be rendered in italics
in a given product is not tagged as italics, but rather as a title. The title
tag defines what it is; output tools will render its appearance, whether it is
print, CD, web, or some other format. In The XML Handbook, Goldfarb
and Prescod refer to the “ambiguity of formatting”, stating that
“formatting information would merely clutter up” an abstract
document.[2] By keeping presentational instructions out of
the data, a higher level of portability is maintained, making the XML database
far more powerful and reusable.
The organizational structure of XML also establishes
relationships between parts of information. The relationships may be structural
(e.g., a section element may contain a title and a paragraph or
subsection) or semantic (e.g., a drug monograph may have a section
called Warnings, with standard subsections of Pregnancy, Elderly, and Children,
among others). These content models are part of the overall tree structure
of the information set which describes the relationship of all the elements in
the data.
3.3. The Relationship between Technical Writing and XML
There is a clear connection between the concepts of technical
writing and structured markup. The process of organizing and categorizing
relevant pieces of information into a consistent pattern to produce a
well-defined, easy to follow document is very similar to the data-modeling that
is part of DTD development in XML. In fact, the technical writer and semantic
DTD developer look at many of the same things in performing their respective
analyses. Both analyze the organizational structure of the content and try to
identify standard headings and section types, the writer for consistent
structure to enhance readability, the developer for consistent structure and
semantic definition to enhance functionality and granularity. Perhaps it would
be helpful for the technical writer to view the DTD developer's role as
one of extending the intended value of the content.
Even the format of the original manuscript can play a role in
content analysis. According to Trevor Alyn:
The tags in an abstract XML document and the styling in a
rendered print document do the same thing - they just do it differently. XML
communicates structure literally, using element types and nesting. Print
publications communicate structure visually, using formatting and
arrangement.[2]
DTD developers can key on organization, font variants, placement
of headers, indenting, and so forth as indicators of structure.[2] As described in the earlier discussion of technical writing,
the presentation of section headers and subsection headers in the formatted
instance can indicate to the DTD developer levels of nesting that need to be
accounted for.
Sounds easy, doesn't? The concept is easy; reality is not.
Sometimes the restrictions placed on the writer by the DTD diminish the overall
quality of the content. Other times the nature of the information diminishes
the quality of the tagging. Case in point: Ideally, the content model of each
element should define exactly what that element can contain and what it cannot,
and there should always be some sort of content required. However, in data as
complex as drug information, it is not always possible to require content.
Within warnings content of drug monographs, there are a number of common
subsections worthy of semantic tagging, so we created standard markup to cover
these sections as well as leaving room for generic sections that apply only to
specific drugs. The problem is that while the standard subsections occur in the
majority of drug monographs, none of them occur in every monograph, and there
is not always drug-specific information in each monograph. Therefore, the
entire content model must be optional; that is, if the information does appear,
it must appear in a certain order, but no specific piece is required in all
monographs. This makes the content model seem ambiguous, but it is because of
the necessary nature of drug information.
Striking a balance between truly important flexibility and the
need for consistent structure in a data model is problematic to say the least.
Unfortunately, there is no magical silver bullet to slay this demon. There will
usually be some level of compromise mandated, depending on the complexity level
of the data set. Steps must be taken to keep the need for compromise minimal,
but it will happen.
The keys to peaceful coexistence between technical writing and
XML are communication, education, and cooperation. This may sound trite, but it
is true nonetheless. Open lines of communication between the writers and
developers will help identify problems before they become significantly
entrenched. Education, both for the technical writers and editors with regard
to the purpose behind XML and for the developers with regard to understanding
the content, will help in the development of cleaner, tighter data. Finally,
close cooperation in the analysis, design, and implementation of DTDs will help
expand the editors' ability to comprehend and function within the XML
environment.
For instance, we need to encourage technical writers and editors
to send requests for change to DTDs, even if many of those requests are turned
down. When requests are turned down, always include the rationale used to reach
that decision. The first goal when a situation is encountered where the
information does not match the DTD should be to see if that content can be
edited to fit the existing standard structures without impacting the value of
the information. If a logical reason for modifying the DTD can be demonstrated,
then make it so. But we must make sure it is not simply because of inconsistent
authoring or organization in a given source document. There must be a
definitive reason for such deviation from the established standard.
Inconsistency of organization in authoring can be the result of a
number of variables: Multiple authors working on the same project, an
author's shifting mindset when writing at different times, no established
protocol to guide the author, and so forth. The technical writer's job is
to interpret these inconsistencies and reorganize them into an efficient,
easy-to-read document. Protocols and procedures must be established to guide
the writing and editing of content to assist in adherence to standards. The
analysis that leads to the development of these standards applies to both
editorial processes and DTD development. This is where a carefully defined DTD
can aid the writers in maintaining consistency. Emphasize that the DTD is a
tool, not an impediment.
Another valuable practice is to involve the editors in the
repurposing of data. For example, when we decided to develop an electronic
version of our flagship product, the programming staff ran into serious
difficulty trying to repurpose the data. It was not until one of our editors
was assigned to the project that the format-specificity of our data was clearly
identified. Our DTD was oriented toward print-groups and had no upper level
structure to facilitate parsing and manipulating the entire data set as a
single document. Much of the aforementioned creative tagging also became
apparent as we tried to reconcile the new stylesheet and intended functionality
with the markup structure. A custom DTD was put together for the electronic
version and the upper level tagging was added by hand, which led to many late
nights in an effort to hit the deadline. It is unfortunate that it took a
reformatting of the print product a couple of years later to give us an
opportunity to redesign the DTD and clean up the entire data set to make the
content media-neutral, but at least we were able to reach that point over time.
It was a tremendous learning experience for us, and all subsequent DTDs and
work on a media-neutral repository were enhanced greatly by this experience.
But until the technical writers actually participated in the repurposing
process, the impact of format-specific tagging was not truly understood. The
idea that if the printed page output correctly that the tagging must be correct
got blown out of the water; it made us look more carefully at the structure of
our content.
Ideally, it would be a great asset to be able to split the
authoring and editing staff between content/markup specialists and format
specialists. The content specialists would focus on developing and tagging the
data without consideration of format; the format specialists would then take
that content, apply stylesheets as needed, and perform any format-related edits
in the composition tool, as opposed to the data. However, for many companies,
budget constraints make it financially unfeasible to maintain two editing
groups; therefore, the content editors and format editors are often the same
people.
One way to cover this situation is to institute a quality
assurance (QA) process specific to markup. At Facts, in addition to
the standard content QA our information receives, we have added a markup QA
process as the last phase in our workflow before posting to the database. Each
document or document set passes through a markup QA specialist who first parses
the data several times at different granular levels, then reviews the markup
for anomalies and instances of creative tagging. The markup specialist has
compiled a list of tagging issues to look for, documenting new trends as they
come up to nip them in the bud. The QA specialist will correct minor tagging
issues; more significant issues will be identified and sent back to the
technical writer or editor with an explanation of the problem and method for
resolving it. This gives the responsibility back to the editors and encourages
them to focus on the integrity of the data, rather than on formatting.
The QA specialist is also tasked with training new editors in
markup concepts and in how to use the editing tools. As an integral member of
the DTD development team and an experience editor, the QA specialist has a deep
understanding of both worlds and can help new technical writers adjust to the
intricacies and philosophies of the new publishing environment.
4. Final Comments
Technical writing and XML can not only exist in the same universe,
but they also serve a very similar function. Both try to understand and define
the relationships between pieces of information and organize them into a
cohesive unit of meaning. The two disciplines complement each other: Technical
writing is intended to structure information in a consistent, easily
human-readable manner, while XML assists in maintaining consistency and extends
the content through semantic and structural tagging to be machine-readable as
well. The reusability of XML-tagged content greatly reduces the need for
writers and editors to rehash the same material for a variety of products; they
write it once and repurpose the existing content as needed. This does not mean
we need fewer editors; on the contrary, now the writers and editors can produce
more material faster and more efficiently.
The relationship between technical writing and structured markup
needs to be clearly defined and emphasized to those involved in the process of
content creation and delivery. It is imperative that protocols and workflows be
established for content creation and XML development. A quality assurance
program for both content and markup is critical to success. Training needs to
be instituted to introduce the writing and editing professionals to the new
publishing paradigm and must be supplemented on an ongoing basis to keep skills
current and revisited periodically if old format-oriented habits reappear. In
order to produce useful, media-neutral XML, the content specialists and DTD
developers and programmers cannot each live in their own little worlds,
operating in a vacuum. The development of content and DTDs must be a
coordinated effort, and lines of communication must be open and
utilized.
|