Miss an article? Archives
Friday, March 14, 2008
By Michael Gross, DCLNews
In the early days of digitizing information, five years ago, it was enough to just make more and more content electronic, but that’s no longer enough. With the ever-enlarging mounds of data out there, it’s not enough to create more ‘electronic paper.’
There’s a tremendous need to enhance the information so it can be more readily found, more easily accessed, and more easily reorganized. Content tagging in XML and SGML is key in this effort.
This article discusses content tagging, and how one might incorporated this enhanced information into legacy documents that were not written with these tags in mind.
What is Content Tagging?
When documents are converted to XML (or SGML), part of the conversion process is to create the tags that makes XML so useful. Most of the created XML tags are there to replicate the printed structure of the document. These “appearance” tags describe constructs such as sections, lists, captions, paragraphs, and tables. “Content tagging,” on the other hand, refers to tagging that is based on the semantic meaning of the content. For example a content tag in a maintenance manual might identify a word or a phrase such as a tool or a part number. In a life sciences technical document, there may be a tag
Two recently popular technical markup schemes, the Darwin Information Typing Architecture (DITA) and S1000D, both require content tagging at a fairly high level. Going further, creating S1000D documents requires that the content be decomposed into modules, with each module requiring a Data Module Code that identifies where a particular module exists within the overall documentation set. This Data Module Code is also a form of content tagging, since it describes what the module is about. Similarly, in DITA, content is decomposed into topics, and those topics are identified by type (e.g. tasks, concepts, and references). Deciding which topic type to apply to a particular DITA topic requires content knowledge since topic types are determined from the semantic information that exists inside of it.
Finally, content tagging can often mean having to take content tables (tables that have a particular appearance) and decompose that appearance into a particular set of content tags designed to hold that information. As an example, a technical maintenance document may contain a table of required tools. Converting this type of document requires all of the information in those tables to be broken down into content tags such as
Legacy Document Content Tagging Conversion
The issue with deriving content tags, as useful as they are, is that they don’t usually exist explicitly in the document; rather they need to be inferred. This process usually requires a combination of automated tools, that can get you much of the way there, coupled with a manual review process (because tools cannot effectively deal with all situations) and occasional use of a Subject Matter Expert (SME) for some level of review. Since SME time is usually quite expensive, limiting his or her involvement is an important strategy.
To help illuminate successful approaches we’ve used, the following further discusses some of the examples referred to earlier:
Example 1: Converting information into tags such as the
Example 2: With S1000D and DITA topic/modules, completely identifying module types and topic types requires a thorough understanding of what the text is about. Automated tools and text patterns can help highlight clues within the text, but usually requires the help of a person to review the text and select the proper classifications. This can be done fairly easily by supplying a table of all section headings within the legacy documents to the client. The client can then review the tables and complete them as needed. Review is typically needed by someone familiar with the materials, but not necessarily a Subject Matter Expert (SME), with an SME reviewing only those items that remain ambiguous.
Example 3: With our final example, the decomposition of content tables, the required level of effort can vary widely. In the best case situation, all tables were set up using the same table template. Conversion of these tables to content tagging can be rather simple because identification of this type of table can be done by examining the column headings, and each column heading will typically translate into a tag (or series of tags).
More typically, tables were created using “similar” column headings, with some columns added or deleted as needed by the author of the table, without following a rigorous template. In those cases it becomes important to identify which column headings actually identify the table type, and which ones are optional. Even in the best case situation, there will be exceptions such as notes (or occasional extra cells) used to convey additional information that the author felt was necessary to include, but for which there is no simple content tag. For those cases, a strategy will need to be worked out with the client on a case-by-case basis.
Conclusion
As our discussion illustrates, content tagging covers a broad range of issues that may need to be addressed in implementing a successful legacy conversion to a modern XML markup scheme. These issues often have to be dealt with on a case-by-case basis, automating when possible, but when it comes to content tagging, a manual review is often needed to detect anomalies that are just not practical to handle via automated methods.
About the Author
Michael Gross is the Chief Technology Officer and ...
Filed under: Content Conversion : Legacy Content Conversion : Metadata : Structured Content : XML
Wednesday, May 14, 2008
From TechCrunch: Michael Arrington writes: “By acting first, MySpace takes the lead and has a shot at being the long term winner - meaning lots of people use MySpace as the place to store data, and share it out to other applications from there. Look for Google to make their move next.” Arrington was commenting on a new MySpace project dubbed “Data Availability”, an example, the company says, “of their dedication to playing nice with the rest of the Internet.” In this case, playing nice means data sharing partnerships with Yahoo, Ebay, Twitter.
Friday, March 14, 2008
If you missed this year’s Content Convergence and Integration Conference in Vancouver, BC, you missed some great presentations. But, thanks to the conference organizers, many of the presentation slide decks are available online. The online offering includes: Fun with XSL: A Case Study of CHC Helicopters, Superheros and a Leprechaun, with Flare: A Case Study in Breaking Down Silos, Making Content Portable, and User generated Rich Media: Make it, Manage It, among others.
Monday, November 19, 2007
If everyone formatted documents “correctly”, following all the rules, document conversion would be a “piece of cake”, says Michael Gross of Data Conversion Laboratory. But that’s not the real world. People don’t “always” read manuals, they take shortcuts, break your rules, and sometimes they even use software in ways never conceived by the developers. In this DCL News interview, Gross reveals the hidden traps that can ambush a conversion effort.

Get The Content Wrangler Newsletter delivered straight to your home or work Inbox. It's full of content goodness.