Miss an interview? Archives
Monday, November 19, 2007
Repurposed with permission from Data Conversion Laboratory, DCLNews
If everyone formatted documents “correctly”, following all the rules, document conversion would be a “piece of cake”, as Michael Gross likes to say. But that’s not the real world. People don’t “always” read manuals, they might take some shortcuts, and sometimes they use software in ways never conceived by their developers. In this interview Michael Gross tells us about the hidden traps that can ambush a conversion effort.
Q: I recently saw a website advertising fully automated legacy document conversion to XML. Can this be true? Very often it [automated conversion] looks good on paper, but is hard to make modifications to, and even harder to convert to a structured markup.
A: While it’s possible for tools or web-based software to perform a completely automated conversion, that doesn’t necessarily mean that the documents produced are “ready-to-use” XML. Assuming that the documents you want to convert are authored in a word-processing or a desktop publishing environment, the bulk of the document could certainly be converted to an XML representation. The primary challenge, however, is that electronic publishing is mostly about appearance (how the document will look) while XML is more about structure and content, so in many cases, the structure must be inferred from visual clues in the source documents (e.g. if something “looks” like a heading then it’s probably a heading, but not always).
Typically 95% of the document will convert properly, but the remaining 5% … take 95% of the effort to cleanup.
The adage that 95% of the effort is expended on the last 5% of a project applies to document conversion as well. Typically 95% of the document will convert properly, but the remaining 5% that automated conversion can’t do right will take 95% of the effort to clean up. In reality, the accuracy of a conversion is partially a function of how well the documents were authored to begin with. Today’s modern electronic authoring environments have powerful features that allow users to setup sophisticated stylesheets and automatic link and cross-reference generation, so that if used properly, the conversion can get pretty close to a perfect conversion. But that is only if (and that’s a pretty big IF) the document authoring rules are closely followed and enforced for all document authors.
In the real world, product technical documentation is often completed in a rush, immediately before a product ships, and no one is taking the time at that point in the process to make sure that a document is being authored in the optimal way to make conversion easier. To make things worse, the people making changes often don’t really know the authoring environment very well, so the authoring is done sloppily…but it looks okay on paper, and that’s what counts at that moment. An XML industry pundit once told me that instead of today’s electronic publishing being referred to as WYSIWYG (What You See Is What You Get) authoring, it more appropriately should be called WYSIATYG - What You See Is All That You Got, because very often it looks good on paper, but is hard to make modifications to, and even harder to convert to a structured markup.
Q: What are the document elements that make totally automated conversion to XML difficult?
A: In conversion to XML, some tags relate to document structure, and others relate to document content. Computers are just not that good at understanding “meaning.”
Content Tags. Document content tagging is usually the most challenging, since content tags usually refer to the meaning, or semantics, of what they contain. Computers are just not that good at understanding “meaning.” So for instance, if the XML tagging requires a tag placed around a repair procedure, since the source publishing documents do not usually contain that type of information, even in the best case, you need to infer this information as best as you can. This is often done by looking for specific word patterns (looking for words such as “Repair’ or ‘Fix’). This type of approach will usually require a human review, since the automated process is bound to get it wrong at least some of the time (even humans don’t always agree on these). So you can expect to need to perform review on many of your content tags.
Structure Tags. Regarding automated conversion of structure tagging, here are some examples of document structures that might trip up the automated conversion:
Q: Can I assume that if I already have structured documents in an XML or SGML format, conversion to another XML structure can be completely automated?
A: While this is a reasonable expectation, the reality is that because not all XML documents are created equal, manual intervention, even in XML to XML conversion, is often required. Remember that the degree of content tagging within XML markup schemes vary greatly. If the target markup scheme requires a higher level of content markup than the source documents, then you will likely require some manual intervention. For instance, if you are converting to DITA (an increasingly popular XML markup scheme used for technical documentation), you need to break documents down into individual reusable topics, and define a type for each topic. This information doesn’t exist in most XML source documents, so some form of manual intervention is required.
Q: I’ve already converted my documentation to HTML webpages. Now I’d like to elevate these pages to an XML version. Can this be done with an automated conversion?
A: Because HTML is principally based on the SGML and XML markup standards, people assume that conversion to XML should be easy. However, HTML can often be one of the most difficult to convert. First, often the source HTML is not well-formed because Web Browsers tend to be quite forgiving and don’t enforce very much structure. Secondly, in order to accomplish a certain appearance on the page, HTML markup can often use convoluted tagging designed simply to produce a certain web browser rendering. So, for example, HTML pages typically contain additional HTML table structures that are not really meant to be tables (in an XML sense), but were used to position certain elements on a webpage. So now, to convert to XML, you’ve got to differentiate between true XML tables (that should remain in the XML) and positional tables (that should be stripped from the HTML). In addition, HTML pages tend to be cluttered with navigational aids, javascript code, and advertisements, all of which will need to be removed to produce correct XML. As is true in other legacy document automated conversion, if the HTML pages were authored in a highly consistent fashion, there is a greater possibility that you can build an automated conversion that will produce accurate XML.
If you know your target format’s requirements in advance . . . you’ll have a better chance at converting them in an automated fashion
So whether you are converting from electronic publishing formats or from markup formats, if you know your target format’s requirements in advance, and can plan the way you author your documentation, developing rigorous standards that are followed precisely, you’ll have a better chance at converting them in an automated fashion. If this is not the case (and it rarely is), for the reasons that we have outlined, you should expect to put in a fair amount of manual effort to check the results of your automated conversion and to address those issues that the conversion could not handle properly.
About Michael Gross
Michael Gross is the Chief Technology Officer and Director of Research and Development for Data Conversion Laboratory. He is responsible for all software-related issues, including product evaluations, feasibility studies, technical client support, and management of in-house software development. He has been solving digital publishing conversion problems at DCL for twenty years and has overseen thousands of legacy conversion projects.
What’s a Data Center? An Interview With Doug Theis, Lifeline Data Centers
Interview with Teresa Mulvihill, LiveTechDocs.com
Interview with Noz Urbina, Mekon UK
Painting The Athletic Male: Interview with Brenden Sanborn, Studio 805
Crystal Ball: What’s Next in the Content Space?—Interview with Dave Kellogg, Mark Logic
Content Convergence and Integration 2008: An Interview With Rahel Bailie
Forget Listserv Digests—You’ve Got MarkMail: Intervew With Jason Hunter, Mark Logic
Websites: Design vs. Style—An Interview With Charles Cooper, The Rockley Group
A New Tool for Technical Communicators: Coventi Pages— Interview with Peter Lee and Dan Wilson
Finding the Hidden Trouble Spots in Your Content: Michael Gross on Content Conversion
DITA Storm Finds New Owner: Interview with Gary Schaffer, Inmedius
Uncovering Wikipedia Edits: Virgil Griffith Shines Light on Anonymous Changes to Popular Wiki
yubnub: The Command Line For The World Wide Web - Interview With Jonathan Aquino
Interview with RJ Jacquez: Adobe’s New Technical Communication Suite
The Real Promise of DITA: Dynamic Content - An Interview with Eric Severson, Flatirons Solutions

Get The Content Wrangler Newsletter delivered straight to your home or work Inbox. It's full of content goodness.