Friday, January 05, 2007
TCW: Jason, thanks for agreeing to chat with us today. For our readers who don’t know who you are, please tell us a little about yourself, your past experience, and your role at Mark Logic.
JH: Thanks for having me here, Scott. Or do people call you Mr. Wrangler?
Some people may know me for my work in Java, where I wrote Java Servlet Programming (O’Reilly), helped develop Tomcat and Ant, created the JDOM open source library for XML manipulation, and worked as Apache’s representative to the Java Community Process Executive Committee. For the last few years though I’ve been concentrating on XQuery and its ability to support large-scale XML content manipulation, and as part of that I joined Mark Logic and work here as Principal Technologist.
TCW: Can you tell us a little about Mark Logic. What types of solutions do you sell and to whom?
JH: Mark Logic sells an XML Content Server (called MarkLogic Server) that acts as a platform for people creating content applications. We use XQuery as the language for interacting with the server, with extensions for advanced text search, transactional updates, and other useful features. While most XQuery engines focus on handling XML data (such as purchase orders) we focus on XML content (such as books, articles, references, web pages, and blogs). XML content, unlike data, is more textual, ordered, hierarchically structured, and diverse in its construct. We’ve enjoyed a lot of success selling in the publishing and government verticals.
My focus within the company is primarily on publishing. We help publishers as they move beyond simple “aggregation of content” to what you might call “interpretation of content”. By understanding XML natively and providing an efficient mechanism to load, query, manipulate, and render XML content, publishers can raise the bar on what they deliver and how fast they can deliver it.
In my discussions with publishers I’ve identified a few trends in web publishing, trends I think we’ve helped advance:
PathCONSULT “differential diagnosis” feature.
Screenshot: Oxford African American Studies website. Get a free 30-day trial of Oxford Press.
Screenshot: “At A Glance” listing for Harriet Ross Tubman, Oxford African American Studies.
Screenshot: Congressional Quarterly website.
There are several other trends. I give a presentation “Web Publishing 2.0” where I explain the top ten trends I’m seeing in the publishing industry.
TCW: Can you help our readers understand what the XQuery standard is, why it was needed, how Mark Logic uses it, and why it’s a better approach than some others?
JH: XQuery is a World Wide Web Consortium (W3C) standard language designed to query XML. In some ways it is to the XML data model what SQL is to the relational data model. XQuery is currently a proposed recommendation, the last step before its formal 1.0 release.
Mark Logic uses XQuery because it provides a powerful mechanism to interact with our XML Content Server. Perhaps because of its name, some think of it as just a query language, but there’s a real programming language in there. You can do some amazing things with it. If the job involves large-scale querying, manipulating, and/or rendering of XML, XQuery produces a solution far simpler than something like Java and JDOM. XQuery can be easier to program than XSLT and, with an indexed store like MarkLogic, run much more efficiently as well.
Of course, XQuery has its limitations. The XQuery 1.0 release will lack a few important features like text search, the ability to modify documents, error capturing, and other features that will someday be standard but in the meanwhile Mark Logic has had to add these features to the language.
TCW: Can you tell us a little bit about one of your implementations? For example, what is SafariU and what problems does it attempt to solve?
JH: SafariU is a web site from O’Reilly Media and Pearson that helps college professors create custom books for their classes. Instead of violating copyright and photocopying parts of books for students, professors can mix and match book pieces, articles, or even their own uploaded content into a new custom book. It’s “rip, mix, burn” but for books instead of songs. The end result is a professionally bound book delivered to the college bookstore, sold for 16 cents a page. The SafariU back-end holds the content as XML (about 5 gigabytes worth) and uses that XML in conjunction with Mark Logic and XQuery to perform advanced search to find materials, HTML rendering for on-screen display, and PDF rendering for the printer-ready copies. In the final PDF the dynamically constructed table of contents and back of the book index make it seem as if the context were always a single book, but in fact they’re just created on the fly with XQuery pulling out section titles and index term elements.
Screenshot: SafariU generated index.
O’Reilly recently setup a Labs site to experiment with the different alternate applications the SafariU system could support. There you’ll find a code search, an image search, a couple quiz games, and a content statistics application to learn all about what books O’Reilly sells. Before using Mark Logic, O’Reilly hosted their content on a simple shared NFS mount and had very little visibility into their most important assets. Now with the Labs site, we can all learn that they have over 303 million words in print and up for sale, including about 2.5 million lines of (now searchable) code.
Screenshot: A results page from O’Reilly Labs.
TCW: That seems like a very beneficial use of XQuery. What other services does SafariU hope to introduce?
JH: The list is long. One thing I’d like to see is a “Search My Bookshelf” feature where you register the books in your office library and search via the site to find which books have the answer you’re looking for. I’ll be happy if I never again use a back of the book index, even as I enjoy having paper copies of books to read. If I do use a classic index, I’d want it to be a dynamically generated index, a concordance putting together all the indexes of all my purchased books, delivered to me as my own personal PDF to print. That’s actually quite an easy “query” to write.
TCW: This is the kind of solution that is more impressive when you see in it action. Is it possible to test drive SafariU?
JH: You can learn about SafariU and view a Flash demo. I’m afraid you won’t be able to test drive unless you’re a professor, as users have unrestricted access to all of O’Reilly’s content, as well as much of Pearson’s.
TCW: Can you tell us a little bit about Oxford University Press? How are they using XQuery and what types of problems are they solving?
JH: First, let me say Oxford University Press (OUP) has some of the most beautiful XML I’ve ever seen. They invested heavily over the years in creating semantically rich XML, and it’s been a real pleasure to work with them to realize the value of this markup by helping them create a platform to host their online sites. You can see XQuery in action on their sample “At a Glance” page. This page includes content pulled from up to 10 different sources, more or less on the fly, and reconciled so that the user can get an instant overview of a particular topic.
TCW: That’s very impressive. What else can you tell us about the Oxford University Press project?
JH: With Oxford’s publishing platform, the platform underlying the African American Studies Center, they’re going to be able to roll out new sites faster than one per quarter. Mark Logic literature says, “We accellerate the creation of information products.” I think their platform, built on Mark Logic and XQuery, proves that true. One reason is there’s no “impedance mismatch” between the XML Content Server and the XML content being manipulated. It’s the power of using the right tool for the job.
TCW: It seems like this is the kind of solution that is more impressive when you see in it action. Is it possible to test drive Oxford’s solution?
JH: Yes, on the home page (oxfordaasc.com) you can request a free trial login. If you’re just curious about what they offer, the Flash demo walks you through the main feature set.
TCW: There’s always a “Wow!” factor project in every information technology professionals arsenal. We use these examples to help folks understand the power of a technology in ways that are meaningful to them. What are some of the cool—and useful—things one could do with XQuery? Do you have an example you like to use that would help our readers understand what other problems XQuery could be used to solve?
JH: One time I was asked to create a “press view” of any arbitrary medical journal article. The journal publisher wanted to help reporters understand things when they came to report on a new study with interesting consequences. The challenge of course is that these reporters weren’t medical experts. I created the “press view” with XQuery and the query, manipulate, render sequence of actions.
I determined that more than anything else reporters would want to see the charts and images from the article, the eye candy. Since I too wasn’t a medical expert, I knew the figures were what I looked at to try to understand what the article was about. My first query thus was: find me all figures and figure captions in the article. People need context, so for each figure I did another query to find any paragraph that mentioned the figure. These I included after the figure. (You can see a similar contextual display in the O’Reilly’s Labs Image Search offering.) To explain the purpose of the article I decided to query the for the table of contents page of the journal issue containing the article and place at the top of the “press view” the article’s blurb from the TOC. At the end of the Press View then I queried all the later journal issues for any letters to the editor pertaining to this article, and I placed those in threaded order. That way reporters could see any controvery the article stirred up.
I think the creation of this synthetic document shows the value of content applications. It’s not just styling for output. It’s querying, manipulating, and rendering. That you can use the same scripting style language to search, extract, and style the content is really powerful.
TCW: Are there any questions you wish we would have asked you? If so, now is your time to ask them.
JH: You could ask where people should go for more information. Here are a few resources:
Thanks again, Scott. It’s always fun when someone asks about the thing you’re passionate about.
Filed under: Search : XML : XQuery
It’s In The Mix: The Next Generation Of Open Source Publishing
Moving Legacy Content To XML: Affordable, Self-Service Analysis/Modeling Tools Needed, Survey Says
Ware are You? Web Content Delivery Strategies
Information Architecture for My Office
Plain English Videos From Common Craft Make Understanding New Technology Easy
U.S. Federal Government Silences Typo Spotters; Forces Them To Stop Encouraging Others

Get The Content Wrangler Newsletter delivered straight to your home or work Inbox. It's full of content goodness.