Miss an article? Archives

Feature Article

Tuesday, April 15, 2008

Automated Intelligent Document Classification, Data Extraction and Search Tools for Legal Pros

[Note: Although this article is a bit heavy on the marketing hype, we’ve included it here for your review. Automated document classification is an important concept for those who work on large scale enterprise content management initiatives to understand. Additionally, the methods used to translate hand-written content (like medical records and all types of forms) into machine-readable content are critically important techniques that all content professionals should be aware of. The focus of this article is the legal industry, but the technologies discussed are applicable in virtually any content-heavy industry.—The Content Wrangler]

Legal teams worldwide often face the prospect of having to sort through mounds of paper documents, looking for particular relevant pieces of information. The need comes up across the board in legal practice: during the process of discovery, in performing due diligence in advance of mergers or acquisitions, in investigatory proceedings and in the redaction of privileged information prior to the release of documents.

image Traditionally, scouring these documents has been a long and exhaustive manual process, often requiring engaging paralegals on an hourly basis, diverting crucial internal resources or outsourcing the task to a third party. Great strides have been made in similar searches of electronic original documents, as the data is by nature simple to index by competent e-discovery or enterprise search systems. However, even these advances hit the wall when presented with images of paper documents, particularly those that do not follow a predetermined, standardized format or contain handwritten information.

In response to these needs, A2iA has made its A2iA DocumentReader software available to legal professionals and the third-parties which service them. By utilizing advanced intelligent word recognition (IWR) and character recognition technologies, combined with robust document identification/classification, data extraction and search capabilities, DocumentReader is successfully navigating the last mile in legal searches. Able to handle structured, semi-structured and even unstructured, handwritten documents, A2iA represents a much more efficient and cost-effective way for legal teams to identify, classify and search through paper documents, evidence and other materials.

An Overview of The Problems
Every day, legal professionals worldwide are faced with the onerous task of sorting through masses of documents of every type. The Federal Rules of Civil Procedure (FCRP) clearly state that all pertinent documents—no matter their form—are subject to the same identification, preservation, disclosure and production requirements. Whether machine-printed or handwritten, standardized forms or free-form blocks of text, any and all relevant documents are required by the FRCP to be produced for inspection or other purposes, without exemption.

Even a small civil case might require a firm to search out pertinent information from boxes of documents: official forms, invoices, letters or hand-written notes. Large corporate mergers requiring extensive due diligence can involve the culling of literally roomfuls of documentation, searching for key data elements, names or phrases among individual pages. Corporate and government redaction efforts are likewise huge in scope, often examining thousands of pages of documentation for individual occurrences of privileged information.

Whether acquiring information for purposes of discovery, due diligence or redaction, documents must be sorted, classified and searched for key data elements. In real world practice today, such documents come in many forms: as data on a hard drive or server, imaged files of previously scanned documents or paper originals needing to be digitized. The wide variety of source materials has hindered many attempts to address these needs. The computerized automation that works exceedingly well for searching electronic data may be incapable of doing the same for the contents of scanned images of paper originals.

For years, the only solution was to perform this work manually, and many firms and legal departments still do so to this day. That’s not to say there haven’t been steps toward a better, more effective method. An entire industry has sprung up around accomplishing this goal, offering everything from dedicated third-party manual (often off-shore) outsourcing of document handling to the complex means of searching through electronic original data known as e-discovery. Each of these has its benefits and limitations.

Many of these electronic solutions hit the wall when it comes to dealing with unstructured documents. Unless they conform to standardized, recognized forms and consist of printed text, finding and extracting the necessary data from some documents can be problematic. Digitized images of loosely-structured documents or free-form handwritten blocks of text have been particularly difficult to work into an automated solution. Time-intensive manual searches of these documents can often be less accurate than results from systems set up to similarly comb through digital originals and purely electronic data. Additionally, there is still some concern among attorneys that shipping documents to a third-party for analysis opens up the possibility of violating privilege.

DocumentReader bridges this gap in functionality and, when integrated into existing systems for handling e-discovery, redaction or due diligence, gives legal professionals far greater capabilities when it comes to managing their document needs. Costs are lowered, efficiencies boosted, times shortened and, perhaps most importantly, accuracy is significantly enhanced. Able to recognize and transcribe free-form and even handwritten documents with as much ease as standardized forms, DocumentReader represents a much more effective way to search though imaged documents than any of the traditional methods available to legal professionals.

By incorporating DocumentReader into a larger enterprise search system, the result is a more powerful method to search documents that increases efficiency and accuracy while expanding the range of documents and information that can be recognized, categorized, indexed and searched. No longer must firms manually review thousands of pages of paper documentation for specific data elements. DocumentReader performs these searches with ease, quickly eliminating documents which do not contain predefined elements zeroing in on those that do.

Through a combination of leading-edge recognition technology and the flexibility to adapt to a range of task-specific configurations, DocumentReader fits perfectly into an overall enterprise search set-up and meets the real needs of legal teams.

About IWR and ICR
Intelligent word recognition (IWR) recognizes entire handwritten or printed words, matching them to a user-defined dictionary and significantly reducing the number of character errors associated with more typical character-recognition engines. Instead of looking at words letter-by-letter, IWR performs a deeper analysis. For each word analyzed, the system breaks down words into a sequence of graphemes. Graphemes are the subparts of letters, which are various curves, shapes and lines that can make up letters. IWR considers various shape and letter groupings in order to calculate a confidence value associated with the word in question. When individual characters are relatively well-formed and distinctly separated, intelligent character recognition (ICR) is used, drawing on extracted features such as curves, loops and lines indicative of particular characters to identify them.

Automating Document Data Capture and Classification
DocumentReader leverages intelligent word and character recognition technology and advanced document classification capabilities to greatly expand automated discovery of key information within documents. Once scanned, digitized versions of paper documents are sorted according to characteristics and document type. The documents are transcribed and can be searched for predefined data elements. Results can then be incorporated into the firms pre-existing e-discovery or document management systems, providing unparalleled access to extracted data.

Intelligent searching allows for the user to define any number of search items, while repeated occurrences of these items hones further recognition. Many outsourcing ...

Read more

Filed under: Intelligent Character RecognitionIntelligent Word Recognition

News & Notes
(updated daily. almost.)
News RSS Feed

Call for Presenters: Intelligent Content 2009

Thursday, September 18, 2008

.image
Intelligent Content 2009 has announced a call for presenters. The event, to be held January 29-30, 2009 at Le Parker Méridien Palm Springs, needs presenters who are creating, managing, and delivering intelligent content and who can present on such topics as:

  • Adaptive content
  • Content mining
  • Context aware behavior
  • Dynamic device adaptation/multichannel publishing
  • Integrated social software and semantics
  • Personalized content
  • Semantic retrieval

The organizers are seeking submissions—presentations, case studies, panel sessions, workshops and interactive demonstrations—that are visionary and practical. But, more than anything, the organizers are seeking sessions that will help attendees learn something useful—something they can use when they return to the office. Case studies of content projects (web, print and/or mobile) are highly desired, as are presentations on content problems solved by social networks or via mashups - anything goes. If you are doing some really forward looking work let the organizers know

Subscribe: Direct Inbox Delivery

Get The Content Wrangler Newsletter delivered straight to your home or work Inbox. It's full of content goodness.

sponsors Image Image Image Image Image Image Image Image Image Image
Internet Blogs - Blog Top Sites