Miss an article? Archives

Feature Article

Tuesday, April 15, 2008

Automated Intelligent Document Classification, Data Extraction and Search Tools for Legal Pros

[Note: Although this article is a bit heavy on the marketing hype, we’ve included it here for your review. Automated document classification is an important concept for those who work on large scale enterprise content management initiatives to understand. Additionally, the methods used to translate hand-written content (like medical records and all types of forms) into machine-readable content are critically important techniques that all content professionals should be aware of. The focus of this article is the legal industry, but the technologies discussed are applicable in virtually any content-heavy industry.—The Content Wrangler]

Legal teams worldwide often face the prospect of having to sort through mounds of paper documents, looking for particular relevant pieces of information. The need comes up across the board in legal practice: during the process of discovery, in performing due diligence in advance of mergers or acquisitions, in investigatory proceedings and in the redaction of privileged information prior to the release of documents.

image Traditionally, scouring these documents has been a long and exhaustive manual process, often requiring engaging paralegals on an hourly basis, diverting crucial internal resources or outsourcing the task to a third party. Great strides have been made in similar searches of electronic original documents, as the data is by nature simple to index by competent e-discovery or enterprise search systems. However, even these advances hit the wall when presented with images of paper documents, particularly those that do not follow a predetermined, standardized format or contain handwritten information.

In response to these needs, A2iA has made its A2iA DocumentReader software available to legal professionals and the third-parties which service them. By utilizing advanced intelligent word recognition (IWR) and character recognition technologies, combined with robust document identification/classification, data extraction and search capabilities, DocumentReader is successfully navigating the last mile in legal searches. Able to handle structured, semi-structured and even unstructured, handwritten documents, A2iA represents a much more efficient and cost-effective way for legal teams to identify, classify and search through paper documents, evidence and other materials.

An Overview of The Problems
Every day, legal professionals worldwide are faced with the onerous task of sorting through masses of documents of every type. The Federal Rules of Civil Procedure (FCRP) clearly state that all pertinent documents—no matter their form—are subject to the same identification, preservation, disclosure and production requirements. Whether machine-printed or handwritten, standardized forms or free-form blocks of text, any and all relevant documents are required by the FRCP to be produced for inspection or other purposes, without exemption.

Even a small civil case might require a firm to search out pertinent information from boxes of documents: official forms, invoices, letters or hand-written notes. Large corporate mergers requiring extensive due diligence can involve the culling of literally roomfuls of documentation, searching for key data elements, names or phrases among individual pages. Corporate and government redaction efforts are likewise huge in scope, often examining thousands of pages of documentation for individual occurrences of privileged information.

Whether acquiring information for purposes of discovery, due diligence or redaction, documents must be sorted, classified and searched for key data elements. In real world practice today, such documents come in many forms: as data on a hard drive or server, imaged files of previously scanned documents or paper originals needing to be digitized. The wide variety of source materials has hindered many attempts to address these needs. The computerized automation that works exceedingly well for searching electronic data may be incapable of doing the same for the contents of scanned images of paper originals.

For years, the only solution was to perform this work manually, and many firms and legal departments still do so to this day. That’s not to say there haven’t been steps toward a better, more effective method. An entire industry has sprung up around accomplishing this goal, offering everything from dedicated third-party manual (often off-shore) outsourcing of document handling to the complex means of searching through electronic original data known as e-discovery. Each of these has its benefits and limitations.

Many of these electronic solutions hit the wall when it comes to dealing with unstructured documents. Unless they conform to standardized, recognized forms and consist of printed text, finding and extracting the necessary data from some documents can be problematic. Digitized images of loosely-structured documents or free-form handwritten blocks of text have been particularly difficult to work into an automated solution. Time-intensive manual searches of these documents can often be less accurate than results from systems set up to similarly comb through digital originals and purely electronic data. Additionally, there is still some concern among attorneys that shipping documents to a third-party for analysis opens up the possibility of violating privilege.

DocumentReader bridges this gap in functionality and, when integrated into existing systems for handling e-discovery, redaction or due diligence, gives legal professionals far greater capabilities when it comes to managing their document needs. Costs are lowered, efficiencies boosted, times shortened and, perhaps most importantly, accuracy is significantly enhanced. Able to recognize and transcribe free-form and even handwritten documents with as much ease as standardized forms, DocumentReader represents a much more effective way to search though imaged documents than any of the traditional methods available to legal professionals.

By incorporating DocumentReader into a larger enterprise search system, the result is a more powerful method to search documents that increases efficiency and accuracy while expanding the range of documents and information that can be recognized, categorized, indexed and searched. No longer must firms manually review thousands of pages of paper documentation for specific data elements. DocumentReader performs these searches with ease, quickly eliminating documents which do not contain predefined elements zeroing in on those that do.

Through a combination of leading-edge recognition technology and the flexibility to adapt to a range of task-specific configurations, DocumentReader fits perfectly into an overall enterprise search set-up and meets the real needs of legal teams.

About IWR and ICR
Intelligent word recognition (IWR) recognizes entire handwritten or printed words, matching them to a user-defined dictionary and significantly reducing the number of character errors associated with more typical character-recognition engines. Instead of looking at words letter-by-letter, IWR performs a deeper analysis. For each word analyzed, the system breaks down words into a sequence of graphemes. Graphemes are the subparts of letters, which are various curves, shapes and lines that can make up letters. IWR considers various shape and letter groupings in order to calculate a confidence value associated with the word in question. When individual characters are relatively well-formed and distinctly separated, intelligent character recognition (ICR) is used, drawing on extracted features such as curves, loops and lines indicative of particular characters to identify them.

Automating Document Data Capture and Classification
DocumentReader leverages intelligent word and character recognition technology and advanced document classification capabilities to greatly expand automated discovery of key information within documents. Once scanned, digitized versions of paper documents are sorted according to characteristics and document type. The documents are transcribed and can be searched for predefined data elements. Results can then be incorporated into the firms pre-existing e-discovery or document management systems, providing unparalleled access to extracted data.

Intelligent searching allows for the user to define any number of search items, while repeated occurrences of these items hones further recognition. Many outsourcing services may indeed scan paper documents, but these solutions ultimately stop at the point of imaging. Such services generally provide very limited meta-data, usually representing some key information about the document, but stopping far short of allowing detailed searches of the actual contents.

Any automated solution must be at least as accurate and error-free as the manual processes it replaces. As accuracy is equally important to successful discovery, due diligence, investigatory or redaction efforts, this is an important advantage of DocumentReaders automated solution. By recognizing and transcribing documents word by word and converting the information to data, DocumentReader searches for pertinent data with an ease, accuracy and wide-reaching capability other solutions lack. Drilling down to the word, grapheme and character level allows DocumentReaders recognition technology to support a full range of imaged documents, whether machine-printed or handwritten.

DocumentReader is a powerful data extraction tool, but its document flexibility, classification and searching abilities truly set it apart from seemingly similar solutions. First, using both a general dictionary and a legal-specific, user-defined trade vocabulary, the software performs a literal transcription of handwritten and/or typed areas. The software then classifies digitized documents into basic categories (letters, identity papers, tax forms, regulatory reports, contracts, invoices, etc.) based upon an analysis of both the geometry and content of the document. By extracting predefined key words from the transcription, A2iA DocumentReader determines the specific category of the document and provides an index of found data elements.

DocumentReader is flexible and powerful enough to handle the full-range of paper documents legal teams must search for particular information.

  • Structured Documents: Within a given organization, specific forms may be frequently encountered. The recognition of common, standardized forms and documents can be preconfigured in DocumentReader according to the users specific needs. This is the simplest form of extraction, since DocumentReader immediately identifies documents based on their structure, and knows in advance the format and location of pertinent data to be extracted.
  • Semi-Structured Documents: Many documents, while not following a fixed, standardized form, contain the same data which appear in nonstandard locations. Checks, for example, may vary greatly from one to another, but in all their forms these documents contain certain common elements (names and addresses, amount, payee name, and check number) which DocumentReader recognizes, allowing data to be extracted from these locations within the document. Invoices, contracts, employment and human resources forms, passports and corporate reports are just a few more examples of the sort of semi-structured documents DocumentReader can be configured to recognize.
  • Unstructured Documents: DocumentReader can also be configured to search for predefined data elements (names, social security numbers, dates, etc.) within unstructured, handwritten documents. Documents which do not conform to any expected structure, nor fit the model of known purposes and/or relevance, are nonetheless scanned, transcribed and searched to determine whether the information sought appears anywhere within the contents.  When performing due diligence for example, a legal team will have a wide variety of information to comb through, including digital data in a variety of forms as well as scanned image files of reams of paper documents. Leading e-discovery services and software can handle the purely electronic data part of that equation, but stumble when confronted with the contents of imaged documents, particularly those which are unstructured and/or contain handwritten information. Even those solutions and service bureaus which do scan and make available imaged versions of documents are usually very limited on the type and detail of information they can provide from the contents.

DocumentReader bridges that gap when carefully configured, including any specific customization called for depending upon the difficulty of the task. The software can sort each document by type, transcribe handwritten and/or typed text, and index particular data elements searched. DocumentReader provides document type identification which allows groups of documents to be verified as having all necessary elements. Preconfigured to recognize a particular grouping, information is extracted even more confidently because documents are related within the group, increasing the accuracy of the results.

When integrated with a firms existing search systems, DocumentReader covers the last mile in providing the most complete solution for document related tasks.

Real World Benefits
With DocumentReader a part of their operations, law firms will find quicker, more accurate and wide-reaching access to information provides an immediate edge over opposing counsel during discovery. Finding every instance of occurrence faster, with results that are more easily categorized and searchable, frees legal teams to use the extra time and resources to construct a stronger case. The ability to scour through the imaged contents of paper original documents means legal teams wont stumble when presented with varied document sources, and can move efficiently through the process.

Massive due diligence review procedures become much more manageable, with the added confidence accurate data extraction affords. Through its ability to classify and confirm the completeness of packages of documents, DocumentReader is perfectly suited to complex tasks such as searching through an organizations HR or business records, verifying completeness and adherence to all folder requirements. Once digitized, analyzed and extracted, the data is easily imported by the embedding application into larger search systems, giving firms access to all pertinent information. The larger the amount of paper source documents, the greater the savings in time and costs associated with the task.

Redaction becomes a matter of defining the elements to be found and awaiting results. Whether seeking particular elements within a given set of formally structured documents, or searching through images of loosely-structured correspondence, invoices, memos and other paper documents, DocumentReader significantly streamlines the process making it more efficient and manageable. When dealing with large redaction efforts, DocumentReaders ability to quickly eliminate the large number of documents which do not require any further review saves untold time and allows legal professionals to focus solely on relevant information.

There are any number of other situations in which DocumentReader can assist the firm, even to the point of sorting and categorizing all incoming mail. The power of A2iAs project management and specific development capabilities lays in its absolute flexibility to meeting client-specific needs. From case-specific operations to back office administration, DocumentReader can make a significant difference in the efficiency and professional effectiveness of a law firm or legal department. Tasks which were traditionally performed manually with all of the attendant resource costs, chances for errors or omissions, and time taken away from more pressing affairs easily become part of an electronic workflow.

When integrated with service bureau outsourcing and/or advanced e-discovery and enterprise search systems, DocumentReader provides the missing link in many set-ups. Embedding applications can import data into document management, knowledge management or enterprise search systems, allowing A2iA-extracted information to become part of a structured database, making it searchable and reportable based on complex parameters with the same level of flexibility as purely digital data. End-users of these systems gain the ability to classify, analyze and extract data from a range of scanned documents with the ease and accuracy the legal field has come to expect from purely digital e-discovery methods.

Conclusion
By incorporating DocumentReader into existing discovery, due diligence, redaction or other document management operations, the forward-looking legal team knows they can handle the massive amounts of documentation even smaller civil cases can present. For larger corporate or governmental needs examining or redacting confidential information from many years worth of a wide variety of documents and image files of paper originals the benefits are even more pronounced. Investigatory teams can also benefit from the software, dictating names, phone numbers, addresses or particular keywords to be searched.

Whether used as a standalone solution or integrated into other systems, DocumentReader holds wide applicability throughout many facets, tasks and specialties of the legal profession. Above and beyond the significant cost savings and increased efficiencies of an automated solution, firms and service providers relying on DocumentReader will also benefit from better data accuracy, more thorough extraction and searches, and far more manageable access to data once found.

DocumentReader can bring significant advantages to legal teams across the entire spectrum of the profession, to accomplish any of a myriad of tasks. If your organization is looking for a faster, better, more accurate and efficient way to gather or remove information from large amounts of paper documentation and/or evidence, you owe it to yourself to investigate DocumentReader.

About A2iA
A2iA is the worldwide leading developer of natural handwriting recognition, Intelligent Word Recognition (IWR) and Intelligent Character Recognition (ICR) technologies and products for the payment, mail, document and forms processing markets. Learn more.

More articles about Intelligent Character RecognitionIntelligent Word Recognition

Image

Categories

Subscribe: Direct Inbox Delivery

Get The Content Wrangler Newsletter delivered straight to your home or work Inbox. It's full of content goodness.

sponsors Image Image image Image image Image Image Image Image Image Image Image