Information Discovery: Everything new is old again

To be truthful, I have been quite surprised at all of the attention that predictive coding has been receiving lately, from the usual legal blogs to the New York Times to Forbes Magazine. It’s not a particularly new technology. It’s actually been around since 1763, when Thomas Bayes first proposed his famous theorem. It’s been used in document decision making, since about 1961. But, when I tried to convince people that it was a useful tool in 2002 and 2003, my arguments fell on deaf ears. Lawyers just were not interested. It never went anywhere. Times have certainly changed.

Concept search took about six years to get into the legal mainstream. Predictive coding, by whatever name, seems to have taken about 18 months. I’m told by some of my lawyer colleagues, that it has now become a necessary part of many statements of work.

The first paper that I know of concerning what we would today call predictive coding is by Maron, and published in 1961. He called it “automatic indexing.” “The term ‘automatic indexing’ denotes the problem of deciding in a mechanical way to which category (subject or field of knowledge) a given document belongs. It concerns the problem of deciding automatically what a given document is ‘about’.” He recognized that “To correctly classify an object or event is a mark of intelligence,” and found that, even in 1961, computers could be quite accurate at assigning documents to appropriate categories.

Predictive coding is a family of evidence-based document categorization technologies. The evidence is the word or words in the documents. Predictive coding systems do not replace human judgment, but rather augment it. They do not decide autonomously what is relevant, but take judgments about responsiveness from a relatively small set of documents (or other sources) and extend these judgments to additional documents. In the OrcaTec system, a user trains the system by reviewing a sample of documents. The computer watches the documents as they are reviewed and the decisions assigned by the reviewer. At the same time, as the computer gains some experience, it predicts the appropriate tag to be applied to the document, making the reviewer more consistent while making the computer more closely approach the decision rules used by the reviewer.

Although there are a number of computational algorithms that are used to compute these evidence-based categorizers, deep inside they all address the problem of finding the highest probability category or categories for a document, given its content.

The interest in predictive coding stems I think from two factors. First, the volume of documents that must be considered continues to grow exponentially, but the resources to deal with them do not. The cost of review in eDiscovery frequently overwhelms the amount at risk. There is widespread recognition that something has to be done. The second, is the emergence of studies and investigations that examine both the quality of human review and the comparative ability of computational systems. For a number of years, TREC, the Text Retrieval Conference, has been investigating information retrieval techniques. For the last several years, they have applied the same methodology to document categorization in eDiscovery. The Electronic Discovery Institute conducted a similar study (I was the lead author). The evidence available from these studies is consistent in showing that people are only moderately accurate in categorizing documents and that computers can be at least as accurate, often more accurate. The general idea is that if computer systems can at least replicate the quality of human review at a much lower cost, then it would be reasonable to use these systems to do first pass review.

In my next post, I will discuss the skepticism that some lawyers have expressed and offer some suggestions for resolving that skipticism.

Information Discovery

Monday, April 4, 2011

Everything new is old again

No comments:

Post a Comment

Blog Archive

About OrcaTec