Information Discovery: April 2011

Predictive coding is a family of evidence-based document categorization technologies that are used to put documents or ESI into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged. The evidence is quite clear from a number of studies, including TREC and the eDiscovery Institute, that predictive coding is at least as accurate as having teams of reviewers read every document. Despite these studies, there is still some skepticism about using predictive coding in eDiscovery. I would like to address some of the issues that these skeptics have raised. The two biggest concerns are whether predictive coding is reasonable and whether it is defensible. I cannot pretend to offer a legal analysis of reasonableness and defensibility, but I do have some background information and opinions that may be useful to make the legal arguments.

Some of the resistance to using predictive coding is the fear that the opposing party will object to the methods used to find responsive (for example) documents. By my reading, the courts do not usually support objections based on the supposition that something might have been missed. The opposing party has to respond with some particularity (e.g., Ford v Edgewood Properties). In In re Exxon, the trial court refused to compel a party to present a deponent to testify as to the efforts taken to search for documents. Multiven, Inc. v. Cisco Systems might also be relevant in that the court ordered the party to use a vendor with better technology, rather than having someone read each document. In a recent decision in Datel v. Microsoft, Judge Laporte quoted the Federal Rules of Evidence: “Further, ‘[d]epending on the circumstances, a party that uses advanced analytical software applications and linguistic tools in screening for privilege and work product may be found to have taken “reasonable steps” to prevent inadvertent disclosure.’ Fed. R. Evid. 502 advisory committee’s note.”

The OrcaTec system, at least, draws repeated random samples from the total population of documents so that we can always extrapolate from the latest sample to the population as a whole. Because the sample is random, it is representative of the whole collection, so the effectiveness of the system on the sample is a statistical prediction of how well the tool would do with the whole collection. You can also sample among the documents that have been and have not been selected for production (as Judge Grimm recommends in Creative Pipe).

In general, predictive coding systems work by finding documents that are similar to those which have already been recognized as responsive (or nonresponsive) by some person of authority. The machines do not make up the decisions, they merely implement them. If there is a black-box argument that applies to predictive coding, it applies even more to hiring teams of human readers. No one knows what is going on in their heads, their judgments change over time, they lose attention after half an hour, etc.

If you cannot tell the difference between documents picked by human reviewers and documents picked by machines, then why should you pay for the difference?

Predictive coding costs a few cents per document. Human reviewers typically pay a dollar or more per document. Proportionality and FRCP Rule 1 would suggest, I think, that a cheaper alternative with demonstrable quality should be preferred by the courts. The burden should be on the opposing party to prove that something was wrong and that engaging a more expensive process of lower quality (people) is somehow justified.

The workflow I suggest is to start with some exploratory data analysis, then set up the predictive coding. In the OrcaTec system, the system selects documents to learn from. Some other systems use slightly different methods. Once the predictive coding is done, you can take the documents that the computer recognizes as responsive and review those. This the result of your first pass review. They will be a small subset of the original collection, the size of which depends on the richness of the collection. I would not recommend producing any documents that have not been reviewed by trusted people. But now, you're not wasting time reading documents that will never get produced.

Predictive coding is great for first pass review, but it can also be used as a quality check on human review. Even if you do not want to rely on predictive coding to do your first-pass review, you can exploit it to check up on your reviewers after they are done.

Some attorneys worry about a potential Daubert challenge to predictive coding. I’m not convinced that it is even pertinent to the discussion, still predictive coding would easily stand up to such a challenge. Predictive coding is mathematical and statistically based. It is main-stream science that has been in existence in one form or another since the 18th Century. There is no voodoo magic in predictive coding, only mathematics. I think that the facts supporting its accuracy are certainly substantial (peer reviewed, main stream science, etc.) and the systems are transparent enough that there should be no (rational) argument about the facts. Many attorneys happily use keyword searching, which has long been known to be rather ineffective. There has never been a Daubert challenge to using keywords to identify responsive documents. Seldom has there been any measurement done to justify the use of keyword searches as a reasonable way to limit the scope of documents that must be reviewed. But if the weak method of keyword searching is acceptable, then a more sophisticated and powerful process should also be acceptable.

Another concern that I hear raised frequently is that lawyers would have a hard time explaining predictive coding, if challenged. I don’t think that the ideas behind predictive coding are very difficult to explain. Predictive coding works by identifying documents that are similar to those that an authoritative person has identified as responsive (or as a member of another category). Systems differ somewhat in how they compute that similarity, but they share the idea that the content of the document, the words and phrases in it, are the evidence to use when measuring this similarity.

A document (or more generally, ESI, electronically stored information) consists of words, phrases, sentences, and paragraphs. These textual units constitute the evidence that a document presents. For simplicity, I will focus on words as the unit, but the same ideas apply to using other text units. For further simplification, we will consider only two categories, responsive and nonresponsive. Again, the same rules apply if we want to include other categories, or if we want to distinguish privileged from nonprivileged, etc.

Based on the words in a document, then, the task is to decide whether this document is more likely to be responsive or more likely to be nonresponsive. This is the task that a reviewer performs when reading a document and it is the task that any predictive coding system performs as well. Both categorize the document as indicated by its content.

When we rely exclusively on human reviewers we have no transparency into how they make their decisions. We can provide them with instructions that are as detailed as we may like, but we do not have direct access to their thought processes. We may ask them, after the fact, why they decided that a particular document was responsive, but their answer is almost always a reconstruction of what they “must have thought” rather than a true explanation. Keyword searching, on the other hand, is very transparent—we can point to the presence of specific key words—but, the presence of a specific key word does not necessarily mean that a document is automatically responsive. One recent matter I worked on used keyword and custodian culling and still only 6% of the selected documents ended up being tagged responsive.

It seems to me that saying, “these documents were chosen as responsive because they resembled documents that were judged by our expert to be responsive,” is a pretty straightforward explanation of predictive coding, whatever technology is used to perform it. Couple that with explicit measurement of the system’s performance, and I think that you have a good case for a transparent process.

Predictive coding would appear to be an efficient, effective, and defensible process for winnowing down your document collection for first-pass review, and beyond.

To be truthful, I have been quite surprised at all of the attention that predictive coding has been receiving lately, from the usual legal blogs to the New York Times to Forbes Magazine. It’s not a particularly new technology. It’s actually been around since 1763, when Thomas Bayes first proposed his famous theorem. It’s been used in document decision making, since about 1961. But, when I tried to convince people that it was a useful tool in 2002 and 2003, my arguments fell on deaf ears. Lawyers just were not interested. It never went anywhere. Times have certainly changed.

Concept search took about six years to get into the legal mainstream. Predictive coding, by whatever name, seems to have taken about 18 months. I’m told by some of my lawyer colleagues, that it has now become a necessary part of many statements of work.

The first paper that I know of concerning what we would today call predictive coding is by Maron, and published in 1961. He called it “automatic indexing.” “The term ‘automatic indexing’ denotes the problem of deciding in a mechanical way to which category (subject or field of knowledge) a given document belongs. It concerns the problem of deciding automatically what a given document is ‘about’.” He recognized that “To correctly classify an object or event is a mark of intelligence,” and found that, even in 1961, computers could be quite accurate at assigning documents to appropriate categories.

Predictive coding is a family of evidence-based document categorization technologies. The evidence is the word or words in the documents. Predictive coding systems do not replace human judgment, but rather augment it. They do not decide autonomously what is relevant, but take judgments about responsiveness from a relatively small set of documents (or other sources) and extend these judgments to additional documents. In the OrcaTec system, a user trains the system by reviewing a sample of documents. The computer watches the documents as they are reviewed and the decisions assigned by the reviewer. At the same time, as the computer gains some experience, it predicts the appropriate tag to be applied to the document, making the reviewer more consistent while making the computer more closely approach the decision rules used by the reviewer.

Although there are a number of computational algorithms that are used to compute these evidence-based categorizers, deep inside they all address the problem of finding the highest probability category or categories for a document, given its content.

The interest in predictive coding stems I think from two factors. First, the volume of documents that must be considered continues to grow exponentially, but the resources to deal with them do not. The cost of review in eDiscovery frequently overwhelms the amount at risk. There is widespread recognition that something has to be done. The second, is the emergence of studies and investigations that examine both the quality of human review and the comparative ability of computational systems. For a number of years, TREC, the Text Retrieval Conference, has been investigating information retrieval techniques. For the last several years, they have applied the same methodology to document categorization in eDiscovery. The Electronic Discovery Institute conducted a similar study (I was the lead author). The evidence available from these studies is consistent in showing that people are only moderately accurate in categorizing documents and that computers can be at least as accurate, often more accurate. The general idea is that if computer systems can at least replicate the quality of human review at a much lower cost, then it would be reasonable to use these systems to do first pass review.

In my next post, I will discuss the skepticism that some lawyers have expressed and offer some suggestions for resolving that skipticism.

Information Discovery

Wednesday, April 6, 2011

Is predictive coding defensible?

Monday, April 4, 2011

Everything new is old again

Blog Archive

About OrcaTec