Wednesday, April 6, 2011

Is predictive coding defensible?

Predictive coding is a family of evidence-based document categorization technologies that are used to put documents or ESI into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged. The evidence is quite clear from a number of studies, including TREC and the eDiscovery Institute, that predictive coding is at least as accurate as having teams of reviewers read every document. Despite these studies, there is still some skepticism about using predictive coding in eDiscovery. I would like to address some of the issues that these skeptics have raised. The two biggest concerns are whether predictive coding is reasonable and whether it is defensible. I cannot pretend to offer a legal analysis of reasonableness and defensibility, but I do have some background information and opinions that may be useful to make the legal arguments.

Some of the resistance to using predictive coding is the fear that the opposing party will object to the methods used to find responsive (for example) documents. By my reading, the courts do not usually support objections based on the supposition that something might have been missed. The opposing party has to respond with some particularity (e.g., Ford v Edgewood Properties). In In re Exxon, the trial court refused to compel a party to present a deponent to testify as to the efforts taken to search for documents. Multiven, Inc. v. Cisco Systems might also be relevant in that the court ordered the party to use a vendor with better technology, rather than having someone read each document. In a recent decision in Datel v. Microsoft, Judge Laporte quoted the Federal Rules of Evidence: “Further, ‘[d]epending on the circumstances, a party that uses advanced analytical software applications and linguistic tools in screening for privilege and work product may be found to have taken “reasonable steps” to prevent inadvertent disclosure.’ Fed. R. Evid. 502 advisory committee’s note.”

The OrcaTec system, at least, draws repeated random samples from the total population of documents so that we can always extrapolate from the latest sample to the population as a whole. Because the sample is random, it is representative of the whole collection, so the effectiveness of the system on the sample is a statistical prediction of how well the tool would do with the whole collection. You can also sample among the documents that have been and have not been selected for production (as Judge Grimm recommends in Creative Pipe).

In general, predictive coding systems work by finding documents that are similar to those which have already been recognized as responsive (or nonresponsive) by some person of authority. The machines do not make up the decisions, they merely implement them. If there is a black-box argument that applies to predictive coding, it applies even more to hiring teams of human readers. No one knows what is going on in their heads, their judgments change over time, they lose attention after half an hour, etc.

If you cannot tell the difference between documents picked by human reviewers and documents picked by machines, then why should you pay for the difference?

Predictive coding costs a few cents per document. Human reviewers typically pay a dollar or more per document. Proportionality and FRCP Rule 1 would suggest, I think, that a cheaper alternative with demonstrable quality should be preferred by the courts. The burden should be on the opposing party to prove that something was wrong and that engaging a more expensive process of lower quality (people) is somehow justified.

The workflow I suggest is to start with some exploratory data analysis, then set up the predictive coding. In the OrcaTec system, the system selects documents to learn from. Some other systems use slightly different methods. Once the predictive coding is done, you can take the documents that the computer recognizes as responsive and review those. This the result of your first pass review. They will be a small subset of the original collection, the size of which depends on the richness of the collection. I would not recommend producing any documents that have not been reviewed by trusted people. But now, you're not wasting time reading documents that will never get produced.

Predictive coding is great for first pass review, but it can also be used as a quality check on human review. Even if you do not want to rely on predictive coding to do your first-pass review, you can exploit it to check up on your reviewers after they are done.

Some attorneys worry about a potential Daubert challenge to predictive coding. I’m not convinced that it is even pertinent to the discussion, still predictive coding would easily stand up to such a challenge. Predictive coding is mathematical and statistically based. It is main-stream science that has been in existence in one form or another since the 18th Century. There is no voodoo magic in predictive coding, only mathematics. I think that the facts supporting its accuracy are certainly substantial (peer reviewed, main stream science, etc.) and the systems are transparent enough that there should be no (rational) argument about the facts. Many attorneys happily use keyword searching, which has long been known to be rather ineffective. There has never been a Daubert challenge to using keywords to identify responsive documents. Seldom has there been any measurement done to justify the use of keyword searches as a reasonable way to limit the scope of documents that must be reviewed. But if the weak method of keyword searching is acceptable, then a more sophisticated and powerful process should also be acceptable.

Another concern that I hear raised frequently is that lawyers would have a hard time explaining predictive coding, if challenged. I don’t think that the ideas behind predictive coding are very difficult to explain. Predictive coding works by identifying documents that are similar to those that an authoritative person has identified as responsive (or as a member of another category). Systems differ somewhat in how they compute that similarity, but they share the idea that the content of the document, the words and phrases in it, are the evidence to use when measuring this similarity.

A document (or more generally, ESI, electronically stored information) consists of words, phrases, sentences, and paragraphs. These textual units constitute the evidence that a document presents. For simplicity, I will focus on words as the unit, but the same ideas apply to using other text units. For further simplification, we will consider only two categories, responsive and nonresponsive. Again, the same rules apply if we want to include other categories, or if we want to distinguish privileged from nonprivileged, etc.

Based on the words in a document, then, the task is to decide whether this document is more likely to be responsive or more likely to be nonresponsive. This is the task that a reviewer performs when reading a document and it is the task that any predictive coding system performs as well. Both categorize the document as indicated by its content.

When we rely exclusively on human reviewers we have no transparency into how they make their decisions. We can provide them with instructions that are as detailed as we may like, but we do not have direct access to their thought processes. We may ask them, after the fact, why they decided that a particular document was responsive, but their answer is almost always a reconstruction of what they “must have thought” rather than a true explanation. Keyword searching, on the other hand, is very transparent—we can point to the presence of specific key words—but, the presence of a specific key word does not necessarily mean that a document is automatically responsive. One recent matter I worked on used keyword and custodian culling and still only 6% of the selected documents ended up being tagged responsive.

It seems to me that saying, “these documents were chosen as responsive because they resembled documents that were judged by our expert to be responsive,” is a pretty straightforward explanation of predictive coding, whatever technology is used to perform it. Couple that with explicit measurement of the system’s performance, and I think that you have a good case for a transparent process.

Predictive coding would appear to be an efficient, effective, and defensible process for winnowing down your document collection for first-pass review, and beyond.

No comments:

Post a Comment