Monday, August 8, 2011

Optimal document decisioning


There seems to be an emerging workflow in eDiscovery where predictive coding and highly professional reviewers are being used in place of large ad hoc groups of temporary attorneys. There is recognition that without high levels of training and good quality-control methods, human review tends to be only moderately accurate. Selecting or training effective reviewers requires an understanding of what makes a reviewer successful and of how to measure that success. We can look to optimal decision theory, and particularly to the branch of optimal decision theory called detection theory to provide insight into training and assessing reviewers.

The work on optimal decision theory began during World War II to measure and understand, for example, how to characterize the sensitivity of radar to detect objects at a distance. It then came to be applied to human decision making as well, work that was published after the war. This type of optimal decision theory is often called detection theory.

Detection theory concerns the question: Based on the available evidence, should we categorize an event as a member of set 1 or as a member of set 2? In radar, the evidence is in the signal reflected from an object, and the sets are whether the reflection is from a plane or from, say, a cloud. In document decisioning, the evidence consists of the words in the document and the sets are, for example, responsive and nonresponsive.

In order to isolate the essence of decisioning, we can simplify the situation further. For the moment, let’s think about a decision where all we have to do is decide whether a tone was played at a particular time or not—a kind of hearing test. Those events when the tone is present are analogous to a document being responsive and those events when the tone is absent are analogous to a document being nonresponsive.

Let’s put on a pair of head phones and listen for the tone. When the tone is present it is played very softly, so there may be some uncertainty about whether the tone was present or not. How do we decide whether we hear a tone or not?

At first, it may seem that detecting the tone is not a matter of making a decision. It is either there or it is not. But, one of the insights of detection theory is that it does actually require a decision and that decision is affected by more than just how loud the tone is.

In detection theory, two kinds of factors influence our decisions. The first is the sensitivity of listener—how well can the listener distinguish between tone and nontone events? The second factor is bias—how willing is the listener to say that the tone was present.

In our hearing test, we present a series of events or trials. The listener has to decide on each of those events whether she is hearing the tone. Detection theory describes how to combine the level of evidence (e.g., intensity of the tone) and these other factors to come up with the best decision possible.

Some listeners have more sensitive hearing than others. The more sensitive a person is, the softer the tone can be played and still be heard. Some reviewers are more sensitive than others. They can tell whether a document is responsive based on more subtle cues than other reviewers.

Bias concerns the willingness or tendency of the speaker to identify an event as a tone event. This willingness can be influenced by a number of factors, including the probability that a given event contains a tone and by the consequences of each type of decision. Put simply, if tone events are very rare, then people will be less likely to say that a tone occurred when they are uncertain. If tone events are more common, they will be more likely to say that a tone occurred when they are uncertain. Reviewers are more likely to categorize a document as responsive if the collection contains more responsive documents.

Similarly, if a person gets paid a dollar for correctly hearing a tone and gets charged 50 cents for an error, then that person will be more likely to say that he or she heard the tone. If we reverse the payment plan so that correctly hearing a tone yields 50 cents, but errors cost a dollar, then that person will be reluctant to say that he or she hears the tone. In the face of uncertainty, the optimal decision depends on the evidence available and the consequences of each type of decision.

The point of this is that you can change the proportion of events that are said to contain the tone not only by making the tone louder or softer, but also by changing the consequences of decisions and the likelihood that the tone is actually present.

Bringing this back to document decisioning, the words in a document constitute the evidence that a document is responsive or not. In the face of uncertainty, decision makers will decide whether a document is responsive based on the degree to which the evidence is consistent with the document being responsive, on their sensitivity to this evidence, on the proportion of responsive documents in a collection, and on the consequences of making each kind of decision. All of these factors play a role in document decisioning.

In the paper by Roitblat, Kershaw, and Oot (2010, JASIST), for example, two teams of reviewers re-examined a sample of documents that had been reviewed by the original Verizon team. In this re-review, Team A identified 24.2% of the documents in their sample as responsive and Team B identified 28.76% as responsive. Although Team B identified significantly more documents as responsive, when the sensitivity of these two teams was measured in the way suggested by detection theory, the two teams did not differ significantly from one another in sensitivity. They did differ in their bias, however, to call an uncertain document responsive. Team B was simply more willing than Team A to categorize documents as responsive without being any better at distinguishing responsive from nonresponsive documents.

The most useful insight to be derived from an optimal decision theory approach to document decisioning is the separability of sensitivity and bias. Reviewers can differ in how sensitive they are to identifying responsive documents and they can be guided to be more or less biased toward accepting documents as responsive when uncertain.

Presumably sensitivity will be affected by education. The more that reviewers know about the factors that govern whether a document is responsive, the better they will be at distinguishing responsive from nonresponsive. Their bias can be changed simply by asking them to be more or less fussy. The optimum review needs not only to be maximally sensitive to the difference between responsive and nonresponsive documents, but to adopt the level of bias that is appropriate to the matter at hand.

When assessing reviewers, optimal decision theory suggests that you separate out the sensitivity from the bias. The quality of a reviewer is represented by his or her sensitivity, not by bias. If all you measure, for example, is the proportion of responsive documents found by a candidate reviewer (where responsive is defined by someone authoritative), then you could easily miss highly competent reviewers because they have a different level of bias from the authoritative reviewer. Equally likely, you could select a candidate who finds many responsive documents just because he or she is biased to call more documents responsive. Although reviewer sensitivity may be difficult to change, bias is very easy to change. You have only to ask the person to be more or less generous. Unless you measure both bias and sensitivity, you won’t be able to make sound judgments about the quality of reviewers, whether those reviewers are machines or people.

Note: Traditional information retrieval science uses precision and recall to measure performance. These two measures recognize that there is a tradeoff between precision and recall. You can increase precision by focusing the retrieval more narrowly, but this usually results in a decrease in recall. You can get the highest recall by retrieving all documents, but then you would have very low precision. Precision and recall measures are affected by both bias and sensitivity, but they do not provide any means to separate one from the other. Sensitivity and bias have been used in information retrieval studies, but not as commonly as precision and recall.

Sunday, June 12, 2011

Competitor’s press release about predictive coding patent stretches the truth

[updated June 14, 2011]
[updated June 22, 2011]
One of OrcaTec's competitors, Recommind, has recently been awarded a patent related to predictive coding. In a press release dated June 8, 2011, announcing this award, they make some very grandiose claims with no basis in fact. According to their press release, they actually claim to have patented predictive coding. This claim is a gross exaggeration and unsupported by the details of the patent (No. 7,933,859) or the history of predictive coding. Patents are intended to protect inventions and there is no evidence that Recommind invented predictive coding.

Having examined the patent carefully, I can say that this patent covers only a very narrow method of computing in predictive coding and is unlikely to have any impact on the ability of any other eDiscovery service provider to continue to offer this game-changing capability. Primarily it involves the combination of using three things: Probabilistic Latent Semantic Analysis (a key part of Recommind's core product), Support Vector Machines (a statistical learning tool), and user feedback.

The scope of a patent is determined by its claims, not by the title of a press release. A valid patent requires that the proposed invention be (among other things) novel and non-obvious. In contrast, what we now call predictive coding has a very long history. I have written about some of this history previously.

Further, the use of predictive coding in eDiscovery predates Recommind’s patent application by many years. At the time that they filed their patent, predictive coding was in wide use, so it could not be considered novel or non-obvious as the Patent Office defines these terms, though some specific methods for implementing it may meet these requirements.

Predictive coding is a family of evidence-based document categorization technologies that are used to put documents or electronically stored information (ESI) into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged. The underlying idea of using evidence to categorize objects has been around since the 18th Century. The notion of applying similar ideas to document classification or categorization was described in 1961 by M.E. Maron.

In his paper, Maron noted:

Loosely speaking, it appears that there are two parts to the problem of classifying. The first part concerns the selection of certain relevant aspects of an item as pieces of evidence. The second part of the problem concerns the use of these pieces of evidence to predict the proper category to which the item in question belongs. (p. 404).

The general process of updating a system’s classification rules as a result of user feedback is called “relevance feedback.” It has been in use since at least 1971 (Rocchio, 1971). For example, Lewis and Gale (1994) used relevance feedback and “uncertainty sampling” to predictively categorize news stories. Uncertainty sampling is a method of selecting specific documents to be categorized by the user when improving the quality of the predictive categorizer. The documents shown to the user for classification are those that would maximally reduce the uncertainty of the classifier.

In 2002, Paul Graham introduced SpamBayes, which uses techniques similar to those described by Maron to distinguish SPAM from nonSPAM (or HAM) emails. An initial sample of categorized SPAM and HAM emails is analyzed by the program to learn how specific words provide evidence for one category or the other. Subsequent emails are then classified according to these implicit rules and the evidence, which consists of the words in the emails. If the system misclassifies an email as either SPAM or HAM, the user can flag these errors and the classifier will update itself to reflect these reclassified emails. This seems to me to be a very clear example of predictive coding, which again argues that predictive coding, per se, does not meet the novelty or non-obviousness criteria.

eDiscovery service providers have been doing predictive coding (sometimes called by other names) for many years. In January of 2010, before the Recommind patent was filed, the eDiscovery Institute published a paper (by Roitblat, Kershaw and Oot) in the Journal of the American Society for Information Science and Technology describing two eDiscovery service providers and the accuracy of their predictive coding tools. These tools, obviously, were in existence prior to Recommind’s filing. Other service providers have been providing similar services even before that. So Recommind can hardly be considered to have invented predictive coding, yet none of this prior art was actually included in Recommind’s patent application.

One pending patent, however, was included in their application as evidence of prior art in this field. This pending application, assigned to Bank of America, is called “Predictive Coding of Documents in an Electronic Discovery System.” Therefore, the patent application itself recognizes that Recommind cannot be the inventor of predictive coding.

They also included in their list of prior art an article attributed to Thorsten (Thorsten is actually the author's first name, his last name is Joachims) describing a statistical machine-learning technique called SVM (Support Vector Machines), which is used in their claimed invention. This same paper also describes relevance feedback.

Given all of this prior art, it is very clear that Recommind is in no position to claim to have either invented or to “own” predictive coding. Rather, their patent covers a very specific, very narrow approach to predictive coding involving the use of two very specific statistical procedures and relevance feedback. I will leave it to attorneys to determine whether even this circumscribed application constitutes a valid patent.

Still, even if their patent is valid, it leaves plenty of room for other approaches to predictive coding, including the approach used by OrcaTec. Nothing in the Recommind patent would preclude OrcaTec or any other service provider from offering predictive coding services in eDiscovery or any other area. OrcaTec does not use the statistical procedures described in their patent. We believe that OrcaCategorize is an easier to use, and more effective product, which can help attorneys achieve cost savings significantly beyond those claimed by Recommind.

Grandiose claims like those in the Recommind press release indicate either a profound lack of understanding of just what is covered by the patent, or are a deliberate attempt to obfuscate the issues in the industry. Attorneys involved in eDiscovery look to their service providers to provide open, honest and effective processes. They are not well served by unnecessary hyperbole.

Bob Tenant has an excellent blog on this topic that I think resolves the issue. I urge you to read it.
---
About the author:

Herbert L. Roitblat, Ph.D. is the CTO and Chief Scientist of OrcaTec. He holds a number of patents in eDiscovery technology and other areas. He is widely considered to be an expert in eDiscovery methodology and technology. He is a member of the Sedona Working Group on Electronic Document Retention and Production, on the Advisory Board of the Georgetown Legal Center Advanced eDiscovery Institute, a member of the 2011 program committee for the Georgetown Legal Center Advanced eDiscovery Institute, and the chair of the Electronic Discovery Institute. He is a member of the Board of Governors of the Organization of Legal Professionals. He is a frequent speaker on eDiscovery, particularly concerning search, categorization, predictive coding, and quality assurance.

Wednesday, April 6, 2011

Is predictive coding defensible?

Predictive coding is a family of evidence-based document categorization technologies that are used to put documents or ESI into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged. The evidence is quite clear from a number of studies, including TREC and the eDiscovery Institute, that predictive coding is at least as accurate as having teams of reviewers read every document. Despite these studies, there is still some skepticism about using predictive coding in eDiscovery. I would like to address some of the issues that these skeptics have raised. The two biggest concerns are whether predictive coding is reasonable and whether it is defensible. I cannot pretend to offer a legal analysis of reasonableness and defensibility, but I do have some background information and opinions that may be useful to make the legal arguments.

Some of the resistance to using predictive coding is the fear that the opposing party will object to the methods used to find responsive (for example) documents. By my reading, the courts do not usually support objections based on the supposition that something might have been missed. The opposing party has to respond with some particularity (e.g., Ford v Edgewood Properties). In In re Exxon, the trial court refused to compel a party to present a deponent to testify as to the efforts taken to search for documents. Multiven, Inc. v. Cisco Systems might also be relevant in that the court ordered the party to use a vendor with better technology, rather than having someone read each document. In a recent decision in Datel v. Microsoft, Judge Laporte quoted the Federal Rules of Evidence: “Further, ‘[d]epending on the circumstances, a party that uses advanced analytical software applications and linguistic tools in screening for privilege and work product may be found to have taken “reasonable steps” to prevent inadvertent disclosure.’ Fed. R. Evid. 502 advisory committee’s note.”

The OrcaTec system, at least, draws repeated random samples from the total population of documents so that we can always extrapolate from the latest sample to the population as a whole. Because the sample is random, it is representative of the whole collection, so the effectiveness of the system on the sample is a statistical prediction of how well the tool would do with the whole collection. You can also sample among the documents that have been and have not been selected for production (as Judge Grimm recommends in Creative Pipe).

In general, predictive coding systems work by finding documents that are similar to those which have already been recognized as responsive (or nonresponsive) by some person of authority. The machines do not make up the decisions, they merely implement them. If there is a black-box argument that applies to predictive coding, it applies even more to hiring teams of human readers. No one knows what is going on in their heads, their judgments change over time, they lose attention after half an hour, etc.

If you cannot tell the difference between documents picked by human reviewers and documents picked by machines, then why should you pay for the difference?

Predictive coding costs a few cents per document. Human reviewers typically pay a dollar or more per document. Proportionality and FRCP Rule 1 would suggest, I think, that a cheaper alternative with demonstrable quality should be preferred by the courts. The burden should be on the opposing party to prove that something was wrong and that engaging a more expensive process of lower quality (people) is somehow justified.

The workflow I suggest is to start with some exploratory data analysis, then set up the predictive coding. In the OrcaTec system, the system selects documents to learn from. Some other systems use slightly different methods. Once the predictive coding is done, you can take the documents that the computer recognizes as responsive and review those. This the result of your first pass review. They will be a small subset of the original collection, the size of which depends on the richness of the collection. I would not recommend producing any documents that have not been reviewed by trusted people. But now, you're not wasting time reading documents that will never get produced.

Predictive coding is great for first pass review, but it can also be used as a quality check on human review. Even if you do not want to rely on predictive coding to do your first-pass review, you can exploit it to check up on your reviewers after they are done.

Some attorneys worry about a potential Daubert challenge to predictive coding. I’m not convinced that it is even pertinent to the discussion, still predictive coding would easily stand up to such a challenge. Predictive coding is mathematical and statistically based. It is main-stream science that has been in existence in one form or another since the 18th Century. There is no voodoo magic in predictive coding, only mathematics. I think that the facts supporting its accuracy are certainly substantial (peer reviewed, main stream science, etc.) and the systems are transparent enough that there should be no (rational) argument about the facts. Many attorneys happily use keyword searching, which has long been known to be rather ineffective. There has never been a Daubert challenge to using keywords to identify responsive documents. Seldom has there been any measurement done to justify the use of keyword searches as a reasonable way to limit the scope of documents that must be reviewed. But if the weak method of keyword searching is acceptable, then a more sophisticated and powerful process should also be acceptable.

Another concern that I hear raised frequently is that lawyers would have a hard time explaining predictive coding, if challenged. I don’t think that the ideas behind predictive coding are very difficult to explain. Predictive coding works by identifying documents that are similar to those that an authoritative person has identified as responsive (or as a member of another category). Systems differ somewhat in how they compute that similarity, but they share the idea that the content of the document, the words and phrases in it, are the evidence to use when measuring this similarity.

A document (or more generally, ESI, electronically stored information) consists of words, phrases, sentences, and paragraphs. These textual units constitute the evidence that a document presents. For simplicity, I will focus on words as the unit, but the same ideas apply to using other text units. For further simplification, we will consider only two categories, responsive and nonresponsive. Again, the same rules apply if we want to include other categories, or if we want to distinguish privileged from nonprivileged, etc.

Based on the words in a document, then, the task is to decide whether this document is more likely to be responsive or more likely to be nonresponsive. This is the task that a reviewer performs when reading a document and it is the task that any predictive coding system performs as well. Both categorize the document as indicated by its content.

When we rely exclusively on human reviewers we have no transparency into how they make their decisions. We can provide them with instructions that are as detailed as we may like, but we do not have direct access to their thought processes. We may ask them, after the fact, why they decided that a particular document was responsive, but their answer is almost always a reconstruction of what they “must have thought” rather than a true explanation. Keyword searching, on the other hand, is very transparent—we can point to the presence of specific key words—but, the presence of a specific key word does not necessarily mean that a document is automatically responsive. One recent matter I worked on used keyword and custodian culling and still only 6% of the selected documents ended up being tagged responsive.

It seems to me that saying, “these documents were chosen as responsive because they resembled documents that were judged by our expert to be responsive,” is a pretty straightforward explanation of predictive coding, whatever technology is used to perform it. Couple that with explicit measurement of the system’s performance, and I think that you have a good case for a transparent process.

Predictive coding would appear to be an efficient, effective, and defensible process for winnowing down your document collection for first-pass review, and beyond.

Monday, April 4, 2011

Everything new is old again

To be truthful, I have been quite surprised at all of the attention that predictive coding has been receiving lately, from the usual legal blogs to the New York Times to Forbes Magazine. It’s not a particularly new technology. It’s actually been around since 1763, when Thomas Bayes first proposed his famous theorem. It’s been used in document decision making, since about 1961. But, when I tried to convince people that it was a useful tool in 2002 and 2003, my arguments fell on deaf ears. Lawyers just were not interested. It never went anywhere. Times have certainly changed.

Concept search took about six years to get into the legal mainstream. Predictive coding, by whatever name, seems to have taken about 18 months. I’m told by some of my lawyer colleagues, that it has now become a necessary part of many statements of work.

The first paper that I know of concerning what we would today call predictive coding is by Maron, and published in 1961. He called it “automatic indexing.” “The term ‘automatic indexing’ denotes the problem of deciding in a mechanical way to which category (subject or field of knowledge) a given document belongs. It concerns the problem of deciding automatically what a given document is ‘about’.” He recognized that “To correctly classify an object or event is a mark of intelligence,” and found that, even in 1961, computers could be quite accurate at assigning documents to appropriate categories.

Predictive coding is a family of evidence-based document categorization technologies. The evidence is the word or words in the documents. Predictive coding systems do not replace human judgment, but rather augment it. They do not decide autonomously what is relevant, but take judgments about responsiveness from a relatively small set of documents (or other sources) and extend these judgments to additional documents. In the OrcaTec system, a user trains the system by reviewing a sample of documents. The computer watches the documents as they are reviewed and the decisions assigned by the reviewer. At the same time, as the computer gains some experience, it predicts the appropriate tag to be applied to the document, making the reviewer more consistent while making the computer more closely approach the decision rules used by the reviewer.

Although there are a number of computational algorithms that are used to compute these evidence-based categorizers, deep inside they all address the problem of finding the highest probability category or categories for a document, given its content.

The interest in predictive coding stems I think from two factors. First, the volume of documents that must be considered continues to grow exponentially, but the resources to deal with them do not. The cost of review in eDiscovery frequently overwhelms the amount at risk. There is widespread recognition that something has to be done. The second, is the emergence of studies and investigations that examine both the quality of human review and the comparative ability of computational systems. For a number of years, TREC, the Text Retrieval Conference, has been investigating information retrieval techniques. For the last several years, they have applied the same methodology to document categorization in eDiscovery. The Electronic Discovery Institute conducted a similar study (I was the lead author). The evidence available from these studies is consistent in showing that people are only moderately accurate in categorizing documents and that computers can be at least as accurate, often more accurate. The general idea is that if computer systems can at least replicate the quality of human review at a much lower cost, then it would be reasonable to use these systems to do first pass review.

In my next post, I will discuss the skepticism that some lawyers have expressed and offer some suggestions for resolving that skipticism.

Friday, February 25, 2011

Opening the Black Box of Predictive Coding

Yesterday, we did a very interesting podcast on ESIBytes with Karl Schieneman, Jaime Carbonell, and Vasco Pedro on predictive coding. Jaime and Vasco are well known in the machine learning space, but have not been active in eDiscovery, so it was really interesting to get their perspective on how these advanced technologies could be effective in eDiscovery.

A few points that I would really like to emphasize from that conversation include the consensus among the four of us that machine learning tools are not magic bullets, but must involve humans.

We talked about two ways of using machine learning to organize documents--clustering and categorization. In eDiscovery, categorization is often called predictive coding. Clustering groups together documents that are similar to one another. The computer derives the groupings to be used. Categorization, on the other hand, starts with categories that are specified by people and puts documents into the category that provides the best match.

When using clustering, people have to decide which of the groups of documents are important after the computer organizes them. When using categorization, people have to design the categories before the computer organizes them. In neither case, do we rely on the computer to make the legal judgments about what is important to the matter. That is a decision that is best made by someone with real-world experience and legal judgment.

Computers can take the tedium out of implementing human legal judgments, but so far, they are not in a position to make the judgments themselves. These systems do not take the lawyer out of the equation, they simply provide support to reduce the level of effort required to implement the attorney's judgment on ever-growing collections of documents.

That brings me to the second point I want to discuss--machine learning algorithms and their ability to handle large data sets. Jaime Carbonell remarked that most of the machine learning algorithms he was familiar with handled gigabyte size problems well, but did not do well on terabyte sized problems. In general that's true, but he did mention that search algorithms were a major exception. Google and others have shown ready capabilities of searching the World-Wide Web and its billions of documents.

Not all machine learning algorithms are subject to the kinds of scaling constraints that Jaime mentioned. For example, if it were possible to transform clustering and categorization into search problems, then we would expect that these algorithms would also scale to Web-sized problems. That, in fact, is what we at OrcaTec have done. We have recast the traditional clustering and categorization algorithms into scalable search-like algorithms that scale directly into very large collections in reasonable amounts of time. As a result, our software can handle even very large collections and efficiently and effectively cluster and categorize the documents.

Finally, that brings me to my third point from our conversation. There was widespread agreement that assessment of the effectiveness of our processes is an essential component. Whether you know or even care about the content of the black box or the head of the temp attorney hired to do the review, we all agreed that measurement was an essential part of the process. You cannot know that you've done a reasonable job unless you can show that you measured the job that you did. You cannot improve the quality of your process unless you measure that quality.

When we have measured human review performance, it was not as consistent as one might have imagined. When compared with machine learning, the machines come out at least as accurate as people do, and often more accurate. From the things that I see and hear, measurement of eDisocovery processes is rapidly becoming the norm, and, in my opinion, should be. Appropriate quality assessments are neither expensive nor time consuming. They can, however, be invaluable in demonstrating that your review involved a reasonable process at a reasonable level of effectiveness. The Sedona Conference has an excellent paper on achieving and measuring quality in eDiscovery. I highly recommend reading it.