Sunday, March 25, 2012

Da Silva Moore Plaintiffs Slash and Burn their Way Through eDiscovery

The Plaintiffs in the Da Silva Moore case have gone far beyond zealous advocacy in their objection to Judge Peck's order. The Plaintiffs object to the protocol(see the protocol and other documents here) that gives them the keys to the eDiscovery candy store. In return, they propose to burn down the store and eviscerate the landlord.

Da Silva Moore has been generating a lot of attention in eDiscovery circles, first for Judge Peck's decision supporting the use of predictive coding, and then for the challenges to that ruling presented by the Plaintiffs. The eDiscovery issues in this case are undoubtedly important to the legal community so it is critical that we get them right.

The Plaintiffs play loose with the facts in the matter, they fail to recognize that they have already been given the very things that they ask for, and they employ a rash of ad hominem attacks on the judge, the Defendant, and the Defendant's predictive coding vendor, Recommind. Worse still, what they ask for would actually, in my opinion, disadvantage them.

If we boil down this dispute to its essence, the main disagreement seems to be about whether to measure the success of the process using a sample of 2,399 putatively non-responsive documents or a sample of 16,555. The rest is a combination of legal argumentation, which I will let the lawyers dispute, some dubious logical and factual arguments, and personal attacks on the Judge, attorneys, and vendor.

The current disagreement embodied in the challenge to Judge Peck's decision is not about the use of predictive coding per se. The parties agreed to use predictive coding, even if the Plaintiffs now want to claim that that agreement was conditional on having adequate safeguards and measures in place. Judge Peck endorsed the use of predictive coding knowing that the parties had agreed. It was easy to order them to do something that they were already intending to do.

Now, though, the Plaintiffs are complaining that Judge Peck was biased toward predictive coding and that bias somehow interfered with him rendering an honest decision. Although he has clearly spoken out about his interest in predictive coding, I am not aware of any time that Judge Peck endorsed any specific measurement protocol or method. The parties to the case knew about his views on predictive coding, and, for double measure, he reminded them of these views and provided them the opportunity to object. Neither party did. In any case, the point is moot in that the two sides both stated that they were in favor of using predictive coding. It seems disingenuous to then complain about the fact that he spoke supportively of the technology.

The Plaintiff brief attacking Judge Peck for his support of predictive coding reminds me of the scene from the movie Casablanca where Captain Renault says that he is shocked to find out that gambling is going on in Rick's café just as he is presented with his evening's winnings. If the implication is that judges should remain silent about methodological advances, then that would have a chilling effect on the field and on eDiscovery in particular. A frequent complaint that I hear from lawyers is that the judges don't understand technology. Here is a judge who not only understands the technology of modern eDiscovery, but works to educate his fellow judges and the members of the bar about its value. It would be disastrous for legal education if the Plaintiffs were to succeed in sanctioning the Judge for playing this educational role.

The keys to the candy shop

The protocol gives to the Plaintiffs the final say on whether the production meets quality standards (Protocol, p. 18):
If Plaintiffs object to the proposed review based on the random sample quality control results, or any other valid objection, they shall provide MSL with written notice thereof within five days of the receipt of the random sample. The parties shall then meet and confer in good faith to resolve any difficulties, and failing that shall apply to the Court for relief. MSL shall not be required to proceed with the final search and review described in Paragraph 7 above unless and until objections raised by Plaintiffs have been adjudicated by the Court or resolved by written agreement of the Parties.

They, the Plaintiffs, make a lot of other claims in their brief about things not being specified, when in fact, the protocol gives them the power to specify the criteria as they see fit. They get to define what is relevant. They get to determine whether the results are adequate, so it is not clear why they complain that these things are not clearly specified.

Moreover, the Defendant is sharing with them every document used in training and testing the predictive coding system. The Plaintiffs can object at any point in the process and trigger a meet and confer to resolve any possible dispute. It's not clear, therefore, why they would complain that the criteria are not clearly spelled out when they can object for any valid reason. Any further specificity would simply limit their ability to object. If they don't like the calculations or measures used by the Defendant, they have the documents and can do their own analysis.

The Plaintiffs are being given more data than they could reasonably expect from other defendants or when using other technology. I am not convinced that it should be necessary in general to share the predictive coding training documents with opposing counsel. These training documents provide no information about the operation of the predictive coding system. The documents are only useful for assessing the honesty or competence of the party training the predictive coding system, they presume that the predictive coding system will make good use of the information they contain. I will leave to lawyers any further discussion of whether document sharing is required or legally useful.

Misuse of the "facts"

The Plaintiffs complain that the method described in the protocol risks failing to capture a staggering 65% of the relevant documents in this case. They reach this conclusion based on their claim that Recommind’s “recall,” was very low, averaging only 35%. This is apparently a fundamental misreading or misrepresentation of the TREC (Text Retrieval Conference) 2011 preliminary results (attached to Neale's declaration). Although it may be tempting to use the TREC results for this purpose, TREC was never designed to be a commercial "bakeoff" or certification of commercial products. It is designed as a research project and it imposes special limitations on the systems that participate, limitations that might not be applicable in actual use during discovery. Moreover, Recommind scored much better on recall than the Plaintiffs claim, about twice as well.

The Plaintiffs chose to look at the system's recall level at the point where the measure F1 was maximized. F1 is a measure that combines precision and recall with an equal emphasis on both. In this matter, the parties are obviously much more concerned with recall than precision, so the F1 measure is not the best choice for judging performance. If, rather, we look at the actual recall achieved by the system, while accepting a reasonable number of non-responsive documents, then Recommind's performance was considerably higher, reaching an estimated 70% or more on the three tasks (judging from the gain curves in the Preliminary TREC report). To claim that the data support a recall rate of only 35% is misleading at best.

Methodological issue

The Plaintiffs complain that there are a number of methodological issues that are not fully spelled out in the protocol. Among these are how certain standard statistical properties will be measured (for example, the confidence interval around recall). Because they are standard statistical properties, they should hardly need to be spelled out again in this context. These are routine functions that any competent statistician should be able to compute.

The biggest issue that is raised, and the only one where the Plaintiffs actually have an alternative proposal, concerns how the results of predictive coding are to be evaluated. Because, according to the protocol, the Plaintiffs have the right to object to the quality of the production, it actually falls on them to determine whether it is adequate or not. The dispute revolves around the collection of a sample of non-responsive documents at the end of predictive coding (post-sample) and here the parties both seem to be somewhat confused.

According to the protocol, the Defendant will collect 2,399 documents designated by the predictive coding to be non-responsive. The plaintiffs want them to collect 16,555 of these documents. They never clearly articulate why they want this number of documents. The putative purpose of this sample is to evaluate the system's precision and recall, but in fact, this sample is useless for computing these measures.

Precision concerns the number of correctly identified responsive documents relative to the number of documents identified by the system as responsive. Precision is a measure of the specificity of the result. Recall concerns the number of correctly identified responsive documents relative to the total number of responsive documents. Recall is a measure of the completeness of the result.

The sample that both sides want to draw contains by design no documents that have been identified by the system as responsive so it cannot be used to calculate either precision or recall. Any argument about the size of this sample is meaningless if the sample cannot provide the information that they are seeking.

A better measure to use in this circumstance is elusion. Rather than calculate the percentage of responsive documents that have been found, elusion calculates the percentage of documents that were erroneously classified as non-responsive. I have published on this topic in the Sedona Conference Journal, 2007. Elusion is the percentage of the rejected documents that are actually responsive. It can be used to create an accept-on-zero quality control test or one can simply measure it. Measuring elusion would require the same size sample as the original 2,399-document pre-sample used to measure prevalence. The methods for computing the accept-on-zero quality control test are described in the Sedona Conference Journal paper. The parties could apply whatever acceptance criterion they want, without having to sample huge numbers of documents to evaluate success.

Another test that could be used is a z-test for proportions. If predictive coding works, then it should decrease the number of responsive documents that are present in the post-sample, relative to the pre-sample. The pre-sample apparently identified 36 responsive documents out of 2,399 in a random sample. A post-sample of 2,399 documents, drawn randomly from the documents identified as non-responsive would have to have 21 or fewer responsive documents for it to be significantly different (by a conservative 2-tailed test) at the 95% confidence level.

Conclusion

The Parties in this case are not arguing about the appropriateness of using predictive coding. They agreed to its use. The Plaintiffs are objecting to some very specific details of how this predictive coding will be conducted. Along the way they raise every possible objection that they can imagine, most of which are beside the point; they misinterpret or misrepresent data; they fail to realize that they have the very information they are seeking; and they seek data that will not do them any good, all while vilifying the judge, the other party, and the party's predictive coding service provider. It is as if given the keys to the candy store, they are throwing a tantrum because they have not been told whether to eat the red whips or the cinnamon sticks. Their slash and burn approach to negotiation is far beyond zealous advocacy and far from consistent with the pattern of cooperation that has been promoted by the Sedona Conference and by a large number of judges, including Judge Peck.

Disclosures

So that there are no surprises about where I am coming from, let me repeat some perhaps pertinent facts. Certain other bloggers have recently insinuated that there might be some problem with the credibility of the paper that Anne Kershaw, Patrick Oot, and I published in the peer-reviewed Journal of the American Society for Information Science and Technology on predictive coding. Judge Peck mentioned this paper in his opinion. The technology investigated in that study was from two companies with which none of the authors had any financial relationship.

I am the CTO and Chief Scientist for OrcaTec. I designed OrcaTec's predictive coding tools, starting in February of 2010, after the paper mentioned earlier had already been published and after it became clear that there was interest in predictive coding for eDiscovery. OrcaTec is a competitor of Recommind, and uses very different technology. Our goal is not to defend Recommind, but to try to bring some common sense to the process of eDiscovery.

Neither I, nor OrcaTec has any financial interest in this case, though I have had conversations in the past with Paul Neale, Ralph Losey, and Judge Peck about predictive coding.

I have also commented on this case in an ESI-Bytes podcast, where we talk more about the statistics of measurement.

Thanks to Rob Robinson for collecting all of the relevant documents in one easy to access location.

Monday, January 9, 2012

On Some Selected Search Secrets

Ralph Losey recently wrote an important series of blog posts (here, here, and here) describing five secrets of search. He pulled together a substantial array of facts and ideas that should have a powerful impact on eDiscovery and the use of technology in it. He raised so many good points, that it would take up all of my time just to enumerate them. He also highlighted the need for peer review. In that spirit I would like to address a few of his conclusions in the hope of furthering discussions among lawyers, judges, and information scientists about the best ways to pursue eDiscovery.

These are the problematic points I would like to consider:
1. Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.
2. Webber’s analysis shows that human review is better than machine review
3. Reviewer quality is paramount.
4. Human review is good for small volumes, but not large ones.
5. Random samples with 95% confidence levels +/- 2 are unrealistically high.


Issue: Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.

Losey quotes extensively from a paper written by William Webber, which reanalyzes some results from the TREC Legal Track, 2009, and some other sources. Like Losey’s commentary, this paper also has a lot to recommend it. Some of the conclusions that Losey reaches are fairly attributable to Webber, but some go beyond what Webber would probably be comfortable with. The most significant fact, because important arguments are based on it, is a description of some work by Ellen Voorhees that concluded that 65% recall at 65% precision is the best performance one can expect.

The problem is that this 65% factoid is taken out of context. In the context of the TREC studies and the way that documents are ultimately determined to be relevant or not, this is thought to be the best that can be achieved. The 65% is not a fact of nature. It says, actually, nothing about the accuracy of the predictive coding systems being studied. Losey notes that this limit is due to the inherent uncertainty in human judgments of relevance, but goes on to claim that this is a limit on machine-based or machine assisted categorization. It is not.

Part of the TREC Legal Track process is to distribute sets of documents to ad hoc reviewers, whom they call assessors. Each assessor gets a block or bin of about 500 documents and is asked to categorize them as relevant or not relevant to the topic query. None of the documents in this part of the process is reviewed by more than one assessor. Each assessor typically reviews only one batch. Although information about the topic is provided to each assessor, there is no rigorous effort expended to train them. As you might expect, the assessors can be highly variable. But, generally speaking, we don’t have any assessment of their variability or skill level. This is an important point and I will have to come back to it soon.

Predictive coding systems generally work by applying machine learning to a sample of documents and extrapolating from that sample to the whole collection. Different systems get their samples in different ways, but the performance of the system depends on the quality of the sample. Garbage in – garbage out. More fully, variability in accuracy can come from at least three sources:
1. Variability in the training set
2. Variability due to software algorithms
3. Variability due to the judgment standard against which the system is scored

If the system is trained on an inconsistent set of documents, or if it performs inconsistently, or if it is judged inconsistently, its ultimate level of performance will be poor. Voorhees, in the paper cited by Webber, found that professional information analysts agreed with one another on less than half of the responsive documents. This fact says nothing about any predictive coding system, it talks only about agreement of one person with another. One of the assessors she compared was the author of the topic and so could be considered the best available authority on the topic. The second assessor was the author of a different topic.

Under TREC, the variability due to the training set is left uncontrolled. It is up to each team to figure out how to train their systems. The variability due to the judgment standard is consistent across systems, so any variation among systems can be attributed to the training set or the system capabilities. That strategy is perfectly fine for most TREC purposes. We can compare the relative performance of a participating system. The problem only comes when we want to ascertain how well a system will do in absolute terms. The performance of predictive coding systems in the TREC Legal Track is suppressed by the variability of the judgment standard. It is not a design flaw for TREC, it is only a problem when we want to extrapolate from TREC results to eDiscovery or other real world situations. It under-estimates how well the system will do with more rigorous training and testing standards. The original TREC methodology was never designed to produce absolute estimates of performance, only relative.

Anything that we can do to improve the consistency of the training and testing set of document categorizations will improve the absolute quality of the results. But such quality improvements are typically expensive.

The TREC Legal Track has moved to using a Topic Authority (like Voorhees’s primary assessor). Even an authoritative assessor is not infallible, but it may be the best that we can achieve. It also may be realistic.

I would like to see the Topic Authority (TA) produce an authoritative training set and a second authoritative judgment set. The first set is used to train the predictive coding system, the second is used to test it.

Using a topic authority to provide the training and final assessment sets will substantially reduce the variability of the human judgments. We need two sets because we cannot use the same documents to train the system as we use to test the system. If we used only one, then the performance of the system on the same documents could over-estimate its capabilities. The system could simply memorize the training examples and spit back the same categories it was given. Having separate training and testing sets is standard procedure in most machine learning studies.

When we do a scientific study, we want to know how well the system will do in other, similar, situations. This prediction typically requires a statistical inference, and to make a valid statistical inference the two measurements need to be independent.

To translate this into eDiscovery process, the training set should be created by the person who knows the most about the case and then evaluated, for example, using a random sample of documents predicted to be responsive and nonresponsive, by the same person. Losey is correct, if you have multiple reviewers, each applying idiosyncratic standards to the review you will get poor results, even from a highly accurate predictive coding system. On the other hand, with rigorous training and a rigorous evaluation process, high levels of predictive coding accuracy are often achievable.


Issue: Webber’s analysis shows that human review is better than machine review

I have no doubt that human review could sometimes be better than machine-assisted review, but the data discussed by Webber do not say anything one way or the other about this claim.

Webber did, in fact, find that some of the human reviewers showed higher precision and recall than did the best-performing computer system on some tasks. But, because of the methods used, we don’t know whether these differences were due merely to chance, to specific methods used to obtain the scores, or to genuine differences among reviewers. Moreover, the procedure prevents us from making a valid statistical comparison.

The TREC Legal Track results that were analyzed in Webber’s paper involved a three step process. The various predictive coding systems were trained on whatever data their teams thought were appropriate. The results of the multiple teams were combined and sampled along with a number of documents that were not designated responsive by any team. From these, the bins or batches were created and distributed to the assessors. Once the assessors made their decisions, the machine teams were given a second chance to “appeal” any decisions to the Topic Authority. If the TA agreed with the computer system’s judgments the computer system then was measured as performing better and the assessor’s performance was judged as performing worse. The appeals process, in other words, moved the target after the arrow had been shot.

If none of the documents judged by a particular assessor was appealed, then that assessor would have precision and recall of 1.0. Prior to the appeal, the assessors’ judgments were the accuracy standard. The more documents that were appealed, the more likely that assessor would be to have a low score. Their score could not increase from the appeals process. So, whether an assessor scored well or scored poorly was determined almost completely by the number of documents that were appealed—by how much the target was moved.

Because of the (negative) correlation between the performance of the computer system and the performance of the assessor, their performances were not independent. Therefore, a simple statistical comparison between the performance of the assessor and the performance of the computer system is not valid.

Even if the comparison were valid, we still have other problems. The different TREC tasks involved different topics. Some were presumably easier than others. The assessors who made the decisions may have varied in ability, but we have no information about which were more skillful. The bins or batches that were reviewed probably differed among themselves in the number of responsive documents each contained. Because only one assessor judged each document, we don’t know whether the differences in accuracy (as judged by topic authority) were due to differences in the documents being reviewed or to differences in the assessors.


Issue: Reviewer quality is paramount

Webber found that some assessors performed better than others. Continuing the argument of the previous section, though, we cannot infer from this that some assessors were more talented, skilled, or better prepared than others.
It is entirely circular to say that some assessors were more skillful than others and so were more accurate because the only evidence we have that they were more skillful is that they were measured to be more accurate. You cannot explain an observation (higher accuracy) by the fact that you made the observation (higher accuracy). It cannot be both the cause and the effect.

Whether the source of variation in performance among the assessors was due to variation in the number or difficulty of the decisions or was due to differences in assessor talent, you cannot simply pick the best of one thing (the assessor performance) and compare it to the average of another (the computer assisted review). The computer performance is based on all of the documents, each assessor’s performance is based on only 500 documents. The computer performance was a representational equivalent of the average of all assessor judgments. Just by chance, some assessors will be higher than others. In fact, about half of the assessors should, just by chance, score above and about half should score below the average. But, we have no way to determine whether those selected reviewers scored high because of chance or because of some difference in skill. We measured them only once and we cannot use the fact that they scored well to explain their high score. We need some independent evidence.

The best reviewers on each topic could have been the best because they got lucky and got an easy bin, or they got a bin with a large number of responsive documents, or just by chance. Unless we disentangle these three possibilities, we cannot claim that some reviewers were better or that reviewer quality matters. In fact, these data provide no evidence one way or the other relative to these claims.

In any case, the question in practice is not whether some human somewhere could do better than some computer system. The question is whether the computer system did well enough or is there some compelling reason to force parties through the expense of using superior human reviewers? Even if some human reviewers could consistently do better than some machine, this is not the current industry standard.
In some sense, the ideal would be for the senior attorney in the case to read every single document with no effect of fatigue, boredom, distraction, or error. Instead, the current practice is to farm out first pass review to either a team of anonymous, ad hoc, or inexpensive reviewers or to search by keyword. Even if Losey were right, the standard is to use the kind of reviewers that he says are not up to the task.


Issue: Human review is good for small volumes, but not large ones

This claim may also be true, but the present data do not provide any evidence for or against it. The evidence that Losey cites in support of this claim is the same evidence that, I argued, failed to show that human review is better than machine review. It requires the same circular reasoning. We do not know from Webber’s analysis whether some reviewers were actually better than others, only that on this particular task, with these particular documents, they scored higher. Similarly, we don’t know from these data that doing only 500 documents is accurate, whereas doing more leads to inaccuracy. We don’t even know in the tested circumstance whether performance would decrease over that number. All bins were about the same size, so there is no way to test the hypothesis with these data that performance decreases as the set size rises above 500. It just was not tested.

When confronted with a small (e.g., several hundred) versus a large volume of documents to review, we can expect that fatigue, distraction, and other human factors will decrease accuracy over time. Based on other evidence from psychology and other areas, it is likely that performance will decline somewhat with larger document sets, but there is no evidence here for that.

If this were the only factor, we could arrange the situation so that reviewers only looked at 500 documents at a time before they took a break.


Issue: Random samples with 95% confidence levels +/- 2% confidence intervals are unrealistically high.

It’s not entirely clear what this claim means. On the one hand, there is a common misperception of what it means to have a 95% confidence level. Some people mistakenly assume that the confidence level refers to the accuracy of the results. But the confidence level is not the same thing as the accuracy level.
The confidence level refers to our belief in the measurement’s reliability, it does not tell what we are measuring. The confidence interval (e.g., ±2%) is a prediction of how precisely our sample estimates the true value of the whole population. Put simply, a 95% confidence interval means that in 95% of the experiments with this confidence level, we expect to find that true population value within the range specified by the experiment’s confidence interval.

For example, a recent CNN, Time Magazine poll found that 37% of likely Republican primary voters in South Carolina supported Mitt Romney, based on a survey sample size of 485 likely Republican primary voters. With a 95% confidence level, these results are accurate to within ±4.5 percentage points (the confidence interval). It does not mean that Romney is supported by 95% of the voters or that Romney is has a 95% probability of winning. It means that if the election were held today, the survey predicts that 37% of the voters would vote for Romney. I suspect that Losey means something different.

I suspect that he is referring to the relatively weak levels of agreement found by the TREC studies and others. If our measurement is not very precise, then we can hardly expect that our estimates will be more precise. This concern, though, rests on obtaining the measurements in the same way that TREC has traditionally done it. If we can reduce the variability of our training set and our comparison set, we can go a long way toward making our estimates more precise.

In any case, many relevant estimates do not depend on the accuracy of a group of human assessors. In practice, for example, our estimates of such factors as the prevalence of responsive documents can rest on the decisions made by a single authoritative individual, perhaps the attorney responsible for signing the Rule 26(g) declaration. Those estimates can be made precise enough with reasonable sample sizes.


Conclusion

The main problem with Losey’s discussion derives from taking the results reported by Webber and Voorhees as an immutable fact of information retrieval. The observation that there is only moderate agreement among independent assessors is a description of the human judgments in these studies, it says nothing about any machine systems used to predict which documents are responsive or not. The variability that leads to this moderate level of agreement can be reduced and when it is, the performance of machine-review can be more accurately measured.

The second problem derives from the difficulty of attributing causation in experiments that were not designed to attribute such causation. Within the data analyzed by Webber, for example, there is no way to distinguish the effects of chance from the effects due to assessor differences.

None of these comments should be interpreted as an indictment of TREC, Webber, or Losey. Science proceeds when people with different perspectives have the chance to critique each other’s work and to raise questions that may not have previously been considered.

None of these comments is intended to argue that predictive coding is fundamentally inaccurate. Rather my main argument is that the studies from which these data were derived were not designed to answer many of the questions we would like to ask of it. They do not speak against the effectiveness of predictive coding, nor do they speak in favor of it. Other studies will need to be conducted that address these questions specifically and are designed to answer them.

Finally, even if we disagree about the effectiveness of predictive coding relative to human performance, there is little disagreement any more about the effectiveness of a purely human linear review or of using a simple keyword search to identify responsive documents. The cost of human review continues to skyrocket as the volume of documents that must be considered increases. In many cases, human review is simply impractical within the cost and time constraints of the matter. Under those circumstances, something else has to be done to reduce that burden. That something else, seems to be predictive coding and the fact that we can measure its accuracy only adds to its usefulness.

Monday, August 8, 2011

Optimal document decisioning


There seems to be an emerging workflow in eDiscovery where predictive coding and highly professional reviewers are being used in place of large ad hoc groups of temporary attorneys. There is recognition that without high levels of training and good quality-control methods, human review tends to be only moderately accurate. Selecting or training effective reviewers requires an understanding of what makes a reviewer successful and of how to measure that success. We can look to optimal decision theory, and particularly to the branch of optimal decision theory called detection theory to provide insight into training and assessing reviewers.

The work on optimal decision theory began during World War II to measure and understand, for example, how to characterize the sensitivity of radar to detect objects at a distance. It then came to be applied to human decision making as well, work that was published after the war. This type of optimal decision theory is often called detection theory.

Detection theory concerns the question: Based on the available evidence, should we categorize an event as a member of set 1 or as a member of set 2? In radar, the evidence is in the signal reflected from an object, and the sets are whether the reflection is from a plane or from, say, a cloud. In document decisioning, the evidence consists of the words in the document and the sets are, for example, responsive and nonresponsive.

In order to isolate the essence of decisioning, we can simplify the situation further. For the moment, let’s think about a decision where all we have to do is decide whether a tone was played at a particular time or not—a kind of hearing test. Those events when the tone is present are analogous to a document being responsive and those events when the tone is absent are analogous to a document being nonresponsive.

Let’s put on a pair of head phones and listen for the tone. When the tone is present it is played very softly, so there may be some uncertainty about whether the tone was present or not. How do we decide whether we hear a tone or not?

At first, it may seem that detecting the tone is not a matter of making a decision. It is either there or it is not. But, one of the insights of detection theory is that it does actually require a decision and that decision is affected by more than just how loud the tone is.

In detection theory, two kinds of factors influence our decisions. The first is the sensitivity of listener—how well can the listener distinguish between tone and nontone events? The second factor is bias—how willing is the listener to say that the tone was present.

In our hearing test, we present a series of events or trials. The listener has to decide on each of those events whether she is hearing the tone. Detection theory describes how to combine the level of evidence (e.g., intensity of the tone) and these other factors to come up with the best decision possible.

Some listeners have more sensitive hearing than others. The more sensitive a person is, the softer the tone can be played and still be heard. Some reviewers are more sensitive than others. They can tell whether a document is responsive based on more subtle cues than other reviewers.

Bias concerns the willingness or tendency of the speaker to identify an event as a tone event. This willingness can be influenced by a number of factors, including the probability that a given event contains a tone and by the consequences of each type of decision. Put simply, if tone events are very rare, then people will be less likely to say that a tone occurred when they are uncertain. If tone events are more common, they will be more likely to say that a tone occurred when they are uncertain. Reviewers are more likely to categorize a document as responsive if the collection contains more responsive documents.

Similarly, if a person gets paid a dollar for correctly hearing a tone and gets charged 50 cents for an error, then that person will be more likely to say that he or she heard the tone. If we reverse the payment plan so that correctly hearing a tone yields 50 cents, but errors cost a dollar, then that person will be reluctant to say that he or she hears the tone. In the face of uncertainty, the optimal decision depends on the evidence available and the consequences of each type of decision.

The point of this is that you can change the proportion of events that are said to contain the tone not only by making the tone louder or softer, but also by changing the consequences of decisions and the likelihood that the tone is actually present.

Bringing this back to document decisioning, the words in a document constitute the evidence that a document is responsive or not. In the face of uncertainty, decision makers will decide whether a document is responsive based on the degree to which the evidence is consistent with the document being responsive, on their sensitivity to this evidence, on the proportion of responsive documents in a collection, and on the consequences of making each kind of decision. All of these factors play a role in document decisioning.

In the paper by Roitblat, Kershaw, and Oot (2010, JASIST), for example, two teams of reviewers re-examined a sample of documents that had been reviewed by the original Verizon team. In this re-review, Team A identified 24.2% of the documents in their sample as responsive and Team B identified 28.76% as responsive. Although Team B identified significantly more documents as responsive, when the sensitivity of these two teams was measured in the way suggested by detection theory, the two teams did not differ significantly from one another in sensitivity. They did differ in their bias, however, to call an uncertain document responsive. Team B was simply more willing than Team A to categorize documents as responsive without being any better at distinguishing responsive from nonresponsive documents.

The most useful insight to be derived from an optimal decision theory approach to document decisioning is the separability of sensitivity and bias. Reviewers can differ in how sensitive they are to identifying responsive documents and they can be guided to be more or less biased toward accepting documents as responsive when uncertain.

Presumably sensitivity will be affected by education. The more that reviewers know about the factors that govern whether a document is responsive, the better they will be at distinguishing responsive from nonresponsive. Their bias can be changed simply by asking them to be more or less fussy. The optimum review needs not only to be maximally sensitive to the difference between responsive and nonresponsive documents, but to adopt the level of bias that is appropriate to the matter at hand.

When assessing reviewers, optimal decision theory suggests that you separate out the sensitivity from the bias. The quality of a reviewer is represented by his or her sensitivity, not by bias. If all you measure, for example, is the proportion of responsive documents found by a candidate reviewer (where responsive is defined by someone authoritative), then you could easily miss highly competent reviewers because they have a different level of bias from the authoritative reviewer. Equally likely, you could select a candidate who finds many responsive documents just because he or she is biased to call more documents responsive. Although reviewer sensitivity may be difficult to change, bias is very easy to change. You have only to ask the person to be more or less generous. Unless you measure both bias and sensitivity, you won’t be able to make sound judgments about the quality of reviewers, whether those reviewers are machines or people.

Note: Traditional information retrieval science uses precision and recall to measure performance. These two measures recognize that there is a tradeoff between precision and recall. You can increase precision by focusing the retrieval more narrowly, but this usually results in a decrease in recall. You can get the highest recall by retrieving all documents, but then you would have very low precision. Precision and recall measures are affected by both bias and sensitivity, but they do not provide any means to separate one from the other. Sensitivity and bias have been used in information retrieval studies, but not as commonly as precision and recall.

Sunday, June 12, 2011

Competitor’s press release about predictive coding patent stretches the truth

[updated June 14, 2011]
[updated June 22, 2011]
One of OrcaTec's competitors, Recommind, has recently been awarded a patent related to predictive coding. In a press release dated June 8, 2011, announcing this award, they make some very grandiose claims with no basis in fact. According to their press release, they actually claim to have patented predictive coding. This claim is a gross exaggeration and unsupported by the details of the patent (No. 7,933,859) or the history of predictive coding. Patents are intended to protect inventions and there is no evidence that Recommind invented predictive coding.

Having examined the patent carefully, I can say that this patent covers only a very narrow method of computing in predictive coding and is unlikely to have any impact on the ability of any other eDiscovery service provider to continue to offer this game-changing capability. Primarily it involves the combination of using three things: Probabilistic Latent Semantic Analysis (a key part of Recommind's core product), Support Vector Machines (a statistical learning tool), and user feedback.

The scope of a patent is determined by its claims, not by the title of a press release. A valid patent requires that the proposed invention be (among other things) novel and non-obvious. In contrast, what we now call predictive coding has a very long history. I have written about some of this history previously.

Further, the use of predictive coding in eDiscovery predates Recommind’s patent application by many years. At the time that they filed their patent, predictive coding was in wide use, so it could not be considered novel or non-obvious as the Patent Office defines these terms, though some specific methods for implementing it may meet these requirements.

Predictive coding is a family of evidence-based document categorization technologies that are used to put documents or electronically stored information (ESI) into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged. The underlying idea of using evidence to categorize objects has been around since the 18th Century. The notion of applying similar ideas to document classification or categorization was described in 1961 by M.E. Maron.

In his paper, Maron noted:

Loosely speaking, it appears that there are two parts to the problem of classifying. The first part concerns the selection of certain relevant aspects of an item as pieces of evidence. The second part of the problem concerns the use of these pieces of evidence to predict the proper category to which the item in question belongs. (p. 404).

The general process of updating a system’s classification rules as a result of user feedback is called “relevance feedback.” It has been in use since at least 1971 (Rocchio, 1971). For example, Lewis and Gale (1994) used relevance feedback and “uncertainty sampling” to predictively categorize news stories. Uncertainty sampling is a method of selecting specific documents to be categorized by the user when improving the quality of the predictive categorizer. The documents shown to the user for classification are those that would maximally reduce the uncertainty of the classifier.

In 2002, Paul Graham introduced SpamBayes, which uses techniques similar to those described by Maron to distinguish SPAM from nonSPAM (or HAM) emails. An initial sample of categorized SPAM and HAM emails is analyzed by the program to learn how specific words provide evidence for one category or the other. Subsequent emails are then classified according to these implicit rules and the evidence, which consists of the words in the emails. If the system misclassifies an email as either SPAM or HAM, the user can flag these errors and the classifier will update itself to reflect these reclassified emails. This seems to me to be a very clear example of predictive coding, which again argues that predictive coding, per se, does not meet the novelty or non-obviousness criteria.

eDiscovery service providers have been doing predictive coding (sometimes called by other names) for many years. In January of 2010, before the Recommind patent was filed, the eDiscovery Institute published a paper (by Roitblat, Kershaw and Oot) in the Journal of the American Society for Information Science and Technology describing two eDiscovery service providers and the accuracy of their predictive coding tools. These tools, obviously, were in existence prior to Recommind’s filing. Other service providers have been providing similar services even before that. So Recommind can hardly be considered to have invented predictive coding, yet none of this prior art was actually included in Recommind’s patent application.

One pending patent, however, was included in their application as evidence of prior art in this field. This pending application, assigned to Bank of America, is called “Predictive Coding of Documents in an Electronic Discovery System.” Therefore, the patent application itself recognizes that Recommind cannot be the inventor of predictive coding.

They also included in their list of prior art an article attributed to Thorsten (Thorsten is actually the author's first name, his last name is Joachims) describing a statistical machine-learning technique called SVM (Support Vector Machines), which is used in their claimed invention. This same paper also describes relevance feedback.

Given all of this prior art, it is very clear that Recommind is in no position to claim to have either invented or to “own” predictive coding. Rather, their patent covers a very specific, very narrow approach to predictive coding involving the use of two very specific statistical procedures and relevance feedback. I will leave it to attorneys to determine whether even this circumscribed application constitutes a valid patent.

Still, even if their patent is valid, it leaves plenty of room for other approaches to predictive coding, including the approach used by OrcaTec. Nothing in the Recommind patent would preclude OrcaTec or any other service provider from offering predictive coding services in eDiscovery or any other area. OrcaTec does not use the statistical procedures described in their patent. We believe that OrcaCategorize is an easier to use, and more effective product, which can help attorneys achieve cost savings significantly beyond those claimed by Recommind.

Grandiose claims like those in the Recommind press release indicate either a profound lack of understanding of just what is covered by the patent, or are a deliberate attempt to obfuscate the issues in the industry. Attorneys involved in eDiscovery look to their service providers to provide open, honest and effective processes. They are not well served by unnecessary hyperbole.

Bob Tenant has an excellent blog on this topic that I think resolves the issue. I urge you to read it.
---
About the author:

Herbert L. Roitblat, Ph.D. is the CTO and Chief Scientist of OrcaTec. He holds a number of patents in eDiscovery technology and other areas. He is widely considered to be an expert in eDiscovery methodology and technology. He is a member of the Sedona Working Group on Electronic Document Retention and Production, on the Advisory Board of the Georgetown Legal Center Advanced eDiscovery Institute, a member of the 2011 program committee for the Georgetown Legal Center Advanced eDiscovery Institute, and the chair of the Electronic Discovery Institute. He is a member of the Board of Governors of the Organization of Legal Professionals. He is a frequent speaker on eDiscovery, particularly concerning search, categorization, predictive coding, and quality assurance.

Wednesday, April 6, 2011

Is predictive coding defensible?

Predictive coding is a family of evidence-based document categorization technologies that are used to put documents or ESI into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged. The evidence is quite clear from a number of studies, including TREC and the eDiscovery Institute, that predictive coding is at least as accurate as having teams of reviewers read every document. Despite these studies, there is still some skepticism about using predictive coding in eDiscovery. I would like to address some of the issues that these skeptics have raised. The two biggest concerns are whether predictive coding is reasonable and whether it is defensible. I cannot pretend to offer a legal analysis of reasonableness and defensibility, but I do have some background information and opinions that may be useful to make the legal arguments.

Some of the resistance to using predictive coding is the fear that the opposing party will object to the methods used to find responsive (for example) documents. By my reading, the courts do not usually support objections based on the supposition that something might have been missed. The opposing party has to respond with some particularity (e.g., Ford v Edgewood Properties). In In re Exxon, the trial court refused to compel a party to present a deponent to testify as to the efforts taken to search for documents. Multiven, Inc. v. Cisco Systems might also be relevant in that the court ordered the party to use a vendor with better technology, rather than having someone read each document. In a recent decision in Datel v. Microsoft, Judge Laporte quoted the Federal Rules of Evidence: “Further, ‘[d]epending on the circumstances, a party that uses advanced analytical software applications and linguistic tools in screening for privilege and work product may be found to have taken “reasonable steps” to prevent inadvertent disclosure.’ Fed. R. Evid. 502 advisory committee’s note.”

The OrcaTec system, at least, draws repeated random samples from the total population of documents so that we can always extrapolate from the latest sample to the population as a whole. Because the sample is random, it is representative of the whole collection, so the effectiveness of the system on the sample is a statistical prediction of how well the tool would do with the whole collection. You can also sample among the documents that have been and have not been selected for production (as Judge Grimm recommends in Creative Pipe).

In general, predictive coding systems work by finding documents that are similar to those which have already been recognized as responsive (or nonresponsive) by some person of authority. The machines do not make up the decisions, they merely implement them. If there is a black-box argument that applies to predictive coding, it applies even more to hiring teams of human readers. No one knows what is going on in their heads, their judgments change over time, they lose attention after half an hour, etc.

If you cannot tell the difference between documents picked by human reviewers and documents picked by machines, then why should you pay for the difference?

Predictive coding costs a few cents per document. Human reviewers typically pay a dollar or more per document. Proportionality and FRCP Rule 1 would suggest, I think, that a cheaper alternative with demonstrable quality should be preferred by the courts. The burden should be on the opposing party to prove that something was wrong and that engaging a more expensive process of lower quality (people) is somehow justified.

The workflow I suggest is to start with some exploratory data analysis, then set up the predictive coding. In the OrcaTec system, the system selects documents to learn from. Some other systems use slightly different methods. Once the predictive coding is done, you can take the documents that the computer recognizes as responsive and review those. This the result of your first pass review. They will be a small subset of the original collection, the size of which depends on the richness of the collection. I would not recommend producing any documents that have not been reviewed by trusted people. But now, you're not wasting time reading documents that will never get produced.

Predictive coding is great for first pass review, but it can also be used as a quality check on human review. Even if you do not want to rely on predictive coding to do your first-pass review, you can exploit it to check up on your reviewers after they are done.

Some attorneys worry about a potential Daubert challenge to predictive coding. I’m not convinced that it is even pertinent to the discussion, still predictive coding would easily stand up to such a challenge. Predictive coding is mathematical and statistically based. It is main-stream science that has been in existence in one form or another since the 18th Century. There is no voodoo magic in predictive coding, only mathematics. I think that the facts supporting its accuracy are certainly substantial (peer reviewed, main stream science, etc.) and the systems are transparent enough that there should be no (rational) argument about the facts. Many attorneys happily use keyword searching, which has long been known to be rather ineffective. There has never been a Daubert challenge to using keywords to identify responsive documents. Seldom has there been any measurement done to justify the use of keyword searches as a reasonable way to limit the scope of documents that must be reviewed. But if the weak method of keyword searching is acceptable, then a more sophisticated and powerful process should also be acceptable.

Another concern that I hear raised frequently is that lawyers would have a hard time explaining predictive coding, if challenged. I don’t think that the ideas behind predictive coding are very difficult to explain. Predictive coding works by identifying documents that are similar to those that an authoritative person has identified as responsive (or as a member of another category). Systems differ somewhat in how they compute that similarity, but they share the idea that the content of the document, the words and phrases in it, are the evidence to use when measuring this similarity.

A document (or more generally, ESI, electronically stored information) consists of words, phrases, sentences, and paragraphs. These textual units constitute the evidence that a document presents. For simplicity, I will focus on words as the unit, but the same ideas apply to using other text units. For further simplification, we will consider only two categories, responsive and nonresponsive. Again, the same rules apply if we want to include other categories, or if we want to distinguish privileged from nonprivileged, etc.

Based on the words in a document, then, the task is to decide whether this document is more likely to be responsive or more likely to be nonresponsive. This is the task that a reviewer performs when reading a document and it is the task that any predictive coding system performs as well. Both categorize the document as indicated by its content.

When we rely exclusively on human reviewers we have no transparency into how they make their decisions. We can provide them with instructions that are as detailed as we may like, but we do not have direct access to their thought processes. We may ask them, after the fact, why they decided that a particular document was responsive, but their answer is almost always a reconstruction of what they “must have thought” rather than a true explanation. Keyword searching, on the other hand, is very transparent—we can point to the presence of specific key words—but, the presence of a specific key word does not necessarily mean that a document is automatically responsive. One recent matter I worked on used keyword and custodian culling and still only 6% of the selected documents ended up being tagged responsive.

It seems to me that saying, “these documents were chosen as responsive because they resembled documents that were judged by our expert to be responsive,” is a pretty straightforward explanation of predictive coding, whatever technology is used to perform it. Couple that with explicit measurement of the system’s performance, and I think that you have a good case for a transparent process.

Predictive coding would appear to be an efficient, effective, and defensible process for winnowing down your document collection for first-pass review, and beyond.

Monday, April 4, 2011

Everything new is old again

To be truthful, I have been quite surprised at all of the attention that predictive coding has been receiving lately, from the usual legal blogs to the New York Times to Forbes Magazine. It’s not a particularly new technology. It’s actually been around since 1763, when Thomas Bayes first proposed his famous theorem. It’s been used in document decision making, since about 1961. But, when I tried to convince people that it was a useful tool in 2002 and 2003, my arguments fell on deaf ears. Lawyers just were not interested. It never went anywhere. Times have certainly changed.

Concept search took about six years to get into the legal mainstream. Predictive coding, by whatever name, seems to have taken about 18 months. I’m told by some of my lawyer colleagues, that it has now become a necessary part of many statements of work.

The first paper that I know of concerning what we would today call predictive coding is by Maron, and published in 1961. He called it “automatic indexing.” “The term ‘automatic indexing’ denotes the problem of deciding in a mechanical way to which category (subject or field of knowledge) a given document belongs. It concerns the problem of deciding automatically what a given document is ‘about’.” He recognized that “To correctly classify an object or event is a mark of intelligence,” and found that, even in 1961, computers could be quite accurate at assigning documents to appropriate categories.

Predictive coding is a family of evidence-based document categorization technologies. The evidence is the word or words in the documents. Predictive coding systems do not replace human judgment, but rather augment it. They do not decide autonomously what is relevant, but take judgments about responsiveness from a relatively small set of documents (or other sources) and extend these judgments to additional documents. In the OrcaTec system, a user trains the system by reviewing a sample of documents. The computer watches the documents as they are reviewed and the decisions assigned by the reviewer. At the same time, as the computer gains some experience, it predicts the appropriate tag to be applied to the document, making the reviewer more consistent while making the computer more closely approach the decision rules used by the reviewer.

Although there are a number of computational algorithms that are used to compute these evidence-based categorizers, deep inside they all address the problem of finding the highest probability category or categories for a document, given its content.

The interest in predictive coding stems I think from two factors. First, the volume of documents that must be considered continues to grow exponentially, but the resources to deal with them do not. The cost of review in eDiscovery frequently overwhelms the amount at risk. There is widespread recognition that something has to be done. The second, is the emergence of studies and investigations that examine both the quality of human review and the comparative ability of computational systems. For a number of years, TREC, the Text Retrieval Conference, has been investigating information retrieval techniques. For the last several years, they have applied the same methodology to document categorization in eDiscovery. The Electronic Discovery Institute conducted a similar study (I was the lead author). The evidence available from these studies is consistent in showing that people are only moderately accurate in categorizing documents and that computers can be at least as accurate, often more accurate. The general idea is that if computer systems can at least replicate the quality of human review at a much lower cost, then it would be reasonable to use these systems to do first pass review.

In my next post, I will discuss the skepticism that some lawyers have expressed and offer some suggestions for resolving that skipticism.

Friday, February 25, 2011

Opening the Black Box of Predictive Coding

Yesterday, we did a very interesting podcast on ESIBytes with Karl Schieneman, Jaime Carbonell, and Vasco Pedro on predictive coding. Jaime and Vasco are well known in the machine learning space, but have not been active in eDiscovery, so it was really interesting to get their perspective on how these advanced technologies could be effective in eDiscovery.

A few points that I would really like to emphasize from that conversation include the consensus among the four of us that machine learning tools are not magic bullets, but must involve humans.

We talked about two ways of using machine learning to organize documents--clustering and categorization. In eDiscovery, categorization is often called predictive coding. Clustering groups together documents that are similar to one another. The computer derives the groupings to be used. Categorization, on the other hand, starts with categories that are specified by people and puts documents into the category that provides the best match.

When using clustering, people have to decide which of the groups of documents are important after the computer organizes them. When using categorization, people have to design the categories before the computer organizes them. In neither case, do we rely on the computer to make the legal judgments about what is important to the matter. That is a decision that is best made by someone with real-world experience and legal judgment.

Computers can take the tedium out of implementing human legal judgments, but so far, they are not in a position to make the judgments themselves. These systems do not take the lawyer out of the equation, they simply provide support to reduce the level of effort required to implement the attorney's judgment on ever-growing collections of documents.

That brings me to the second point I want to discuss--machine learning algorithms and their ability to handle large data sets. Jaime Carbonell remarked that most of the machine learning algorithms he was familiar with handled gigabyte size problems well, but did not do well on terabyte sized problems. In general that's true, but he did mention that search algorithms were a major exception. Google and others have shown ready capabilities of searching the World-Wide Web and its billions of documents.

Not all machine learning algorithms are subject to the kinds of scaling constraints that Jaime mentioned. For example, if it were possible to transform clustering and categorization into search problems, then we would expect that these algorithms would also scale to Web-sized problems. That, in fact, is what we at OrcaTec have done. We have recast the traditional clustering and categorization algorithms into scalable search-like algorithms that scale directly into very large collections in reasonable amounts of time. As a result, our software can handle even very large collections and efficiently and effectively cluster and categorize the documents.

Finally, that brings me to my third point from our conversation. There was widespread agreement that assessment of the effectiveness of our processes is an essential component. Whether you know or even care about the content of the black box or the head of the temp attorney hired to do the review, we all agreed that measurement was an essential part of the process. You cannot know that you've done a reasonable job unless you can show that you measured the job that you did. You cannot improve the quality of your process unless you measure that quality.

When we have measured human review performance, it was not as consistent as one might have imagined. When compared with machine learning, the machines come out at least as accurate as people do, and often more accurate. From the things that I see and hear, measurement of eDisocovery processes is rapidly becoming the norm, and, in my opinion, should be. Appropriate quality assessments are neither expensive nor time consuming. They can, however, be invaluable in demonstrating that your review involved a reasonable process at a reasonable level of effectiveness. The Sedona Conference has an excellent paper on achieving and measuring quality in eDiscovery. I highly recommend reading it.