Sunday, March 25, 2012

Da Silva Moore Plaintiffs Slash and Burn their Way Through eDiscovery

The Plaintiffs in the Da Silva Moore case have gone far beyond zealous advocacy in their objection to Judge Peck's order. The Plaintiffs object to the protocol(see the protocol and other documents here) that gives them the keys to the eDiscovery candy store. In return, they propose to burn down the store and eviscerate the landlord.

Da Silva Moore has been generating a lot of attention in eDiscovery circles, first for Judge Peck's decision supporting the use of predictive coding, and then for the challenges to that ruling presented by the Plaintiffs. The eDiscovery issues in this case are undoubtedly important to the legal community so it is critical that we get them right.

The Plaintiffs play loose with the facts in the matter, they fail to recognize that they have already been given the very things that they ask for, and they employ a rash of ad hominem attacks on the judge, the Defendant, and the Defendant's predictive coding vendor, Recommind. Worse still, what they ask for would actually, in my opinion, disadvantage them.

If we boil down this dispute to its essence, the main disagreement seems to be about whether to measure the success of the process using a sample of 2,399 putatively non-responsive documents or a sample of 16,555. The rest is a combination of legal argumentation, which I will let the lawyers dispute, some dubious logical and factual arguments, and personal attacks on the Judge, attorneys, and vendor.

The current disagreement embodied in the challenge to Judge Peck's decision is not about the use of predictive coding per se. The parties agreed to use predictive coding, even if the Plaintiffs now want to claim that that agreement was conditional on having adequate safeguards and measures in place. Judge Peck endorsed the use of predictive coding knowing that the parties had agreed. It was easy to order them to do something that they were already intending to do.

Now, though, the Plaintiffs are complaining that Judge Peck was biased toward predictive coding and that bias somehow interfered with him rendering an honest decision. Although he has clearly spoken out about his interest in predictive coding, I am not aware of any time that Judge Peck endorsed any specific measurement protocol or method. The parties to the case knew about his views on predictive coding, and, for double measure, he reminded them of these views and provided them the opportunity to object. Neither party did. In any case, the point is moot in that the two sides both stated that they were in favor of using predictive coding. It seems disingenuous to then complain about the fact that he spoke supportively of the technology.

The Plaintiff brief attacking Judge Peck for his support of predictive coding reminds me of the scene from the movie Casablanca where Captain Renault says that he is shocked to find out that gambling is going on in Rick's café just as he is presented with his evening's winnings. If the implication is that judges should remain silent about methodological advances, then that would have a chilling effect on the field and on eDiscovery in particular. A frequent complaint that I hear from lawyers is that the judges don't understand technology. Here is a judge who not only understands the technology of modern eDiscovery, but works to educate his fellow judges and the members of the bar about its value. It would be disastrous for legal education if the Plaintiffs were to succeed in sanctioning the Judge for playing this educational role.

The keys to the candy shop

The protocol gives to the Plaintiffs the final say on whether the production meets quality standards (Protocol, p. 18):
If Plaintiffs object to the proposed review based on the random sample quality control results, or any other valid objection, they shall provide MSL with written notice thereof within five days of the receipt of the random sample. The parties shall then meet and confer in good faith to resolve any difficulties, and failing that shall apply to the Court for relief. MSL shall not be required to proceed with the final search and review described in Paragraph 7 above unless and until objections raised by Plaintiffs have been adjudicated by the Court or resolved by written agreement of the Parties.

They, the Plaintiffs, make a lot of other claims in their brief about things not being specified, when in fact, the protocol gives them the power to specify the criteria as they see fit. They get to define what is relevant. They get to determine whether the results are adequate, so it is not clear why they complain that these things are not clearly specified.

Moreover, the Defendant is sharing with them every document used in training and testing the predictive coding system. The Plaintiffs can object at any point in the process and trigger a meet and confer to resolve any possible dispute. It's not clear, therefore, why they would complain that the criteria are not clearly spelled out when they can object for any valid reason. Any further specificity would simply limit their ability to object. If they don't like the calculations or measures used by the Defendant, they have the documents and can do their own analysis.

The Plaintiffs are being given more data than they could reasonably expect from other defendants or when using other technology. I am not convinced that it should be necessary in general to share the predictive coding training documents with opposing counsel. These training documents provide no information about the operation of the predictive coding system. The documents are only useful for assessing the honesty or competence of the party training the predictive coding system, they presume that the predictive coding system will make good use of the information they contain. I will leave to lawyers any further discussion of whether document sharing is required or legally useful.

Misuse of the "facts"

The Plaintiffs complain that the method described in the protocol risks failing to capture a staggering 65% of the relevant documents in this case. They reach this conclusion based on their claim that Recommind’s “recall,” was very low, averaging only 35%. This is apparently a fundamental misreading or misrepresentation of the TREC (Text Retrieval Conference) 2011 preliminary results (attached to Neale's declaration). Although it may be tempting to use the TREC results for this purpose, TREC was never designed to be a commercial "bakeoff" or certification of commercial products. It is designed as a research project and it imposes special limitations on the systems that participate, limitations that might not be applicable in actual use during discovery. Moreover, Recommind scored much better on recall than the Plaintiffs claim, about twice as well.

The Plaintiffs chose to look at the system's recall level at the point where the measure F1 was maximized. F1 is a measure that combines precision and recall with an equal emphasis on both. In this matter, the parties are obviously much more concerned with recall than precision, so the F1 measure is not the best choice for judging performance. If, rather, we look at the actual recall achieved by the system, while accepting a reasonable number of non-responsive documents, then Recommind's performance was considerably higher, reaching an estimated 70% or more on the three tasks (judging from the gain curves in the Preliminary TREC report). To claim that the data support a recall rate of only 35% is misleading at best.

Methodological issue

The Plaintiffs complain that there are a number of methodological issues that are not fully spelled out in the protocol. Among these are how certain standard statistical properties will be measured (for example, the confidence interval around recall). Because they are standard statistical properties, they should hardly need to be spelled out again in this context. These are routine functions that any competent statistician should be able to compute.

The biggest issue that is raised, and the only one where the Plaintiffs actually have an alternative proposal, concerns how the results of predictive coding are to be evaluated. Because, according to the protocol, the Plaintiffs have the right to object to the quality of the production, it actually falls on them to determine whether it is adequate or not. The dispute revolves around the collection of a sample of non-responsive documents at the end of predictive coding (post-sample) and here the parties both seem to be somewhat confused.

According to the protocol, the Defendant will collect 2,399 documents designated by the predictive coding to be non-responsive. The plaintiffs want them to collect 16,555 of these documents. They never clearly articulate why they want this number of documents. The putative purpose of this sample is to evaluate the system's precision and recall, but in fact, this sample is useless for computing these measures.

Precision concerns the number of correctly identified responsive documents relative to the number of documents identified by the system as responsive. Precision is a measure of the specificity of the result. Recall concerns the number of correctly identified responsive documents relative to the total number of responsive documents. Recall is a measure of the completeness of the result.

The sample that both sides want to draw contains by design no documents that have been identified by the system as responsive so it cannot be used to calculate either precision or recall. Any argument about the size of this sample is meaningless if the sample cannot provide the information that they are seeking.

A better measure to use in this circumstance is elusion. Rather than calculate the percentage of responsive documents that have been found, elusion calculates the percentage of documents that were erroneously classified as non-responsive. I have published on this topic in the Sedona Conference Journal, 2007. Elusion is the percentage of the rejected documents that are actually responsive. It can be used to create an accept-on-zero quality control test or one can simply measure it. Measuring elusion would require the same size sample as the original 2,399-document pre-sample used to measure prevalence. The methods for computing the accept-on-zero quality control test are described in the Sedona Conference Journal paper. The parties could apply whatever acceptance criterion they want, without having to sample huge numbers of documents to evaluate success.

Another test that could be used is a z-test for proportions. If predictive coding works, then it should decrease the number of responsive documents that are present in the post-sample, relative to the pre-sample. The pre-sample apparently identified 36 responsive documents out of 2,399 in a random sample. A post-sample of 2,399 documents, drawn randomly from the documents identified as non-responsive would have to have 21 or fewer responsive documents for it to be significantly different (by a conservative 2-tailed test) at the 95% confidence level.

Conclusion

The Parties in this case are not arguing about the appropriateness of using predictive coding. They agreed to its use. The Plaintiffs are objecting to some very specific details of how this predictive coding will be conducted. Along the way they raise every possible objection that they can imagine, most of which are beside the point; they misinterpret or misrepresent data; they fail to realize that they have the very information they are seeking; and they seek data that will not do them any good, all while vilifying the judge, the other party, and the party's predictive coding service provider. It is as if given the keys to the candy store, they are throwing a tantrum because they have not been told whether to eat the red whips or the cinnamon sticks. Their slash and burn approach to negotiation is far beyond zealous advocacy and far from consistent with the pattern of cooperation that has been promoted by the Sedona Conference and by a large number of judges, including Judge Peck.

Disclosures

So that there are no surprises about where I am coming from, let me repeat some perhaps pertinent facts. Certain other bloggers have recently insinuated that there might be some problem with the credibility of the paper that Anne Kershaw, Patrick Oot, and I published in the peer-reviewed Journal of the American Society for Information Science and Technology on predictive coding. Judge Peck mentioned this paper in his opinion. The technology investigated in that study was from two companies with which none of the authors had any financial relationship.

I am the CTO and Chief Scientist for OrcaTec. I designed OrcaTec's predictive coding tools, starting in February of 2010, after the paper mentioned earlier had already been published and after it became clear that there was interest in predictive coding for eDiscovery. OrcaTec is a competitor of Recommind, and uses very different technology. Our goal is not to defend Recommind, but to try to bring some common sense to the process of eDiscovery.

Neither I, nor OrcaTec has any financial interest in this case, though I have had conversations in the past with Paul Neale, Ralph Losey, and Judge Peck about predictive coding.

I have also commented on this case in an ESI-Bytes podcast, where we talk more about the statistics of measurement.

Thanks to Rob Robinson for collecting all of the relevant documents in one easy to access location.

Monday, January 9, 2012

On Some Selected Search Secrets

Ralph Losey recently wrote an important series of blog posts (here, here, and here) describing five secrets of search. He pulled together a substantial array of facts and ideas that should have a powerful impact on eDiscovery and the use of technology in it. He raised so many good points, that it would take up all of my time just to enumerate them. He also highlighted the need for peer review. In that spirit I would like to address a few of his conclusions in the hope of furthering discussions among lawyers, judges, and information scientists about the best ways to pursue eDiscovery.

These are the problematic points I would like to consider:
1. Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.
2. Webber’s analysis shows that human review is better than machine review
3. Reviewer quality is paramount.
4. Human review is good for small volumes, but not large ones.
5. Random samples with 95% confidence levels +/- 2 are unrealistically high.


Issue: Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.

Losey quotes extensively from a paper written by William Webber, which reanalyzes some results from the TREC Legal Track, 2009, and some other sources. Like Losey’s commentary, this paper also has a lot to recommend it. Some of the conclusions that Losey reaches are fairly attributable to Webber, but some go beyond what Webber would probably be comfortable with. The most significant fact, because important arguments are based on it, is a description of some work by Ellen Voorhees that concluded that 65% recall at 65% precision is the best performance one can expect.

The problem is that this 65% factoid is taken out of context. In the context of the TREC studies and the way that documents are ultimately determined to be relevant or not, this is thought to be the best that can be achieved. The 65% is not a fact of nature. It says, actually, nothing about the accuracy of the predictive coding systems being studied. Losey notes that this limit is due to the inherent uncertainty in human judgments of relevance, but goes on to claim that this is a limit on machine-based or machine assisted categorization. It is not.

Part of the TREC Legal Track process is to distribute sets of documents to ad hoc reviewers, whom they call assessors. Each assessor gets a block or bin of about 500 documents and is asked to categorize them as relevant or not relevant to the topic query. None of the documents in this part of the process is reviewed by more than one assessor. Each assessor typically reviews only one batch. Although information about the topic is provided to each assessor, there is no rigorous effort expended to train them. As you might expect, the assessors can be highly variable. But, generally speaking, we don’t have any assessment of their variability or skill level. This is an important point and I will have to come back to it soon.

Predictive coding systems generally work by applying machine learning to a sample of documents and extrapolating from that sample to the whole collection. Different systems get their samples in different ways, but the performance of the system depends on the quality of the sample. Garbage in – garbage out. More fully, variability in accuracy can come from at least three sources:
1. Variability in the training set
2. Variability due to software algorithms
3. Variability due to the judgment standard against which the system is scored

If the system is trained on an inconsistent set of documents, or if it performs inconsistently, or if it is judged inconsistently, its ultimate level of performance will be poor. Voorhees, in the paper cited by Webber, found that professional information analysts agreed with one another on less than half of the responsive documents. This fact says nothing about any predictive coding system, it talks only about agreement of one person with another. One of the assessors she compared was the author of the topic and so could be considered the best available authority on the topic. The second assessor was the author of a different topic.

Under TREC, the variability due to the training set is left uncontrolled. It is up to each team to figure out how to train their systems. The variability due to the judgment standard is consistent across systems, so any variation among systems can be attributed to the training set or the system capabilities. That strategy is perfectly fine for most TREC purposes. We can compare the relative performance of a participating system. The problem only comes when we want to ascertain how well a system will do in absolute terms. The performance of predictive coding systems in the TREC Legal Track is suppressed by the variability of the judgment standard. It is not a design flaw for TREC, it is only a problem when we want to extrapolate from TREC results to eDiscovery or other real world situations. It under-estimates how well the system will do with more rigorous training and testing standards. The original TREC methodology was never designed to produce absolute estimates of performance, only relative.

Anything that we can do to improve the consistency of the training and testing set of document categorizations will improve the absolute quality of the results. But such quality improvements are typically expensive.

The TREC Legal Track has moved to using a Topic Authority (like Voorhees’s primary assessor). Even an authoritative assessor is not infallible, but it may be the best that we can achieve. It also may be realistic.

I would like to see the Topic Authority (TA) produce an authoritative training set and a second authoritative judgment set. The first set is used to train the predictive coding system, the second is used to test it.

Using a topic authority to provide the training and final assessment sets will substantially reduce the variability of the human judgments. We need two sets because we cannot use the same documents to train the system as we use to test the system. If we used only one, then the performance of the system on the same documents could over-estimate its capabilities. The system could simply memorize the training examples and spit back the same categories it was given. Having separate training and testing sets is standard procedure in most machine learning studies.

When we do a scientific study, we want to know how well the system will do in other, similar, situations. This prediction typically requires a statistical inference, and to make a valid statistical inference the two measurements need to be independent.

To translate this into eDiscovery process, the training set should be created by the person who knows the most about the case and then evaluated, for example, using a random sample of documents predicted to be responsive and nonresponsive, by the same person. Losey is correct, if you have multiple reviewers, each applying idiosyncratic standards to the review you will get poor results, even from a highly accurate predictive coding system. On the other hand, with rigorous training and a rigorous evaluation process, high levels of predictive coding accuracy are often achievable.


Issue: Webber’s analysis shows that human review is better than machine review

I have no doubt that human review could sometimes be better than machine-assisted review, but the data discussed by Webber do not say anything one way or the other about this claim.

Webber did, in fact, find that some of the human reviewers showed higher precision and recall than did the best-performing computer system on some tasks. But, because of the methods used, we don’t know whether these differences were due merely to chance, to specific methods used to obtain the scores, or to genuine differences among reviewers. Moreover, the procedure prevents us from making a valid statistical comparison.

The TREC Legal Track results that were analyzed in Webber’s paper involved a three step process. The various predictive coding systems were trained on whatever data their teams thought were appropriate. The results of the multiple teams were combined and sampled along with a number of documents that were not designated responsive by any team. From these, the bins or batches were created and distributed to the assessors. Once the assessors made their decisions, the machine teams were given a second chance to “appeal” any decisions to the Topic Authority. If the TA agreed with the computer system’s judgments the computer system then was measured as performing better and the assessor’s performance was judged as performing worse. The appeals process, in other words, moved the target after the arrow had been shot.

If none of the documents judged by a particular assessor was appealed, then that assessor would have precision and recall of 1.0. Prior to the appeal, the assessors’ judgments were the accuracy standard. The more documents that were appealed, the more likely that assessor would be to have a low score. Their score could not increase from the appeals process. So, whether an assessor scored well or scored poorly was determined almost completely by the number of documents that were appealed—by how much the target was moved.

Because of the (negative) correlation between the performance of the computer system and the performance of the assessor, their performances were not independent. Therefore, a simple statistical comparison between the performance of the assessor and the performance of the computer system is not valid.

Even if the comparison were valid, we still have other problems. The different TREC tasks involved different topics. Some were presumably easier than others. The assessors who made the decisions may have varied in ability, but we have no information about which were more skillful. The bins or batches that were reviewed probably differed among themselves in the number of responsive documents each contained. Because only one assessor judged each document, we don’t know whether the differences in accuracy (as judged by topic authority) were due to differences in the documents being reviewed or to differences in the assessors.


Issue: Reviewer quality is paramount

Webber found that some assessors performed better than others. Continuing the argument of the previous section, though, we cannot infer from this that some assessors were more talented, skilled, or better prepared than others.
It is entirely circular to say that some assessors were more skillful than others and so were more accurate because the only evidence we have that they were more skillful is that they were measured to be more accurate. You cannot explain an observation (higher accuracy) by the fact that you made the observation (higher accuracy). It cannot be both the cause and the effect.

Whether the source of variation in performance among the assessors was due to variation in the number or difficulty of the decisions or was due to differences in assessor talent, you cannot simply pick the best of one thing (the assessor performance) and compare it to the average of another (the computer assisted review). The computer performance is based on all of the documents, each assessor’s performance is based on only 500 documents. The computer performance was a representational equivalent of the average of all assessor judgments. Just by chance, some assessors will be higher than others. In fact, about half of the assessors should, just by chance, score above and about half should score below the average. But, we have no way to determine whether those selected reviewers scored high because of chance or because of some difference in skill. We measured them only once and we cannot use the fact that they scored well to explain their high score. We need some independent evidence.

The best reviewers on each topic could have been the best because they got lucky and got an easy bin, or they got a bin with a large number of responsive documents, or just by chance. Unless we disentangle these three possibilities, we cannot claim that some reviewers were better or that reviewer quality matters. In fact, these data provide no evidence one way or the other relative to these claims.

In any case, the question in practice is not whether some human somewhere could do better than some computer system. The question is whether the computer system did well enough or is there some compelling reason to force parties through the expense of using superior human reviewers? Even if some human reviewers could consistently do better than some machine, this is not the current industry standard.
In some sense, the ideal would be for the senior attorney in the case to read every single document with no effect of fatigue, boredom, distraction, or error. Instead, the current practice is to farm out first pass review to either a team of anonymous, ad hoc, or inexpensive reviewers or to search by keyword. Even if Losey were right, the standard is to use the kind of reviewers that he says are not up to the task.


Issue: Human review is good for small volumes, but not large ones

This claim may also be true, but the present data do not provide any evidence for or against it. The evidence that Losey cites in support of this claim is the same evidence that, I argued, failed to show that human review is better than machine review. It requires the same circular reasoning. We do not know from Webber’s analysis whether some reviewers were actually better than others, only that on this particular task, with these particular documents, they scored higher. Similarly, we don’t know from these data that doing only 500 documents is accurate, whereas doing more leads to inaccuracy. We don’t even know in the tested circumstance whether performance would decrease over that number. All bins were about the same size, so there is no way to test the hypothesis with these data that performance decreases as the set size rises above 500. It just was not tested.

When confronted with a small (e.g., several hundred) versus a large volume of documents to review, we can expect that fatigue, distraction, and other human factors will decrease accuracy over time. Based on other evidence from psychology and other areas, it is likely that performance will decline somewhat with larger document sets, but there is no evidence here for that.

If this were the only factor, we could arrange the situation so that reviewers only looked at 500 documents at a time before they took a break.


Issue: Random samples with 95% confidence levels +/- 2% confidence intervals are unrealistically high.

It’s not entirely clear what this claim means. On the one hand, there is a common misperception of what it means to have a 95% confidence level. Some people mistakenly assume that the confidence level refers to the accuracy of the results. But the confidence level is not the same thing as the accuracy level.
The confidence level refers to our belief in the measurement’s reliability, it does not tell what we are measuring. The confidence interval (e.g., ±2%) is a prediction of how precisely our sample estimates the true value of the whole population. Put simply, a 95% confidence interval means that in 95% of the experiments with this confidence level, we expect to find that true population value within the range specified by the experiment’s confidence interval.

For example, a recent CNN, Time Magazine poll found that 37% of likely Republican primary voters in South Carolina supported Mitt Romney, based on a survey sample size of 485 likely Republican primary voters. With a 95% confidence level, these results are accurate to within ±4.5 percentage points (the confidence interval). It does not mean that Romney is supported by 95% of the voters or that Romney is has a 95% probability of winning. It means that if the election were held today, the survey predicts that 37% of the voters would vote for Romney. I suspect that Losey means something different.

I suspect that he is referring to the relatively weak levels of agreement found by the TREC studies and others. If our measurement is not very precise, then we can hardly expect that our estimates will be more precise. This concern, though, rests on obtaining the measurements in the same way that TREC has traditionally done it. If we can reduce the variability of our training set and our comparison set, we can go a long way toward making our estimates more precise.

In any case, many relevant estimates do not depend on the accuracy of a group of human assessors. In practice, for example, our estimates of such factors as the prevalence of responsive documents can rest on the decisions made by a single authoritative individual, perhaps the attorney responsible for signing the Rule 26(g) declaration. Those estimates can be made precise enough with reasonable sample sizes.


Conclusion

The main problem with Losey’s discussion derives from taking the results reported by Webber and Voorhees as an immutable fact of information retrieval. The observation that there is only moderate agreement among independent assessors is a description of the human judgments in these studies, it says nothing about any machine systems used to predict which documents are responsive or not. The variability that leads to this moderate level of agreement can be reduced and when it is, the performance of machine-review can be more accurately measured.

The second problem derives from the difficulty of attributing causation in experiments that were not designed to attribute such causation. Within the data analyzed by Webber, for example, there is no way to distinguish the effects of chance from the effects due to assessor differences.

None of these comments should be interpreted as an indictment of TREC, Webber, or Losey. Science proceeds when people with different perspectives have the chance to critique each other’s work and to raise questions that may not have previously been considered.

None of these comments is intended to argue that predictive coding is fundamentally inaccurate. Rather my main argument is that the studies from which these data were derived were not designed to answer many of the questions we would like to ask of it. They do not speak against the effectiveness of predictive coding, nor do they speak in favor of it. Other studies will need to be conducted that address these questions specifically and are designed to answer them.

Finally, even if we disagree about the effectiveness of predictive coding relative to human performance, there is little disagreement any more about the effectiveness of a purely human linear review or of using a simple keyword search to identify responsive documents. The cost of human review continues to skyrocket as the volume of documents that must be considered increases. In many cases, human review is simply impractical within the cost and time constraints of the matter. Under those circumstances, something else has to be done to reduce that burden. That something else, seems to be predictive coding and the fact that we can measure its accuracy only adds to its usefulness.