Information Discovery: Adventures in Predictive Coding

Monday, July 1, 2013

Adventures in Predictive Coding

Ralph Losey, in a series of blogs, has painstakingly chronicled how he used two different methods, which he called a “multimodal hybrid” approach and a “monomodal hybrid” approach, to identify the responsive documents from the Enron set. He recently summarized his results here and here.

His descriptions are very detailed and often amusing. They provide a great service to the field of eDiscovery. He wants to argue that the multimodal hybrid approach is the better of the two, but his results do not support this conclusion. In fact, his two approaches show no statistically significant differences. I will explain.

The same set of 699,082 documents was considered in both exercises, and both started with a random sample to identify a prevalence rate — the proportion of responsive documents. In both exercises the random sample estimated that about one quarter or less of a percent of the documents were responsive (0.13% vs. 0.25% for the multimodal and monomodal exercises respectively, corresponding to an estimate of 928 vs. 1773 responsive documents in the whole collection). Combining these assessments of prevalence with a third one, Losey estimates that 0.18% of the documents were responsive. That’s less than one fifth of one percent or 1.8 responsive documents per thousand. In my experience, that is a dramatically sparse set of documents.

These are the same documents in the two exercises, so it is not possible that they actually contained different numbers of responsive documents. Was the almost 2:1 prevalence difference between the two exercises due to chance (sampling variation), was it due to changes in Losey’s standards for identifying responsive documents, or was it due to something else? My best guess is that the difference was due to chance.

By chance, different samples from the same population can yield different estimates. If you flip a coin, for example, on average, half of the flips will come out heads and half will come out tails. The population consists of half heads and half tails, but any given series of flips may have more or fewer heads than another. The confidence interval tells us the range of likely proportions. Here are two samples from a series of coin flips.

H T H T H H H T H H

H T T H H T T H T T

The first sample consisted of 7 Heads and 3 Tails. The second sample consisted of 4 Heads and 6 Tails. Were these samples obtained from different coins? One sample is 70% Heads and the other is 40% Heads. I happen to know that the same coin was used for all flips, and that, therefore, we can attribute the difference to chance. With samples of 10 flips, the 95% confidence interval extends from 0.2 (2 Heads) to 0.8 (8 Heads). Although these two samples resulted in numerically different values, we would not be justified in concluding that they were obtained from coins with different biases (different likelihoods of coming up Heads). Statistical hypothesis testing allows us to examine this kind of question more systematically.

Statistical significance means that the difference in results are unlikely to have occurred by chance.

I analyzed the prevalence rates in Losey’s two exercises to see whether the observed difference could reasonably be attributed to chance variation. Both are based on a large sample of documents. Using a statistical hypothesis test called a “Z-test of proportions,” it turns out that the difference is not statistically significant. The difference in prevalence estimates could reasonably have come about by chance. Two random samples from the same population could, with a reasonably high likelihood, produce a difference this large or larger by chance.

By ordinary scientific standards, if we want to conclude that one score is greater than another, we need to show that the difference between the two scores is greater than could be expected by from sampling variation. As we know, scores derived from a sample always have a confidence interval or margin of error around them. With a 95% confidence level, the confidence interval is the range of scores where, 95% of the time, the true population level for that score (the true proportion of responsive documents in the whole collection) would be found. The so-called null hypothesis is that the real difference between these two exercises is 0.0 and the observed difference is due only to sampling error, that is, to chance variations in the samples. The motivated hypothesis is that the difference is real.

Null Hypothesis: There is no reliable difference between scores
Motivated Hypothesis: The two scores differ reliably from one another

Under the null hypothesis, the difference between scores also has a confidence interval around 0.0. If the size of the difference is outside of the confidence interval, then we can say that the difference is (statistically) significant. The probability is less than 5% that the difference was drawn from a distribution centered around 0.0. If this difference is sufficiently large, then we are justified in rejecting the null hypothesis. The difference is unexpected under the null hypothesis. Then we can say that the difference is statistically significant or reliable.

On the other hand, if the magnitude of the difference is within the confidence interval, then we cannot say that the difference is reliable. We fail to reject the null hypothesis, and we may say that we accept the null hypothesis. Differences have to be sufficiently large to reject the null hypothesis or we say that there was no reliable difference. “Sufficiently large” here means “outside the confidence interval.” The bias in most of science is to assume that the null hypothesis is the better explanation unless we can find substantial evidence to the contrary.

Although the difference in estimated prevalences for the two exercises is numerically large (almost double), in fact, my analysis reveals that differences this large could easily have come about by chance due to sampling error. The difference in prevalence proportions is well within the confidence interval assuming that there is really no difference. My analysis does not prove that there was no difference, but it shows that these results do not support the hypothesis that there was a difference. The difference in estimated prevalence between the two exercises is potentially troubling, but the fact that it could have arisen by chance, means that our best guess is that there really was no systematic difference to explain.

We knew from the fact that the same data were used in both exercises that we should not expect a real difference in prevalence between the two assessments, so this failure to find a reliable difference is consistent with our expectations. On the other hand, Losey conducted the exercises with the intention of finding a difference in the accuracy of the two approaches. We can apply the same logic to looking for these differences.

We can assess the accuracy of predictive coding with several different measures. The emphasis in Losey’s approaches is to find as many of the responsive documents as possible. One measure of this goal is called Recall. Of all of the responsive documents in the collection, how many were identified by the combination of user and predictive coding? To assess Recall directly, we would need a sample of responsive documents. This sample would have to be of sufficient size to allow us to compare Recall under each approach. Unfortunately, those data are not available. We would need a sample of, say, 400 responsive documents to estimate Recall directly. We cannot use the predictive coding to find those responsive documents, because that is exactly what we are trying to measure. We need to find an independent way of estimating the total number of responsive documents.

We could try to estimate Recall from a combination of our prevalence estimate and the number of responsive documents identified by the method, but since the estimate of prevalence is so substantially different, it is not immediately obvious how to do so. If the two systems returned the same number of documents, our estimate of recall would be much lower for the monomodal method than for the multimodal method because the prevalence was estimated to be so much higher for the monomodal method.

Instead, I analyzed the Elusion measures for the two exercises. Elusion is a measure of categorization accuracy that is closely (but inversely) related to Recall. Specifically, it is a measure of the proportion of false negatives among the documents that have been classified as non-responsive (documents that should have been classified as responsive, but were incorrectly classified). An effective predictive coding exercise will have very low false negative rates, and therefore very low Elusion, because all or most of the responsive documents have been correctly classified. Low Elusion relative to Prevalence corresponds to high Recall.

Because both exercises involved the same set of documents, their true (as opposed to their observed) Prevalence rates should be the same. If one process was truly more accurate than the other, then they should differ in the proportion of responsive documents that they fail to identify. By his prediction Losey’s multimodal method should have lower Elusion than his monomodal method. That seems not to be the case.

Elusion for the monomodal method was numerically lower than for the multimodal method. A Z-test for the difference in the two Elusion proportions (0.00094 vs. 0.00085) also fails to reach a level of significance. The analysis reveals that the difference between these two Elusion values could also have occurred by chance. The proportion of false negatives in the two exercises was not reliably different from one another. Contrary to Ralph’s assertion, we are not justified to conclude from these exercises that there was a difference in their success rates. So his claim that the multimodal method is better than the monomodal method is unsupported by these data.

Finally, I compared the prevalence in each exercise with its corresponding Elusion proportion, again using a Z-test for proportions. If predictive coding has been effective, then we should observe that Elusion is only a small fraction of prevalence. Prevalence is our measure of the proportion of documents in the whole collection that are responsive. Elusion is our measure of the proportion of actually responsive documents in the set that have been categorized as non-responsive. If we have successfully identified the responsive documents, then they would not be in the Elusion set, so their proportion should be considerably lower in the Elusion set than in the initial random sample drawn from the whole collection.

Losey would not be surprised, I think, to learn that in the monomodal exercise, there was no significant difference between estimated prevalence (0.0025) and estimated Elusion (0.00085). Both proportions could have been drawn from populations with the same proportion of responsive documents. The monomodal method was not successful, according to this analysis, at identifying responsive documents.

What might be more surprising, however, was that there was also no significant difference between prevalence and Elusion in the multimodal exercise (0.0013 vs. 0.00094). Neither exercise found a significant number of responsive documents. There is no evidence that predictive coding added any value in either exercise. Random sampling before and after the exercises could have produced differences larger than the ones observed without employing predictive coding or any other categorization technique in the middle. Predictive coding in these exercises did not remove a significant number of responsive documents from the collection. A random sample was just as likely to contain the same number of responsive documents before predictive coding as after predictive coding.

Far from concluding that the multimodal method was better than the monomodal method, these two exercises cannot point to any reliable effect of either method. Not only did the methods not produce reliably different results from one another, but it looks like they had no effect at all. All of the differences between exercises can be attributed to chance, that is sampling error. We are forced to accept the null hypotheses that there were no differences between methods and no effect of predictive coding. Again, we did not prove that there were no differences, only that there were no reliable differences to be found in these exercises.

These results may not be widely indicative of those that would be found in other predictive coding uses. Other predictive coding exercises do find significant differences of the kind I looked for here.

From my experience, this situation is an outlier. These data may not be representative of predictive coding problems, for example, they are very sparse. Prevalence near zero left little room for Elusion to be lower. Less than a quarter of a percent of the documents in the collection were estimated to be responsive. In the predictive coding matters I have dealt with, Prevalence is typically considerably higher. These exercises may not be indicative of what you can expect in other situations or with other predictive coding methods. These results are not typical. Your results may vary.

Alternatively, it is possible that predictive coding worked well, but that we do not have enough statistical power to detect it. The confidence interval of the difference, just like any other confidence interval, narrows with larger samples. It could be that larger samples might have found a difference. In other words, we cannot conclude that there was no difference, the best we can do is to conclude that there was insufficient evidence to conclude that there was a difference.

But, if we cannot be confident of a difference, we cannot be confident that one method was better than the other. At the same time, we also cannot conclude that some other exercises might not find differences. Accepting the null hypothesis is not the same as proving it.

We cannot conclude that predictive coding or the technology used in these exercises does not work. Many other factors could affect the efficacy of the tools involved.

For the predictive coding algorithms to work, they need to be presented with valid examples of responsive and non-responsive documents. The algorithms do not care how those examples were obtained, provided that they are representative. The most important decisions, then, are the document decisions that go into making the example sets.

Losey’s two methods differ (among other things) in terms of who chooses the examples that are presented to the predictive coding algorithms. Losey’s multimodal method uses a mix of machine and human selection. His monomodal method, which he pejoratively calls the “Borg” method, has the computer select documents for his decision. In both cases, it was Losey making the only real decisions that the algorithms depend on — whether documents are responsive or non-responsive. Losey may find search decisions more interesting than document decisions, but search decisions are secondary and a diversion from the real choices that have to be made. He may feel as though he is being more effective by selecting the document to judge for responsiveness, but that feeling is not supported by these results. Evaluating his hypotheses will have to await a situation where we can point to reliable differences in the occurrence of responsive documents before and after running predictive coding and reliable differences between the results of the two methods.

Predictive coding is not the only way to learn about the documents. eDiscovery often requires exploratory data analysis, to know what we have to work with, what kind of vocabulary people used, who the important participants are, and so on. These are questions that are not easily addressed with predictive coding. We need to engage search and other analytics to address these questions. They are not a substitute for predictive coding, but a necessary part of preparing for eDiscovery. Predictive coding is not designed to replace all forms of engagement with the data, but rather to make document categorization easier, more cost effective, and more accurate.

Not every attorney is as skilled at or as interested in searching as Losey is. However the example documents are chosen, the critical judgments are the legal decisions about whether specific documents are responsive or not. Those judgments may not be glamorous, but they are critical to the justice system and to the performance of predictive coding. Despite rather substantial effort, nothing in his exercises would lead us to conclude that either method was better than the other.

14 comments:

UnknownJuly 1, 2013 at 9:45 AM
Herb,

Your null hypothesis is "prevalence in the null set is equal to prevalence in the collection". Now, if prevalence in the retrieved set is higher than in the collection, it follows by straightforward math that prevalence in the null set must be lower. But Ralph asserts 100% prevalence in the retrieved set (as he has manually reviewed all documents). Therefore, we can reject the null hypothesis.

But really the null hypothesis here is not interesting. If Ralph had found just a single relevant document, then the null hypothesis could be rejected (since removing that one relevant document would leave the null set with lower prevalence than the collection). Unfortunately, the implication is that elusion of the null set compared to the collection is not an interesting property to test for. One should instead test for recall compared to some threshold.

William
ReplyDelete
Replies
gvcJuly 1, 2013 at 9:51 AM
Herb,

You analysis demonstrates that a random sample of the "null set" is inadequate to distinguish search methods when the prevalence is low. Trying to distinguish search methods in this manner is akin to trying to measure the volume of a toilet flush by observing a change in sea level.

A paired test shows that Ralph's multimodal method found slightly more relevant documents, and that the difference was significant (P<0.05). Multimodal found 376 documents that monomodal missed; monomodal found 294 documents that multimodal missed. Put another way, they disagreed on 376+294 = 670 documents. If there was no difference, i.e. you tossed a coin to adjudicate the disagreements, monomodal would prevail on about 335 (50%) of the disagreements. But it did better than that.

If you plug the numbers x=376 and n=670 into a binomial calculator you see that multimodal wins 56% of the time, with a 95% confidence interval of between 52% and 60%. That is, multimodal beats monomodal by more than coin toss.

Gordon
ReplyDelete
Replies
Gerard. J. BrittonJuly 1, 2013 at 12:32 PM
I can't resist asking, is this the Thomas Aquinas Summa Theologica page (just how many angels can a monomodal process identify)? Apologies.

The discussion of whether one method's performance legitimately surpassed another seems in some senses misdirected.

Isn't the clear import of the comparative exercise that the training set of either approach did not come even close to adequate "coverage" (in an information retrieval sense)? If we conducted a third "infinity+1 modal" process, wouldn't we likely have an addition to the Venn diagram with combinations of overlapping and unique documents?

And if, as Pr. Webber indicates, the threshold's the thing in deciding whether the process's result is satisfactory, what is it? I'd argue that relative recall isn't it. That's because no one has established that it measures actual ediscovery performance validity, and there's some really good good arguments supporting the belief that it doesn't in ediscovery.

I'm not quite convinced yet, but PC performance testing for ediscovery may well become an assessment of compliance with known (some yet-to-be implemented) systematic information retrieval steps that mitigate the risk of important discovery information loss, rather than a comparative measure or a simple measurement against a threshold.

As an aside, I found the reference to a three level coding system confusing. Did the system actually train and predict on NR, R and Hot)? Or was a binary relevance classifier involved and the Hot designation only used for counting documents at the end. If the latter, the relative number of Hot documents would not appear to be very indicative of any specific capability to actually classify hot documents per se.

Also, as a point of clarification, under either scenario I believe that the author stated that he did not manually review each document in the "categorized" sets (as Pr. Webber indicates) but rather only a small portion of documents (~2500 or < 2% in the first case; ~12000 or ~25% in the second). The reliability of the automated "bulk tagging" techniques that produced the complete retrieved set from those "samples" was not established, or if it was it is not clear to me.

Gerard
ReplyDelete
Replies
Gerard. J. BrittonJuly 3, 2013 at 12:53 PM
Gordon,

Thanks for the feedback. Just to be clear though, could you explain some of what you have described?
Is there any record in TREC of "relevant" and "highly relevant" codes being used to create separate predicted relevance classes for prediction? (If so, I am confused. The TREC 2009 overview does not indicate that this was the case: "Participant submissions. Each team's final deliverable is a binary classification of the full population for relevance to each target topic in which it has chosen to participate".) TREC only measured binary relevance. Having a process wherein high relevance documents are noted is very much different than requiring that solution providers specifically train for and deliver multi-level predictive relevance values. If you don't measure it, you can't measure it.

In terms of my statements about recall, for purposes here, I argue three things:

1) that absent separate quantification of a critically important characteristic (the importance of retrieved relevant documents) you can’t validly use recall and precision measures to compare human review performance against that of PC-based reviews because the comparison yields no information about the information payload conveyed (especially given the fundamentally different way humans and machines determine which documents to retrieve). (Concretely, you can produce a lot of “meh” and achieve good recall, and retrieve the bulk of the information payload and fail the metaphorical swimming test with a lower score.) Hence we now have in the industry: “humans bad; machines good”, instead of the much more accurate “ humans good but too expensive; machines might be a decent alternative but we’d better be careful”.

2) to validly convey a measure of information retrieval performance (how much of the information payload was retrieved) which is what is important to the litigator’s information need, the measure must at least accurately and reliably convey some information about the distribution of relevant document importance.

3) recall does not measure the relative level of retrieval of information payload; it measures the relative number of documents carrying any level of information payload: the game changing email gets the same value as the 40th instance of the useless responsive invoice that was retrieved 40 times. That’s not telling a litigator what he or she needs to know.

ReplyDelete
Replies
Gerard. J. BrittonJuly 3, 2013 at 12:54 PM
Your argument appears to be that recall is sufficient for three reasons:

1) there is research indicating that there is some (positive) correlation ( “tend”) between the likelihood of relevance and the level of relevance.
Response: first frankly this seems suspect as a general proposition. In every PC solution I’ve observed, retrieved documents that look most (in content) like training documents get the highest likelihood score. If you train on a hot document, it’s near-duplicate siblings will appear high on this list. If the training set is full of marginal crap, the dopplegangers get the highest scores (anyone who has used a PC solution knows they have to first wade through repeated documents that look like the training documents before getting to new stuff. So, this seems like pretty weak tea. Moreover, has some loosey-goosey “tends to positively correlate” now become a measure in math?

2) I (that being me) haven’t demonstrated a “negative correlation” between likelihood/level.
Response: I defer to response 1. I’m not proposing that we rely on the likelihood/level inference as a measure. If you want to rely on it, prove it.

3) I (again me) haven’t provided an alternative.
Response: this is a good argument (if true) that I am a “curse the dark” in lieu of “lighting one candle” type. But that’s not the issue. Assume I can’t. The question that still remains is: is recall alone a valid measure to serve the purposes of testing, comparing or validating predictive coding performance given the information need and goals of ediscovery. My argument is, whether or not I have an alternative, it is not. And I think my argument has merit.
But anyway, I actually believe that there may be workable alternatives which I can discuss, and I’m sure others can develop even better ones. But before they can do that, recall (and all the hyperbole that has sprung from it) needs to be critically objectively examined by stakeholders.

That seems pretty fair, no?
ReplyDelete
Replies
Gerard. J. BrittonJuly 6, 2013 at 7:59 AM
Gordon,
Re: “page 3, http://trec.nist.gov/pubs/trec18/papers/LEGAL09.OVERVIEW.pdf”

Yes, TREC tasked groups to use highly relevant documents, counted them, and calculated and tabled precision and recall numbers for them, but then dismissed them conducted analysis and reporting on only “all relevant” documents using binary relevance codes. See TREC Overview report referenced above:

“[s]o the picture overall looked similar if just counting highly relevant documents. Below, we just focus on the results counting “all relevant" documents, particularly as half of the topics did not actually have training examples distinguishing highly relevant from other relevant documents.” Section 3.10.2

It’s worth noting that the set-based “highly relevant and “all-relevant” recall/precision Topic tables listed in the TREC 2009 Appendix http://trec.nist.gov/pubs/trec18/appendices/legal/app09scores.pdf displays “highly relevant at great variance from “all documents”; many “highly relevant” recall rates were extremely low.

The question arises: why have people code for separate grades of relevance but not analyze and report on it as part of TREC, or include “highly relevant” comparisons in your report in Jolt, Technology-Assisted Review In E-Discovery Can Be More Effective And More Efficient Than Exhaustive Manual Review? After all, they’re “highly relevant”.

As Ralph Losey notes about the report:
“Other flaws and issues they discuss with TREC show that scientific research of legal search is still in its early stages and has far to go. Other flaws were not discussed in the article, including what is to me the biggest, namely the lack of relevancy ranking. Documents are either relevant or not, whereas in practice, highly relevant documents are noted, and for good reason. They are far, far more important than documents of just average relevance.” http://bit.ly/mjtKG2

Attorneys and courts should be provided with that data and that comparative analysis to allow them to decide the “reasonableness” and “proportionality” of e.g. a .5 “all relevant” document recall rate with a .001 “highly relevant” document recall rate. TREC established that the coding and predictions can be done; therefore, reporting should include analysis of “highly relevant” document retrieval performance and proposed protocols should include them.
ReplyDelete
Replies
Gerard. J. BrittonJuly 6, 2013 at 8:00 AM
Cont’d
"My lawyer colleagues tell me that the federal and state civil requires require a reasonable search to find (as nearly as practicable) all responsive documents, not just the important ones."
You might ask them if they've ever encountered (or even heard of) a party being sanctioned for failing to provide sufficient relevant “drek”. Then have them opine on e.g. WOI, LLC v. Monaco Coach Corp., or Quantum Communications Corp. v. Star Broad., Inc. which involve sanction for dropping the ball on important documents.
It can be strongly argued that “reasonable” as contemplated in litigation discovery implies that the reasonable effort requirement mandates diligence in proportion to significance. So discovery obligations do not preclude prioritizing risk of production failure in a practical way based upon the relative importance of documents and the methodology used; it’s likely the approach is more defensible. Under human review, this (rightly or wrongly) was not considered a necessary step (I think there are arguments that the redundancy built into a good manual review process naturally minimizes the risk). However, when using a mechanical method like PC that employs quantifications at its core, there is a solid argument that it becomes explicitly necessary for a deliverable to contain metrics about that risk. To provide that, one of the necessary components is a description of the breakdown of e.g. “highly relevant” in the predicted set vs. a control set as well as a description of systematic efforts to optimize “highly relevant” document detection.
ReplyDelete
Replies
Gerard. J. BrittonJuly 6, 2013 at 8:05 AM
(cont’d)
"If you don't have an alternative measure or an alternative method, what precisely do you suggest my lawyer colleagues should do next time they receive a request for production?"
A bit of game play here. If an attorney has a client who wishes to produce the least discovery possible, propose today’s PC protocols and accept the risk of non-produced documents. If an attorney represents a client who believes an opposing party has material that proves its case (or disproves the opposing party’s position), then here are some alternatives:
Suggestion1: generally, don’t rely very much on baselines based on claims of equivalent (much less) superior PC performance compared to humans , in part based on what’s stated above, in part based upon the different way humans and machines operate in an information retrieval environment, and in part based upon the nature of overlapping documents in ediscovery corpi. Set realistic information retrieval objectives relative to 100% relevant material retrieval for gradient relevance classes.
Suggestion 2: generally reject simple recall as a valid unilateral measure of information retrieval in ediscovery. It was designed for other purposes, and contains assumptions and limitations that make its usefulness as the dispositive measure of ediscovery retrieval performance questionable at best. Request that it be supplemented with metrics about things such as e.g.: the content diversity of produced sets; the relative size of the set of non-produced documents for which the PC system produced weak likelihood scores; and descriptions produced in Solution 5 below.
Suggestion 3: If the set is not too large (< ~500,000), consider arguing for a manually review that uses cost-efficient staff, accelerated process techniques to increase review efficiency, and puts adequate QC procedures in place. The result will be better than PC and possibly less expensive than PC at today’s prices.
Suggestion 4: if the set is large but not too large (~500,000 to several million), agree to have PC used directly. If you are a party who believes that you would benefit more than your opponent from solid discovery (you’ll get more than you’ll give), argue for: larger sample sizes for controls to keep the interval estimate for prevalence narrower than current techniques require; content diversity techniques in training sampling and in some of the testing; gradient relevance metrics (recall, precision, prevalence);
Suggestion 5: request that content-based probabilistic predictive coding protocols be supplemented with analytical techniques that systematically identify patterns in the non-content attributes of “highly relevant” documents, especially communication documents.
Suggestion 6: request specific manual review of all members of a communication thread containing at least one “highly relevant” item, where any, but not all, thread members were produced.
Suggestion 7: where documents sets are very large, argue for culling techniques that reduce relevant item loss like PC focused not on identifying relevant material but on identifying largest near-duplicate clusters of non-relevant items.
ReplyDelete
Replies
0s0-PaDecember 10, 2014 at 2:13 AM
If this is how things like ediscovery software is made then I am definitely lost!
ReplyDelete
Replies

Add comment

Information Discovery

Monday, July 1, 2013

Adventures in Predictive Coding

14 comments:

Blog Archive

About OrcaTec