Ralph Losey, in a series of blogs, has painstakingly chronicled how he used two different methods, which he called a “multimodal hybrid” approach and a “monomodal hybrid” approach, to identify the responsive documents from the Enron set. He recently summarized his results here and here.
His descriptions are very detailed and often amusing. They provide a great service to the field of eDiscovery. He wants to argue that the multimodal hybrid approach is the better of the two, but his results do not support this conclusion. In fact, his two approaches show no statistically significant differences. I will explain.
The same set of 699,082 documents was considered in both exercises, and both started with a random sample to identify a prevalence rate — the proportion of responsive documents. In both exercises the random sample estimated that about one quarter or less of a percent of the documents were responsive (0.13% vs. 0.25% for the multimodal and monomodal exercises respectively, corresponding to an estimate of 928 vs. 1773 responsive documents in the whole collection). Combining these assessments of prevalence with a third one, Losey estimates that 0.18% of the documents were responsive. That’s less than one fifth of one percent or 1.8 responsive documents per thousand. In my experience, that is a dramatically sparse set of documents.
These are the same documents in the two exercises, so it is not possible that they actually contained different numbers of responsive documents. Was the almost 2:1 prevalence difference between the two exercises due to chance (sampling variation), was it due to changes in Losey’s standards for identifying responsive documents, or was it due to something else? My best guess is that the difference was due to chance.
By chance, different samples from the same population can yield different estimates. If you flip a coin, for example, on average, half of the flips will come out heads and half will come out tails. The population consists of half heads and half tails, but any given series of flips may have more or fewer heads than another. The confidence interval tells us the range of likely proportions. Here are two samples from a series of coin flips.
H T H T H H H T H H
H T T H H T T H T T
The first sample consisted of 7 Heads and 3 Tails. The second sample consisted of 4 Heads and 6 Tails. Were these samples obtained from different coins? One sample is 70% Heads and the other is 40% Heads. I happen to know that the same coin was used for all flips, and that, therefore, we can attribute the difference to chance. With samples of 10 flips, the 95% confidence interval extends from 0.2 (2 Heads) to 0.8 (8 Heads). Although these two samples resulted in numerically different values, we would not be justified in concluding that they were obtained from coins with different biases (different likelihoods of coming up Heads). Statistical hypothesis testing allows us to examine this kind of question more systematically.
Statistical significance means that the difference in results are unlikely to have occurred by chance.
I analyzed the prevalence rates in Losey’s two exercises to see whether the observed difference could reasonably be attributed to chance variation. Both are based on a large sample of documents. Using a statistical hypothesis test called a “Z-test of proportions,” it turns out that the difference is not statistically significant. The difference in prevalence estimates could reasonably have come about by chance. Two random samples from the same population could, with a reasonably high likelihood, produce a difference this large or larger by chance.
By ordinary scientific standards, if we want to conclude that one score is greater than another, we need to show that the difference between the two scores is greater than could be expected by from sampling variation. As we know, scores derived from a sample always have a confidence interval or margin of error around them. With a 95% confidence level, the confidence interval is the range of scores where, 95% of the time, the true population level for that score (the true proportion of responsive documents in the whole collection) would be found. The so-called null hypothesis is that the real difference between these two exercises is 0.0 and the observed difference is due only to sampling error, that is, to chance variations in the samples. The motivated hypothesis is that the difference is real.
- Null Hypothesis: There is no reliable difference between scores
- Motivated Hypothesis: The two scores differ reliably from one another
Under the null hypothesis, the difference between scores also has a confidence interval around 0.0. If the size of the difference is outside of the confidence interval, then we can say that the difference is (statistically) significant. The probability is less than 5% that the difference was drawn from a distribution centered around 0.0. If this difference is sufficiently large, then we are justified in rejecting the null hypothesis. The difference is unexpected under the null hypothesis. Then we can say that the difference is statistically significant or reliable.
On the other hand, if the magnitude of the difference is within the confidence interval, then we cannot say that the difference is reliable. We fail to reject the null hypothesis, and we may say that we accept the null hypothesis. Differences have to be sufficiently large to reject the null hypothesis or we say that there was no reliable difference. “Sufficiently large” here means “outside the confidence interval.” The bias in most of science is to assume that the null hypothesis is the better explanation unless we can find substantial evidence to the contrary.
Although the difference in estimated prevalences for the two exercises is numerically large (almost double), in fact, my analysis reveals that differences this large could easily have come about by chance due to sampling error. The difference in prevalence proportions is well within the confidence interval assuming that there is really no difference. My analysis does not prove that there was no difference, but it shows that these results do not support the hypothesis that there was a difference. The difference in estimated prevalence between the two exercises is potentially troubling, but the fact that it could have arisen by chance, means that our best guess is that there really was no systematic difference to explain.
We knew from the fact that the same data were used in both exercises that we should not expect a real difference in prevalence between the two assessments, so this failure to find a reliable difference is consistent with our expectations. On the other hand, Losey conducted the exercises with the intention of finding a difference in the accuracy of the two approaches. We can apply the same logic to looking for these differences.
We can assess the accuracy of predictive coding with several different measures. The emphasis in Losey’s approaches is to find as many of the responsive documents as possible. One measure of this goal is called Recall. Of all of the responsive documents in the collection, how many were identified by the combination of user and predictive coding? To assess Recall directly, we would need a sample of responsive documents. This sample would have to be of sufficient size to allow us to compare Recall under each approach. Unfortunately, those data are not available. We would need a sample of, say, 400 responsive documents to estimate Recall directly. We cannot use the predictive coding to find those responsive documents, because that is exactly what we are trying to measure. We need to find an independent way of estimating the total number of responsive documents.
We could try to estimate Recall from a combination of our prevalence estimate and the number of responsive documents identified by the method, but since the estimate of prevalence is so substantially different, it is not immediately obvious how to do so. If the two systems returned the same number of documents, our estimate of recall would be much lower for the monomodal method than for the multimodal method because the prevalence was estimated to be so much higher for the monomodal method.
Instead, I analyzed the Elusion measures for the two exercises. Elusion is a measure of categorization accuracy that is closely (but inversely) related to Recall. Specifically, it is a measure of the proportion of false negatives among the documents that have been classified as non-responsive (documents that should have been classified as responsive, but were incorrectly classified). An effective predictive coding exercise will have very low false negative rates, and therefore very low Elusion, because all or most of the responsive documents have been correctly classified. Low Elusion relative to Prevalence corresponds to high Recall.
Because both exercises involved the same set of documents, their true (as opposed to their observed) Prevalence rates should be the same. If one process was truly more accurate than the other, then they should differ in the proportion of responsive documents that they fail to identify. By his prediction Losey’s multimodal method should have lower Elusion than his monomodal method. That seems not to be the case.
Elusion for the monomodal method was numerically lower than for the multimodal method. A Z-test for the difference in the two Elusion proportions (0.00094 vs. 0.00085) also fails to reach a level of significance. The analysis reveals that the difference between these two Elusion values could also have occurred by chance. The proportion of false negatives in the two exercises was not reliably different from one another. Contrary to Ralph’s assertion, we are not justified to conclude from these exercises that there was a difference in their success rates. So his claim that the multimodal method is better than the monomodal method is unsupported by these data.
Finally, I compared the prevalence in each exercise with its corresponding Elusion proportion, again using a Z-test for proportions. If predictive coding has been effective, then we should observe that Elusion is only a small fraction of prevalence. Prevalence is our measure of the proportion of documents in the whole collection that are responsive. Elusion is our measure of the proportion of actually responsive documents in the set that have been categorized as non-responsive. If we have successfully identified the responsive documents, then they would not be in the Elusion set, so their proportion should be considerably lower in the Elusion set than in the initial random sample drawn from the whole collection.
Losey would not be surprised, I think, to learn that in the monomodal exercise, there was no significant difference between estimated prevalence (0.0025) and estimated Elusion (0.00085). Both proportions could have been drawn from populations with the same proportion of responsive documents. The monomodal method was not successful, according to this analysis, at identifying responsive documents.
What might be more surprising, however, was that there was also no significant difference between prevalence and Elusion in the multimodal exercise (0.0013 vs. 0.00094). Neither exercise found a significant number of responsive documents. There is no evidence that predictive coding added any value in either exercise. Random sampling before and after the exercises could have produced differences larger than the ones observed without employing predictive coding or any other categorization technique in the middle. Predictive coding in these exercises did not remove a significant number of responsive documents from the collection. A random sample was just as likely to contain the same number of responsive documents before predictive coding as after predictive coding.
Far from concluding that the multimodal method was better than the monomodal method, these two exercises cannot point to any reliable effect of either method. Not only did the methods not produce reliably different results from one another, but it looks like they had no effect at all. All of the differences between exercises can be attributed to chance, that is sampling error. We are forced to accept the null hypotheses that there were no differences between methods and no effect of predictive coding. Again, we did not prove that there were no differences, only that there were no reliable differences to be found in these exercises.
These results may not be widely indicative of those that would be found in other predictive coding uses. Other predictive coding exercises do find significant differences of the kind I looked for here.
From my experience, this situation is an outlier. These data may not be representative of predictive coding problems, for example, they are very sparse. Prevalence near zero left little room for Elusion to be lower. Less than a quarter of a percent of the documents in the collection were estimated to be responsive. In the predictive coding matters I have dealt with, Prevalence is typically considerably higher. These exercises may not be indicative of what you can expect in other situations or with other predictive coding methods. These results are not typical. Your results may vary.
Alternatively, it is possible that predictive coding worked well, but that we do not have enough statistical power to detect it. The confidence interval of the difference, just like any other confidence interval, narrows with larger samples. It could be that larger samples might have found a difference. In other words, we cannot conclude that there was no difference, the best we can do is to conclude that there was insufficient evidence to conclude that there was a difference.
But, if we cannot be confident of a difference, we cannot be confident that one method was better than the other. At the same time, we also cannot conclude that some other exercises might not find differences. Accepting the null hypothesis is not the same as proving it.
We cannot conclude that predictive coding or the technology used in these exercises does not work. Many other factors could affect the efficacy of the tools involved.
For the predictive coding algorithms to work, they need to be presented with valid examples of responsive and non-responsive documents. The algorithms do not care how those examples were obtained, provided that they are representative. The most important decisions, then, are the document decisions that go into making the example sets.
Losey’s two methods differ (among other things) in terms of who chooses the examples that are presented to the predictive coding algorithms. Losey’s multimodal method uses a mix of machine and human selection. His monomodal method, which he pejoratively calls the “Borg” method, has the computer select documents for his decision. In both cases, it was Losey making the only real decisions that the algorithms depend on — whether documents are responsive or non-responsive. Losey may find search decisions more interesting than document decisions, but search decisions are secondary and a diversion from the real choices that have to be made. He may feel as though he is being more effective by selecting the document to judge for responsiveness, but that feeling is not supported by these results. Evaluating his hypotheses will have to await a situation where we can point to reliable differences in the occurrence of responsive documents before and after running predictive coding and reliable differences between the results of the two methods.
Predictive coding is not the only way to learn about the documents. eDiscovery often requires exploratory data analysis, to know what we have to work with, what kind of vocabulary people used, who the important participants are, and so on. These are questions that are not easily addressed with predictive coding. We need to engage search and other analytics to address these questions. They are not a substitute for predictive coding, but a necessary part of preparing for eDiscovery. Predictive coding is not designed to replace all forms of engagement with the data, but rather to make document categorization easier, more cost effective, and more accurate.
Not every attorney is as skilled at or as interested in searching as Losey is. However the example documents are chosen, the critical judgments are the legal decisions about whether specific documents are responsive or not. Those judgments may not be glamorous, but they are critical to the justice system and to the performance of predictive coding. Despite rather substantial effort, nothing in his exercises would lead us to conclude that either method was better than the other.