Tuesday, July 23, 2013

Further Adventures in Predictive Coding

The commentators on my recent post about Adventures in Predictive Coding raise a number of interesting issues that I would like to explore further.  I have already responded to Webber’s comments.  I would like to comment further on the statistics suggested by Cormack and visit a few issues raised by Britton.

As a reminder, my post was in response to Ralph Losey’s blogs in which he chronicled how he used two different methods to identify the responsive documents from the Enron set. My post demonstrated how, statistically, despite his substantial effort, nothing in his exercises would lead us to conclude definitively that either method was better than the other.

Cormack


Gordon Cormack has argued that a statistical test with more power than the one I used might be able to find a significant difference between Losey’s two methods.  He is correct that, in theory, some other test might, indeed, find significance.  He is incorrect, however, about the test that he chose to use, which he says is the “sign test,” and in his application of this test.  Although it is a fairly minor point, in light of the fact that neither method yielded a statistically significant number of responsive documents relative to simply randomly selecting documents, it may still be instructive to discuss the statistical methods that were used to compare the processes. 

Each statistical hypothesis test is based on a certain set of assumptions.  If those assumptions are not met, then the test will yield misleading or inappropriate inferences.  The numbers may look right, but if the process that generated those numbers is substantially different in a statistical sense from the one assumed by the test, then erroneous conclusions are likely.

According to Wikipedia, the sign test “can be used to test the hypothesis that there is ‘no difference in medians’ between the continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X and Y.”  To translate that, these are the assumptions of the sign test:

  1. Compare two measures, one called X and one called Y.
  2. Compare paired items.  At a minimum, the two measurements have to be of items that can be matched between measurements.  That is, for each item in measurement 1, there is one and only one item in measurement 2 against which it is compared.  This pairing must be specified before the measurements are taken.  Usually these are the same items, measured two times, once before a treatment and once after a treatment. 
  3. Compare the two sets of measurements using their median.  The median is the middle score, where half of the scores are above the median and half are below. 
  4. The two measures must have continuous distributions.  In order to compare medians, some scores have to be higher than others.  For each of the items being measured (twice), we have to be able to decide whether the second measurement is greater than the first, less than the first, or equal to the first.
  5. The sets of measurements must be random variables.  They have to derived from a random sampling process.  Typically this means that the items that are being measured twice are selected through random sampling. 
  6.  The number of observations, must be fixed prior to computing the statistic.  The binomial rule applies within a particular sample of a particular size.
   
For each pair of measures, we examine whether the second measurement is less than the first, greater than the first or equal to the first. In other words, we count the number of times the comparison comes up with a positive sign, a negative sign, or no sign.  That’s why the test is called a sign test. 

Under the null hypothesis, the proportion of positive signs should be about 0.5 and we use the binomial distribution to assess that hypothesis.  The probability of a certain number of plus or minus signs can be found by the binomial rule based on a probability equal to 0.5 and a certain number of observations.

Cormack’s analysis meets the first and second requirements.  The same documents can be compared in both exercises. 

It does not meet the third or fourth requirements.  There is no median measurement to deal with in Cormack’s analysis.  The two measurements (multimodal vs. monomodal) do not yield a score, only a yes/no decision, so we cannot judge whether one is greater than the other.  Without a pair of continuous measurements, the test reduces to a simple binomial.  If the other assumptions of a binomial were met, we could still use that to assess the significance of the comparison.

The analysis does not meet requirement 5.  There is nothing random about the selection of these documents.  They are not a sample from a larger population, they are the entire population of responsive documents for which the two exercises disagreed. 

Cormack’s analysis also fails requirement 6.  There is no sample size.  The binomial distribution assumes that we have a sample consisting of a fixed number of observations.  In Cormack’s analysis, on the other hand, the number of observations is determined by the number of documents observed to be responsive by one method or the other.  This is circular.  It’s not a sample because the documents were not randomly selected, and the binomial distribution does not apply.

 Because Cormack's analysis does not meet the prerequisites of the sign test
or the binomial, these tests cannot be used to assess significance.

Because Cormack's analysis does not meet the prerequisites of the sign test or the binomial, these tests cannot be used to assess significance. The numbers have the right appearance for computing a proportion, but unless the proportion was created in the way assumed by the statistical test, it cannot be used validly to determine whether the pattern of results is different from what you would expect by chance. 

The same kind of statistical process must produce both the observations and the chance predictions.  Nominally, we are testing whether a random process with one probability (0.5) is different from a random process with a different probability (0.56).  In reality, however, we are comparing a random process (like a coin flip) against a different, non-random process.  If it is not a random process generating the numbers, then the binomial rule cannot be used to estimate its distribution.  If it is not a sample drawn from a larger population, then there is no distribution to begin with.

We may know the probability of a fair coin being flipped coming up with the corresponding proportion of heads, but, because Cormack’s analysis fails to preserve the prerequisites of a binomial, this probability is just irrelevant. We know how coin flips should be distributed according to chance, but we do not have a distribution for numbers of documents generated exclusively by the multimodal and monomodal exercises.  They’re just numbers.

To be sure, the multimodal approach identified numerically more documents as responsive than did the monomodal approach.  We cannot say that that difference is reliable (significant) or not, but if it were, does that mean that the multimodal method was better?  Can we attribute that difference to the methods used? Not necessarily.  A process that claims that more documents are responsive, whether correct or incorrect, would show a similar difference. A process that was more stringent about what should be called a responsive document would also yield a comparable difference.

As Losey noted, his criteria for judging whether a document was responsive may have changed between the multimodal and the monomodal exercises. “The SME [Losey] had narrowed his concept of relevance [between the two exercises].”  The only measure we have of whether a document was truly responsive was Losey’s judgment.  If his standard changed, then fewer documents were relevant in the second exercise, so the fact that fewer were predicted to be responsive could be simply a product of his judgment, not of the method.  So, even if we found that the difference in the number of documents was significant, we still cannot attribute that difference with confidence to the method used to find them. 

Knowing only the number of responsive documents found, we cannot know whether that difference was due to a difference in method or a difference in the standard for categorizing a document responsive.

Britton


Gerard Britton raises the argument that Recall is not a good measure of the completeness of predictive coding.  He notes, correctly, that documents may differ in importance, but then somehow concludes that, therefore, Recall is uninformative.  He offers no alternative measures, but rather seems to believe that human reviewers somehow will automagically be better than the computer at recognizing more-responsive documents.  He claims without evidence, and, I would argue, contrary to the available evidence, that humans and computers categorize documents in fundamentally different ways.

Although it is possible that humans and computers make systematically different judgments about the responsiveness of documents, there is no evidence to support this claim.  It is a speculation.  Britton treats this purported difference as a fact and then claims that it is unacceptable that this supposed fact is then ignored.  In order for computers to have the same or higher accuracy as people, they would have to be more accurate at finding what Britton calls the “meh” documents and poorer than human reviewers at finding the “hot” documents.  TREC has some evidence to the contrary, which Cormack mentioned. 

If the ability of a system to predict hot vs meh documents is a concern, then one can examine this capability directly.  One could train a predictive coding system on just hot documents, for example, and examine its performance on this set.  One could separately train on documents that are not so hot, but still responsive.  One would have to also study how teams of human reviewers do on these documents and I think that the results would not support Britton’s suppositions.

The potential difference in standards for accepting whether a document is in the positive or negative group does not invalidate the measure of Recall.  Recall is just a count (a proportion, actually) of the documents that meet ones criteria.  What those criteria are is completely up to the person doing the measurement.  To say that Recall is invalidated because there can be multiple criteria is a fundamental misunderstanding.

Predictive coding is based on human judgment.  The essence of predictive coding is that a human subject matter expert provides examples of responsive and non-responsive documents.  An authoritative reviewer, such as Losey, determines what standards to apply to categorize the documents.  The computer analyzes the examples and extracts mathematical rules from them, which it then applies to the other documents in its population.  Furthermore, the accuracy, for example, Recall, of the categorization is judged by subject matter experts as well.  If the documents are junk, then they should not – or would not – be categorized as responsive by the subject matter expert during training or during assessment of accuracy. 

In a recent matter, we had access to the judgments made by a professional team of reviewers following a keyword culling process.  Of the documents identified by the team as responsive, the primary attorneys on the case selected only 30% as meeting their standard for responsiveness. 

By comparison, in the Global Aerospace matter (Global Aerospace, Inc., et al.  v. Landow Aviation, L.P., et al., Loudon Circuit Court Case #CL 61040 (VA)), we found that about 81% of the documents recommended by the predictive coding system were found to be responsive. 

In recent filings in the Actos matter (In Re: Actos (Pioglitazone) Product Liability Litigation (United States District Court of Louisiana MDL No.6:11-md-2299), available from PACER), an estimated 64% of the predicted responsive documents (over a large score range) were found to be responsive.

Based on evidence like this, it is extremely difficult to claim that humans are good, but only too expensive.  The evidence is to the contrary.  Unless special steps are taken, human review teams are inconsistent and of only moderate accuracy. 
A good manual review process may yield a good set of documents, but there is absolutely no evidence
that even this good review would be better than a good predictive coding process.

Britton claims that the risk of finding more documents with lower responsiveness and fewer documents with high responsiveness is greater when using quantification than when using human readers. As mentioned, this concern is unfounded and unsupported by any evidence of which I am aware.  A good manual review process may yield a good set of documents, but there is absolutely no evidence that even this good review would be better than a good predictive coding process.  To the contrary, there is evidence that a well-run predictive coding process does better than a well-run manual review process.

If there is no evidence that manual review is superior to predictive coding, then Britton’s suggestions of alternative review methodologies are not likely to be of any use.  They are based on false premises and fundamental misunderstandings.  They are likely to cost clients a great deal more money and yield poorer results for both parties.

eDiscovery practitioners may disagree about some of the methods we use to conduct and analyze predictive coding, but the bottom line seems clear.  Predictive coding, when done well, is effective for identifying responsive and non-responsive documents.  It is less expensive, more accurate, and typically faster than other forms of first-pass review.  The differences among methods may yield differences in accuracy or speed, but these differences are small when compared to manual review.  Predictive coding is a game-changer in eDiscovery.  It is here to stay and likely to take on an ever-increasing portion of the burden of eDiscovery.


Monday, July 1, 2013

Adventures in Predictive Coding



Ralph Losey, in a series of blogs, has painstakingly chronicled how he used two different methods, which he called a “multimodal hybrid” approach and a “monomodal hybrid” approach, to identify the responsive documents from the Enron set.  He recently summarized his results here and here.

His descriptions are very detailed and often amusing.  They provide a great service to the field of eDiscovery.  He wants to argue that the multimodal hybrid approach is the better of the two, but his results do not support this conclusion.  In fact, his two approaches show no statistically significant differences.  I will explain.

The same set of 699,082 documents was considered in both exercises, and both started with a random sample to identify a prevalence rate — the proportion of responsive documents.  In both exercises the random sample estimated that about one quarter or less of a percent of the documents were responsive (0.13% vs. 0.25% for the multimodal and monomodal exercises respectively, corresponding to an estimate of 928 vs. 1773 responsive documents in the whole collection).  Combining these assessments of prevalence with a third one, Losey estimates that 0.18% of the documents were responsive.  That’s less than one fifth of one percent or 1.8 responsive documents per thousand.  In my experience, that is a dramatically sparse set of documents.

These are the same documents in the two exercises, so it is not possible that they actually contained different numbers of responsive documents.  Was the almost 2:1 prevalence difference between the two exercises due to chance (sampling variation), was it due to changes in Losey’s standards for identifying responsive documents, or was it due to something else?  My best guess is that the difference was due to chance.

By chance, different samples from the same population can yield different estimates.  If you flip a coin, for example, on average, half of the flips will come out heads and half will come out tails.  The population consists of half heads and half tails, but any given series of flips may have more or fewer heads than another.  The confidence interval tells us the range of likely proportions.  Here are two samples from a series of coin flips. 

H T H T H H H T H H

H T T H H T T H T T

The first sample consisted of 7 Heads and 3 Tails.  The second sample consisted of 4 Heads and 6 Tails.  Were these samples obtained from different coins?  One sample is 70% Heads and the other is 40% Heads.  I happen to know that the same coin was used for all flips, and that, therefore, we can attribute the difference to chance.  With samples of 10 flips, the 95% confidence interval extends from 0.2 (2 Heads) to 0.8 (8 Heads).  Although these two samples resulted in numerically different values, we would not be justified in concluding that they were obtained from coins with different biases (different likelihoods of coming up Heads).  Statistical hypothesis testing allows us to examine this kind of question more systematically.

Statistical significance means that the difference in results are unlikely to have occurred by chance. 

I analyzed the prevalence rates in Losey’s two exercises to see whether the observed difference could reasonably be attributed to chance variation.  Both are based on a large sample of documents.  Using a statistical hypothesis test called a “Z-test of proportions,” it turns out that the difference is not statistically significant.  The difference in prevalence estimates could reasonably have come about by chance.  Two random samples from the same population could, with a reasonably high likelihood, produce a difference this large or larger by chance.

By ordinary scientific standards, if we want to conclude that one score is greater than another, we need to show that the difference between the two scores is greater than could be expected by from sampling variation.  As we know, scores derived from a sample always have a confidence interval or margin of error around them.  With a 95% confidence level, the confidence interval is the range of scores where, 95% of the time, the true population level for that score (the true proportion of responsive documents in the whole collection) would be found.  The so-called null hypothesis is that the real difference between these two exercises is 0.0 and the observed difference is due only to sampling error, that is, to chance variations in the samples.  The motivated hypothesis is that the difference is real.

  • Null Hypothesis: There is no reliable difference between scores
  • Motivated Hypothesis: The two scores differ reliably from one another
  
Under the null hypothesis, the difference between scores also has a confidence interval around 0.0.  If the size of the difference is outside of the confidence interval, then we can say that the difference is (statistically) significant.  The probability is less than 5% that the difference was drawn from a distribution centered around 0.0.  If this difference is sufficiently large, then we are justified in rejecting the null hypothesis.  The difference is unexpected under the null hypothesis.  Then we can say that the difference is statistically significant or reliable.

On the other hand, if the magnitude of the difference is within the confidence interval, then we cannot say that the difference is reliable.  We fail to reject the null hypothesis, and we may say that we accept the null hypothesis.  Differences have to be sufficiently large to reject the null hypothesis or we say that there was no reliable difference.  “Sufficiently large” here means “outside the confidence interval.”  The bias in most of science is to assume that the null hypothesis is the better explanation unless we can find substantial evidence to the contrary.

Although the difference in estimated prevalences for the two exercises is numerically large (almost double), in fact, my analysis reveals that differences this large could easily have come about by chance due to sampling error.  The difference in prevalence proportions is well within the confidence interval assuming that there is really no difference.  My analysis does not prove that there was no difference, but it shows that these results do not support the hypothesis that there was a difference. The difference in estimated prevalence between the two exercises is potentially troubling, but the fact that it could have arisen by chance, means that our best guess is that there really was no systematic difference to explain. 

We knew from the fact that the same data were used in both exercises that we should not expect a real difference in prevalence between the two assessments, so this failure to find a reliable difference is consistent with our expectations.  On the other hand, Losey conducted the exercises with the intention of finding a difference in the accuracy of the two approaches.  We can apply the same logic to looking for these differences.

We can assess the accuracy of predictive coding with several different measures.  The emphasis in Losey’s approaches is to find as many of the responsive documents as possible.  One measure of this goal is called Recall.  Of all of the responsive documents in the collection, how many were identified by the combination of user and predictive coding?  To assess Recall directly, we would need a sample of responsive documents.  This sample would have to be of sufficient size to allow us to compare Recall under each approach.  Unfortunately, those data are not available.  We would need a sample of, say, 400 responsive documents to estimate Recall directly.  We cannot use the predictive coding to find those responsive documents, because that is exactly what we are trying to measure. We need to find an independent way of estimating the total number of responsive documents.

We could try to estimate Recall from a combination of our prevalence estimate and the number of responsive documents identified by the method, but since the estimate of prevalence is so substantially different, it is not immediately obvious how to do so.  If the two systems returned the same number of documents, our estimate of recall would be much lower for the monomodal method than for the multimodal method because the prevalence was estimated to be so much higher for the monomodal method.

Instead, I analyzed the Elusion measures for the two exercises.  Elusion is a measure of categorization accuracy that is closely (but inversely) related to Recall.  Specifically, it is a measure of the proportion of false negatives among the documents that have been classified as non-responsive (documents that should have been classified as responsive, but were incorrectly classified).  An effective predictive coding exercise will have very low false negative rates, and therefore very low Elusion, because all or most of the responsive documents have been correctly classified.  Low Elusion relative to Prevalence corresponds to high Recall.

Because both exercises involved the same set of documents, their true (as opposed to their observed) Prevalence rates should be the same.  If one process was truly more accurate than the other, then they should differ in the proportion of responsive documents that they fail to identify.  By his prediction Losey’s multimodal method should have lower Elusion than his monomodal method.  That seems not to be the case.

Elusion for the monomodal method was numerically lower than for the multimodal method.  A Z-test for the difference in the two Elusion proportions (0.00094 vs. 0.00085) also fails to reach a level of significance.  The analysis reveals that the difference between these two Elusion values could also have occurred by chance.  The proportion of false negatives in the two exercises was not reliably different from one another. Contrary to Ralph’s assertion, we are not justified to conclude from these exercises that there was a difference in their success rates.  So his claim that the multimodal method is better than the monomodal method is unsupported by these data.

Finally, I compared the prevalence in each exercise with its corresponding Elusion proportion, again using a Z-test for proportions.  If predictive coding has been effective, then we should observe that Elusion is only a small fraction of prevalence.  Prevalence is our measure of the proportion of documents in the whole collection that are responsive.  Elusion is our measure of the proportion of actually responsive documents in the set that have been categorized as non-responsive.  If we have successfully identified the responsive documents, then they would not be in the Elusion set, so their proportion should be considerably lower in the Elusion set than in the initial random sample drawn from the whole collection.

Losey would not be surprised, I think, to learn that in the monomodal exercise, there was no significant difference between estimated prevalence (0.0025) and estimated Elusion (0.00085).  Both proportions could have been drawn from populations with the same proportion of responsive documents.  The monomodal method was not successful, according to this analysis, at identifying responsive documents.

What might be more surprising, however, was that there was also no significant difference between prevalence and Elusion in the multimodal exercise (0.0013 vs. 0.00094).  Neither exercise found a significant number of responsive documents.  There is no evidence that predictive coding added any value in either exercise.  Random sampling before and after the exercises could have produced differences larger than the ones observed without employing predictive coding or any other categorization technique in the middle.  Predictive coding in these exercises did not remove a significant number of responsive documents from the collection.  A random sample was just as likely to contain the same number of responsive documents before predictive coding as after predictive coding.

Far from concluding that the multimodal method was better than the monomodal method, these two exercises cannot point to any reliable effect of either method.  Not only did the methods not produce reliably different results from one another, but it looks like they had no effect at all.  All of the differences between exercises can be attributed to chance, that is sampling error.  We are forced to accept the null hypotheses that there were no differences between methods and no effect of predictive coding.  Again, we did not prove that there were no differences, only that there were no reliable differences to be found in these exercises.

These results may not be widely indicative of those that would be found in other predictive coding uses.  Other predictive coding exercises do find significant differences of the kind I looked for here.

From my experience, this situation is an outlier.  These data may not be representative of predictive coding problems, for example, they are very sparse.  Prevalence near zero left little room for Elusion to be lower. Less than a quarter of a percent of the documents in the collection were estimated to be responsive.  In the predictive coding matters I have dealt with, Prevalence is typically considerably higher.  These exercises may not be indicative of what you can expect in other situations or with other predictive coding methods.  These results are not typical. Your results may vary.

Alternatively, it is possible that predictive coding worked well, but that we do not have enough statistical power to detect it.  The confidence interval of the difference, just like any other confidence interval, narrows with larger samples.  It could be that larger samples might have found a difference.  In other words, we cannot conclude that there was no difference, the best we can do is to conclude that there was insufficient evidence to conclude that there was a difference. 

But, if we cannot be confident of a difference, we cannot be confident that one method was better than the other.  At the same time, we also cannot conclude that some other exercises might not find differences.  Accepting the null hypothesis is not the same as proving it.

We cannot conclude that predictive coding or the technology used in these exercises does not work.  Many other factors could affect the efficacy of the tools involved. 

For the predictive coding algorithms to work, they need to be presented with valid examples of responsive and non-responsive documents.  The algorithms do not care how those examples were obtained, provided that they are representative.  The most important decisions, then, are the document decisions that go into making the example sets. 

Losey’s two methods differ (among other things) in terms of who chooses the examples that are presented to the predictive coding algorithms.  Losey’s multimodal method uses a mix of machine and human selection.  His monomodal method, which he pejoratively calls the “Borg” method, has the computer select documents for his decision.  In both cases, it was Losey making the only real decisions that the algorithms depend on — whether documents are responsive or non-responsive.  Losey may find search decisions more interesting than document decisions, but search decisions are secondary and a diversion from the real choices that have to be made.  He may feel as though he is being more effective by selecting the document to judge for responsiveness, but that feeling is not supported by these results.  Evaluating his hypotheses will have to await a situation where we can point to reliable differences in the occurrence of responsive documents before and after running predictive coding and reliable differences between the results of the two methods.

Predictive coding is not the only way to learn about the documents.  eDiscovery often requires exploratory data analysis, to know what we have to work with, what kind of vocabulary people used, who the important participants are, and so on.  These are questions that are not easily addressed with predictive coding.  We need to engage search and other analytics to address these questions.  They are not a substitute for predictive coding, but a necessary part of preparing for eDiscovery.  Predictive coding is not designed to replace all forms of engagement with the data, but rather to make document categorization easier, more cost effective, and more accurate.

Not every attorney is as skilled at or as interested in searching as Losey is.  However the example documents are chosen, the critical judgments are the legal decisions about whether specific documents are responsive or not.  Those judgments may not be glamorous, but they are critical to the justice system and to the performance of predictive coding.  Despite rather substantial effort, nothing in his exercises would lead us to conclude that either method was better than the other.