As a reminder, my post was in response to Ralph Losey’s blogs in which he chronicled how he used two different methods to identify the responsive documents from the Enron set. My post demonstrated how, statistically, despite his substantial effort, nothing in his exercises would lead us to conclude definitively that either method was better than the other.

## Cormack

Gordon Cormack has argued that a statistical test with more power than the one I used might be able to find a significant difference between Losey’s two methods. He is correct that, in theory, some other test might, indeed, find significance. He is incorrect, however, about the test that he chose to use, which he says is the “sign test,” and in his application of this test. Although it is a fairly minor point, in light of the fact that neither method yielded a statistically significant number of responsive documents relative to simply randomly selecting documents, it may still be instructive to discuss the statistical methods that were used to compare the processes.

Each statistical hypothesis test is based on a certain set of assumptions. If those assumptions are not met, then the test will yield misleading or inappropriate inferences. The numbers may look right, but if the process that generated those numbers is substantially different in a statistical sense from the one assumed by the test, then erroneous conclusions are likely.

According to Wikipedia, the sign test “can be used to test the hypothesis that there is ‘no difference in medians’ between the continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X and Y.” To translate that, these are the assumptions of the sign test:

- Compare two measures, one called X and one called Y.
- Compare paired items. At a minimum, the two measurements have to be of items that can be matched between measurements. That is, for each item in measurement 1, there is one and only one item in measurement 2 against which it is compared. This pairing must be specified before the measurements are taken. Usually these are the same items, measured two times, once before a treatment and once after a treatment.
- Compare the two sets of measurements using their median. The median is the middle score, where half of the scores are above the median and half are below.
- The two measures must have continuous distributions. In order to compare medians, some scores have to be higher than others. For each of the items being measured (twice), we have to be able to decide whether the second measurement is greater than the first, less than the first, or equal to the first.
- The sets of measurements must be random variables. They have to derived from a random sampling process. Typically this means that the items that are being measured twice are selected through random sampling.
- The number of observations, must be fixed prior to computing the statistic. The binomial rule applies within a particular sample of a particular size.

For each pair of measures, we examine whether the second measurement is less than the first, greater than the first or equal to the first. In other words, we count the number of times the comparison comes up with a positive sign, a negative sign, or no sign. That’s why the test is called a sign test.

Under the null hypothesis, the proportion of positive signs should be about 0.5 and we use the binomial distribution to assess that hypothesis. The probability of a certain number of plus or minus signs can be found by the binomial rule based on a probability equal to 0.5 and a certain number of observations.

Cormack’s analysis meets the first and second requirements. The same documents can be compared in both exercises.

It does not meet the third or fourth requirements. There is no median measurement to deal with in Cormack’s analysis. The two measurements (multimodal vs. monomodal) do not yield a score, only a yes/no decision, so we cannot judge whether one is greater than the other. Without a pair of continuous measurements, the test reduces to a simple binomial. If the other assumptions of a binomial were met, we could still use that to assess the significance of the comparison.

The analysis does not meet requirement 5. There is nothing random about the selection of these documents. They are not a sample from a larger population, they are the entire population of responsive documents for which the two exercises disagreed.

Cormack’s analysis also fails requirement 6. There is no sample size. The binomial distribution assumes that we have a sample consisting of a fixed number of observations. In Cormack’s analysis, on the other hand, the number of observations is determined by the number of documents observed to be responsive by one method or the other. This is circular. It’s not a sample because the documents were not randomly selected, and the binomial distribution does not apply.

**Because Cormack's analysis does not meet the prerequisites of the sign test**

or the binomial, these tests cannot be used to assess significance.or the binomial, these tests cannot be used to assess significance.

Because Cormack's analysis does not meet the prerequisites of the sign test or the binomial, these tests cannot be used to assess significance. The numbers have the right appearance for computing a proportion, but unless the proportion was created in the way assumed by the statistical test, it cannot be used validly to determine whether the pattern of results is different from what you would expect by chance.

The same kind of statistical process must produce both the observations and the chance predictions. Nominally, we are testing whether a random process with one probability (0.5) is different from a random process with a different probability (0.56). In reality, however, we are comparing a random process (like a coin flip) against a different, non-random process. If it is not a random process generating the numbers, then the binomial rule cannot be used to estimate its distribution. If it is not a sample drawn from a larger population, then there is no distribution to begin with.

We may know the probability of a fair coin being flipped coming up with the corresponding proportion of heads, but, because Cormack’s analysis fails to preserve the prerequisites of a binomial, this probability is just irrelevant. We know how coin flips should be distributed according to chance, but we do not have a distribution for numbers of documents generated exclusively by the multimodal and monomodal exercises. They’re just numbers.

To be sure, the multimodal approach identified numerically more documents as responsive than did the monomodal approach. We cannot say that that difference is reliable (significant) or not, but if it were, does that mean that the multimodal method was better? Can we attribute that difference to the methods used? Not necessarily. A process that claims that more documents are responsive, whether correct or incorrect, would show a similar difference. A process that was more stringent about what should be called a responsive document would also yield a comparable difference.

As Losey noted, his criteria for judging whether a document was responsive may have changed between the multimodal and the monomodal exercises. “The SME [Losey] had narrowed his concept of relevance [between the two exercises].” The only measure we have of whether a document was truly responsive was Losey’s judgment. If his standard changed, then fewer documents were relevant in the second exercise, so the fact that fewer were predicted to be responsive could be simply a product of his judgment, not of the method. So, even if we found that the difference in the number of documents was significant, we still cannot attribute that difference with confidence to the method used to find them.

Knowing only the number of responsive documents found, we cannot know whether that difference was due to a difference in method or a difference in the standard for categorizing a document responsive.

## Britton

Gerard Britton raises the argument that Recall is not a good measure of the completeness of predictive coding. He notes, correctly, that documents may differ in importance, but then somehow concludes that, therefore, Recall is uninformative. He offers no alternative measures, but rather seems to believe that human reviewers somehow will automagically be better than the computer at recognizing more-responsive documents. He claims without evidence, and, I would argue, contrary to the available evidence, that humans and computers categorize documents in fundamentally different ways.

Although it is possible that humans and computers make systematically different judgments about the responsiveness of documents, there is no evidence to support this claim. It is a speculation. Britton treats this purported difference as a fact and then claims that it is unacceptable that this supposed fact is then ignored. In order for computers to have the same or higher accuracy as people, they would have to be more accurate at finding what Britton calls the “meh” documents and poorer than human reviewers at finding the “hot” documents. TREC has some evidence to the contrary, which Cormack mentioned.

If the ability of a system to predict hot vs meh documents is a concern, then one can examine this capability directly. One could train a predictive coding system on just hot documents, for example, and examine its performance on this set. One could separately train on documents that are not so hot, but still responsive. One would have to also study how teams of human reviewers do on these documents and I think that the results would not support Britton’s suppositions.

The potential difference in standards for accepting whether a document is in the positive or negative group does not invalidate the measure of Recall. Recall is just a count (a proportion, actually) of the documents that meet ones criteria. What those criteria are is completely up to the person doing the measurement. To say that Recall is invalidated because there can be multiple criteria is a fundamental misunderstanding.

Predictive coding is based on human judgment. The essence of predictive coding is that a human subject matter expert provides examples of responsive and non-responsive documents. An authoritative reviewer, such as Losey, determines what standards to apply to categorize the documents. The computer analyzes the examples and extracts mathematical rules from them, which it then applies to the other documents in its population. Furthermore, the accuracy, for example, Recall, of the categorization is judged by subject matter experts as well. If the documents are junk, then they should not – or would not – be categorized as responsive by the subject matter expert during training or during assessment of accuracy.

In a recent matter, we had access to the judgments made by a professional team of reviewers following a keyword culling process. Of the documents identified by the team as responsive, the primary attorneys on the case selected only 30% as meeting their standard for responsiveness.

By comparison, in the Global Aerospace matter (Global Aerospace, Inc., et al. v. Landow Aviation, L.P., et al., Loudon Circuit Court Case #CL 61040 (VA)), we found that about 81% of the documents recommended by the predictive coding system were found to be responsive.

In recent filings in the Actos matter (In Re: Actos (Pioglitazone) Product Liability Litigation (United States District Court of Louisiana MDL No.6:11-md-2299), available from PACER), an estimated 64% of the predicted responsive documents (over a large score range) were found to be responsive.

Based on evidence like this, it is extremely difficult to claim that humans are good, but only too expensive. The evidence is to the contrary. Unless special steps are taken, human review teams are inconsistent and of only moderate accuracy.

A good manual review process may yield a good set of documents, but there is absolutely no evidence

that even this good review would be better than a good predictive coding process.

Britton claims that the risk of finding more documents with lower responsiveness and fewer documents with high responsiveness is greater when using quantification than when using human readers. As mentioned, this concern is unfounded and unsupported by any evidence of which I am aware. A good manual review process may yield a good set of documents, but there is absolutely no evidence that even this good review would be better than a good predictive coding process. To the contrary, there is evidence that a well-run predictive coding process does better than a well-run manual review process.

If there is no evidence that manual review is superior to predictive coding, then Britton’s suggestions of alternative review methodologies are not likely to be of any use. They are based on false premises and fundamental misunderstandings. They are likely to cost clients a great deal more money and yield poorer results for both parties.

eDiscovery practitioners may disagree about some of the methods we use to conduct and analyze predictive coding, but the bottom line seems clear. Predictive coding, when done well, is effective for identifying responsive and non-responsive documents. It is less expensive, more accurate, and typically faster than other forms of first-pass review. The differences among methods may yield differences in accuracy or speed, but these differences are small when compared to manual review. Predictive coding is a game-changer in eDiscovery. It is here to stay and likely to take on an ever-increasing portion of the burden of eDiscovery.

I suppose that a screaming red headline is the blog equivalent of pounding on the table.

ReplyDeleteI stand by what I wrote. For more explanation, please refer to Chapter 12, Section 12.3 of my book: http://www.amazon.com/Information-Retrieval-Implementing-Evaluating-Engines/dp/0262026511

Herb,

ReplyDeleteThe question of statistical significance arises when we ask whether a difference observed on a sample can reliably be attributed to the population, too. In Ralph's exercise, either there is no sample, or there is a sample size of one; in neither case is it meaningful to talk of statistical significance.

There is no sample, in that Ralph is working on the full collection, not on a sample of it. If the multi-modal method found more responsive documents than the mono-modal, then that is that. (As you point out, there is disagreement over relevance between the two productions; but that is not a statistical question.)

Alternatively, there is a sample size of one, if we regard Ralph's exercise as a "sample" of all the possible production exercises that could be carried out, comparing multi- and mono-modal retrieval. But, as you'll be aware, one can essentially infer nothing in a statistical sense from a sample size of one.

Ralph's study is best viewed as a case study, into the main findings of which statistical methods do not enter (though they may help illuminate sub-questions).

William