Information Discovery

Tuesday, July 23, 2013

Further Adventures in Predictive Coding

The commentators on my recent post about Adventures in Predictive Coding raise a number of interesting issues that I would like to explore further. I have already responded to Webber’s comments. I would like to comment further on the statistics suggested by Cormack and visit a few issues raised by Britton.

As a reminder, my post was in response to Ralph Losey’s blogs in which he chronicled how he used two different methods to identify the responsive documents from the Enron set. My post demonstrated how, statistically, despite his substantial effort, nothing in his exercises would lead us to conclude definitively that either method was better than the other.

Cormack

Gordon Cormack has argued that a statistical test with more power than the one I used might be able to find a significant difference between Losey’s two methods. He is correct that, in theory, some other test might, indeed, find significance. He is incorrect, however, about the test that he chose to use, which he says is the “sign test,” and in his application of this test. Although it is a fairly minor point, in light of the fact that neither method yielded a statistically significant number of responsive documents relative to simply randomly selecting documents, it may still be instructive to discuss the statistical methods that were used to compare the processes.

Each statistical hypothesis test is based on a certain set of assumptions. If those assumptions are not met, then the test will yield misleading or inappropriate inferences. The numbers may look right, but if the process that generated those numbers is substantially different in a statistical sense from the one assumed by the test, then erroneous conclusions are likely.

According to Wikipedia, the sign test “can be used to test the hypothesis that there is ‘no difference in medians’ between the continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X and Y.” To translate that, these are the assumptions of the sign test:

Compare two measures, one called X and one called Y.
Compare paired items. At a minimum, the two measurements have to be of items that can be matched between measurements. That is, for each item in measurement 1, there is one and only one item in measurement 2 against which it is compared. This pairing must be specified before the measurements are taken. Usually these are the same items, measured two times, once before a treatment and once after a treatment.
Compare the two sets of measurements using their median. The median is the middle score, where half of the scores are above the median and half are below.
The two measures must have continuous distributions. In order to compare medians, some scores have to be higher than others. For each of the items being measured (twice), we have to be able to decide whether the second measurement is greater than the first, less than the first, or equal to the first.
The sets of measurements must be random variables. They have to derived from a random sampling process. Typically this means that the items that are being measured twice are selected through random sampling.
The number of observations, must be fixed prior to computing the statistic. The binomial rule applies within a particular sample of a particular size.

For each pair of measures, we examine whether the second measurement is less than the first, greater than the first or equal to the first. In other words, we count the number of times the comparison comes up with a positive sign, a negative sign, or no sign. That’s why the test is called a sign test.

Under the null hypothesis, the proportion of positive signs should be about 0.5 and we use the binomial distribution to assess that hypothesis. The probability of a certain number of plus or minus signs can be found by the binomial rule based on a probability equal to 0.5 and a certain number of observations.

Cormack’s analysis meets the first and second requirements. The same documents can be compared in both exercises.

It does not meet the third or fourth requirements. There is no median measurement to deal with in Cormack’s analysis. The two measurements (multimodal vs. monomodal) do not yield a score, only a yes/no decision, so we cannot judge whether one is greater than the other. Without a pair of continuous measurements, the test reduces to a simple binomial. If the other assumptions of a binomial were met, we could still use that to assess the significance of the comparison.

The analysis does not meet requirement 5. There is nothing random about the selection of these documents. They are not a sample from a larger population, they are the entire population of responsive documents for which the two exercises disagreed.

Cormack’s analysis also fails requirement 6. There is no sample size. The binomial distribution assumes that we have a sample consisting of a fixed number of observations. In Cormack’s analysis, on the other hand, the number of observations is determined by the number of documents observed to be responsive by one method or the other. This is circular. It’s not a sample because the documents were not randomly selected, and the binomial distribution does not apply.

Because Cormack's analysis does not meet the prerequisites of the sign test
or the binomial, these tests cannot be used to assess significance.

Because Cormack's analysis does not meet the prerequisites of the sign test or the binomial, these tests cannot be used to assess significance. The numbers have the right appearance for computing a proportion, but unless the proportion was created in the way assumed by the statistical test, it cannot be used validly to determine whether the pattern of results is different from what you would expect by chance.

The same kind of statistical process must produce both the observations and the chance predictions. Nominally, we are testing whether a random process with one probability (0.5) is different from a random process with a different probability (0.56). In reality, however, we are comparing a random process (like a coin flip) against a different, non-random process. If it is not a random process generating the numbers, then the binomial rule cannot be used to estimate its distribution. If it is not a sample drawn from a larger population, then there is no distribution to begin with.

We may know the probability of a fair coin being flipped coming up with the corresponding proportion of heads, but, because Cormack’s analysis fails to preserve the prerequisites of a binomial, this probability is just irrelevant. We know how coin flips should be distributed according to chance, but we do not have a distribution for numbers of documents generated exclusively by the multimodal and monomodal exercises. They’re just numbers.

To be sure, the multimodal approach identified numerically more documents as responsive than did the monomodal approach. We cannot say that that difference is reliable (significant) or not, but if it were, does that mean that the multimodal method was better? Can we attribute that difference to the methods used? Not necessarily. A process that claims that more documents are responsive, whether correct or incorrect, would show a similar difference. A process that was more stringent about what should be called a responsive document would also yield a comparable difference.

As Losey noted, his criteria for judging whether a document was responsive may have changed between the multimodal and the monomodal exercises. “The SME [Losey] had narrowed his concept of relevance [between the two exercises].” The only measure we have of whether a document was truly responsive was Losey’s judgment. If his standard changed, then fewer documents were relevant in the second exercise, so the fact that fewer were predicted to be responsive could be simply a product of his judgment, not of the method. So, even if we found that the difference in the number of documents was significant, we still cannot attribute that difference with confidence to the method used to find them.

Knowing only the number of responsive documents found, we cannot know whether that difference was due to a difference in method or a difference in the standard for categorizing a document responsive.

Britton

Gerard Britton raises the argument that Recall is not a good measure of the completeness of predictive coding. He notes, correctly, that documents may differ in importance, but then somehow concludes that, therefore, Recall is uninformative. He offers no alternative measures, but rather seems to believe that human reviewers somehow will automagically be better than the computer at recognizing more-responsive documents. He claims without evidence, and, I would argue, contrary to the available evidence, that humans and computers categorize documents in fundamentally different ways.

Although it is possible that humans and computers make systematically different judgments about the responsiveness of documents, there is no evidence to support this claim. It is a speculation. Britton treats this purported difference as a fact and then claims that it is unacceptable that this supposed fact is then ignored. In order for computers to have the same or higher accuracy as people, they would have to be more accurate at finding what Britton calls the “meh” documents and poorer than human reviewers at finding the “hot” documents. TREC has some evidence to the contrary, which Cormack mentioned.

If the ability of a system to predict hot vs meh documents is a concern, then one can examine this capability directly. One could train a predictive coding system on just hot documents, for example, and examine its performance on this set. One could separately train on documents that are not so hot, but still responsive. One would have to also study how teams of human reviewers do on these documents and I think that the results would not support Britton’s suppositions.

The potential difference in standards for accepting whether a document is in the positive or negative group does not invalidate the measure of Recall. Recall is just a count (a proportion, actually) of the documents that meet ones criteria. What those criteria are is completely up to the person doing the measurement. To say that Recall is invalidated because there can be multiple criteria is a fundamental misunderstanding.

Predictive coding is based on human judgment. The essence of predictive coding is that a human subject matter expert provides examples of responsive and non-responsive documents. An authoritative reviewer, such as Losey, determines what standards to apply to categorize the documents. The computer analyzes the examples and extracts mathematical rules from them, which it then applies to the other documents in its population. Furthermore, the accuracy, for example, Recall, of the categorization is judged by subject matter experts as well. If the documents are junk, then they should not – or would not – be categorized as responsive by the subject matter expert during training or during assessment of accuracy.

In a recent matter, we had access to the judgments made by a professional team of reviewers following a keyword culling process. Of the documents identified by the team as responsive, the primary attorneys on the case selected only 30% as meeting their standard for responsiveness.

By comparison, in the Global Aerospace matter (Global Aerospace, Inc., et al. v. Landow Aviation, L.P., et al., Loudon Circuit Court Case #CL 61040 (VA)), we found that about 81% of the documents recommended by the predictive coding system were found to be responsive.

In recent filings in the Actos matter (In Re: Actos (Pioglitazone) Product Liability Litigation (United States District Court of Louisiana MDL No.6:11-md-2299), available from PACER), an estimated 64% of the predicted responsive documents (over a large score range) were found to be responsive.

Based on evidence like this, it is extremely difficult to claim that humans are good, but only too expensive. The evidence is to the contrary. Unless special steps are taken, human review teams are inconsistent and of only moderate accuracy.

A good manual review process may yield a good set of documents, but there is absolutely no evidence
that even this good review would be better than a good predictive coding process.

Britton claims that the risk of finding more documents with lower responsiveness and fewer documents with high responsiveness is greater when using quantification than when using human readers. As mentioned, this concern is unfounded and unsupported by any evidence of which I am aware. A good manual review process may yield a good set of documents, but there is absolutely no evidence that even this good review would be better than a good predictive coding process. To the contrary, there is evidence that a well-run predictive coding process does better than a well-run manual review process.

If there is no evidence that manual review is superior to predictive coding, then Britton’s suggestions of alternative review methodologies are not likely to be of any use. They are based on false premises and fundamental misunderstandings. They are likely to cost clients a great deal more money and yield poorer results for both parties.

eDiscovery practitioners may disagree about some of the methods we use to conduct and analyze predictive coding, but the bottom line seems clear. Predictive coding, when done well, is effective for identifying responsive and non-responsive documents. It is less expensive, more accurate, and typically faster than other forms of first-pass review. The differences among methods may yield differences in accuracy or speed, but these differences are small when compared to manual review. Predictive coding is a game-changer in eDiscovery. It is here to stay and likely to take on an ever-increasing portion of the burden of eDiscovery.

Monday, July 1, 2013

Adventures in Predictive Coding

Ralph Losey, in a series of blogs, has painstakingly chronicled how he used two different methods, which he called a “multimodal hybrid” approach and a “monomodal hybrid” approach, to identify the responsive documents from the Enron set. He recently summarized his results here and here.

His descriptions are very detailed and often amusing. They provide a great service to the field of eDiscovery. He wants to argue that the multimodal hybrid approach is the better of the two, but his results do not support this conclusion. In fact, his two approaches show no statistically significant differences. I will explain.

The same set of 699,082 documents was considered in both exercises, and both started with a random sample to identify a prevalence rate — the proportion of responsive documents. In both exercises the random sample estimated that about one quarter or less of a percent of the documents were responsive (0.13% vs. 0.25% for the multimodal and monomodal exercises respectively, corresponding to an estimate of 928 vs. 1773 responsive documents in the whole collection). Combining these assessments of prevalence with a third one, Losey estimates that 0.18% of the documents were responsive. That’s less than one fifth of one percent or 1.8 responsive documents per thousand. In my experience, that is a dramatically sparse set of documents.

These are the same documents in the two exercises, so it is not possible that they actually contained different numbers of responsive documents. Was the almost 2:1 prevalence difference between the two exercises due to chance (sampling variation), was it due to changes in Losey’s standards for identifying responsive documents, or was it due to something else? My best guess is that the difference was due to chance.

By chance, different samples from the same population can yield different estimates. If you flip a coin, for example, on average, half of the flips will come out heads and half will come out tails. The population consists of half heads and half tails, but any given series of flips may have more or fewer heads than another. The confidence interval tells us the range of likely proportions. Here are two samples from a series of coin flips.

H T H T H H H T H H

H T T H H T T H T T

The first sample consisted of 7 Heads and 3 Tails. The second sample consisted of 4 Heads and 6 Tails. Were these samples obtained from different coins? One sample is 70% Heads and the other is 40% Heads. I happen to know that the same coin was used for all flips, and that, therefore, we can attribute the difference to chance. With samples of 10 flips, the 95% confidence interval extends from 0.2 (2 Heads) to 0.8 (8 Heads). Although these two samples resulted in numerically different values, we would not be justified in concluding that they were obtained from coins with different biases (different likelihoods of coming up Heads). Statistical hypothesis testing allows us to examine this kind of question more systematically.

Statistical significance means that the difference in results are unlikely to have occurred by chance.

I analyzed the prevalence rates in Losey’s two exercises to see whether the observed difference could reasonably be attributed to chance variation. Both are based on a large sample of documents. Using a statistical hypothesis test called a “Z-test of proportions,” it turns out that the difference is not statistically significant. The difference in prevalence estimates could reasonably have come about by chance. Two random samples from the same population could, with a reasonably high likelihood, produce a difference this large or larger by chance.

By ordinary scientific standards, if we want to conclude that one score is greater than another, we need to show that the difference between the two scores is greater than could be expected by from sampling variation. As we know, scores derived from a sample always have a confidence interval or margin of error around them. With a 95% confidence level, the confidence interval is the range of scores where, 95% of the time, the true population level for that score (the true proportion of responsive documents in the whole collection) would be found. The so-called null hypothesis is that the real difference between these two exercises is 0.0 and the observed difference is due only to sampling error, that is, to chance variations in the samples. The motivated hypothesis is that the difference is real.

Null Hypothesis: There is no reliable difference between scores
Motivated Hypothesis: The two scores differ reliably from one another

Under the null hypothesis, the difference between scores also has a confidence interval around 0.0. If the size of the difference is outside of the confidence interval, then we can say that the difference is (statistically) significant. The probability is less than 5% that the difference was drawn from a distribution centered around 0.0. If this difference is sufficiently large, then we are justified in rejecting the null hypothesis. The difference is unexpected under the null hypothesis. Then we can say that the difference is statistically significant or reliable.

On the other hand, if the magnitude of the difference is within the confidence interval, then we cannot say that the difference is reliable. We fail to reject the null hypothesis, and we may say that we accept the null hypothesis. Differences have to be sufficiently large to reject the null hypothesis or we say that there was no reliable difference. “Sufficiently large” here means “outside the confidence interval.” The bias in most of science is to assume that the null hypothesis is the better explanation unless we can find substantial evidence to the contrary.

Although the difference in estimated prevalences for the two exercises is numerically large (almost double), in fact, my analysis reveals that differences this large could easily have come about by chance due to sampling error. The difference in prevalence proportions is well within the confidence interval assuming that there is really no difference. My analysis does not prove that there was no difference, but it shows that these results do not support the hypothesis that there was a difference. The difference in estimated prevalence between the two exercises is potentially troubling, but the fact that it could have arisen by chance, means that our best guess is that there really was no systematic difference to explain.

We knew from the fact that the same data were used in both exercises that we should not expect a real difference in prevalence between the two assessments, so this failure to find a reliable difference is consistent with our expectations. On the other hand, Losey conducted the exercises with the intention of finding a difference in the accuracy of the two approaches. We can apply the same logic to looking for these differences.

We can assess the accuracy of predictive coding with several different measures. The emphasis in Losey’s approaches is to find as many of the responsive documents as possible. One measure of this goal is called Recall. Of all of the responsive documents in the collection, how many were identified by the combination of user and predictive coding? To assess Recall directly, we would need a sample of responsive documents. This sample would have to be of sufficient size to allow us to compare Recall under each approach. Unfortunately, those data are not available. We would need a sample of, say, 400 responsive documents to estimate Recall directly. We cannot use the predictive coding to find those responsive documents, because that is exactly what we are trying to measure. We need to find an independent way of estimating the total number of responsive documents.

We could try to estimate Recall from a combination of our prevalence estimate and the number of responsive documents identified by the method, but since the estimate of prevalence is so substantially different, it is not immediately obvious how to do so. If the two systems returned the same number of documents, our estimate of recall would be much lower for the monomodal method than for the multimodal method because the prevalence was estimated to be so much higher for the monomodal method.

Instead, I analyzed the Elusion measures for the two exercises. Elusion is a measure of categorization accuracy that is closely (but inversely) related to Recall. Specifically, it is a measure of the proportion of false negatives among the documents that have been classified as non-responsive (documents that should have been classified as responsive, but were incorrectly classified). An effective predictive coding exercise will have very low false negative rates, and therefore very low Elusion, because all or most of the responsive documents have been correctly classified. Low Elusion relative to Prevalence corresponds to high Recall.

Because both exercises involved the same set of documents, their true (as opposed to their observed) Prevalence rates should be the same. If one process was truly more accurate than the other, then they should differ in the proportion of responsive documents that they fail to identify. By his prediction Losey’s multimodal method should have lower Elusion than his monomodal method. That seems not to be the case.

Elusion for the monomodal method was numerically lower than for the multimodal method. A Z-test for the difference in the two Elusion proportions (0.00094 vs. 0.00085) also fails to reach a level of significance. The analysis reveals that the difference between these two Elusion values could also have occurred by chance. The proportion of false negatives in the two exercises was not reliably different from one another. Contrary to Ralph’s assertion, we are not justified to conclude from these exercises that there was a difference in their success rates. So his claim that the multimodal method is better than the monomodal method is unsupported by these data.

Finally, I compared the prevalence in each exercise with its corresponding Elusion proportion, again using a Z-test for proportions. If predictive coding has been effective, then we should observe that Elusion is only a small fraction of prevalence. Prevalence is our measure of the proportion of documents in the whole collection that are responsive. Elusion is our measure of the proportion of actually responsive documents in the set that have been categorized as non-responsive. If we have successfully identified the responsive documents, then they would not be in the Elusion set, so their proportion should be considerably lower in the Elusion set than in the initial random sample drawn from the whole collection.

Losey would not be surprised, I think, to learn that in the monomodal exercise, there was no significant difference between estimated prevalence (0.0025) and estimated Elusion (0.00085). Both proportions could have been drawn from populations with the same proportion of responsive documents. The monomodal method was not successful, according to this analysis, at identifying responsive documents.

What might be more surprising, however, was that there was also no significant difference between prevalence and Elusion in the multimodal exercise (0.0013 vs. 0.00094). Neither exercise found a significant number of responsive documents. There is no evidence that predictive coding added any value in either exercise. Random sampling before and after the exercises could have produced differences larger than the ones observed without employing predictive coding or any other categorization technique in the middle. Predictive coding in these exercises did not remove a significant number of responsive documents from the collection. A random sample was just as likely to contain the same number of responsive documents before predictive coding as after predictive coding.

Far from concluding that the multimodal method was better than the monomodal method, these two exercises cannot point to any reliable effect of either method. Not only did the methods not produce reliably different results from one another, but it looks like they had no effect at all. All of the differences between exercises can be attributed to chance, that is sampling error. We are forced to accept the null hypotheses that there were no differences between methods and no effect of predictive coding. Again, we did not prove that there were no differences, only that there were no reliable differences to be found in these exercises.

These results may not be widely indicative of those that would be found in other predictive coding uses. Other predictive coding exercises do find significant differences of the kind I looked for here.

From my experience, this situation is an outlier. These data may not be representative of predictive coding problems, for example, they are very sparse. Prevalence near zero left little room for Elusion to be lower. Less than a quarter of a percent of the documents in the collection were estimated to be responsive. In the predictive coding matters I have dealt with, Prevalence is typically considerably higher. These exercises may not be indicative of what you can expect in other situations or with other predictive coding methods. These results are not typical. Your results may vary.

Alternatively, it is possible that predictive coding worked well, but that we do not have enough statistical power to detect it. The confidence interval of the difference, just like any other confidence interval, narrows with larger samples. It could be that larger samples might have found a difference. In other words, we cannot conclude that there was no difference, the best we can do is to conclude that there was insufficient evidence to conclude that there was a difference.

But, if we cannot be confident of a difference, we cannot be confident that one method was better than the other. At the same time, we also cannot conclude that some other exercises might not find differences. Accepting the null hypothesis is not the same as proving it.

We cannot conclude that predictive coding or the technology used in these exercises does not work. Many other factors could affect the efficacy of the tools involved.

For the predictive coding algorithms to work, they need to be presented with valid examples of responsive and non-responsive documents. The algorithms do not care how those examples were obtained, provided that they are representative. The most important decisions, then, are the document decisions that go into making the example sets.

Losey’s two methods differ (among other things) in terms of who chooses the examples that are presented to the predictive coding algorithms. Losey’s multimodal method uses a mix of machine and human selection. His monomodal method, which he pejoratively calls the “Borg” method, has the computer select documents for his decision. In both cases, it was Losey making the only real decisions that the algorithms depend on — whether documents are responsive or non-responsive. Losey may find search decisions more interesting than document decisions, but search decisions are secondary and a diversion from the real choices that have to be made. He may feel as though he is being more effective by selecting the document to judge for responsiveness, but that feeling is not supported by these results. Evaluating his hypotheses will have to await a situation where we can point to reliable differences in the occurrence of responsive documents before and after running predictive coding and reliable differences between the results of the two methods.

Predictive coding is not the only way to learn about the documents. eDiscovery often requires exploratory data analysis, to know what we have to work with, what kind of vocabulary people used, who the important participants are, and so on. These are questions that are not easily addressed with predictive coding. We need to engage search and other analytics to address these questions. They are not a substitute for predictive coding, but a necessary part of preparing for eDiscovery. Predictive coding is not designed to replace all forms of engagement with the data, but rather to make document categorization easier, more cost effective, and more accurate.

Not every attorney is as skilled at or as interested in searching as Losey is. However the example documents are chosen, the critical judgments are the legal decisions about whether specific documents are responsive or not. Those judgments may not be glamorous, but they are critical to the justice system and to the performance of predictive coding. Despite rather substantial effort, nothing in his exercises would lead us to conclude that either method was better than the other.

Friday, June 28, 2013

Predictive Coding: Debunking a Sample of Myths about Random Sampling

To read some of the eDiscovery blogs that have been posted recently, you might think that there has been a lot of discussion about the “right way” to do predictive coding. Much of this analysis is based on the writers’ intuitions about statistical claims. Generally, these are lawyers attempting to make statistical claims as to which methods are better than others. As the Nobel Laureate Daniel Kahneman and Amos Tversky reported some years ago, these subjective analyses don’t always match up with reality. In the context of eDiscovery, the mismatch of intuition and statistical reality can lead to incorrect assumptions or myths about random sampling, a sample of which we can consider here.

Predictive coding systems learn to distinguish documents based on a set of categorized examples. These examples must have three characteristics in order to give accurate results: Validity, consistency, representativeness.

Validity. The example document decisions must be valid—the sample documents identified as responsive vs. non-responsive must actually be responsive and non-responsive respectively.
Consistency: The categorization must be consistent—the same standards must apply from one example to the next.
Representativeness: The documents must be representative of the distinction in the entire population of all documents.

Poor results follow if any of these requirements is unmet.

Random sampling is one of several methods that can be used to generate these example documents and meet these requirements. Other methods are sometimes called judgmental samples, purposive samples, or seed sets. The training examples are developed, usually on an ad hoc basis, using keyword searching or other techniques. Using expertise, search terms, or some other method, one or more attorneys selects sample documents to provide the training examples. These techniques may or may not meet the three requirements of effective predictive coding.

Random sampling means that every document in a collection or population has an equal chance of participating in the sample. Random sampling does not tell you what population to start with, and this is a source of some serious misunderstanding. In the jargon of statistics, the population is called the “sampling frame.” How you choose the sampling frame can have a profound impact on the time and effort it takes to do predictive coding. Random sampling is not a substitute for intelligent document collection.

Random sampling and the statistics that surround it have been in active use for well over a century. Random sampling is the basis for almost all scientific work, because it is the only way to make valid inferences from the properties of the sample to the properties of the population from which that sample was drawn. Only random sampling provides a means of predicting from the sample the accuracy that can be expected when estimating the properties of the population (such as prevalence).

Random sampling is recognized by the courts as a useful and cost-effective method to gather information about large groups of items (including documents and people) where it would be impractical or burdensome to examine each one individually, as noted in the Federal Judicial Center’s Reference Manual on Scientific Evidence: Reference Guide on Statistics.

Random sampling is used widely in industrial quality control, in forecasting elections, and, of course, throughout the sciences. A definitive book on sampling is Sampling Techniques by William G. Cochran (1977, 2nd Ed. Wiley).

Random sampling is a critical part of our evaluation of the success of predictive coding and other processes. When we measure the accuracy of predictive coding, we typically assess a randomly selected set of documents for responsiveness. Random selection is the best way to pick an unbiased representative sample, and only with a representative sample can you justify inferences from the sample to the population from which the sample was drawn. Random samples are the best way to convince opposing counsel that your assessment is based on a fair analysis of the available documents. Just as you would not want to play against an opponent with loaded dice, you would not want to base your decisions on a biased sample of documents. Random sampling assures you and your opponents of an unbiased assessment.

Myth: Random sampling requires you to include all of your documents without any constraint

Even on the surface, this myth is obviously wrong. There are always constraints in how you collect documents for a matter. You typically don’t collect documents from the Library of Congress, for example, or from other companies.

Some documents are worthy of being considered for possible responsiveness and others are not. It takes legal judgment and some knowledge of the document universe to determine what documents to include in your sampling frame. Random sampling means that every document in the set being considered (the sampling frame) has an equal chance of appearing in a random sample. That is what makes random samples representative of the population from which they are drawn. Random sampling does not force you to include inappropriate documents in that population. It says nothing about what goes into that population. Instead, random sampling allows you describe the population as it is.

Random sampling does not control what goes into the sampling frame, but it does affect what you can say about it. The random sample is representative of the documents that have been included in the sampling frame and may or may not be representative of documents outside of the sampling frame. You can generalize from the sample to the rest of the documents in the sampling frame, but not to those documents that were not included. In an election poll, if you sample only from New York City, then you can talk about New York City voters’ opinions, but you cannot say anything useful about voters outside of the city. If you cull down your document collection prior to sampling, you can justifiably talk about the documents that you kept, but not about the documents that were rejected by your culling criteria. So, rather than sampling determining what documents are included, the documents included in the sampling frame determine what you can infer from samples drawn from it.

If you want to make claims about all of the documents held by a particular organization, then your sampling must include all documents held by the organization. If you want to make claims about a certain subset of documents, perhaps those held by certain custodians, then your sampling frame need only include those documents.

Random sampling is the best way to get a representative set of documents from a population, but how you pick that population depends on your needs. Election polls do not interview just any kind of person in the country for every kind of poll. Rather, these polls tend to focus on likely voters or at least eligible voters. The sampling frame is constrained to meet the goals of the poll. If the poll concerns a local election, only those voters who are eligible to vote in that election are polled, not everyone in the whole country and not everyone who walks by in a shopping mall.

Your document collection may be similarly constrained. Document collection may be limited to a certain time frame, custodians, or document types, and so on. It takes intelligence to collect a sensible set of documents for litigation or investigation, just as it always has. If certain kinds of documents, certain custodians, or certain time periods are known to be irrelevant to the matter, for example, they should not be included. Neither random sampling nor legal rules require it. Your random samples will be representative of the population you select and not representative of the population that you don’t select.

The decisions you make about your collection are critically important, but that is not affected by the technology that you use to analyze that collection. Random sampling does not require you to check your intelligence at the door.

Myth: Random sampling must miss important concepts in the document collection

Random sampling can easily be shown to provide a good account of the characteristics of the document population (the sampling frame). If our random sample was as big as our population of documents, then it is obvious that it would find every responsive document in the collection and would leave nothing behind. The characteristics of the sample would match exactly the characteristics of the whole population. It is also obvious that a sample as large as the population is impractical. That is the situation we use predictive coding to avoid. What may not be so obvious is that smaller samples provide approximations to the population. The larger the sample, all other things being equal, the better the approximation. Whatever concepts are present in the population will be present in the sample in approximately the same proportion. Nothing in random sampling causes you to miss anything.

The characteristics of the sample are approximations of the characteristics of the collection, so features or concepts that are more common in the collection will be more common in the sample. Very rare things are, by definition, very rare, so they may not appear in the sample. But they are unlikely to be captured using other approaches as well.

It is nice to imagine that an educated reader would find the rare smoking gun in a collection if he or she came across it. The evidence is contrary, however. Detecting rare events is difficult no matter what method or system you use. Despite massive security efforts, terrorists still can slip through at airports and other locations, in part because terrorists are so rare and travelers are so common. Millions of people pass through security checkpoints every day and almost none of them is a terrorist.

The same is true in document review. For a reviewer to identify a document as responsive, the reviewer has to see the document and has to recognize that the document actually provides evidence of being responsive. We can affect the probability of a reviewer seeing a document, but there is still the problem of recognizing it when it has been seen. Psychological studies have found that the more often one decides that an item is irrelevant, the harder it is to detect an item that is, in fact, relevant. The TSA officers who look at the X-rays of your carry-on bags change roles every 20 minutes for precisely this reason. Practically none of the suitcases will ever have a terrorist’s weapon in it.

People would like to think that if they saw a smoking gun document, they would immediately recognize it. Here, too, psychological studies indicate otherwise. Documents in isolation or in hindsight may be obvious, but in the fog of review they are easy to miss. Many of us think that we are above average in being able to design searches for what we want or to recognize responsive documents when they are seen, but we rarely actually analyze our performance to test that hypothesis. There is a strong bias, called the “overconfidence effect,” to believe that we are better than our peers. The attorneys in the Blair and Maron study, one of the first examining the efficacy of keyword searching, thought that they had found around 75% of the relevant documents in their collection, when, in fact, they had found 20%.

Non-random judgmental samples give the appearance to the attorneys running them that they are in control, but that control may be illusory. You can achieve a high level of responsive documents in the documents that you find, but that tells you nothing about your success at finding all of the responsive documents. You may feel confident on the basis of finding more or less only the responsive documents, but still fail to find a substantial number of responsive documents that are not retrieved.

With enough work, keyword searches can be effective for finding responsive documents, but few matters get this high-level treatment. For example, in Kleen Products, one of the defendants reported that they spent 1400 hours coming up with a set of keywords. It is often difficult to guess the right keywords that will return useful documents without providing overwhelming quantities of irrelevant documents. For example, we have seen many times in investigations that the attorney will suggest that we search for documents with legal terms of art that most authors would never use.

Not all attorneys can be above average at guessing keywords and not all matters merit the level of effort that would be required to develop effective lists. In one recent matter where the attorneys tried to cull down the document collection with a keyword list, only 2% of the documents selected by the keywords ended up being called responsive by the principal attorneys. And they don’t know what they missed because they did not analyze the documents that were not selected by the keywords (the identified responsive documents constituted only 0.3% of the whole collection).

Like the attorneys in the Blair and Maron study, the attorneys in this matter thought they were effective at identifying the responsive documents that they needed, but they never actually evaluated their success. Without that evaluation, it is impossible to say whether they were effective or merely overconfident. To evaluate their effectiveness, they should have drawn a random sample of the documents not selected by the keywords and evaluated them for responsiveness. On the other hand, if they were willing to draw a random sample, they probably did not have to do the keyword culling at all, simply use the evaluated random sample as part of the training process for predictive coding.

There have been studies of computer assisted review and of human review efficacy, but proper evaluation of discovery by any method is relatively rare. Without a formal analysis, it is difficult to truly evaluate how well we actually do because it is so easy to over-interpret the evidence we have at hand. This tendency is related to confirmation bias, which is the bias to seek evidence that confirms, rather than tests, our ideas. The studies that are available, suggest that human review is not as accurate as many attorneys would like to believe.

Myth: Random sampling relies on the premise of ignorance: no one knows anything about how to find the probative documents so you might as well search randomly

As Will Rogers said, “It isn't what we don't know that gives us trouble, it's what we know that ain't so.” The problem with assuming that we know how to find the responsive documents is that we may actually be mistaken in our belief. In one matter, one of the parties was certain that there was email that was highly pertinent to the matter and that it was sent sometime between two dates. When the email (or one like it) was eventually found, it had actually been sent 18 months earlier.

Random samples do not prevent you from using information you have about the matter and about the location of probative documents, but they can prevent you from being so convinced of your knowledge that you miss the very information that you are looking for. Confirmation bias leads us to look for evidence that confirms our beliefs rather than tests them. You may, in fact, know how to find critical documents, but unless you critically test that belief, you will never know how accurate you were. And, there are powerful psychological biases at work that interfere with our ability to critically evaluate our beliefs.

We do not choose random samples because we are ignorant of the differences among items, but rather to get a representative, unbiased, sample of those differences. (See Reference Manual on Scientific Evidence: Reference Guide on Statistics, p. 90.)

The representativeness of random samples ensures that the training examples are representative of the range of documents in the collection. The reviewer’s decisions about these documents, therefore, will be representative of the decisions that need to be made about the collection. Non-representative samples may, hypothetically, do better at finding certain kinds of responsive documents, but they would then necessarily be poorer at finding other kinds of responsive documents. Without random sampling, we would have no way to evaluate any non-random claim of success because we could not generalize from the specific documents in the non-random sample to the rest of the collection. In other words, non-random sampling may make one feel as though he or she is being more effective, but only while avoiding a fair evaluation.

Predictive coding depends on the representativeness of the training examples. It does not make any assumptions that all documents are equally likely to be responsive. It does not make any assumptions that the user of the predictive coding system is ignorant. Quite the contrary. Predictive coding in general, requires a high level of expertise about the subject matter to determine which documents are and which are not responsive. It requires legal judgments. It does not rely on your ability as a search engine guru or information scientist.

Predictive coding is a tool. It can be used effectively or ineffectively. If the decisions that are driving its training examples are not valid, the result of predictive coding will not be valid. Random sampling helps to evaluate the thoroughness, consistency, and validity of the decisions.

Moreover, random sampling does not require you to ignore information that you have. If there are documents that are known to be responsive, these can easily be included in the training set. You will still want to use random sampling of additional documents to ensure that you have a representative sample, but there is unlikely to be any harm from starting with known documents, as long as they represent valid decisions about responsiveness.

Using random sampling to train predictive coding means that you don’t have to make an extraordinary effort to find a representative training set. But if you have known documents they should not be ignored. Nothing in random sampling requires you to ignore information that you already have.

Myth: Random sampling prevents lawyers from engaging their professional expertise

Random sampling does not remove the necessity for lawyers to make legal judgments, to guide the collection of data, to prepare for depositions, or to understand the case and the documents relevant to it.

To the contrary, random sampling allows lawyers to focus on using their legal expertise, without requiring them to be sophisticated information scientists. Lawyers’ professional expertise extends to the law, legal process, and the like, but it seldom includes information science. Seeking the documents required to build a seed set often requires the use of complex Boolean expressions, and while some attorneys are quite adept at designing such queries, most are not. It can be quite challenging to construct queries that give all or most of the documents that are wanted and not overwhelm with a lot of unwanted documents.

Traditional kinds of searching, involving keywords and Boolean expressions, requires one to know the language that was used by the authors of the documents. Interviews of the available authors can reveal some of this information, but even the authors may not remember the exact terms that they used. And, they may not be available.

Many attorneys appear to think that the document authors tend to talk about things the way that attorneys would talk about them. I have been asked, for example, to search for the word “fraud” in some investigations as if someone would email another person and say, “hey, let’s commit fraud.” Such things can happen, but people are much more likely to use language that is more subtle, colloquial, and idiosyncratic. On top of that, the words may not even be spelled correctly or consistently. The authors may even use code words to talk about important issues. It is much more likely that these words can be recognized than guessed.

The attorneys may not be able to describe how to find the responsive documents, but they are typically very able to recognize that a document is responsive. The success of predictive coding with random sampling depends strongly on the reviewer’s legal acumen to identify responsive documents. It does not require any specific level of information science skill.

Part of the concern with random sampling seems to be a desire to preserve the attorney’s role in selecting documents for consideration by the predictive coding system. As mentioned earlier, all of the decisions about documents being responsive or non-responsive are made by the attorney, not by the computer. The difference between an attorney-provided seed set and a random sample is who chooses the documents for consideration. We should use professional judgment to select the documents for consideration, if that professional judgment is sound and effective it can contribute to improved predictive coding results. But a word of caution is appropriate here.

Without explicit measurement, we cannot know whether that judgment is good and that measurement requires the evaluation of a random sample. Further, I am reminded of the work of the psychologist Paul Meehl who compared what he called clinical vs statistical judgment. In virtually every case, he found that using statistical models led to better outcomes than intuitive judgments. His approach has been widely adopted in psychology, hiring decisions, and other areas, but there continues to be resistance to his approach, despite the consistent evidence.

There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one. When you are pushing [scores of] investigations, predicting everything from the outcomes of football games to the diagnosis of liver disease and when you can hardly come up with a half dozen studies showing even a weak tendency in favor of the clinician, it is time to draw a practical conclusion” (Meehl 1986, 372−3).

An attorney’s professional judgment about where to look for responsive documents is an implicit prediction about where these documents are likely to be. It is an intuitive and informal assessment of this likelihood, which, as Meehl, and Kahneman and Tversky have found, is subject to inappropriate biases and does not consistently correspond to reality. For most attorneys, there is simply no evidence to lead us to believe that actively choosing a seed set is necessarily better than using random sampling to pick training examples.

Random sampling allows lawyers to focus on their legal judgments, rather than information science techniques.

Myth: Lawyers use random sampling to hide behind the computer’s decisions

In predictive coding, the computer does not make the decisions, it merely implements the decisions that have been made by a subject matter expert attorney. There is nothing to hide behind. Random sampling chooses the documents to be considered, but the judgment about those documents is all yours. The predictive coding system analyzes the decisions you make and extends those decisions in a consistent way to other documents in the collection.

By itself, the predictive coding system cannot determine which documents are interesting and which are not. Your knowledgeable input provides its only notion of responsiveness. Rather, it has an analysis of the evidence in each document that makes that document more like those you have designated as responsive or more like those you have designated as non-responsive. The choice is yours. And, as you update your choices, the system continues to adjust its internal models to more closely approximate your choices.

Conclusion

Predictive coding is a game-changing approach to eDiscovery. It dramatically modifies the workflow process of eDiscovery and speeds it up by orders of magnitude. It helps producing parties to manage the cost of eDiscovery while improving accuracy and the time needed to meet discovery obligations. It helps receiving parties to get more of the information that they are entitled to receive in a shorter time and with less of the junk that typically invades productions.

Random sampling is an essential part of predictive coding. It is universally used in the measurement of its outcome. In addition, training with random sampling increases the transparency of the process—it shows at each step of training how well the system is doing in a form that can be directly compared with the ultimate outcome of the process. Random sampling is the best way that we know to provide the representativeness that is required for effective predictive coding. It minimizes the effort required to seek out potentially responsive documents and provides easily defensible criteria for selecting the training examples.

There are other methods to provide the training examples that predictive coding requires. Random sampling does not have to replace those other methods, but it should be an essential part of any training process to ensure that coverage be as complete as possible. The research is clear that we all have expectations about what information should be present and great confidence that we can find that information. Often this confidence is unfounded and in any case, it should be measured as objectively as possible. Random sampling provides that unbiased objective measurement and an unbiased objective method to identify the documents in a collection that are actually responsive.

The power of detailed statistical analysis to reveal actionable information about the world has long been recognized by statisticians. Recent books, such as Money Ball by Michael Lewis and The Signal and the Noise: Why So Many Predictions Fail — but Some Don't by Nate Silver have helped to make that value more apparent to people outside of statistics. The very same principles and processes described in these books can be applied to eDiscovery with similar levels of success. An essential ingredient in that success is random sampling.

Sunday, March 25, 2012

Da Silva Moore Plaintiffs Slash and Burn their Way Through eDiscovery

The Plaintiffs in the Da Silva Moore case have gone far beyond zealous advocacy in their objection to Judge Peck's order. The Plaintiffs object to the protocol(see the protocol and other documents here) that gives them the keys to the eDiscovery candy store. In return, they propose to burn down the store and eviscerate the landlord.

Da Silva Moore has been generating a lot of attention in eDiscovery circles, first for Judge Peck's decision supporting the use of predictive coding, and then for the challenges to that ruling presented by the Plaintiffs. The eDiscovery issues in this case are undoubtedly important to the legal community so it is critical that we get them right.

The Plaintiffs play loose with the facts in the matter, they fail to recognize that they have already been given the very things that they ask for, and they employ a rash of ad hominem attacks on the judge, the Defendant, and the Defendant's predictive coding vendor, Recommind. Worse still, what they ask for would actually, in my opinion, disadvantage them.

If we boil down this dispute to its essence, the main disagreement seems to be about whether to measure the success of the process using a sample of 2,399 putatively non-responsive documents or a sample of 16,555. The rest is a combination of legal argumentation, which I will let the lawyers dispute, some dubious logical and factual arguments, and personal attacks on the Judge, attorneys, and vendor.

The current disagreement embodied in the challenge to Judge Peck's decision is not about the use of predictive coding per se. The parties agreed to use predictive coding, even if the Plaintiffs now want to claim that that agreement was conditional on having adequate safeguards and measures in place. Judge Peck endorsed the use of predictive coding knowing that the parties had agreed. It was easy to order them to do something that they were already intending to do.

Now, though, the Plaintiffs are complaining that Judge Peck was biased toward predictive coding and that bias somehow interfered with him rendering an honest decision. Although he has clearly spoken out about his interest in predictive coding, I am not aware of any time that Judge Peck endorsed any specific measurement protocol or method. The parties to the case knew about his views on predictive coding, and, for double measure, he reminded them of these views and provided them the opportunity to object. Neither party did. In any case, the point is moot in that the two sides both stated that they were in favor of using predictive coding. It seems disingenuous to then complain about the fact that he spoke supportively of the technology.

The Plaintiff brief attacking Judge Peck for his support of predictive coding reminds me of the scene from the movie Casablanca where Captain Renault says that he is shocked to find out that gambling is going on in Rick's café just as he is presented with his evening's winnings. If the implication is that judges should remain silent about methodological advances, then that would have a chilling effect on the field and on eDiscovery in particular. A frequent complaint that I hear from lawyers is that the judges don't understand technology. Here is a judge who not only understands the technology of modern eDiscovery, but works to educate his fellow judges and the members of the bar about its value. It would be disastrous for legal education if the Plaintiffs were to succeed in sanctioning the Judge for playing this educational role.

The keys to the candy shop

The protocol gives to the Plaintiffs the final say on whether the production meets quality standards (Protocol, p. 18):

If Plaintiffs object to the proposed review based on the random sample quality control results, or any other valid objection, they shall provide MSL with written notice thereof within five days of the receipt of the random sample. The parties shall then meet and confer in good faith to resolve any difficulties, and failing that shall apply to the Court for relief. MSL shall not be required to proceed with the final search and review described in Paragraph 7 above unless and until objections raised by Plaintiffs have been adjudicated by the Court or resolved by written agreement of the Parties.

They, the Plaintiffs, make a lot of other claims in their brief about things not being specified, when in fact, the protocol gives them the power to specify the criteria as they see fit. They get to define what is relevant. They get to determine whether the results are adequate, so it is not clear why they complain that these things are not clearly specified.

Moreover, the Defendant is sharing with them every document used in training and testing the predictive coding system. The Plaintiffs can object at any point in the process and trigger a meet and confer to resolve any possible dispute. It's not clear, therefore, why they would complain that the criteria are not clearly spelled out when they can object for any valid reason. Any further specificity would simply limit their ability to object. If they don't like the calculations or measures used by the Defendant, they have the documents and can do their own analysis.

The Plaintiffs are being given more data than they could reasonably expect from other defendants or when using other technology. I am not convinced that it should be necessary in general to share the predictive coding training documents with opposing counsel. These training documents provide no information about the operation of the predictive coding system. The documents are only useful for assessing the honesty or competence of the party training the predictive coding system, they presume that the predictive coding system will make good use of the information they contain. I will leave to lawyers any further discussion of whether document sharing is required or legally useful.

Misuse of the "facts"

The Plaintiffs complain that the method described in the protocol risks failing to capture a staggering 65% of the relevant documents in this case. They reach this conclusion based on their claim that Recommind’s “recall,” was very low, averaging only 35%. This is apparently a fundamental misreading or misrepresentation of the TREC (Text Retrieval Conference) 2011 preliminary results (attached to Neale's declaration). Although it may be tempting to use the TREC results for this purpose, TREC was never designed to be a commercial "bakeoff" or certification of commercial products. It is designed as a research project and it imposes special limitations on the systems that participate, limitations that might not be applicable in actual use during discovery. Moreover, Recommind scored much better on recall than the Plaintiffs claim, about twice as well.

The Plaintiffs chose to look at the system's recall level at the point where the measure F1 was maximized. F1 is a measure that combines precision and recall with an equal emphasis on both. In this matter, the parties are obviously much more concerned with recall than precision, so the F1 measure is not the best choice for judging performance. If, rather, we look at the actual recall achieved by the system, while accepting a reasonable number of non-responsive documents, then Recommind's performance was considerably higher, reaching an estimated 70% or more on the three tasks (judging from the gain curves in the Preliminary TREC report). To claim that the data support a recall rate of only 35% is misleading at best.

Methodological issue

The Plaintiffs complain that there are a number of methodological issues that are not fully spelled out in the protocol. Among these are how certain standard statistical properties will be measured (for example, the confidence interval around recall). Because they are standard statistical properties, they should hardly need to be spelled out again in this context. These are routine functions that any competent statistician should be able to compute.

The biggest issue that is raised, and the only one where the Plaintiffs actually have an alternative proposal, concerns how the results of predictive coding are to be evaluated. Because, according to the protocol, the Plaintiffs have the right to object to the quality of the production, it actually falls on them to determine whether it is adequate or not. The dispute revolves around the collection of a sample of non-responsive documents at the end of predictive coding (post-sample) and here the parties both seem to be somewhat confused.

According to the protocol, the Defendant will collect 2,399 documents designated by the predictive coding to be non-responsive. The plaintiffs want them to collect 16,555 of these documents. They never clearly articulate why they want this number of documents. The putative purpose of this sample is to evaluate the system's precision and recall, but in fact, this sample is useless for computing these measures.

Precision concerns the number of correctly identified responsive documents relative to the number of documents identified by the system as responsive. Precision is a measure of the specificity of the result. Recall concerns the number of correctly identified responsive documents relative to the total number of responsive documents. Recall is a measure of the completeness of the result.

The sample that both sides want to draw contains by design no documents that have been identified by the system as responsive so it cannot be used to calculate either precision or recall. Any argument about the size of this sample is meaningless if the sample cannot provide the information that they are seeking.

A better measure to use in this circumstance is elusion. Rather than calculate the percentage of responsive documents that have been found, elusion calculates the percentage of documents that were erroneously classified as non-responsive. I have published on this topic in the Sedona Conference Journal, 2007. Elusion is the percentage of the rejected documents that are actually responsive. It can be used to create an accept-on-zero quality control test or one can simply measure it. Measuring elusion would require the same size sample as the original 2,399-document pre-sample used to measure prevalence. The methods for computing the accept-on-zero quality control test are described in the Sedona Conference Journal paper. The parties could apply whatever acceptance criterion they want, without having to sample huge numbers of documents to evaluate success.

Another test that could be used is a z-test for proportions. If predictive coding works, then it should decrease the number of responsive documents that are present in the post-sample, relative to the pre-sample. The pre-sample apparently identified 36 responsive documents out of 2,399 in a random sample. A post-sample of 2,399 documents, drawn randomly from the documents identified as non-responsive would have to have 21 or fewer responsive documents for it to be significantly different (by a conservative 2-tailed test) at the 95% confidence level.

Conclusion

The Parties in this case are not arguing about the appropriateness of using predictive coding. They agreed to its use. The Plaintiffs are objecting to some very specific details of how this predictive coding will be conducted. Along the way they raise every possible objection that they can imagine, most of which are beside the point; they misinterpret or misrepresent data; they fail to realize that they have the very information they are seeking; and they seek data that will not do them any good, all while vilifying the judge, the other party, and the party's predictive coding service provider. It is as if given the keys to the candy store, they are throwing a tantrum because they have not been told whether to eat the red whips or the cinnamon sticks. Their slash and burn approach to negotiation is far beyond zealous advocacy and far from consistent with the pattern of cooperation that has been promoted by the Sedona Conference and by a large number of judges, including Judge Peck.

Disclosures

So that there are no surprises about where I am coming from, let me repeat some perhaps pertinent facts. Certain other bloggers have recently insinuated that there might be some problem with the credibility of the paper that Anne Kershaw, Patrick Oot, and I published in the peer-reviewed Journal of the American Society for Information Science and Technology on predictive coding. Judge Peck mentioned this paper in his opinion. The technology investigated in that study was from two companies with which none of the authors had any financial relationship.

I am the CTO and Chief Scientist for OrcaTec. I designed OrcaTec's predictive coding tools, starting in February of 2010, after the paper mentioned earlier had already been published and after it became clear that there was interest in predictive coding for eDiscovery. OrcaTec is a competitor of Recommind, and uses very different technology. Our goal is not to defend Recommind, but to try to bring some common sense to the process of eDiscovery.

Neither I, nor OrcaTec has any financial interest in this case, though I have had conversations in the past with Paul Neale, Ralph Losey, and Judge Peck about predictive coding.

I have also commented on this case in an ESI-Bytes podcast, where we talk more about the statistics of measurement.

Thanks to Rob Robinson for collecting all of the relevant documents in one easy to access location.

Monday, January 9, 2012

On Some Selected Search Secrets

Ralph Losey recently wrote an important series of blog posts (here, here, and here) describing five secrets of search. He pulled together a substantial array of facts and ideas that should have a powerful impact on eDiscovery and the use of technology in it. He raised so many good points, that it would take up all of my time just to enumerate them. He also highlighted the need for peer review. In that spirit I would like to address a few of his conclusions in the hope of furthering discussions among lawyers, judges, and information scientists about the best ways to pursue eDiscovery.

These are the problematic points I would like to consider:
1. Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.
2. Webber’s analysis shows that human review is better than machine review
3. Reviewer quality is paramount.
4. Human review is good for small volumes, but not large ones.
5. Random samples with 95% confidence levels +/- 2 are unrealistically high.

Issue: Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.

Losey quotes extensively from a paper written by William Webber, which reanalyzes some results from the TREC Legal Track, 2009, and some other sources. Like Losey’s commentary, this paper also has a lot to recommend it. Some of the conclusions that Losey reaches are fairly attributable to Webber, but some go beyond what Webber would probably be comfortable with. The most significant fact, because important arguments are based on it, is a description of some work by Ellen Voorhees that concluded that 65% recall at 65% precision is the best performance one can expect.

The problem is that this 65% factoid is taken out of context. In the context of the TREC studies and the way that documents are ultimately determined to be relevant or not, this is thought to be the best that can be achieved. The 65% is not a fact of nature. It says, actually, nothing about the accuracy of the predictive coding systems being studied. Losey notes that this limit is due to the inherent uncertainty in human judgments of relevance, but goes on to claim that this is a limit on machine-based or machine assisted categorization. It is not.

Part of the TREC Legal Track process is to distribute sets of documents to ad hoc reviewers, whom they call assessors. Each assessor gets a block or bin of about 500 documents and is asked to categorize them as relevant or not relevant to the topic query. None of the documents in this part of the process is reviewed by more than one assessor. Each assessor typically reviews only one batch. Although information about the topic is provided to each assessor, there is no rigorous effort expended to train them. As you might expect, the assessors can be highly variable. But, generally speaking, we don’t have any assessment of their variability or skill level. This is an important point and I will have to come back to it soon.

Predictive coding systems generally work by applying machine learning to a sample of documents and extrapolating from that sample to the whole collection. Different systems get their samples in different ways, but the performance of the system depends on the quality of the sample. Garbage in – garbage out. More fully, variability in accuracy can come from at least three sources:
1. Variability in the training set
2. Variability due to software algorithms
3. Variability due to the judgment standard against which the system is scored

If the system is trained on an inconsistent set of documents, or if it performs inconsistently, or if it is judged inconsistently, its ultimate level of performance will be poor. Voorhees, in the paper cited by Webber, found that professional information analysts agreed with one another on less than half of the responsive documents. This fact says nothing about any predictive coding system, it talks only about agreement of one person with another. One of the assessors she compared was the author of the topic and so could be considered the best available authority on the topic. The second assessor was the author of a different topic.

Under TREC, the variability due to the training set is left uncontrolled. It is up to each team to figure out how to train their systems. The variability due to the judgment standard is consistent across systems, so any variation among systems can be attributed to the training set or the system capabilities. That strategy is perfectly fine for most TREC purposes. We can compare the relative performance of a participating system. The problem only comes when we want to ascertain how well a system will do in absolute terms. The performance of predictive coding systems in the TREC Legal Track is suppressed by the variability of the judgment standard. It is not a design flaw for TREC, it is only a problem when we want to extrapolate from TREC results to eDiscovery or other real world situations. It under-estimates how well the system will do with more rigorous training and testing standards. The original TREC methodology was never designed to produce absolute estimates of performance, only relative.

Anything that we can do to improve the consistency of the training and testing set of document categorizations will improve the absolute quality of the results. But such quality improvements are typically expensive.

The TREC Legal Track has moved to using a Topic Authority (like Voorhees’s primary assessor). Even an authoritative assessor is not infallible, but it may be the best that we can achieve. It also may be realistic.

I would like to see the Topic Authority (TA) produce an authoritative training set and a second authoritative judgment set. The first set is used to train the predictive coding system, the second is used to test it.

Using a topic authority to provide the training and final assessment sets will substantially reduce the variability of the human judgments. We need two sets because we cannot use the same documents to train the system as we use to test the system. If we used only one, then the performance of the system on the same documents could over-estimate its capabilities. The system could simply memorize the training examples and spit back the same categories it was given. Having separate training and testing sets is standard procedure in most machine learning studies.

When we do a scientific study, we want to know how well the system will do in other, similar, situations. This prediction typically requires a statistical inference, and to make a valid statistical inference the two measurements need to be independent.

To translate this into eDiscovery process, the training set should be created by the person who knows the most about the case and then evaluated, for example, using a random sample of documents predicted to be responsive and nonresponsive, by the same person. Losey is correct, if you have multiple reviewers, each applying idiosyncratic standards to the review you will get poor results, even from a highly accurate predictive coding system. On the other hand, with rigorous training and a rigorous evaluation process, high levels of predictive coding accuracy are often achievable.

Issue: Webber’s analysis shows that human review is better than machine review

I have no doubt that human review could sometimes be better than machine-assisted review, but the data discussed by Webber do not say anything one way or the other about this claim.

Webber did, in fact, find that some of the human reviewers showed higher precision and recall than did the best-performing computer system on some tasks. But, because of the methods used, we don’t know whether these differences were due merely to chance, to specific methods used to obtain the scores, or to genuine differences among reviewers. Moreover, the procedure prevents us from making a valid statistical comparison.

The TREC Legal Track results that were analyzed in Webber’s paper involved a three step process. The various predictive coding systems were trained on whatever data their teams thought were appropriate. The results of the multiple teams were combined and sampled along with a number of documents that were not designated responsive by any team. From these, the bins or batches were created and distributed to the assessors. Once the assessors made their decisions, the machine teams were given a second chance to “appeal” any decisions to the Topic Authority. If the TA agreed with the computer system’s judgments the computer system then was measured as performing better and the assessor’s performance was judged as performing worse. The appeals process, in other words, moved the target after the arrow had been shot.

If none of the documents judged by a particular assessor was appealed, then that assessor would have precision and recall of 1.0. Prior to the appeal, the assessors’ judgments were the accuracy standard. The more documents that were appealed, the more likely that assessor would be to have a low score. Their score could not increase from the appeals process. So, whether an assessor scored well or scored poorly was determined almost completely by the number of documents that were appealed—by how much the target was moved.

Because of the (negative) correlation between the performance of the computer system and the performance of the assessor, their performances were not independent. Therefore, a simple statistical comparison between the performance of the assessor and the performance of the computer system is not valid.

Even if the comparison were valid, we still have other problems. The different TREC tasks involved different topics. Some were presumably easier than others. The assessors who made the decisions may have varied in ability, but we have no information about which were more skillful. The bins or batches that were reviewed probably differed among themselves in the number of responsive documents each contained. Because only one assessor judged each document, we don’t know whether the differences in accuracy (as judged by topic authority) were due to differences in the documents being reviewed or to differences in the assessors.

Issue: Reviewer quality is paramount

Webber found that some assessors performed better than others. Continuing the argument of the previous section, though, we cannot infer from this that some assessors were more talented, skilled, or better prepared than others.
It is entirely circular to say that some assessors were more skillful than others and so were more accurate because the only evidence we have that they were more skillful is that they were measured to be more accurate. You cannot explain an observation (higher accuracy) by the fact that you made the observation (higher accuracy). It cannot be both the cause and the effect.

Whether the source of variation in performance among the assessors was due to variation in the number or difficulty of the decisions or was due to differences in assessor talent, you cannot simply pick the best of one thing (the assessor performance) and compare it to the average of another (the computer assisted review). The computer performance is based on all of the documents, each assessor’s performance is based on only 500 documents. The computer performance was a representational equivalent of the average of all assessor judgments. Just by chance, some assessors will be higher than others. In fact, about half of the assessors should, just by chance, score above and about half should score below the average. But, we have no way to determine whether those selected reviewers scored high because of chance or because of some difference in skill. We measured them only once and we cannot use the fact that they scored well to explain their high score. We need some independent evidence.

The best reviewers on each topic could have been the best because they got lucky and got an easy bin, or they got a bin with a large number of responsive documents, or just by chance. Unless we disentangle these three possibilities, we cannot claim that some reviewers were better or that reviewer quality matters. In fact, these data provide no evidence one way or the other relative to these claims.

In any case, the question in practice is not whether some human somewhere could do better than some computer system. The question is whether the computer system did well enough or is there some compelling reason to force parties through the expense of using superior human reviewers? Even if some human reviewers could consistently do better than some machine, this is not the current industry standard.
In some sense, the ideal would be for the senior attorney in the case to read every single document with no effect of fatigue, boredom, distraction, or error. Instead, the current practice is to farm out first pass review to either a team of anonymous, ad hoc, or inexpensive reviewers or to search by keyword. Even if Losey were right, the standard is to use the kind of reviewers that he says are not up to the task.

Issue: Human review is good for small volumes, but not large ones

This claim may also be true, but the present data do not provide any evidence for or against it. The evidence that Losey cites in support of this claim is the same evidence that, I argued, failed to show that human review is better than machine review. It requires the same circular reasoning. We do not know from Webber’s analysis whether some reviewers were actually better than others, only that on this particular task, with these particular documents, they scored higher. Similarly, we don’t know from these data that doing only 500 documents is accurate, whereas doing more leads to inaccuracy. We don’t even know in the tested circumstance whether performance would decrease over that number. All bins were about the same size, so there is no way to test the hypothesis with these data that performance decreases as the set size rises above 500. It just was not tested.

When confronted with a small (e.g., several hundred) versus a large volume of documents to review, we can expect that fatigue, distraction, and other human factors will decrease accuracy over time. Based on other evidence from psychology and other areas, it is likely that performance will decline somewhat with larger document sets, but there is no evidence here for that.

If this were the only factor, we could arrange the situation so that reviewers only looked at 500 documents at a time before they took a break.

Issue: Random samples with 95% confidence levels +/- 2% confidence intervals are unrealistically high.

It’s not entirely clear what this claim means. On the one hand, there is a common misperception of what it means to have a 95% confidence level. Some people mistakenly assume that the confidence level refers to the accuracy of the results. But the confidence level is not the same thing as the accuracy level.
The confidence level refers to our belief in the measurement’s reliability, it does not tell what we are measuring. The confidence interval (e.g., ±2%) is a prediction of how precisely our sample estimates the true value of the whole population. Put simply, a 95% confidence interval means that in 95% of the experiments with this confidence level, we expect to find that true population value within the range specified by the experiment’s confidence interval.

For example, a recent CNN, Time Magazine poll found that 37% of likely Republican primary voters in South Carolina supported Mitt Romney, based on a survey sample size of 485 likely Republican primary voters. With a 95% confidence level, these results are accurate to within ±4.5 percentage points (the confidence interval). It does not mean that Romney is supported by 95% of the voters or that Romney is has a 95% probability of winning. It means that if the election were held today, the survey predicts that 37% of the voters would vote for Romney. I suspect that Losey means something different.

I suspect that he is referring to the relatively weak levels of agreement found by the TREC studies and others. If our measurement is not very precise, then we can hardly expect that our estimates will be more precise. This concern, though, rests on obtaining the measurements in the same way that TREC has traditionally done it. If we can reduce the variability of our training set and our comparison set, we can go a long way toward making our estimates more precise.

In any case, many relevant estimates do not depend on the accuracy of a group of human assessors. In practice, for example, our estimates of such factors as the prevalence of responsive documents can rest on the decisions made by a single authoritative individual, perhaps the attorney responsible for signing the Rule 26(g) declaration. Those estimates can be made precise enough with reasonable sample sizes.

Conclusion

The main problem with Losey’s discussion derives from taking the results reported by Webber and Voorhees as an immutable fact of information retrieval. The observation that there is only moderate agreement among independent assessors is a description of the human judgments in these studies, it says nothing about any machine systems used to predict which documents are responsive or not. The variability that leads to this moderate level of agreement can be reduced and when it is, the performance of machine-review can be more accurately measured.

The second problem derives from the difficulty of attributing causation in experiments that were not designed to attribute such causation. Within the data analyzed by Webber, for example, there is no way to distinguish the effects of chance from the effects due to assessor differences.

None of these comments should be interpreted as an indictment of TREC, Webber, or Losey. Science proceeds when people with different perspectives have the chance to critique each other’s work and to raise questions that may not have previously been considered.

None of these comments is intended to argue that predictive coding is fundamentally inaccurate. Rather my main argument is that the studies from which these data were derived were not designed to answer many of the questions we would like to ask of it. They do not speak against the effectiveness of predictive coding, nor do they speak in favor of it. Other studies will need to be conducted that address these questions specifically and are designed to answer them.

Finally, even if we disagree about the effectiveness of predictive coding relative to human performance, there is little disagreement any more about the effectiveness of a purely human linear review or of using a simple keyword search to identify responsive documents. The cost of human review continues to skyrocket as the volume of documents that must be considered increases. In many cases, human review is simply impractical within the cost and time constraints of the matter. Under those circumstances, something else has to be done to reduce that burden. That something else, seems to be predictive coding and the fact that we can measure its accuracy only adds to its usefulness.