Tuesday, July 23, 2013

Further Adventures in Predictive Coding

The commentators on my recent post about Adventures in Predictive Coding raise a number of interesting issues that I would like to explore further.  I have already responded to Webber’s comments.  I would like to comment further on the statistics suggested by Cormack and visit a few issues raised by Britton.

As a reminder, my post was in response to Ralph Losey’s blogs in which he chronicled how he used two different methods to identify the responsive documents from the Enron set. My post demonstrated how, statistically, despite his substantial effort, nothing in his exercises would lead us to conclude definitively that either method was better than the other.

Cormack


Gordon Cormack has argued that a statistical test with more power than the one I used might be able to find a significant difference between Losey’s two methods.  He is correct that, in theory, some other test might, indeed, find significance.  He is incorrect, however, about the test that he chose to use, which he says is the “sign test,” and in his application of this test.  Although it is a fairly minor point, in light of the fact that neither method yielded a statistically significant number of responsive documents relative to simply randomly selecting documents, it may still be instructive to discuss the statistical methods that were used to compare the processes. 

Each statistical hypothesis test is based on a certain set of assumptions.  If those assumptions are not met, then the test will yield misleading or inappropriate inferences.  The numbers may look right, but if the process that generated those numbers is substantially different in a statistical sense from the one assumed by the test, then erroneous conclusions are likely.

According to Wikipedia, the sign test “can be used to test the hypothesis that there is ‘no difference in medians’ between the continuous distributions of two random variables X and Y, in the situation when we can draw paired samples from X and Y.”  To translate that, these are the assumptions of the sign test:

  1. Compare two measures, one called X and one called Y.
  2. Compare paired items.  At a minimum, the two measurements have to be of items that can be matched between measurements.  That is, for each item in measurement 1, there is one and only one item in measurement 2 against which it is compared.  This pairing must be specified before the measurements are taken.  Usually these are the same items, measured two times, once before a treatment and once after a treatment. 
  3. Compare the two sets of measurements using their median.  The median is the middle score, where half of the scores are above the median and half are below. 
  4. The two measures must have continuous distributions.  In order to compare medians, some scores have to be higher than others.  For each of the items being measured (twice), we have to be able to decide whether the second measurement is greater than the first, less than the first, or equal to the first.
  5. The sets of measurements must be random variables.  They have to derived from a random sampling process.  Typically this means that the items that are being measured twice are selected through random sampling. 
  6.  The number of observations, must be fixed prior to computing the statistic.  The binomial rule applies within a particular sample of a particular size.
   
For each pair of measures, we examine whether the second measurement is less than the first, greater than the first or equal to the first. In other words, we count the number of times the comparison comes up with a positive sign, a negative sign, or no sign.  That’s why the test is called a sign test. 

Under the null hypothesis, the proportion of positive signs should be about 0.5 and we use the binomial distribution to assess that hypothesis.  The probability of a certain number of plus or minus signs can be found by the binomial rule based on a probability equal to 0.5 and a certain number of observations.

Cormack’s analysis meets the first and second requirements.  The same documents can be compared in both exercises. 

It does not meet the third or fourth requirements.  There is no median measurement to deal with in Cormack’s analysis.  The two measurements (multimodal vs. monomodal) do not yield a score, only a yes/no decision, so we cannot judge whether one is greater than the other.  Without a pair of continuous measurements, the test reduces to a simple binomial.  If the other assumptions of a binomial were met, we could still use that to assess the significance of the comparison.

The analysis does not meet requirement 5.  There is nothing random about the selection of these documents.  They are not a sample from a larger population, they are the entire population of responsive documents for which the two exercises disagreed. 

Cormack’s analysis also fails requirement 6.  There is no sample size.  The binomial distribution assumes that we have a sample consisting of a fixed number of observations.  In Cormack’s analysis, on the other hand, the number of observations is determined by the number of documents observed to be responsive by one method or the other.  This is circular.  It’s not a sample because the documents were not randomly selected, and the binomial distribution does not apply.

 Because Cormack's analysis does not meet the prerequisites of the sign test
or the binomial, these tests cannot be used to assess significance.

Because Cormack's analysis does not meet the prerequisites of the sign test or the binomial, these tests cannot be used to assess significance. The numbers have the right appearance for computing a proportion, but unless the proportion was created in the way assumed by the statistical test, it cannot be used validly to determine whether the pattern of results is different from what you would expect by chance. 

The same kind of statistical process must produce both the observations and the chance predictions.  Nominally, we are testing whether a random process with one probability (0.5) is different from a random process with a different probability (0.56).  In reality, however, we are comparing a random process (like a coin flip) against a different, non-random process.  If it is not a random process generating the numbers, then the binomial rule cannot be used to estimate its distribution.  If it is not a sample drawn from a larger population, then there is no distribution to begin with.

We may know the probability of a fair coin being flipped coming up with the corresponding proportion of heads, but, because Cormack’s analysis fails to preserve the prerequisites of a binomial, this probability is just irrelevant. We know how coin flips should be distributed according to chance, but we do not have a distribution for numbers of documents generated exclusively by the multimodal and monomodal exercises.  They’re just numbers.

To be sure, the multimodal approach identified numerically more documents as responsive than did the monomodal approach.  We cannot say that that difference is reliable (significant) or not, but if it were, does that mean that the multimodal method was better?  Can we attribute that difference to the methods used? Not necessarily.  A process that claims that more documents are responsive, whether correct or incorrect, would show a similar difference. A process that was more stringent about what should be called a responsive document would also yield a comparable difference.

As Losey noted, his criteria for judging whether a document was responsive may have changed between the multimodal and the monomodal exercises. “The SME [Losey] had narrowed his concept of relevance [between the two exercises].”  The only measure we have of whether a document was truly responsive was Losey’s judgment.  If his standard changed, then fewer documents were relevant in the second exercise, so the fact that fewer were predicted to be responsive could be simply a product of his judgment, not of the method.  So, even if we found that the difference in the number of documents was significant, we still cannot attribute that difference with confidence to the method used to find them. 

Knowing only the number of responsive documents found, we cannot know whether that difference was due to a difference in method or a difference in the standard for categorizing a document responsive.

Britton


Gerard Britton raises the argument that Recall is not a good measure of the completeness of predictive coding.  He notes, correctly, that documents may differ in importance, but then somehow concludes that, therefore, Recall is uninformative.  He offers no alternative measures, but rather seems to believe that human reviewers somehow will automagically be better than the computer at recognizing more-responsive documents.  He claims without evidence, and, I would argue, contrary to the available evidence, that humans and computers categorize documents in fundamentally different ways.

Although it is possible that humans and computers make systematically different judgments about the responsiveness of documents, there is no evidence to support this claim.  It is a speculation.  Britton treats this purported difference as a fact and then claims that it is unacceptable that this supposed fact is then ignored.  In order for computers to have the same or higher accuracy as people, they would have to be more accurate at finding what Britton calls the “meh” documents and poorer than human reviewers at finding the “hot” documents.  TREC has some evidence to the contrary, which Cormack mentioned. 

If the ability of a system to predict hot vs meh documents is a concern, then one can examine this capability directly.  One could train a predictive coding system on just hot documents, for example, and examine its performance on this set.  One could separately train on documents that are not so hot, but still responsive.  One would have to also study how teams of human reviewers do on these documents and I think that the results would not support Britton’s suppositions.

The potential difference in standards for accepting whether a document is in the positive or negative group does not invalidate the measure of Recall.  Recall is just a count (a proportion, actually) of the documents that meet ones criteria.  What those criteria are is completely up to the person doing the measurement.  To say that Recall is invalidated because there can be multiple criteria is a fundamental misunderstanding.

Predictive coding is based on human judgment.  The essence of predictive coding is that a human subject matter expert provides examples of responsive and non-responsive documents.  An authoritative reviewer, such as Losey, determines what standards to apply to categorize the documents.  The computer analyzes the examples and extracts mathematical rules from them, which it then applies to the other documents in its population.  Furthermore, the accuracy, for example, Recall, of the categorization is judged by subject matter experts as well.  If the documents are junk, then they should not – or would not – be categorized as responsive by the subject matter expert during training or during assessment of accuracy. 

In a recent matter, we had access to the judgments made by a professional team of reviewers following a keyword culling process.  Of the documents identified by the team as responsive, the primary attorneys on the case selected only 30% as meeting their standard for responsiveness. 

By comparison, in the Global Aerospace matter (Global Aerospace, Inc., et al.  v. Landow Aviation, L.P., et al., Loudon Circuit Court Case #CL 61040 (VA)), we found that about 81% of the documents recommended by the predictive coding system were found to be responsive. 

In recent filings in the Actos matter (In Re: Actos (Pioglitazone) Product Liability Litigation (United States District Court of Louisiana MDL No.6:11-md-2299), available from PACER), an estimated 64% of the predicted responsive documents (over a large score range) were found to be responsive.

Based on evidence like this, it is extremely difficult to claim that humans are good, but only too expensive.  The evidence is to the contrary.  Unless special steps are taken, human review teams are inconsistent and of only moderate accuracy. 
A good manual review process may yield a good set of documents, but there is absolutely no evidence
that even this good review would be better than a good predictive coding process.

Britton claims that the risk of finding more documents with lower responsiveness and fewer documents with high responsiveness is greater when using quantification than when using human readers. As mentioned, this concern is unfounded and unsupported by any evidence of which I am aware.  A good manual review process may yield a good set of documents, but there is absolutely no evidence that even this good review would be better than a good predictive coding process.  To the contrary, there is evidence that a well-run predictive coding process does better than a well-run manual review process.

If there is no evidence that manual review is superior to predictive coding, then Britton’s suggestions of alternative review methodologies are not likely to be of any use.  They are based on false premises and fundamental misunderstandings.  They are likely to cost clients a great deal more money and yield poorer results for both parties.

eDiscovery practitioners may disagree about some of the methods we use to conduct and analyze predictive coding, but the bottom line seems clear.  Predictive coding, when done well, is effective for identifying responsive and non-responsive documents.  It is less expensive, more accurate, and typically faster than other forms of first-pass review.  The differences among methods may yield differences in accuracy or speed, but these differences are small when compared to manual review.  Predictive coding is a game-changer in eDiscovery.  It is here to stay and likely to take on an ever-increasing portion of the burden of eDiscovery.


Monday, July 1, 2013

Adventures in Predictive Coding



Ralph Losey, in a series of blogs, has painstakingly chronicled how he used two different methods, which he called a “multimodal hybrid” approach and a “monomodal hybrid” approach, to identify the responsive documents from the Enron set.  He recently summarized his results here and here.

His descriptions are very detailed and often amusing.  They provide a great service to the field of eDiscovery.  He wants to argue that the multimodal hybrid approach is the better of the two, but his results do not support this conclusion.  In fact, his two approaches show no statistically significant differences.  I will explain.

The same set of 699,082 documents was considered in both exercises, and both started with a random sample to identify a prevalence rate — the proportion of responsive documents.  In both exercises the random sample estimated that about one quarter or less of a percent of the documents were responsive (0.13% vs. 0.25% for the multimodal and monomodal exercises respectively, corresponding to an estimate of 928 vs. 1773 responsive documents in the whole collection).  Combining these assessments of prevalence with a third one, Losey estimates that 0.18% of the documents were responsive.  That’s less than one fifth of one percent or 1.8 responsive documents per thousand.  In my experience, that is a dramatically sparse set of documents.

These are the same documents in the two exercises, so it is not possible that they actually contained different numbers of responsive documents.  Was the almost 2:1 prevalence difference between the two exercises due to chance (sampling variation), was it due to changes in Losey’s standards for identifying responsive documents, or was it due to something else?  My best guess is that the difference was due to chance.

By chance, different samples from the same population can yield different estimates.  If you flip a coin, for example, on average, half of the flips will come out heads and half will come out tails.  The population consists of half heads and half tails, but any given series of flips may have more or fewer heads than another.  The confidence interval tells us the range of likely proportions.  Here are two samples from a series of coin flips. 

H T H T H H H T H H

H T T H H T T H T T

The first sample consisted of 7 Heads and 3 Tails.  The second sample consisted of 4 Heads and 6 Tails.  Were these samples obtained from different coins?  One sample is 70% Heads and the other is 40% Heads.  I happen to know that the same coin was used for all flips, and that, therefore, we can attribute the difference to chance.  With samples of 10 flips, the 95% confidence interval extends from 0.2 (2 Heads) to 0.8 (8 Heads).  Although these two samples resulted in numerically different values, we would not be justified in concluding that they were obtained from coins with different biases (different likelihoods of coming up Heads).  Statistical hypothesis testing allows us to examine this kind of question more systematically.

Statistical significance means that the difference in results are unlikely to have occurred by chance. 

I analyzed the prevalence rates in Losey’s two exercises to see whether the observed difference could reasonably be attributed to chance variation.  Both are based on a large sample of documents.  Using a statistical hypothesis test called a “Z-test of proportions,” it turns out that the difference is not statistically significant.  The difference in prevalence estimates could reasonably have come about by chance.  Two random samples from the same population could, with a reasonably high likelihood, produce a difference this large or larger by chance.

By ordinary scientific standards, if we want to conclude that one score is greater than another, we need to show that the difference between the two scores is greater than could be expected by from sampling variation.  As we know, scores derived from a sample always have a confidence interval or margin of error around them.  With a 95% confidence level, the confidence interval is the range of scores where, 95% of the time, the true population level for that score (the true proportion of responsive documents in the whole collection) would be found.  The so-called null hypothesis is that the real difference between these two exercises is 0.0 and the observed difference is due only to sampling error, that is, to chance variations in the samples.  The motivated hypothesis is that the difference is real.

  • Null Hypothesis: There is no reliable difference between scores
  • Motivated Hypothesis: The two scores differ reliably from one another
  
Under the null hypothesis, the difference between scores also has a confidence interval around 0.0.  If the size of the difference is outside of the confidence interval, then we can say that the difference is (statistically) significant.  The probability is less than 5% that the difference was drawn from a distribution centered around 0.0.  If this difference is sufficiently large, then we are justified in rejecting the null hypothesis.  The difference is unexpected under the null hypothesis.  Then we can say that the difference is statistically significant or reliable.

On the other hand, if the magnitude of the difference is within the confidence interval, then we cannot say that the difference is reliable.  We fail to reject the null hypothesis, and we may say that we accept the null hypothesis.  Differences have to be sufficiently large to reject the null hypothesis or we say that there was no reliable difference.  “Sufficiently large” here means “outside the confidence interval.”  The bias in most of science is to assume that the null hypothesis is the better explanation unless we can find substantial evidence to the contrary.

Although the difference in estimated prevalences for the two exercises is numerically large (almost double), in fact, my analysis reveals that differences this large could easily have come about by chance due to sampling error.  The difference in prevalence proportions is well within the confidence interval assuming that there is really no difference.  My analysis does not prove that there was no difference, but it shows that these results do not support the hypothesis that there was a difference. The difference in estimated prevalence between the two exercises is potentially troubling, but the fact that it could have arisen by chance, means that our best guess is that there really was no systematic difference to explain. 

We knew from the fact that the same data were used in both exercises that we should not expect a real difference in prevalence between the two assessments, so this failure to find a reliable difference is consistent with our expectations.  On the other hand, Losey conducted the exercises with the intention of finding a difference in the accuracy of the two approaches.  We can apply the same logic to looking for these differences.

We can assess the accuracy of predictive coding with several different measures.  The emphasis in Losey’s approaches is to find as many of the responsive documents as possible.  One measure of this goal is called Recall.  Of all of the responsive documents in the collection, how many were identified by the combination of user and predictive coding?  To assess Recall directly, we would need a sample of responsive documents.  This sample would have to be of sufficient size to allow us to compare Recall under each approach.  Unfortunately, those data are not available.  We would need a sample of, say, 400 responsive documents to estimate Recall directly.  We cannot use the predictive coding to find those responsive documents, because that is exactly what we are trying to measure. We need to find an independent way of estimating the total number of responsive documents.

We could try to estimate Recall from a combination of our prevalence estimate and the number of responsive documents identified by the method, but since the estimate of prevalence is so substantially different, it is not immediately obvious how to do so.  If the two systems returned the same number of documents, our estimate of recall would be much lower for the monomodal method than for the multimodal method because the prevalence was estimated to be so much higher for the monomodal method.

Instead, I analyzed the Elusion measures for the two exercises.  Elusion is a measure of categorization accuracy that is closely (but inversely) related to Recall.  Specifically, it is a measure of the proportion of false negatives among the documents that have been classified as non-responsive (documents that should have been classified as responsive, but were incorrectly classified).  An effective predictive coding exercise will have very low false negative rates, and therefore very low Elusion, because all or most of the responsive documents have been correctly classified.  Low Elusion relative to Prevalence corresponds to high Recall.

Because both exercises involved the same set of documents, their true (as opposed to their observed) Prevalence rates should be the same.  If one process was truly more accurate than the other, then they should differ in the proportion of responsive documents that they fail to identify.  By his prediction Losey’s multimodal method should have lower Elusion than his monomodal method.  That seems not to be the case.

Elusion for the monomodal method was numerically lower than for the multimodal method.  A Z-test for the difference in the two Elusion proportions (0.00094 vs. 0.00085) also fails to reach a level of significance.  The analysis reveals that the difference between these two Elusion values could also have occurred by chance.  The proportion of false negatives in the two exercises was not reliably different from one another. Contrary to Ralph’s assertion, we are not justified to conclude from these exercises that there was a difference in their success rates.  So his claim that the multimodal method is better than the monomodal method is unsupported by these data.

Finally, I compared the prevalence in each exercise with its corresponding Elusion proportion, again using a Z-test for proportions.  If predictive coding has been effective, then we should observe that Elusion is only a small fraction of prevalence.  Prevalence is our measure of the proportion of documents in the whole collection that are responsive.  Elusion is our measure of the proportion of actually responsive documents in the set that have been categorized as non-responsive.  If we have successfully identified the responsive documents, then they would not be in the Elusion set, so their proportion should be considerably lower in the Elusion set than in the initial random sample drawn from the whole collection.

Losey would not be surprised, I think, to learn that in the monomodal exercise, there was no significant difference between estimated prevalence (0.0025) and estimated Elusion (0.00085).  Both proportions could have been drawn from populations with the same proportion of responsive documents.  The monomodal method was not successful, according to this analysis, at identifying responsive documents.

What might be more surprising, however, was that there was also no significant difference between prevalence and Elusion in the multimodal exercise (0.0013 vs. 0.00094).  Neither exercise found a significant number of responsive documents.  There is no evidence that predictive coding added any value in either exercise.  Random sampling before and after the exercises could have produced differences larger than the ones observed without employing predictive coding or any other categorization technique in the middle.  Predictive coding in these exercises did not remove a significant number of responsive documents from the collection.  A random sample was just as likely to contain the same number of responsive documents before predictive coding as after predictive coding.

Far from concluding that the multimodal method was better than the monomodal method, these two exercises cannot point to any reliable effect of either method.  Not only did the methods not produce reliably different results from one another, but it looks like they had no effect at all.  All of the differences between exercises can be attributed to chance, that is sampling error.  We are forced to accept the null hypotheses that there were no differences between methods and no effect of predictive coding.  Again, we did not prove that there were no differences, only that there were no reliable differences to be found in these exercises.

These results may not be widely indicative of those that would be found in other predictive coding uses.  Other predictive coding exercises do find significant differences of the kind I looked for here.

From my experience, this situation is an outlier.  These data may not be representative of predictive coding problems, for example, they are very sparse.  Prevalence near zero left little room for Elusion to be lower. Less than a quarter of a percent of the documents in the collection were estimated to be responsive.  In the predictive coding matters I have dealt with, Prevalence is typically considerably higher.  These exercises may not be indicative of what you can expect in other situations or with other predictive coding methods.  These results are not typical. Your results may vary.

Alternatively, it is possible that predictive coding worked well, but that we do not have enough statistical power to detect it.  The confidence interval of the difference, just like any other confidence interval, narrows with larger samples.  It could be that larger samples might have found a difference.  In other words, we cannot conclude that there was no difference, the best we can do is to conclude that there was insufficient evidence to conclude that there was a difference. 

But, if we cannot be confident of a difference, we cannot be confident that one method was better than the other.  At the same time, we also cannot conclude that some other exercises might not find differences.  Accepting the null hypothesis is not the same as proving it.

We cannot conclude that predictive coding or the technology used in these exercises does not work.  Many other factors could affect the efficacy of the tools involved. 

For the predictive coding algorithms to work, they need to be presented with valid examples of responsive and non-responsive documents.  The algorithms do not care how those examples were obtained, provided that they are representative.  The most important decisions, then, are the document decisions that go into making the example sets. 

Losey’s two methods differ (among other things) in terms of who chooses the examples that are presented to the predictive coding algorithms.  Losey’s multimodal method uses a mix of machine and human selection.  His monomodal method, which he pejoratively calls the “Borg” method, has the computer select documents for his decision.  In both cases, it was Losey making the only real decisions that the algorithms depend on — whether documents are responsive or non-responsive.  Losey may find search decisions more interesting than document decisions, but search decisions are secondary and a diversion from the real choices that have to be made.  He may feel as though he is being more effective by selecting the document to judge for responsiveness, but that feeling is not supported by these results.  Evaluating his hypotheses will have to await a situation where we can point to reliable differences in the occurrence of responsive documents before and after running predictive coding and reliable differences between the results of the two methods.

Predictive coding is not the only way to learn about the documents.  eDiscovery often requires exploratory data analysis, to know what we have to work with, what kind of vocabulary people used, who the important participants are, and so on.  These are questions that are not easily addressed with predictive coding.  We need to engage search and other analytics to address these questions.  They are not a substitute for predictive coding, but a necessary part of preparing for eDiscovery.  Predictive coding is not designed to replace all forms of engagement with the data, but rather to make document categorization easier, more cost effective, and more accurate.

Not every attorney is as skilled at or as interested in searching as Losey is.  However the example documents are chosen, the critical judgments are the legal decisions about whether specific documents are responsive or not.  Those judgments may not be glamorous, but they are critical to the justice system and to the performance of predictive coding.  Despite rather substantial effort, nothing in his exercises would lead us to conclude that either method was better than the other.


Friday, June 28, 2013

Predictive Coding: Debunking a Sample of Myths about Random Sampling



To read some of the eDiscovery blogs that have been posted recently, you might think that there has been a lot of discussion about the “right way” to do predictive coding.  Much of this analysis is based on the writers’ intuitions about statistical claims.  Generally, these are lawyers attempting to make statistical claims as to which methods are better than others.  As the Nobel Laureate Daniel Kahneman and Amos Tversky reported some years ago, these subjective analyses don’t always match up with reality. In the context of eDiscovery, the mismatch of intuition and statistical reality can lead to incorrect assumptions or myths about random sampling, a sample of which we can consider here.

Predictive coding systems learn to distinguish documents based on a set of categorized examples.  These examples must have three characteristics in order to give accurate results:  Validity, consistency, representativeness. 

  • Validity.  The example document decisions must be valid—the sample documents identified as responsive vs. non-responsive must actually be responsive and non-responsive respectively. 
  • Consistency: The categorization must be consistent—the same standards must apply from one example to the next. 
  • Representativeness: The documents must be representative of the distinction in the entire population of all documents.  
Poor results follow if any of these requirements is unmet.

Random sampling is one of several methods that can be used to generate these example documents and meet these requirements.  Other methods are sometimes called judgmental samples, purposive samples, or seed sets. The training examples are developed, usually on an ad hoc basis, using keyword searching or other techniques.  Using expertise, search terms, or some other method, one or more attorneys selects sample documents to provide the training examples.  These techniques may or may not meet the three requirements of effective predictive coding.

Random sampling means that every document in a collection or population has an equal chance of participating in the sample.  Random sampling does not tell you what population to start with, and this is a source of some serious misunderstanding.  In the jargon of statistics, the population is called the “sampling frame.”  How you choose the sampling frame can have a profound impact on the time and effort it takes to do predictive coding.  Random sampling is not a substitute for intelligent document collection.

Random sampling and the statistics that surround it have been in active use for well over a century.  Random sampling is the basis for almost all scientific work, because it is the only way to make valid inferences from the properties of the sample to the properties of the population from which that sample was drawn.  Only random sampling provides a means of predicting from the sample the accuracy that can be expected when estimating the properties of the population (such as prevalence).

Random sampling is recognized by the courts as a useful and cost-effective method to gather information about large groups of items (including documents and people) where it would be impractical or burdensome to examine each one individually, as noted in the Federal Judicial Center’s Reference Manual on Scientific Evidence: Reference Guide on Statistics

Random sampling is used widely in industrial quality control, in forecasting elections, and, of course, throughout the sciences.  A definitive book on sampling is Sampling Techniques by William G. Cochran (1977, 2nd Ed. Wiley). 

Random sampling is a critical part of our evaluation of the success of predictive coding and other processes.  When we measure the accuracy of predictive coding, we typically assess a randomly selected set of documents for responsiveness.  Random selection is the best way to pick an unbiased representative sample, and only with a representative sample can you justify inferences from the sample to the population from which the sample was drawn.  Random samples are the best way to convince opposing counsel that your assessment is based on a fair analysis of the available documents.  Just as you would not want to play against an opponent with loaded dice, you would not want to base your decisions on a biased sample of documents.  Random sampling assures you and your opponents of an unbiased assessment.

Myth: Random sampling requires you to include all of your documents without any constraint


Even on the surface, this myth is obviously wrong.  There are always constraints in how you collect documents for a matter.  You typically don’t collect documents from the Library of Congress, for example, or from other companies.

Some documents are worthy of being considered for possible responsiveness and others are not.  It takes legal judgment and some knowledge of the document universe to determine what documents to include in your sampling frame.  Random sampling means that every document in the set being considered (the sampling frame) has an equal chance of appearing in a random sample.  That is what makes random samples representative of the population from which they are drawn.  Random sampling does not force you to include inappropriate documents in that population.  It says nothing about what goes into that population.  Instead, random sampling allows you describe the population as it is. 

Random sampling does not control what goes into the sampling frame, but it does affect what you can say about it.  The random sample is representative of the documents that have been included in the sampling frame and may or may not be representative of documents outside of the sampling frame.  You can generalize from the sample to the rest of the documents in the sampling frame, but not to those documents that were not included. In an election poll, if you sample only from New York City, then you can talk about New York City voters’ opinions, but you cannot say anything useful about voters outside of the city.  If you cull down your document collection prior to sampling, you can justifiably talk about the documents that you kept, but not about the documents that were rejected by your culling criteria.  So, rather than sampling determining what documents are included, the documents included in the sampling frame determine what you can infer from samples drawn from it.

If you want to make claims about all of the documents held by a particular organization, then your sampling must include all documents held by the organization.  If you want to make claims about a certain subset of documents, perhaps those held by certain custodians, then your sampling frame need only include those documents.

Random sampling is the best way to get a representative set of documents from a population, but how you pick that population depends on your needs.  Election polls do not interview just any kind of person in the country for every kind of poll.  Rather, these polls tend to focus on likely voters or at least eligible voters.  The sampling frame is constrained to meet the goals of the poll.  If the poll concerns a local election, only those voters who are eligible to vote in that election are polled, not everyone in the whole country and not everyone who walks by in a shopping mall.

Your document collection may be similarly constrained.  Document collection may be limited to a certain time frame, custodians, or document types, and so on.  It takes intelligence to collect a sensible set of documents for litigation or investigation, just as it always has.  If certain kinds of documents, certain custodians, or certain time periods are known to be irrelevant to the matter, for example, they should not be included.  Neither random sampling nor legal rules require it.  Your random samples will be representative of the population you select and not representative of the population that you don’t select. 

The decisions you make about your collection are critically important, but that is not affected by the technology that you use to analyze that collection.  Random sampling does not require you to check your intelligence at the door.

Myth: Random sampling must miss important concepts in the document collection


Random sampling can easily be shown to provide a good account of the characteristics of the document population (the sampling frame).  If our random sample was as big as our population of documents, then it is obvious that it would find every responsive document in the collection and would leave nothing behind.  The characteristics of the sample would match exactly the characteristics of the whole population.   It is also obvious that a sample as large as the population is impractical. That is the situation we use predictive coding to avoid.  What may not be so obvious is that smaller samples provide approximations to the population.  The larger the sample, all other things being equal, the better the approximation.  Whatever concepts are present in the population will be present in the sample in approximately the same proportion.  Nothing in random sampling causes you to miss anything.

The characteristics of the sample are approximations of the characteristics of the collection, so features or concepts that are more common in the collection will be more common in the sample.  Very rare things are, by definition, very rare, so they may not appear in the sample.  But they are unlikely to be captured using other approaches as well.

It is nice to imagine that an educated reader would find the rare smoking gun in a collection if he or she came across it.  The evidence is contrary, however.  Detecting rare events is difficult no matter what method or system you use.  Despite massive security efforts, terrorists still can slip through at airports and other locations, in part because terrorists are so rare and travelers are so common.  Millions of people pass through security checkpoints every day and almost none of them is a terrorist. 

The same is true in document review.  For a reviewer to identify a document as responsive, the reviewer has to see the document and has to recognize that the document actually provides evidence of being responsive.  We can affect the probability of a reviewer seeing a document, but there is still the problem of recognizing it when it has been seen.  Psychological studies have found that the more often one decides that an item is irrelevant, the harder it is to detect an item that is, in fact, relevant.  The TSA officers who look at the X-rays of your carry-on bags change roles every 20 minutes for precisely this reason. Practically none of the suitcases will ever have a terrorist’s weapon in it. 

People would like to think that if they saw a smoking gun document, they would immediately recognize it.  Here, too, psychological studies indicate otherwise.  Documents in isolation or in hindsight may be obvious, but in the fog of review they are easy to miss.  Many of us think that we are above average in being able to design searches for what we want or to recognize responsive documents when they are seen, but we rarely actually analyze our performance to test that hypothesis.  There is a strong bias, called the “overconfidence effect,” to believe that we are better than our peers.  The attorneys in the Blair and Maron study, one of the first examining the efficacy of keyword searching, thought that they had found around 75% of the relevant documents in their collection, when, in fact, they had found 20%. 

Non-random judgmental samples give the appearance to the attorneys running them that they are in control, but that control may be illusory. You can achieve a high level of responsive documents in the documents that you find, but that tells you nothing about your success at finding all of the responsive documents. You may feel confident on the basis of finding more or less only the responsive documents, but still fail to find a substantial number of responsive documents that are not retrieved.

With enough work, keyword searches can be effective for finding responsive documents, but few matters get this high-level treatment.  For example, in Kleen Products, one of the defendants reported that they spent 1400 hours coming up with a set of keywords.  It is often difficult to guess the right keywords that will return useful documents without providing overwhelming quantities of irrelevant documents.  For example, we have seen many times in investigations that the attorney will suggest that we search for documents with legal terms of art that most authors would never use. 

Not all attorneys can be above average at guessing keywords and not all matters merit the level of effort that would be required to develop effective lists.  In one recent matter where the attorneys tried to cull down the document collection with a keyword list, only 2% of the documents selected by the keywords ended up being called responsive by the principal attorneys.  And they don’t know what they missed because they did not analyze the documents that were not selected by the keywords (the identified responsive documents constituted only 0.3% of the whole collection). 

Like the attorneys in the Blair and Maron study, the attorneys in this matter thought they were effective at identifying the responsive documents that they needed, but they never actually evaluated their success. Without that evaluation, it is impossible to say whether they were effective or merely overconfident.  To evaluate their effectiveness, they should have drawn a random sample of the documents not selected by the keywords and evaluated them for responsiveness.  On the other hand, if they were willing to draw a random sample, they probably did not have to do the keyword culling at all, simply use the evaluated random sample as part of the training process for predictive coding. 

There have been studies of computer assisted review and of human review efficacy, but proper evaluation of discovery by any method is relatively rare.  Without a formal analysis, it is difficult to truly evaluate how well we actually do because it is so easy to over-interpret the evidence we have at hand.  This tendency is related to confirmation bias, which is the bias to seek evidence that confirms, rather than tests, our ideas.  The studies that are available, suggest that human review is not as accurate as many attorneys would like to believe.

Myth: Random sampling relies on the premise of ignorance: no one knows anything about how to find the probative documents so you might as well search randomly


As Will Rogers said, “It isn't what we don't know that gives us trouble, it's what we know that ain't so.” The problem with assuming that we know how to find the responsive documents is that we may actually be mistaken in our belief.  In one matter, one of the parties was certain that there was email that was highly pertinent to the matter and that it was sent sometime between two dates.  When the email (or one like it) was eventually found, it had actually been sent 18 months earlier. 

Random samples do not prevent you from using information you have about the matter and about the location of probative documents, but they can prevent you from being so convinced of your knowledge that you miss the very information that you are looking for. Confirmation bias leads us to look for evidence that confirms our beliefs rather than tests them.  You may, in fact, know how to find critical documents, but unless you critically test that belief, you will never know how accurate you were. And, there are powerful psychological biases at work that interfere with our ability to critically evaluate our beliefs. 

We do not choose random samples because we are ignorant of the differences among items, but rather to get a representative, unbiased, sample of those differences.  (See Reference Manual on Scientific Evidence: Reference Guide on Statistics, p. 90.)

The representativeness of random samples ensures that the training examples are representative of the range of documents in the collection.  The reviewer’s decisions about these documents, therefore, will be representative of the decisions that need to be made about the collection.  Non-representative samples may, hypothetically, do better at finding certain kinds of responsive documents, but they would then necessarily be poorer at finding other kinds of responsive documents.  Without random sampling, we would have no way to evaluate any non-random claim of success because we could not generalize from the specific documents in the non-random sample to the rest of the collection.  In other words, non-random sampling may make one feel as though he or she is being more effective, but only while avoiding a fair evaluation.

Predictive coding depends on the representativeness of the training examples.  It does not make any assumptions that all documents are equally likely to be responsive.  It does not make any assumptions that the user of the predictive coding system is ignorant.  Quite the contrary.  Predictive coding in general, requires a high level of expertise about the subject matter to determine which documents are and which are not responsive. It requires legal judgments. It does not rely on your ability as a search engine guru or information scientist. 

Predictive coding is a tool.  It can be used effectively or ineffectively.  If the decisions that are driving its training examples are not valid, the result of predictive coding will not be valid.  Random sampling helps to evaluate the thoroughness, consistency, and validity of the decisions.

Moreover, random sampling does not require you to ignore information that you have.  If there are documents that are known to be responsive, these can easily be included in the training set.  You will still want to use random sampling of additional documents to ensure that you have a representative sample, but there is unlikely to be any harm from starting with known documents, as long as they represent valid decisions about responsiveness.

Using random sampling to train predictive coding means that you don’t have to make an extraordinary effort to find a representative training set.  But if you have known documents they should not be ignored.  Nothing in random sampling requires you to ignore information that you already have.

Myth: Random sampling prevents lawyers from engaging their professional expertise


Random sampling does not remove the necessity for lawyers to make legal judgments, to guide the collection of data, to prepare for depositions, or to understand the case and the documents relevant to it.

To the contrary, random sampling allows lawyers to focus on using their legal expertise, without requiring them to be sophisticated information scientists.  Lawyers’ professional expertise extends to the law, legal process, and the like, but it seldom includes information science.  Seeking the documents required to build a seed set often requires the use of complex Boolean expressions, and while some attorneys are quite adept at designing such queries, most are not.  It can be quite challenging to construct queries that give all or most of the documents that are wanted and not overwhelm with a lot of unwanted documents. 

Traditional kinds of searching, involving keywords and Boolean expressions, requires one to know the language that was used by the authors of the documents.  Interviews of the available authors can reveal some of this information, but even the authors may not remember the exact terms that they used.  And, they may not be available.

Many attorneys appear to think that the document authors tend to talk about things the way that attorneys would talk about them.  I have been asked, for example, to search for the word “fraud” in some investigations as if someone would email another person and say, “hey, let’s commit fraud.”  Such things can happen, but people are much more likely to use language that is more subtle, colloquial, and idiosyncratic.  On top of that, the words may not even be spelled correctly or consistently.  The authors may even use code words to talk about important issues.  It is much more likely that these words can be recognized than guessed. 

The attorneys may not be able to describe how to find the responsive documents, but they are typically very able to recognize that a document is responsive.  The success of predictive coding with random sampling depends strongly on the reviewer’s legal acumen to identify responsive documents.  It does not require any specific level of information science skill. 

Part of the concern with random sampling seems to be a desire to preserve the attorney’s role in selecting documents for consideration by the predictive coding system.  As mentioned earlier, all of the decisions about documents being responsive or non-responsive are made by the attorney, not by the computer.  The difference between an attorney-provided seed set and a random sample is who chooses the documents for consideration.  We should use professional judgment to select the documents for consideration, if that professional judgment is sound and effective it can contribute to improved predictive coding results. But a word of caution is appropriate here.

Without explicit measurement, we cannot know whether that judgment is good and that measurement requires the evaluation of a random sample.  Further, I am reminded of the work of the psychologist Paul Meehl who compared what he called clinical vs statistical judgment.  In virtually every case, he found that using statistical models led to better outcomes than intuitive judgments.  His approach has been widely adopted in psychology, hiring decisions, and other areas, but there continues to be resistance to his approach, despite the consistent evidence. 

There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one. When you are pushing [scores of] investigations, predicting everything from the outcomes of football games to the diagnosis of liver disease and when you can hardly come up with a half dozen studies showing even a weak tendency in favor of the clinician, it is time to draw a practical conclusion” (Meehl 1986, 372−3).

An attorney’s professional judgment about where to look for responsive documents is an implicit prediction about where these documents are likely to be.  It is an intuitive and informal assessment of this likelihood, which, as Meehl, and Kahneman and Tversky have found, is subject to inappropriate biases and does not consistently correspond to reality.  For most attorneys, there is simply no evidence to lead us to believe that actively choosing a seed set is necessarily better than using random sampling to pick training examples.

Random sampling allows lawyers to focus on their legal judgments, rather than information science techniques.

Myth: Lawyers use random sampling to hide behind the computer’s decisions


In predictive coding, the computer does not make the decisions, it merely implements the decisions that have been made by a subject matter expert attorney.  There is nothing to hide behind.  Random sampling chooses the documents to be considered, but the judgment about those documents is all yours.  The predictive coding system analyzes the decisions you make and extends those decisions in a consistent way to other documents in the collection. 

By itself, the predictive coding system cannot determine which documents are interesting and which are not.  Your knowledgeable input provides its only notion of responsiveness. Rather, it has an analysis of the evidence in each document that makes that document more like those you have designated as responsive or more like those you have designated as non-responsive.  The choice is yours.  And, as you update your choices, the system continues to adjust its internal models to more closely approximate your choices.

Conclusion


Predictive coding is a game-changing approach to eDiscovery.  It dramatically modifies the workflow process of eDiscovery and speeds it up by orders of magnitude.  It helps producing parties to manage the cost of eDiscovery while improving accuracy and the time needed to meet discovery obligations.  It helps receiving parties to get more of the information that they are entitled to receive in a shorter time and with less of the junk that typically invades productions. 

Random sampling is an essential part of predictive coding.  It is universally used in the measurement of its outcome.  In addition, training with random sampling increases the transparency of the process—it shows at each step of training how well the system is doing in a form that can be directly compared with the ultimate outcome of the process.  Random sampling is the best way that we know to provide the representativeness that is required for effective predictive coding.  It minimizes the effort required to seek out potentially responsive documents and provides easily defensible criteria for selecting the training examples. 

There are other methods to provide the training examples that predictive coding requires.  Random sampling does not have to replace those other methods, but it should be an essential part of any training process to ensure that coverage be as complete as possible.  The research is clear that we all have expectations about what information should be present and great confidence that we can find that information.  Often this confidence is unfounded and in any case, it should be measured as objectively as possible.  Random sampling provides that unbiased objective measurement and an unbiased objective method to identify the documents in a collection that are actually responsive. 

The power of detailed statistical analysis to reveal actionable information about the world has long been recognized by statisticians.  Recent books, such as Money Ball by Michael Lewis and The Signal and the Noise: Why So Many Predictions Fail — but Some Don't by Nate Silver have helped to make that value more apparent to people outside of statistics.  The very same principles and processes described in these books can be applied to eDiscovery with similar levels of success.  An essential ingredient in that success is random sampling.