Friday, June 28, 2013

Predictive Coding: Debunking a Sample of Myths about Random Sampling

To read some of the eDiscovery blogs that have been posted recently, you might think that there has been a lot of discussion about the “right way” to do predictive coding.  Much of this analysis is based on the writers’ intuitions about statistical claims.  Generally, these are lawyers attempting to make statistical claims as to which methods are better than others.  As the Nobel Laureate Daniel Kahneman and Amos Tversky reported some years ago, these subjective analyses don’t always match up with reality. In the context of eDiscovery, the mismatch of intuition and statistical reality can lead to incorrect assumptions or myths about random sampling, a sample of which we can consider here.

Predictive coding systems learn to distinguish documents based on a set of categorized examples.  These examples must have three characteristics in order to give accurate results:  Validity, consistency, representativeness. 

  • Validity.  The example document decisions must be valid—the sample documents identified as responsive vs. non-responsive must actually be responsive and non-responsive respectively. 
  • Consistency: The categorization must be consistent—the same standards must apply from one example to the next. 
  • Representativeness: The documents must be representative of the distinction in the entire population of all documents.  
Poor results follow if any of these requirements is unmet.

Random sampling is one of several methods that can be used to generate these example documents and meet these requirements.  Other methods are sometimes called judgmental samples, purposive samples, or seed sets. The training examples are developed, usually on an ad hoc basis, using keyword searching or other techniques.  Using expertise, search terms, or some other method, one or more attorneys selects sample documents to provide the training examples.  These techniques may or may not meet the three requirements of effective predictive coding.

Random sampling means that every document in a collection or population has an equal chance of participating in the sample.  Random sampling does not tell you what population to start with, and this is a source of some serious misunderstanding.  In the jargon of statistics, the population is called the “sampling frame.”  How you choose the sampling frame can have a profound impact on the time and effort it takes to do predictive coding.  Random sampling is not a substitute for intelligent document collection.

Random sampling and the statistics that surround it have been in active use for well over a century.  Random sampling is the basis for almost all scientific work, because it is the only way to make valid inferences from the properties of the sample to the properties of the population from which that sample was drawn.  Only random sampling provides a means of predicting from the sample the accuracy that can be expected when estimating the properties of the population (such as prevalence).

Random sampling is recognized by the courts as a useful and cost-effective method to gather information about large groups of items (including documents and people) where it would be impractical or burdensome to examine each one individually, as noted in the Federal Judicial Center’s Reference Manual on Scientific Evidence: Reference Guide on Statistics

Random sampling is used widely in industrial quality control, in forecasting elections, and, of course, throughout the sciences.  A definitive book on sampling is Sampling Techniques by William G. Cochran (1977, 2nd Ed. Wiley). 

Random sampling is a critical part of our evaluation of the success of predictive coding and other processes.  When we measure the accuracy of predictive coding, we typically assess a randomly selected set of documents for responsiveness.  Random selection is the best way to pick an unbiased representative sample, and only with a representative sample can you justify inferences from the sample to the population from which the sample was drawn.  Random samples are the best way to convince opposing counsel that your assessment is based on a fair analysis of the available documents.  Just as you would not want to play against an opponent with loaded dice, you would not want to base your decisions on a biased sample of documents.  Random sampling assures you and your opponents of an unbiased assessment.

Myth: Random sampling requires you to include all of your documents without any constraint

Even on the surface, this myth is obviously wrong.  There are always constraints in how you collect documents for a matter.  You typically don’t collect documents from the Library of Congress, for example, or from other companies.

Some documents are worthy of being considered for possible responsiveness and others are not.  It takes legal judgment and some knowledge of the document universe to determine what documents to include in your sampling frame.  Random sampling means that every document in the set being considered (the sampling frame) has an equal chance of appearing in a random sample.  That is what makes random samples representative of the population from which they are drawn.  Random sampling does not force you to include inappropriate documents in that population.  It says nothing about what goes into that population.  Instead, random sampling allows you describe the population as it is. 

Random sampling does not control what goes into the sampling frame, but it does affect what you can say about it.  The random sample is representative of the documents that have been included in the sampling frame and may or may not be representative of documents outside of the sampling frame.  You can generalize from the sample to the rest of the documents in the sampling frame, but not to those documents that were not included. In an election poll, if you sample only from New York City, then you can talk about New York City voters’ opinions, but you cannot say anything useful about voters outside of the city.  If you cull down your document collection prior to sampling, you can justifiably talk about the documents that you kept, but not about the documents that were rejected by your culling criteria.  So, rather than sampling determining what documents are included, the documents included in the sampling frame determine what you can infer from samples drawn from it.

If you want to make claims about all of the documents held by a particular organization, then your sampling must include all documents held by the organization.  If you want to make claims about a certain subset of documents, perhaps those held by certain custodians, then your sampling frame need only include those documents.

Random sampling is the best way to get a representative set of documents from a population, but how you pick that population depends on your needs.  Election polls do not interview just any kind of person in the country for every kind of poll.  Rather, these polls tend to focus on likely voters or at least eligible voters.  The sampling frame is constrained to meet the goals of the poll.  If the poll concerns a local election, only those voters who are eligible to vote in that election are polled, not everyone in the whole country and not everyone who walks by in a shopping mall.

Your document collection may be similarly constrained.  Document collection may be limited to a certain time frame, custodians, or document types, and so on.  It takes intelligence to collect a sensible set of documents for litigation or investigation, just as it always has.  If certain kinds of documents, certain custodians, or certain time periods are known to be irrelevant to the matter, for example, they should not be included.  Neither random sampling nor legal rules require it.  Your random samples will be representative of the population you select and not representative of the population that you don’t select. 

The decisions you make about your collection are critically important, but that is not affected by the technology that you use to analyze that collection.  Random sampling does not require you to check your intelligence at the door.

Myth: Random sampling must miss important concepts in the document collection

Random sampling can easily be shown to provide a good account of the characteristics of the document population (the sampling frame).  If our random sample was as big as our population of documents, then it is obvious that it would find every responsive document in the collection and would leave nothing behind.  The characteristics of the sample would match exactly the characteristics of the whole population.   It is also obvious that a sample as large as the population is impractical. That is the situation we use predictive coding to avoid.  What may not be so obvious is that smaller samples provide approximations to the population.  The larger the sample, all other things being equal, the better the approximation.  Whatever concepts are present in the population will be present in the sample in approximately the same proportion.  Nothing in random sampling causes you to miss anything.

The characteristics of the sample are approximations of the characteristics of the collection, so features or concepts that are more common in the collection will be more common in the sample.  Very rare things are, by definition, very rare, so they may not appear in the sample.  But they are unlikely to be captured using other approaches as well.

It is nice to imagine that an educated reader would find the rare smoking gun in a collection if he or she came across it.  The evidence is contrary, however.  Detecting rare events is difficult no matter what method or system you use.  Despite massive security efforts, terrorists still can slip through at airports and other locations, in part because terrorists are so rare and travelers are so common.  Millions of people pass through security checkpoints every day and almost none of them is a terrorist. 

The same is true in document review.  For a reviewer to identify a document as responsive, the reviewer has to see the document and has to recognize that the document actually provides evidence of being responsive.  We can affect the probability of a reviewer seeing a document, but there is still the problem of recognizing it when it has been seen.  Psychological studies have found that the more often one decides that an item is irrelevant, the harder it is to detect an item that is, in fact, relevant.  The TSA officers who look at the X-rays of your carry-on bags change roles every 20 minutes for precisely this reason. Practically none of the suitcases will ever have a terrorist’s weapon in it. 

People would like to think that if they saw a smoking gun document, they would immediately recognize it.  Here, too, psychological studies indicate otherwise.  Documents in isolation or in hindsight may be obvious, but in the fog of review they are easy to miss.  Many of us think that we are above average in being able to design searches for what we want or to recognize responsive documents when they are seen, but we rarely actually analyze our performance to test that hypothesis.  There is a strong bias, called the “overconfidence effect,” to believe that we are better than our peers.  The attorneys in the Blair and Maron study, one of the first examining the efficacy of keyword searching, thought that they had found around 75% of the relevant documents in their collection, when, in fact, they had found 20%. 

Non-random judgmental samples give the appearance to the attorneys running them that they are in control, but that control may be illusory. You can achieve a high level of responsive documents in the documents that you find, but that tells you nothing about your success at finding all of the responsive documents. You may feel confident on the basis of finding more or less only the responsive documents, but still fail to find a substantial number of responsive documents that are not retrieved.

With enough work, keyword searches can be effective for finding responsive documents, but few matters get this high-level treatment.  For example, in Kleen Products, one of the defendants reported that they spent 1400 hours coming up with a set of keywords.  It is often difficult to guess the right keywords that will return useful documents without providing overwhelming quantities of irrelevant documents.  For example, we have seen many times in investigations that the attorney will suggest that we search for documents with legal terms of art that most authors would never use. 

Not all attorneys can be above average at guessing keywords and not all matters merit the level of effort that would be required to develop effective lists.  In one recent matter where the attorneys tried to cull down the document collection with a keyword list, only 2% of the documents selected by the keywords ended up being called responsive by the principal attorneys.  And they don’t know what they missed because they did not analyze the documents that were not selected by the keywords (the identified responsive documents constituted only 0.3% of the whole collection). 

Like the attorneys in the Blair and Maron study, the attorneys in this matter thought they were effective at identifying the responsive documents that they needed, but they never actually evaluated their success. Without that evaluation, it is impossible to say whether they were effective or merely overconfident.  To evaluate their effectiveness, they should have drawn a random sample of the documents not selected by the keywords and evaluated them for responsiveness.  On the other hand, if they were willing to draw a random sample, they probably did not have to do the keyword culling at all, simply use the evaluated random sample as part of the training process for predictive coding. 

There have been studies of computer assisted review and of human review efficacy, but proper evaluation of discovery by any method is relatively rare.  Without a formal analysis, it is difficult to truly evaluate how well we actually do because it is so easy to over-interpret the evidence we have at hand.  This tendency is related to confirmation bias, which is the bias to seek evidence that confirms, rather than tests, our ideas.  The studies that are available, suggest that human review is not as accurate as many attorneys would like to believe.

Myth: Random sampling relies on the premise of ignorance: no one knows anything about how to find the probative documents so you might as well search randomly

As Will Rogers said, “It isn't what we don't know that gives us trouble, it's what we know that ain't so.” The problem with assuming that we know how to find the responsive documents is that we may actually be mistaken in our belief.  In one matter, one of the parties was certain that there was email that was highly pertinent to the matter and that it was sent sometime between two dates.  When the email (or one like it) was eventually found, it had actually been sent 18 months earlier. 

Random samples do not prevent you from using information you have about the matter and about the location of probative documents, but they can prevent you from being so convinced of your knowledge that you miss the very information that you are looking for. Confirmation bias leads us to look for evidence that confirms our beliefs rather than tests them.  You may, in fact, know how to find critical documents, but unless you critically test that belief, you will never know how accurate you were. And, there are powerful psychological biases at work that interfere with our ability to critically evaluate our beliefs. 

We do not choose random samples because we are ignorant of the differences among items, but rather to get a representative, unbiased, sample of those differences.  (See Reference Manual on Scientific Evidence: Reference Guide on Statistics, p. 90.)

The representativeness of random samples ensures that the training examples are representative of the range of documents in the collection.  The reviewer’s decisions about these documents, therefore, will be representative of the decisions that need to be made about the collection.  Non-representative samples may, hypothetically, do better at finding certain kinds of responsive documents, but they would then necessarily be poorer at finding other kinds of responsive documents.  Without random sampling, we would have no way to evaluate any non-random claim of success because we could not generalize from the specific documents in the non-random sample to the rest of the collection.  In other words, non-random sampling may make one feel as though he or she is being more effective, but only while avoiding a fair evaluation.

Predictive coding depends on the representativeness of the training examples.  It does not make any assumptions that all documents are equally likely to be responsive.  It does not make any assumptions that the user of the predictive coding system is ignorant.  Quite the contrary.  Predictive coding in general, requires a high level of expertise about the subject matter to determine which documents are and which are not responsive. It requires legal judgments. It does not rely on your ability as a search engine guru or information scientist. 

Predictive coding is a tool.  It can be used effectively or ineffectively.  If the decisions that are driving its training examples are not valid, the result of predictive coding will not be valid.  Random sampling helps to evaluate the thoroughness, consistency, and validity of the decisions.

Moreover, random sampling does not require you to ignore information that you have.  If there are documents that are known to be responsive, these can easily be included in the training set.  You will still want to use random sampling of additional documents to ensure that you have a representative sample, but there is unlikely to be any harm from starting with known documents, as long as they represent valid decisions about responsiveness.

Using random sampling to train predictive coding means that you don’t have to make an extraordinary effort to find a representative training set.  But if you have known documents they should not be ignored.  Nothing in random sampling requires you to ignore information that you already have.

Myth: Random sampling prevents lawyers from engaging their professional expertise

Random sampling does not remove the necessity for lawyers to make legal judgments, to guide the collection of data, to prepare for depositions, or to understand the case and the documents relevant to it.

To the contrary, random sampling allows lawyers to focus on using their legal expertise, without requiring them to be sophisticated information scientists.  Lawyers’ professional expertise extends to the law, legal process, and the like, but it seldom includes information science.  Seeking the documents required to build a seed set often requires the use of complex Boolean expressions, and while some attorneys are quite adept at designing such queries, most are not.  It can be quite challenging to construct queries that give all or most of the documents that are wanted and not overwhelm with a lot of unwanted documents. 

Traditional kinds of searching, involving keywords and Boolean expressions, requires one to know the language that was used by the authors of the documents.  Interviews of the available authors can reveal some of this information, but even the authors may not remember the exact terms that they used.  And, they may not be available.

Many attorneys appear to think that the document authors tend to talk about things the way that attorneys would talk about them.  I have been asked, for example, to search for the word “fraud” in some investigations as if someone would email another person and say, “hey, let’s commit fraud.”  Such things can happen, but people are much more likely to use language that is more subtle, colloquial, and idiosyncratic.  On top of that, the words may not even be spelled correctly or consistently.  The authors may even use code words to talk about important issues.  It is much more likely that these words can be recognized than guessed. 

The attorneys may not be able to describe how to find the responsive documents, but they are typically very able to recognize that a document is responsive.  The success of predictive coding with random sampling depends strongly on the reviewer’s legal acumen to identify responsive documents.  It does not require any specific level of information science skill. 

Part of the concern with random sampling seems to be a desire to preserve the attorney’s role in selecting documents for consideration by the predictive coding system.  As mentioned earlier, all of the decisions about documents being responsive or non-responsive are made by the attorney, not by the computer.  The difference between an attorney-provided seed set and a random sample is who chooses the documents for consideration.  We should use professional judgment to select the documents for consideration, if that professional judgment is sound and effective it can contribute to improved predictive coding results. But a word of caution is appropriate here.

Without explicit measurement, we cannot know whether that judgment is good and that measurement requires the evaluation of a random sample.  Further, I am reminded of the work of the psychologist Paul Meehl who compared what he called clinical vs statistical judgment.  In virtually every case, he found that using statistical models led to better outcomes than intuitive judgments.  His approach has been widely adopted in psychology, hiring decisions, and other areas, but there continues to be resistance to his approach, despite the consistent evidence. 

There is no controversy in social science which shows such a large body of qualitatively diverse studies coming out so uniformly in the same direction as this one. When you are pushing [scores of] investigations, predicting everything from the outcomes of football games to the diagnosis of liver disease and when you can hardly come up with a half dozen studies showing even a weak tendency in favor of the clinician, it is time to draw a practical conclusion” (Meehl 1986, 372−3).

An attorney’s professional judgment about where to look for responsive documents is an implicit prediction about where these documents are likely to be.  It is an intuitive and informal assessment of this likelihood, which, as Meehl, and Kahneman and Tversky have found, is subject to inappropriate biases and does not consistently correspond to reality.  For most attorneys, there is simply no evidence to lead us to believe that actively choosing a seed set is necessarily better than using random sampling to pick training examples.

Random sampling allows lawyers to focus on their legal judgments, rather than information science techniques.

Myth: Lawyers use random sampling to hide behind the computer’s decisions

In predictive coding, the computer does not make the decisions, it merely implements the decisions that have been made by a subject matter expert attorney.  There is nothing to hide behind.  Random sampling chooses the documents to be considered, but the judgment about those documents is all yours.  The predictive coding system analyzes the decisions you make and extends those decisions in a consistent way to other documents in the collection. 

By itself, the predictive coding system cannot determine which documents are interesting and which are not.  Your knowledgeable input provides its only notion of responsiveness. Rather, it has an analysis of the evidence in each document that makes that document more like those you have designated as responsive or more like those you have designated as non-responsive.  The choice is yours.  And, as you update your choices, the system continues to adjust its internal models to more closely approximate your choices.


Predictive coding is a game-changing approach to eDiscovery.  It dramatically modifies the workflow process of eDiscovery and speeds it up by orders of magnitude.  It helps producing parties to manage the cost of eDiscovery while improving accuracy and the time needed to meet discovery obligations.  It helps receiving parties to get more of the information that they are entitled to receive in a shorter time and with less of the junk that typically invades productions. 

Random sampling is an essential part of predictive coding.  It is universally used in the measurement of its outcome.  In addition, training with random sampling increases the transparency of the process—it shows at each step of training how well the system is doing in a form that can be directly compared with the ultimate outcome of the process.  Random sampling is the best way that we know to provide the representativeness that is required for effective predictive coding.  It minimizes the effort required to seek out potentially responsive documents and provides easily defensible criteria for selecting the training examples. 

There are other methods to provide the training examples that predictive coding requires.  Random sampling does not have to replace those other methods, but it should be an essential part of any training process to ensure that coverage be as complete as possible.  The research is clear that we all have expectations about what information should be present and great confidence that we can find that information.  Often this confidence is unfounded and in any case, it should be measured as objectively as possible.  Random sampling provides that unbiased objective measurement and an unbiased objective method to identify the documents in a collection that are actually responsive. 

The power of detailed statistical analysis to reveal actionable information about the world has long been recognized by statisticians.  Recent books, such as Money Ball by Michael Lewis and The Signal and the Noise: Why So Many Predictions Fail — but Some Don't by Nate Silver have helped to make that value more apparent to people outside of statistics.  The very same principles and processes described in these books can be applied to eDiscovery with similar levels of success.  An essential ingredient in that success is random sampling.