Monday, January 9, 2012

On Some Selected Search Secrets

Ralph Losey recently wrote an important series of blog posts (here, here, and here) describing five secrets of search. He pulled together a substantial array of facts and ideas that should have a powerful impact on eDiscovery and the use of technology in it. He raised so many good points, that it would take up all of my time just to enumerate them. He also highlighted the need for peer review. In that spirit I would like to address a few of his conclusions in the hope of furthering discussions among lawyers, judges, and information scientists about the best ways to pursue eDiscovery.

These are the problematic points I would like to consider:
1. Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.
2. Webber’s analysis shows that human review is better than machine review
3. Reviewer quality is paramount.
4. Human review is good for small volumes, but not large ones.
5. Random samples with 95% confidence levels +/- 2 are unrealistically high.


Issue: Machines are not that good at categorizing documents. They are limited to about 65% precision and 65% recall.

Losey quotes extensively from a paper written by William Webber, which reanalyzes some results from the TREC Legal Track, 2009, and some other sources. Like Losey’s commentary, this paper also has a lot to recommend it. Some of the conclusions that Losey reaches are fairly attributable to Webber, but some go beyond what Webber would probably be comfortable with. The most significant fact, because important arguments are based on it, is a description of some work by Ellen Voorhees that concluded that 65% recall at 65% precision is the best performance one can expect.

The problem is that this 65% factoid is taken out of context. In the context of the TREC studies and the way that documents are ultimately determined to be relevant or not, this is thought to be the best that can be achieved. The 65% is not a fact of nature. It says, actually, nothing about the accuracy of the predictive coding systems being studied. Losey notes that this limit is due to the inherent uncertainty in human judgments of relevance, but goes on to claim that this is a limit on machine-based or machine assisted categorization. It is not.

Part of the TREC Legal Track process is to distribute sets of documents to ad hoc reviewers, whom they call assessors. Each assessor gets a block or bin of about 500 documents and is asked to categorize them as relevant or not relevant to the topic query. None of the documents in this part of the process is reviewed by more than one assessor. Each assessor typically reviews only one batch. Although information about the topic is provided to each assessor, there is no rigorous effort expended to train them. As you might expect, the assessors can be highly variable. But, generally speaking, we don’t have any assessment of their variability or skill level. This is an important point and I will have to come back to it soon.

Predictive coding systems generally work by applying machine learning to a sample of documents and extrapolating from that sample to the whole collection. Different systems get their samples in different ways, but the performance of the system depends on the quality of the sample. Garbage in – garbage out. More fully, variability in accuracy can come from at least three sources:
1. Variability in the training set
2. Variability due to software algorithms
3. Variability due to the judgment standard against which the system is scored

If the system is trained on an inconsistent set of documents, or if it performs inconsistently, or if it is judged inconsistently, its ultimate level of performance will be poor. Voorhees, in the paper cited by Webber, found that professional information analysts agreed with one another on less than half of the responsive documents. This fact says nothing about any predictive coding system, it talks only about agreement of one person with another. One of the assessors she compared was the author of the topic and so could be considered the best available authority on the topic. The second assessor was the author of a different topic.

Under TREC, the variability due to the training set is left uncontrolled. It is up to each team to figure out how to train their systems. The variability due to the judgment standard is consistent across systems, so any variation among systems can be attributed to the training set or the system capabilities. That strategy is perfectly fine for most TREC purposes. We can compare the relative performance of a participating system. The problem only comes when we want to ascertain how well a system will do in absolute terms. The performance of predictive coding systems in the TREC Legal Track is suppressed by the variability of the judgment standard. It is not a design flaw for TREC, it is only a problem when we want to extrapolate from TREC results to eDiscovery or other real world situations. It under-estimates how well the system will do with more rigorous training and testing standards. The original TREC methodology was never designed to produce absolute estimates of performance, only relative.

Anything that we can do to improve the consistency of the training and testing set of document categorizations will improve the absolute quality of the results. But such quality improvements are typically expensive.

The TREC Legal Track has moved to using a Topic Authority (like Voorhees’s primary assessor). Even an authoritative assessor is not infallible, but it may be the best that we can achieve. It also may be realistic.

I would like to see the Topic Authority (TA) produce an authoritative training set and a second authoritative judgment set. The first set is used to train the predictive coding system, the second is used to test it.

Using a topic authority to provide the training and final assessment sets will substantially reduce the variability of the human judgments. We need two sets because we cannot use the same documents to train the system as we use to test the system. If we used only one, then the performance of the system on the same documents could over-estimate its capabilities. The system could simply memorize the training examples and spit back the same categories it was given. Having separate training and testing sets is standard procedure in most machine learning studies.

When we do a scientific study, we want to know how well the system will do in other, similar, situations. This prediction typically requires a statistical inference, and to make a valid statistical inference the two measurements need to be independent.

To translate this into eDiscovery process, the training set should be created by the person who knows the most about the case and then evaluated, for example, using a random sample of documents predicted to be responsive and nonresponsive, by the same person. Losey is correct, if you have multiple reviewers, each applying idiosyncratic standards to the review you will get poor results, even from a highly accurate predictive coding system. On the other hand, with rigorous training and a rigorous evaluation process, high levels of predictive coding accuracy are often achievable.


Issue: Webber’s analysis shows that human review is better than machine review

I have no doubt that human review could sometimes be better than machine-assisted review, but the data discussed by Webber do not say anything one way or the other about this claim.

Webber did, in fact, find that some of the human reviewers showed higher precision and recall than did the best-performing computer system on some tasks. But, because of the methods used, we don’t know whether these differences were due merely to chance, to specific methods used to obtain the scores, or to genuine differences among reviewers. Moreover, the procedure prevents us from making a valid statistical comparison.

The TREC Legal Track results that were analyzed in Webber’s paper involved a three step process. The various predictive coding systems were trained on whatever data their teams thought were appropriate. The results of the multiple teams were combined and sampled along with a number of documents that were not designated responsive by any team. From these, the bins or batches were created and distributed to the assessors. Once the assessors made their decisions, the machine teams were given a second chance to “appeal” any decisions to the Topic Authority. If the TA agreed with the computer system’s judgments the computer system then was measured as performing better and the assessor’s performance was judged as performing worse. The appeals process, in other words, moved the target after the arrow had been shot.

If none of the documents judged by a particular assessor was appealed, then that assessor would have precision and recall of 1.0. Prior to the appeal, the assessors’ judgments were the accuracy standard. The more documents that were appealed, the more likely that assessor would be to have a low score. Their score could not increase from the appeals process. So, whether an assessor scored well or scored poorly was determined almost completely by the number of documents that were appealed—by how much the target was moved.

Because of the (negative) correlation between the performance of the computer system and the performance of the assessor, their performances were not independent. Therefore, a simple statistical comparison between the performance of the assessor and the performance of the computer system is not valid.

Even if the comparison were valid, we still have other problems. The different TREC tasks involved different topics. Some were presumably easier than others. The assessors who made the decisions may have varied in ability, but we have no information about which were more skillful. The bins or batches that were reviewed probably differed among themselves in the number of responsive documents each contained. Because only one assessor judged each document, we don’t know whether the differences in accuracy (as judged by topic authority) were due to differences in the documents being reviewed or to differences in the assessors.


Issue: Reviewer quality is paramount

Webber found that some assessors performed better than others. Continuing the argument of the previous section, though, we cannot infer from this that some assessors were more talented, skilled, or better prepared than others.
It is entirely circular to say that some assessors were more skillful than others and so were more accurate because the only evidence we have that they were more skillful is that they were measured to be more accurate. You cannot explain an observation (higher accuracy) by the fact that you made the observation (higher accuracy). It cannot be both the cause and the effect.

Whether the source of variation in performance among the assessors was due to variation in the number or difficulty of the decisions or was due to differences in assessor talent, you cannot simply pick the best of one thing (the assessor performance) and compare it to the average of another (the computer assisted review). The computer performance is based on all of the documents, each assessor’s performance is based on only 500 documents. The computer performance was a representational equivalent of the average of all assessor judgments. Just by chance, some assessors will be higher than others. In fact, about half of the assessors should, just by chance, score above and about half should score below the average. But, we have no way to determine whether those selected reviewers scored high because of chance or because of some difference in skill. We measured them only once and we cannot use the fact that they scored well to explain their high score. We need some independent evidence.

The best reviewers on each topic could have been the best because they got lucky and got an easy bin, or they got a bin with a large number of responsive documents, or just by chance. Unless we disentangle these three possibilities, we cannot claim that some reviewers were better or that reviewer quality matters. In fact, these data provide no evidence one way or the other relative to these claims.

In any case, the question in practice is not whether some human somewhere could do better than some computer system. The question is whether the computer system did well enough or is there some compelling reason to force parties through the expense of using superior human reviewers? Even if some human reviewers could consistently do better than some machine, this is not the current industry standard.
In some sense, the ideal would be for the senior attorney in the case to read every single document with no effect of fatigue, boredom, distraction, or error. Instead, the current practice is to farm out first pass review to either a team of anonymous, ad hoc, or inexpensive reviewers or to search by keyword. Even if Losey were right, the standard is to use the kind of reviewers that he says are not up to the task.


Issue: Human review is good for small volumes, but not large ones

This claim may also be true, but the present data do not provide any evidence for or against it. The evidence that Losey cites in support of this claim is the same evidence that, I argued, failed to show that human review is better than machine review. It requires the same circular reasoning. We do not know from Webber’s analysis whether some reviewers were actually better than others, only that on this particular task, with these particular documents, they scored higher. Similarly, we don’t know from these data that doing only 500 documents is accurate, whereas doing more leads to inaccuracy. We don’t even know in the tested circumstance whether performance would decrease over that number. All bins were about the same size, so there is no way to test the hypothesis with these data that performance decreases as the set size rises above 500. It just was not tested.

When confronted with a small (e.g., several hundred) versus a large volume of documents to review, we can expect that fatigue, distraction, and other human factors will decrease accuracy over time. Based on other evidence from psychology and other areas, it is likely that performance will decline somewhat with larger document sets, but there is no evidence here for that.

If this were the only factor, we could arrange the situation so that reviewers only looked at 500 documents at a time before they took a break.


Issue: Random samples with 95% confidence levels +/- 2% confidence intervals are unrealistically high.

It’s not entirely clear what this claim means. On the one hand, there is a common misperception of what it means to have a 95% confidence level. Some people mistakenly assume that the confidence level refers to the accuracy of the results. But the confidence level is not the same thing as the accuracy level.
The confidence level refers to our belief in the measurement’s reliability, it does not tell what we are measuring. The confidence interval (e.g., ±2%) is a prediction of how precisely our sample estimates the true value of the whole population. Put simply, a 95% confidence interval means that in 95% of the experiments with this confidence level, we expect to find that true population value within the range specified by the experiment’s confidence interval.

For example, a recent CNN, Time Magazine poll found that 37% of likely Republican primary voters in South Carolina supported Mitt Romney, based on a survey sample size of 485 likely Republican primary voters. With a 95% confidence level, these results are accurate to within ±4.5 percentage points (the confidence interval). It does not mean that Romney is supported by 95% of the voters or that Romney is has a 95% probability of winning. It means that if the election were held today, the survey predicts that 37% of the voters would vote for Romney. I suspect that Losey means something different.

I suspect that he is referring to the relatively weak levels of agreement found by the TREC studies and others. If our measurement is not very precise, then we can hardly expect that our estimates will be more precise. This concern, though, rests on obtaining the measurements in the same way that TREC has traditionally done it. If we can reduce the variability of our training set and our comparison set, we can go a long way toward making our estimates more precise.

In any case, many relevant estimates do not depend on the accuracy of a group of human assessors. In practice, for example, our estimates of such factors as the prevalence of responsive documents can rest on the decisions made by a single authoritative individual, perhaps the attorney responsible for signing the Rule 26(g) declaration. Those estimates can be made precise enough with reasonable sample sizes.


Conclusion

The main problem with Losey’s discussion derives from taking the results reported by Webber and Voorhees as an immutable fact of information retrieval. The observation that there is only moderate agreement among independent assessors is a description of the human judgments in these studies, it says nothing about any machine systems used to predict which documents are responsive or not. The variability that leads to this moderate level of agreement can be reduced and when it is, the performance of machine-review can be more accurately measured.

The second problem derives from the difficulty of attributing causation in experiments that were not designed to attribute such causation. Within the data analyzed by Webber, for example, there is no way to distinguish the effects of chance from the effects due to assessor differences.

None of these comments should be interpreted as an indictment of TREC, Webber, or Losey. Science proceeds when people with different perspectives have the chance to critique each other’s work and to raise questions that may not have previously been considered.

None of these comments is intended to argue that predictive coding is fundamentally inaccurate. Rather my main argument is that the studies from which these data were derived were not designed to answer many of the questions we would like to ask of it. They do not speak against the effectiveness of predictive coding, nor do they speak in favor of it. Other studies will need to be conducted that address these questions specifically and are designed to answer them.

Finally, even if we disagree about the effectiveness of predictive coding relative to human performance, there is little disagreement any more about the effectiveness of a purely human linear review or of using a simple keyword search to identify responsive documents. The cost of human review continues to skyrocket as the volume of documents that must be considered increases. In many cases, human review is simply impractical within the cost and time constraints of the matter. Under those circumstances, something else has to be done to reduce that burden. That something else, seems to be predictive coding and the fact that we can measure its accuracy only adds to its usefulness.

2 comments:

  1. Herb,

    Very interesting post, and I agree with most of your points. You or your readers might also like to read my blog reply to Ralph's posts:

    http://blog.codalism.com/?p=1549

    To my mind, the issue is this. We can objectively measure how well a production meets one observer's conception of relevance (provided that conception is reasonably consistent), simply by sample the production and asking the observer to assess it. Text classification or predictive coding runs this process in reverse: the machine learns the classification by observing the assessor's assessments. What we're going to have serious difficulties doing, though, is measuring the reasonableness of that observer's conception of relevance, as embodied in a production, by using third-party assessors, because the level of disagreement between assessors is so high. The interactive task of the TREC Legal Track tackles this problem by having the topic authority (who plays the role of supervising attorney) communicate with participants, and write out detailed criteria of relevance guidelines; then participants are asked to appeal assessments that violate these guidelines, and the TA adjudicates the appeals. But how can we objectively measure the quality of real-life productions?

    ReplyDelete