Monday, August 8, 2011
There seems to be an emerging workflow in eDiscovery where predictive coding and highly professional reviewers are being used in place of large ad hoc groups of temporary attorneys. There is recognition that without high levels of training and good quality-control methods, human review tends to be only moderately accurate. Selecting or training effective reviewers requires an understanding of what makes a reviewer successful and of how to measure that success. We can look to optimal decision theory, and particularly to the branch of optimal decision theory called detection theory to provide insight into training and assessing reviewers.
The work on optimal decision theory began during World War II to measure and understand, for example, how to characterize the sensitivity of radar to detect objects at a distance. It then came to be applied to human decision making as well, work that was published after the war. This type of optimal decision theory is often called detection theory.
Detection theory concerns the question: Based on the available evidence, should we categorize an event as a member of set 1 or as a member of set 2? In radar, the evidence is in the signal reflected from an object, and the sets are whether the reflection is from a plane or from, say, a cloud. In document decisioning, the evidence consists of the words in the document and the sets are, for example, responsive and nonresponsive.
In order to isolate the essence of decisioning, we can simplify the situation further. For the moment, let’s think about a decision where all we have to do is decide whether a tone was played at a particular time or not—a kind of hearing test. Those events when the tone is present are analogous to a document being responsive and those events when the tone is absent are analogous to a document being nonresponsive.
Let’s put on a pair of head phones and listen for the tone. When the tone is present it is played very softly, so there may be some uncertainty about whether the tone was present or not. How do we decide whether we hear a tone or not?
At first, it may seem that detecting the tone is not a matter of making a decision. It is either there or it is not. But, one of the insights of detection theory is that it does actually require a decision and that decision is affected by more than just how loud the tone is.
In detection theory, two kinds of factors influence our decisions. The first is the sensitivity of listener—how well can the listener distinguish between tone and nontone events? The second factor is bias—how willing is the listener to say that the tone was present.
In our hearing test, we present a series of events or trials. The listener has to decide on each of those events whether she is hearing the tone. Detection theory describes how to combine the level of evidence (e.g., intensity of the tone) and these other factors to come up with the best decision possible.
Some listeners have more sensitive hearing than others. The more sensitive a person is, the softer the tone can be played and still be heard. Some reviewers are more sensitive than others. They can tell whether a document is responsive based on more subtle cues than other reviewers.
Bias concerns the willingness or tendency of the speaker to identify an event as a tone event. This willingness can be influenced by a number of factors, including the probability that a given event contains a tone and by the consequences of each type of decision. Put simply, if tone events are very rare, then people will be less likely to say that a tone occurred when they are uncertain. If tone events are more common, they will be more likely to say that a tone occurred when they are uncertain. Reviewers are more likely to categorize a document as responsive if the collection contains more responsive documents.
Similarly, if a person gets paid a dollar for correctly hearing a tone and gets charged 50 cents for an error, then that person will be more likely to say that he or she heard the tone. If we reverse the payment plan so that correctly hearing a tone yields 50 cents, but errors cost a dollar, then that person will be reluctant to say that he or she hears the tone. In the face of uncertainty, the optimal decision depends on the evidence available and the consequences of each type of decision.
The point of this is that you can change the proportion of events that are said to contain the tone not only by making the tone louder or softer, but also by changing the consequences of decisions and the likelihood that the tone is actually present.
Bringing this back to document decisioning, the words in a document constitute the evidence that a document is responsive or not. In the face of uncertainty, decision makers will decide whether a document is responsive based on the degree to which the evidence is consistent with the document being responsive, on their sensitivity to this evidence, on the proportion of responsive documents in a collection, and on the consequences of making each kind of decision. All of these factors play a role in document decisioning.
In the paper by Roitblat, Kershaw, and Oot (2010, JASIST), for example, two teams of reviewers re-examined a sample of documents that had been reviewed by the original Verizon team. In this re-review, Team A identified 24.2% of the documents in their sample as responsive and Team B identified 28.76% as responsive. Although Team B identified significantly more documents as responsive, when the sensitivity of these two teams was measured in the way suggested by detection theory, the two teams did not differ significantly from one another in sensitivity. They did differ in their bias, however, to call an uncertain document responsive. Team B was simply more willing than Team A to categorize documents as responsive without being any better at distinguishing responsive from nonresponsive documents.
The most useful insight to be derived from an optimal decision theory approach to document decisioning is the separability of sensitivity and bias. Reviewers can differ in how sensitive they are to identifying responsive documents and they can be guided to be more or less biased toward accepting documents as responsive when uncertain.
Presumably sensitivity will be affected by education. The more that reviewers know about the factors that govern whether a document is responsive, the better they will be at distinguishing responsive from nonresponsive. Their bias can be changed simply by asking them to be more or less fussy. The optimum review needs not only to be maximally sensitive to the difference between responsive and nonresponsive documents, but to adopt the level of bias that is appropriate to the matter at hand.
When assessing reviewers, optimal decision theory suggests that you separate out the sensitivity from the bias. The quality of a reviewer is represented by his or her sensitivity, not by bias. If all you measure, for example, is the proportion of responsive documents found by a candidate reviewer (where responsive is defined by someone authoritative), then you could easily miss highly competent reviewers because they have a different level of bias from the authoritative reviewer. Equally likely, you could select a candidate who finds many responsive documents just because he or she is biased to call more documents responsive. Although reviewer sensitivity may be difficult to change, bias is very easy to change. You have only to ask the person to be more or less generous. Unless you measure both bias and sensitivity, you won’t be able to make sound judgments about the quality of reviewers, whether those reviewers are machines or people.
Note: Traditional information retrieval science uses precision and recall to measure performance. These two measures recognize that there is a tradeoff between precision and recall. You can increase precision by focusing the retrieval more narrowly, but this usually results in a decrease in recall. You can get the highest recall by retrieving all documents, but then you would have very low precision. Precision and recall measures are affected by both bias and sensitivity, but they do not provide any means to separate one from the other. Sensitivity and bias have been used in information retrieval studies, but not as commonly as precision and recall.