TREC (Text Retrieval Conference) is a major annual research effort sponsored by the National Institute of Standards and Technology and other government agencies. The current report covers the 17th annual conference, and the third version of the legal track. The 2008 legal track was coordinated by a group including Doug Oard, Bruce Hedin, Stephen Tomlinson, and Jason Baron. The goal of the legal track is "evaluation of search technology for discovery of electronically stored information in litigation and regulatory settings."
There were three kinds of task evaluated last year:
- Ad hoc retrieval involving automated search, where each team used its technology to retrieve documents using their own search technology.
- Relevance feedback, where each team retrieved some documents, got feedback after this pass and then modified their searches to take advantage of this feedback.
- Interactive, where each team was allowed to interact with a topic authority and revise their queries based on this feedback. Each team was allowed ten hours of access to the authority. In addition, they were allowed to appeal reviewer decisions that the team thought were inconsistent with the instructions from the topic authority.
Two other kinds of searches were also employed. One was a Boolean search negotiated between a "plaintiff" and a "defendant." The second was a search that retrieved all of the documents in the collection.
A sample of the documents that were retrieved was then judged by volunteer assessors (reviewers) to determine whether these documents were responsive to the topic. Finally, a random set of documents that were not retrieved by any of the technologies was sampled and assessed in an attempt to find out whether some responsive documents might have been missed by all of the teams.
Some of the more interesting findings in this report concern the levels of agreement seen between assessors. Some of the same topics were used in previous years of the TREC legal track, so it is possible to compare the judgments made during the current year with those made in previous years. For example, the level of agreement between assessors in the 2008 project and those from 2006 and 2007 were reported. Ten documents from each of the repeated topics that were previously judged to be relevant and ten that were previously judged to be non-relevant were assessed by 2008 reviewers. It turns out that "just 58% of previously judged relevant documents were judged relevant again this year." Conversely, "18% of previously judged non-relevant documents were judged relevant this year." Overall, the 2008 assessors agreed with the previous assessors 71.3% of the time. Unfortunately, this is a fairly small sample, but it is consistent with other studies of inter-reviewer agreement. In 2006, the TREC coordinators gave a sample of 25 relevant and 25 nonrelevant documents from each topic to a second assessor and measured the agreement between these two. Here they found about 76% agreement. Other studies outside of TREC have found similar levels of (dis)agreement.
The interactive task also allowed the teams to appeal reviewer decisions, if they thought that the reviewers had made a mistake. Of the 13,339 documents that were assessed for the interactive task, 966 were appealed to the topic authority. This authority played the role, for example, of the senior litigator on the case, with the ultimate authority to overturn the decisions of the volunteer assessors. In about 80% of these appeals, the topic authority agreed with the appeal and recategorized the document. In one case (Topic 103), the appeal allowed the team with already highest recall rate (percentage of retrieved documents that were determined to be relevant) to improve its performance by 47%.
How do we interpret these findings?
These levels of (dis)agreement do not appear to be wildly different from those found in other studies. Inter-assessor consistency presents challenges to any study of information retrieval effectiveness. TREC studies have found repeatedly that this inconsistency does not affect the relative ranking of different approaches, but it could affect how we interpret the absolute levels of performance. TREC may substantially under-estimate how well an application could do in a real world application, such as in discovery of electronically stored information, with consistent measurement.
Like most studies of information retrieval, the TREC legal track takes assessor judgments to be the standard against which to judge the performance of various systems and approaches. The legal track used tens of assessors, primarily second and third year law students. With the volume of documents involved in the TREC legal track, the limited resources, and so on, there may not be a practical alternative to getting these judgments from many different reviewers. The assessors averaged only 21.5 documents per hour, so the average assessor took 23.25 hours to review 500 documents—a substantial commitment of time and effort from a volunteer.
The inconsistency in assessor judgments limits the ability of any system to yield reliable results. The appeal process of the interactive task (topic 103), for example, demonstrates what can be gained by increasing consistency. Practically every system showed an improvement in recall as a result of the appeal process, whether or not they were responsible for submitting the appeal. Improving consistency appears to improve the absolute level of performance, sometimes substantially.
The use of multiple assessors matches well the standard practice in electronic discovery of distributing documents to multiple reviewers. The results described here, and others, suggest that there are likely to be similar levels of inconsistency in these cases. Taking the prior year reviews as the standard against which to measure the 2008 assessors, they found only 58% of the documents deemed to be relevant by the prior review—58% recall. Similarly, the 2006 study, found that the second reviewer recognized as relevant, again, only 58% of the documents deemed relevant by the first reviewer. I do not believe that these results are an artifact of the TREC processes or procedures. Rather, I think that this level of inconsistency is endemic in the process of having multiple reviewers review documents over time.
In the practice of eDiscovery, human review suffers from unknown inconsistencies. There is no reason to think that actual legal review should be any more consistent than that found in the TREC studies. For that reason, standard review practice may be grossly under-delivering responsive documents. At the very least, attorneys should seek to measure the consistency of their reviewers and the effectiveness of their classifications.
The TREC legal track represents a tremendous resource for the legal community and for the information retrieval community as a whole. It is a monumental effort, representing untold hours and uncounted dollars. In future articles I plan to describe other interesting findings to come out of this study.