Monday, January 11, 2010

Computer Assisted Document Categorization in eDiscovery

The January issue of the Journal of the American Society for Information Science and Technology, 61(1):1–11, 2010, has an article by Roitblat, Kershaw, and Oot describing a study that compared computer classification of eDiscovery documents with manual review.  It found that computer classification was at least as consistent as human review was at distinguishing responsive from nonresponsive documents.  If having attorneys review documents is a reasonable approach to identifying responsive documents, then any system that does as well as human review should also be considered a reasonable approach.

The study compared an original categorization, done by contract attorneys in response to a Department of Justice Second Request with one done by two new human teams and two computer systems. The two re-review teams were employees of a service provider specializing in conducting legal reviews of this sort.  Each team consisted of 5 reviewers who were experienced in the subject matter of this collection.  The two teams independently reviewed a random sample of 5,000 documents.  The two computer systems were provided by experienced eDiscovery service providers, one in California, and one in Texas.  The authors of the study had no financial relationship with either service provider or with the company providing the re-review.  The companies donated their time and facilities to the study.

The documents used in the study were collected in response to a "Second Request" concerning Verizon's acquisition of MCI.  The documents were collected from 83 employees in 10 US states.  Together they consisted of 1.3 terabytes of electronic files in the form of 2,319,346 documents.  The collection consisted of about 1.5 million email messages, 300,000 loose files, and 600,000 scanned documents.  After eliminating duplicates, 1,600,047 items were submitted for review.  The attorneys spent about four months, seven days a week, and 16 hours per day on the review at a total cost of $13,598,872.61 or about $8.50 per document.  After review, a total of 176,440 items were produced to the Justice Department.

Accuracy was measured as agreement with the decisions made by the original review team.  The level of agreement between the two human review teams was also measured. 

The two re-review teams identified a greater proportion of the documents as responsive than did the original review.  Overall, their decisions agree with the original review on 75.6% and 72.0% of the documents.  The two teams agreed with one another on about 70% of the documents.

About half of the documents that were identified as responsive by the original review were identified as responsive by either of the re-review teams.  Conversely, about a quarter of the documents identified as nonresponsive by the original review were identified as responsive by the new teams. 

Although the original review and the re-reviews were conducted by comparable people with comparable skills, their level of agreement was only moderate.  We do not know whether this was due to variability in the original review, or was due to some other factor, but these results are comparable to those seen in other situations where people make independent judgments about the categorization of documents (for example, in the TREC studies).  A senior attorney reclassified the documents on which the two teams disagreed.  After this reclassification, the level of agreement between this adjudicated set and the original review rose to 80%. 

The two computer systems identified fewer documents as responsive than did the human review teams, but still a bit more than were identified by the original review.  One system agreed with the original classification on 83.2% of the documents and the other on 83.6%.  Like the human review teams, about half of the documents identified as responsive by the original review were similarly classified by the computer systems.

As legal professionals search for ways to reduce the costs of eDiscovery, this study suggests that it may be reasonable to employ computer-based categorization.  The two computer systems agreed with the original review at least as often as a human team did. 

The computer systems did not create their decisions out of thin air.  One of the systems based its classifications in part on the adjudicated results of the two review teams and the senior attorney.  The other system based its process on an analysis of the Justice Department Request, the training documents given to the reviewers (both the original review and the two review teams), and on a proprietary ontology.  These two systems, in other words, implemented a set of human judgments.  These systems succeed to the extent that they can capture and reliably implement these judgments.  The computers and their software do not get tired, cannot be not distracted, and are able to work 24 hours a day.  These results imply that using a computer-based classification system is a viable way to produce reasonable eDiscovery document categorization.

Please contact me (herb@ediscoveryinstitute.org) or (herb@orcatec.com) if you would like a copy of the full paper.

39 comments:

  1. "The attorneys spent about four months, seven days a week, and 16 hours per day on the review at a total cost of $13,598,872.61 or about $8.50 per document. After review, a total of 176,440 items were produced to the Justice Department."

    This obscures the information somewhat... how many hours total did the attorney review team put into the project? How many people were on the team?

    ReplyDelete
  2. Interesting article. It's becoming apparent that the more review teams treat data like data instead of paper, the higher the quality of review, the lower the cost and the greater the consistency of the result.

    ReplyDelete
  3. Nice Article..Very Amazed..and like ur presentation...


    Affordable SEO Services

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. I am confused - at $13M for 1.6M documents, the reviewers were only able to review 7 docs and hour? Also, to be fair, shouldn't the costs of the computer assisted review be compared? I would like to see more, objective, information before drawing any conclusions.

    ReplyDelete
  6. 河水永遠是相同的,可是每一剎那又都是新的。..................................................

    ReplyDelete
  7. 過去的事早已消失,未來的事更渺不可知,只有現在是真實的........................................

    ReplyDelete
  8. 在莫非定律中有項笨蛋定律:「一個組織中的笨蛋,恆大於等於三分之二。」......................................................................

    ReplyDelete
  9. 向著星球長驅直進的人,反比踟躕在峽路上的人,更容易達到目的。............................................................

    ReplyDelete
  10. 在你一無所有的時候 是誰在陪伴你 他便是你最重要的人............................................................

    ReplyDelete
  11. 愛情是一種發明,需要不斷改良。只是,這種發明和其他發明不一樣,它沒有專利權,隨時會被人搶走。.................................................................

    ReplyDelete
  12. 河水永遠是相同的,可是每一剎那又都是新的。..................................................

    ReplyDelete