Tuesday, September 28, 2010

OrcaTec Releases the OrcaTec Text Analytics Platform

SDR and OrcaTec Join Forces to release the leading text analytic tools for Governance, Risk-Management, and Compliance (GRC) and eDiscovery


We at OrcaTec announce our first new products under our combined new leadership. The OrcaTec Text Analytics platform delivers unprecedented levels of accuracy and efficiency, enabling users to retrieve, filter, analyze, visualize and act on unstructured information for 90% less than conventional technology.


The biggest expense in responding to government inquiries, investigations, and eDiscovery is the cost of document review. That cost is an increasingly heavy burden on companies and most of it is wasted on considering irrelevant or nonresponsive documents. We now have the tools that can reduce that waste and inherently reduce the burden.


In January, of this year, Anne Kershaw, Patrick Oot, and I published a paper in the Journal of the American Society for Information Science and Technology, showing that two computer assisted systems, neither of which were related to OrcaTec or to the Electronic Discovery Institute, could classify documents into responsive and nonresponsive categories at least as well as new human reviewers. This paper also suggested some of the ways that the quality of categorization could be measured, whether it was done by attorneys or by machines. After seeing the results of that study and their acceptance, I decided that we could create a categorizer that might do even better than the two systems that we tested. Part of what we are announcing today is the release of that new software.


By watching how an expert categorizes a sample of documents, the system can induce the implicit rules that the expert is using and replicate them. The computer does not make up the decision rules, it merely implements the rules that the expert used. Consistent with the results from the JASIST paper, we have found that the level of accuracy or agreement between the computer and the expert is as high or higher than one would see if another person was doing the review.


By whatever means a document is categorized, there are two kinds of errors. A nonresponsive document could be falsely categorized as responsive and a responsive document could be falsely categorized as nonresponsive. Making the correct decision depends on a multistep process. Correct results depend on each of these steps being executed flawlessly (or on simply a lucky guess).
  1. The document has to be considered. If it is never seen by the decision maker, machine or person, then it cannot be judged. 
  2. The document has to contain information that would allow it to be identified as responsive or nonresponsive. For example, an email that said “sounds good,” all by itself would be unlikely to be identified as responsive absent of other information. 
  3. The categorizer (machine or human) would have to notice that the distinguishing information was present. People’s attention wanders, they get tired. 
  4. The criteria used to make the decision have to be valid and reliable. Validity means roughly that they are using the right rules and reliability means that they are using them consistently. 
  5. They have to make the correct response. For example, they have to push the responsive button when they intend to hit the responsive button.

Humans are more expert, still, at deciding on what the criteria are that distinguish responsive from nonresponsive documents. It is difficult to send a computer to law school, but computers are better at practically everything else on this list. Instead of looking at a hundred documents an hour, a computer can look at thousands. The computer does not take breaks or vacations. It does not think about what it is going to have for lunch. It does not shift its criteria. So, if we could just tell the computer how to distinguish between responsive and nonresponsive documents, by being more consistent, it would do a better job implementing these rules.


There are different ways of providing the rules to the computer. Some companies use teams of linguists and lawyers to formulate the rules. We use a set of sample judgments. Responsiveness is like pornography. It is difficult to formulate explicit rules about what makes a document responsive, but we know it when we see it. In our system, there is no need to formulate these explicit rules, rather the computer watches how the documents are classified during review and comes to mimic those decisions. The result of this categorization a set of documents that are very likely to be responsive.  The documents that the computer identifies as responsive can be reviewed further while the documents identified as nonresponsive can be sampled for review.  Rather than taking weeks or months to wade through the documents more or less at random, the legal team can concentrate its efforts on those documents that are likely to be responsive at the beginning of the process.


Couple these processes with a detailed methodology of sampling and quality assurance and you have a result that is reliable, transparent, verifiable, and defensible.


The OrcaTec Text Analytics Platform is available as a hosted SaaS application, or on-site as an appliance. On-site, it can be managed directly by the customer or by OrcaTec personnel.


About OrcaTec: The 2010 merger of OrcaTec with the software engineering company Strategic Data Retention combined the talents of a long time provider of SaaS eMail archiving and high speed discovery services with the OrcaTec analytics technology. Come visit us at www.orcatec.com.

Monday, January 11, 2010

Computer Assisted Document Categorization in eDiscovery

The January issue of the Journal of the American Society for Information Science and Technology, 61(1):1–11, 2010, has an article by Roitblat, Kershaw, and Oot describing a study that compared computer classification of eDiscovery documents with manual review.  It found that computer classification was at least as consistent as human review was at distinguishing responsive from nonresponsive documents.  If having attorneys review documents is a reasonable approach to identifying responsive documents, then any system that does as well as human review should also be considered a reasonable approach.

The study compared an original categorization, done by contract attorneys in response to a Department of Justice Second Request with one done by two new human teams and two computer systems. The two re-review teams were employees of a service provider specializing in conducting legal reviews of this sort.  Each team consisted of 5 reviewers who were experienced in the subject matter of this collection.  The two teams independently reviewed a random sample of 5,000 documents.  The two computer systems were provided by experienced eDiscovery service providers, one in California, and one in Texas.  The authors of the study had no financial relationship with either service provider or with the company providing the re-review.  The companies donated their time and facilities to the study.

The documents used in the study were collected in response to a "Second Request" concerning Verizon's acquisition of MCI.  The documents were collected from 83 employees in 10 US states.  Together they consisted of 1.3 terabytes of electronic files in the form of 2,319,346 documents.  The collection consisted of about 1.5 million email messages, 300,000 loose files, and 600,000 scanned documents.  After eliminating duplicates, 1,600,047 items were submitted for review.  The attorneys spent about four months, seven days a week, and 16 hours per day on the review at a total cost of $13,598,872.61 or about $8.50 per document.  After review, a total of 176,440 items were produced to the Justice Department.

Accuracy was measured as agreement with the decisions made by the original review team.  The level of agreement between the two human review teams was also measured. 

The two re-review teams identified a greater proportion of the documents as responsive than did the original review.  Overall, their decisions agree with the original review on 75.6% and 72.0% of the documents.  The two teams agreed with one another on about 70% of the documents.

About half of the documents that were identified as responsive by the original review were identified as responsive by either of the re-review teams.  Conversely, about a quarter of the documents identified as nonresponsive by the original review were identified as responsive by the new teams. 

Although the original review and the re-reviews were conducted by comparable people with comparable skills, their level of agreement was only moderate.  We do not know whether this was due to variability in the original review, or was due to some other factor, but these results are comparable to those seen in other situations where people make independent judgments about the categorization of documents (for example, in the TREC studies).  A senior attorney reclassified the documents on which the two teams disagreed.  After this reclassification, the level of agreement between this adjudicated set and the original review rose to 80%. 

The two computer systems identified fewer documents as responsive than did the human review teams, but still a bit more than were identified by the original review.  One system agreed with the original classification on 83.2% of the documents and the other on 83.6%.  Like the human review teams, about half of the documents identified as responsive by the original review were similarly classified by the computer systems.

As legal professionals search for ways to reduce the costs of eDiscovery, this study suggests that it may be reasonable to employ computer-based categorization.  The two computer systems agreed with the original review at least as often as a human team did. 

The computer systems did not create their decisions out of thin air.  One of the systems based its classifications in part on the adjudicated results of the two review teams and the senior attorney.  The other system based its process on an analysis of the Justice Department Request, the training documents given to the reviewers (both the original review and the two review teams), and on a proprietary ontology.  These two systems, in other words, implemented a set of human judgments.  These systems succeed to the extent that they can capture and reliably implement these judgments.  The computers and their software do not get tired, cannot be not distracted, and are able to work 24 hours a day.  These results imply that using a computer-based classification system is a viable way to produce reasonable eDiscovery document categorization.

Please contact me (herb@ediscoveryinstitute.org) or (herb@orcatec.com) if you would like a copy of the full paper.