Tuesday, September 28, 2010

OrcaTec Releases the OrcaTec Text Analytics Platform

SDR and OrcaTec Join Forces to release the leading text analytic tools for Governance, Risk-Management, and Compliance (GRC) and eDiscovery


We at OrcaTec announce our first new products under our combined new leadership. The OrcaTec Text Analytics platform delivers unprecedented levels of accuracy and efficiency, enabling users to retrieve, filter, analyze, visualize and act on unstructured information for 90% less than conventional technology.


The biggest expense in responding to government inquiries, investigations, and eDiscovery is the cost of document review. That cost is an increasingly heavy burden on companies and most of it is wasted on considering irrelevant or nonresponsive documents. We now have the tools that can reduce that waste and inherently reduce the burden.


In January, of this year, Anne Kershaw, Patrick Oot, and I published a paper in the Journal of the American Society for Information Science and Technology, showing that two computer assisted systems, neither of which were related to OrcaTec or to the Electronic Discovery Institute, could classify documents into responsive and nonresponsive categories at least as well as new human reviewers. This paper also suggested some of the ways that the quality of categorization could be measured, whether it was done by attorneys or by machines. After seeing the results of that study and their acceptance, I decided that we could create a categorizer that might do even better than the two systems that we tested. Part of what we are announcing today is the release of that new software.


By watching how an expert categorizes a sample of documents, the system can induce the implicit rules that the expert is using and replicate them. The computer does not make up the decision rules, it merely implements the rules that the expert used. Consistent with the results from the JASIST paper, we have found that the level of accuracy or agreement between the computer and the expert is as high or higher than one would see if another person was doing the review.


By whatever means a document is categorized, there are two kinds of errors. A nonresponsive document could be falsely categorized as responsive and a responsive document could be falsely categorized as nonresponsive. Making the correct decision depends on a multistep process. Correct results depend on each of these steps being executed flawlessly (or on simply a lucky guess).
  1. The document has to be considered. If it is never seen by the decision maker, machine or person, then it cannot be judged. 
  2. The document has to contain information that would allow it to be identified as responsive or nonresponsive. For example, an email that said “sounds good,” all by itself would be unlikely to be identified as responsive absent of other information. 
  3. The categorizer (machine or human) would have to notice that the distinguishing information was present. People’s attention wanders, they get tired. 
  4. The criteria used to make the decision have to be valid and reliable. Validity means roughly that they are using the right rules and reliability means that they are using them consistently. 
  5. They have to make the correct response. For example, they have to push the responsive button when they intend to hit the responsive button.

Humans are more expert, still, at deciding on what the criteria are that distinguish responsive from nonresponsive documents. It is difficult to send a computer to law school, but computers are better at practically everything else on this list. Instead of looking at a hundred documents an hour, a computer can look at thousands. The computer does not take breaks or vacations. It does not think about what it is going to have for lunch. It does not shift its criteria. So, if we could just tell the computer how to distinguish between responsive and nonresponsive documents, by being more consistent, it would do a better job implementing these rules.


There are different ways of providing the rules to the computer. Some companies use teams of linguists and lawyers to formulate the rules. We use a set of sample judgments. Responsiveness is like pornography. It is difficult to formulate explicit rules about what makes a document responsive, but we know it when we see it. In our system, there is no need to formulate these explicit rules, rather the computer watches how the documents are classified during review and comes to mimic those decisions. The result of this categorization a set of documents that are very likely to be responsive.  The documents that the computer identifies as responsive can be reviewed further while the documents identified as nonresponsive can be sampled for review.  Rather than taking weeks or months to wade through the documents more or less at random, the legal team can concentrate its efforts on those documents that are likely to be responsive at the beginning of the process.


Couple these processes with a detailed methodology of sampling and quality assurance and you have a result that is reliable, transparent, verifiable, and defensible.


The OrcaTec Text Analytics Platform is available as a hosted SaaS application, or on-site as an appliance. On-site, it can be managed directly by the customer or by OrcaTec personnel.


About OrcaTec: The 2010 merger of OrcaTec with the software engineering company Strategic Data Retention combined the talents of a long time provider of SaaS eMail archiving and high speed discovery services with the OrcaTec analytics technology. Come visit us at www.orcatec.com.

No comments:

Post a Comment