Friday, February 25, 2011

Opening the Black Box of Predictive Coding

Yesterday, we did a very interesting podcast on ESIBytes with Karl Schieneman, Jaime Carbonell, and Vasco Pedro on predictive coding. Jaime and Vasco are well known in the machine learning space, but have not been active in eDiscovery, so it was really interesting to get their perspective on how these advanced technologies could be effective in eDiscovery.

A few points that I would really like to emphasize from that conversation include the consensus among the four of us that machine learning tools are not magic bullets, but must involve humans.

We talked about two ways of using machine learning to organize documents--clustering and categorization. In eDiscovery, categorization is often called predictive coding. Clustering groups together documents that are similar to one another. The computer derives the groupings to be used. Categorization, on the other hand, starts with categories that are specified by people and puts documents into the category that provides the best match.

When using clustering, people have to decide which of the groups of documents are important after the computer organizes them. When using categorization, people have to design the categories before the computer organizes them. In neither case, do we rely on the computer to make the legal judgments about what is important to the matter. That is a decision that is best made by someone with real-world experience and legal judgment.

Computers can take the tedium out of implementing human legal judgments, but so far, they are not in a position to make the judgments themselves. These systems do not take the lawyer out of the equation, they simply provide support to reduce the level of effort required to implement the attorney's judgment on ever-growing collections of documents.

That brings me to the second point I want to discuss--machine learning algorithms and their ability to handle large data sets. Jaime Carbonell remarked that most of the machine learning algorithms he was familiar with handled gigabyte size problems well, but did not do well on terabyte sized problems. In general that's true, but he did mention that search algorithms were a major exception. Google and others have shown ready capabilities of searching the World-Wide Web and its billions of documents.

Not all machine learning algorithms are subject to the kinds of scaling constraints that Jaime mentioned. For example, if it were possible to transform clustering and categorization into search problems, then we would expect that these algorithms would also scale to Web-sized problems. That, in fact, is what we at OrcaTec have done. We have recast the traditional clustering and categorization algorithms into scalable search-like algorithms that scale directly into very large collections in reasonable amounts of time. As a result, our software can handle even very large collections and efficiently and effectively cluster and categorize the documents.

Finally, that brings me to my third point from our conversation. There was widespread agreement that assessment of the effectiveness of our processes is an essential component. Whether you know or even care about the content of the black box or the head of the temp attorney hired to do the review, we all agreed that measurement was an essential part of the process. You cannot know that you've done a reasonable job unless you can show that you measured the job that you did. You cannot improve the quality of your process unless you measure that quality.

When we have measured human review performance, it was not as consistent as one might have imagined. When compared with machine learning, the machines come out at least as accurate as people do, and often more accurate. From the things that I see and hear, measurement of eDisocovery processes is rapidly becoming the norm, and, in my opinion, should be. Appropriate quality assessments are neither expensive nor time consuming. They can, however, be invaluable in demonstrating that your review involved a reasonable process at a reasonable level of effectiveness. The Sedona Conference has an excellent paper on achieving and measuring quality in eDiscovery. I highly recommend reading it.

1 comment:

  1. Herbert,

    As usual, excellent points. What's the core difference in the search approach that cracks the scaling challenge?

    "Not all machine learning algorithms are subject to the kinds of scaling constraints that Jaime mentioned. For example, if it were possible to transform clustering and categorization into search problems, then we would expect that these algorithms would also scale to Web-sized problems. That, in fact, is what we at OrcaTec have done. We have recast the traditional clustering and categorization algorithms into scalable search-like algorithms that scale directly into very large collections in reasonable amounts of time. As a result, our software can handle even very large collections and efficiently and effectively cluster and categorize the documents."

    ReplyDelete