<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3720617976685068531</id><updated>2012-01-16T17:42:22.976-08:00</updated><category term='ediscovery analytics'/><category term='electronic discovery'/><category term='TREC'/><category term='ediscovery'/><category term='semantic search'/><category term='truevert'/><category term='information retrieval'/><category term='concept search'/><category term='TREC legal track'/><category term='Early Case Assessment'/><category term='orcatec'/><title type='text'>Information Discovery</title><subtitle type='html'>A discussion of information discovery in Web 3.0, litigation, intelligence analysis, semantic search and other areas.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>12</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-47796752802630270</id><published>2012-01-09T22:35:00.000-08:00</published><updated>2012-01-10T09:33:42.579-08:00</updated><title type='text'>On Some Selected Search Secrets</title><content type='html'>Ralph Losey recently wrote an important series of blog posts (&lt;a href="http://e-discoveryteam.com/2011/12/11/secrets-of-search-part-one/"&gt;here&lt;/a&gt;, &lt;a href="http://e-discoveryteam.com/2011/12/18/secrets-of-search-part-ii/"&gt;here&lt;/a&gt;, and &lt;a href="http://e-discoveryteam.com/2011/12/29/secrets-of-search-part-iii/"&gt;here&lt;/a&gt;) describing five secrets of search.  He pulled together a substantial array of facts and ideas that should have a powerful impact on eDiscovery and the use of technology in it.  He raised so many good points, that it would take up all of my time just to enumerate them.  He also highlighted the need for peer review.  In that spirit I would like to address a few of his conclusions in the hope of furthering discussions among lawyers, judges, and information scientists about the best ways to pursue eDiscovery.&lt;br /&gt;&lt;br /&gt;These are the problematic points I would like to consider:&lt;br /&gt;1. Machines are not that good at categorizing documents.  They are limited to about 65% precision and 65% recall.&lt;br /&gt;2. Webber’s analysis  shows that human review is better than machine review &lt;br /&gt;3. Reviewer quality is paramount.&lt;br /&gt;4. Human review is good for small volumes, but not large ones.&lt;br /&gt;5. Random samples with 95% confidence levels +/- 2 are unrealistically high.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Issue: Machines are not that good at categorizing documents.  They are limited to about 65% precision and 65% recall.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Losey quotes extensively from a &lt;a href="http://www.umiacs.umd.edu/~oard/sire11/papers/webber.pdf"&gt;paper &lt;/a&gt;written by William Webber, which reanalyzes some results from the TREC Legal Track, 2009, and some other sources.  Like Losey’s commentary, this paper also has a lot to recommend it.  Some of the conclusions that Losey reaches are fairly attributable to Webber, but some go beyond what Webber would probably be comfortable with.  The most significant fact, because important arguments are based on it, is a description of some work by Ellen Voorhees that concluded that 65% recall at 65% precision is the best performance one can expect.  &lt;br /&gt;&lt;br /&gt;The problem is that this 65% factoid is taken out of context.  In the context of the TREC studies and the way that documents are ultimately determined to be relevant or not, this is thought to be the best that can be achieved.  The 65% is not a fact of nature.  It says, actually, nothing about the accuracy of the predictive coding systems being studied.  Losey notes that this limit is due to the inherent uncertainty in human judgments of relevance, but goes on to claim that this is a limit on machine-based or machine assisted categorization.  It is not.&lt;br /&gt;&lt;br /&gt;Part of the TREC Legal Track process is to distribute sets of documents to ad hoc reviewers, whom they call assessors.  Each assessor gets a block or bin of about 500 documents and is asked to categorize them as relevant or not relevant to the topic query.  None of the documents in this part of the process is reviewed by more than one assessor.  Each assessor typically reviews only one batch.  Although information about the topic is provided to each assessor, there is no rigorous effort expended to train them.  As you might expect, the assessors can be highly variable.  But, generally speaking, we don’t have any assessment of their variability or skill level.  This is an important point and I will have to come back to it soon.&lt;br /&gt;&lt;br /&gt;Predictive coding systems generally work by applying machine learning to a sample of documents and extrapolating from that sample to the whole collection.  Different systems get their samples in different ways, but the performance of the system depends on the quality of the sample.  Garbage in – garbage out. More fully, variability in accuracy can come from at least three sources:&lt;br /&gt;1. Variability in the training set&lt;br /&gt;2. Variability due to software algorithms&lt;br /&gt;3. Variability due to the judgment standard against which the system is scored&lt;br /&gt;&lt;br /&gt;If the system is trained on an inconsistent set of documents, or if it performs inconsistently, or if it is judged inconsistently, its ultimate level of performance will be poor.  Voorhees, in the paper cited by Webber, found that professional information analysts agreed with one another on less than half of the responsive documents.  This fact says nothing about any predictive coding system, it talks only about agreement of one person with another.  One of the assessors she compared was the author of the topic and so could be considered  the best available authority on the topic.  The second assessor was the author of a different topic. &lt;br /&gt;&lt;br /&gt;Under TREC, the variability due to the training set is left uncontrolled.  It is up to each team to figure out how to train their systems. The variability due to the judgment standard is consistent across systems, so any variation among systems can be attributed to the training set or the system capabilities.  That strategy is perfectly fine for most TREC purposes.  We can compare the relative performance of a participating system.  The problem only comes when we want to ascertain how well a system will do in absolute terms.  The performance of predictive coding systems in the TREC Legal Track is suppressed by the variability of the judgment standard.  It is not a design flaw for TREC, it is only a problem when we want to extrapolate from TREC results to eDiscovery or other real world situations.  It under-estimates how well the system will do with more rigorous training and testing standards.  The original TREC methodology was never designed to produce absolute estimates of performance, only relative.&lt;br /&gt;&lt;br /&gt;Anything that we can do to improve the consistency of the training and testing set of document categorizations will improve the absolute quality of the results.  But such quality improvements are typically expensive.&lt;br /&gt;&lt;br /&gt;The TREC Legal Track has moved to using a Topic Authority (like Voorhees’s primary assessor).  Even an authoritative assessor is not infallible, but it may be the best that we can achieve.  It also may be realistic. &lt;br /&gt;&lt;br /&gt;I would like to see the Topic Authority (TA) produce an authoritative training set and a second authoritative judgment set.  The first set is used to train the predictive coding system, the second is used to test it.  &lt;br /&gt;&lt;br /&gt;Using a topic authority to provide the training and final assessment sets will substantially reduce the variability of the human judgments.  We need two sets because we cannot use the same documents to train the system as we use to test the system.  If we used only one, then the performance of the system on the same documents could over-estimate its capabilities.  The system could simply memorize the training examples and spit back the same categories it was given.  Having separate training and testing sets is standard procedure in most machine learning studies. &lt;br /&gt;&lt;br /&gt;When we do a scientific study, we want to know how well the system will do in other, similar, situations.  This prediction typically requires a statistical inference, and to make a valid statistical inference the two measurements need to be independent. &lt;br /&gt;&lt;br /&gt;To translate this into eDiscovery process, the training set should be created by the person who knows the most about the case and then evaluated, for example, using a random sample of documents predicted to be responsive and nonresponsive, by the same person.   Losey is correct, if you have multiple reviewers, each applying idiosyncratic standards to the review you will get poor results, even from a highly accurate predictive coding system. On the other hand, with rigorous training and a rigorous evaluation process, high levels of predictive coding accuracy are often achievable. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Issue: Webber’s analysis shows that human review is better than machine review&lt;/b&gt; &lt;br /&gt;&lt;br /&gt;I have no doubt that human review could sometimes be better than machine-assisted review, but the data discussed by Webber do not say anything one way or the other about this claim.&lt;br /&gt;&lt;br /&gt;Webber did, in fact, find that some of the human reviewers showed higher precision and recall than did the best-performing computer system on some tasks.  But, because of the methods used, we don’t know whether these differences were due merely to chance, to specific methods used to obtain the scores, or to genuine differences among reviewers.  Moreover, the procedure prevents us from making a valid statistical comparison.&lt;br /&gt;&lt;br /&gt;The TREC Legal Track results that were analyzed in Webber’s paper involved a three step process.  The various predictive coding systems were trained on whatever data their teams thought were appropriate.  The results of the multiple teams were combined and sampled along with a number of documents that were not designated responsive by any team.  From these, the bins or batches were created and distributed to the assessors.  Once the assessors made their decisions, the machine teams were given a second chance to “appeal” any decisions to the Topic Authority.  If the TA agreed with the computer system’s judgments the computer system then was measured as performing better and the assessor’s performance was judged as performing worse.  The appeals process, in other words, moved the target after the arrow had been shot.&lt;br /&gt;&lt;br /&gt;If none of the documents judged by a particular assessor was appealed, then that assessor would have precision and recall of 1.0.  Prior to the appeal, the assessors’ judgments &lt;b&gt;were&lt;/b&gt; the accuracy standard.  The more documents that were appealed, the more likely that assessor would be to have a low score.  Their score could not increase from the appeals process.  So, whether an assessor scored well or scored poorly was determined almost completely by the number of documents that were appealed—by how much the target was moved.  &lt;br /&gt;&lt;br /&gt;Because of the (negative) correlation between the performance of the computer system and the performance of the assessor, their performances were not independent.  Therefore, a simple statistical comparison between the performance of the assessor and the performance of the computer system is not valid.&lt;br /&gt;&lt;br /&gt;Even if the comparison were valid, we still have other problems. The different TREC tasks involved different topics.  Some were presumably easier than others.  The assessors who made the decisions may have varied in ability, but we have no information about which were more skillful.  The bins or batches that were reviewed probably differed among themselves in the number of responsive documents each contained.  Because only one assessor judged each document, we don’t know whether the differences in accuracy (as judged by topic authority) were due to differences in the documents being reviewed or to differences in the assessors.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Issue: Reviewer quality is paramount&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Webber found that some assessors performed better than others.  Continuing the argument of the previous section, though, we cannot infer from this that some assessors were more talented, skilled, or better prepared than others.  &lt;br /&gt;It is entirely circular to say that some assessors were more skillful than others and so were more accurate because the only evidence we have that they were more skillful is that they were measured to be more accurate.  You cannot explain an observation (higher accuracy) by the fact that you made the observation (higher accuracy). It cannot be both the cause and the effect.  &lt;br /&gt;&lt;br /&gt;Whether the source of variation in performance among the assessors was due to variation in the number or difficulty of the decisions or was due to differences in assessor talent, you cannot simply pick the best of one thing (the assessor performance) and compare it to the average of another (the computer assisted review).  The computer performance is based on all of the documents, each assessor’s performance is based on only 500 documents.  The computer performance was a representational equivalent of the average of all assessor judgments.  Just by chance, some assessors will be higher than others.  In fact, about half of the assessors should, just by chance, score above and about half should score below the average.  But, we have no way to determine whether those selected reviewers scored high because of chance or because of some difference in skill. We measured them only once and we cannot use the fact that they scored well to explain their high score.  We need some independent evidence. &lt;br /&gt;&lt;br /&gt;The best reviewers on each topic could have been the best because they got lucky and got an easy bin, or they got a bin with a large number of responsive documents, or just by chance.  Unless we disentangle these three possibilities, we cannot claim that some reviewers were better or that reviewer quality matters.  In fact, these data provide no evidence one way or the other relative to these claims.&lt;br /&gt;&lt;br /&gt;In any case, the question in practice is not whether some human somewhere could do better than some computer system.  The question is whether the computer system did well enough or is there some compelling reason to force parties through the expense of using superior human reviewers?  Even if some human reviewers could consistently do better than some machine, this is not the current industry standard.  &lt;br /&gt;In some sense, the ideal would be for the senior attorney in the case to read every single document with no effect of fatigue, boredom, distraction, or error.  Instead, the current practice is to farm out first pass review to either a team of anonymous, ad hoc, or inexpensive reviewers or to search by keyword.  Even if Losey were right, the standard is to use the kind of reviewers that he says are not up to the task. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Issue: Human review is good for small volumes, but not large ones&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This claim may also be true, but the present data do not provide any evidence for or against it.  The evidence that Losey cites in support of this claim is the same evidence that, I argued,  failed to show that human review is better than machine review.  It requires the same circular reasoning.  We do not know from Webber’s analysis whether some reviewers were actually better than others, only that on this particular task, with these particular documents, they scored higher.  Similarly, we don’t know from these data that doing only 500 documents is accurate, whereas doing more leads to inaccuracy.  We don’t even know in the tested circumstance whether performance would decrease over that number.  All bins were about the same size, so there is no way to test the hypothesis with these data that performance decreases as the set size rises above 500. It just was not tested.&lt;br /&gt;&lt;br /&gt;When confronted with a small (e.g., several hundred) versus a large volume of documents to review, we can expect that fatigue, distraction, and other human factors will decrease accuracy over time.  Based on other evidence from psychology and other areas, it is likely that performance will decline somewhat with larger document sets, but there is no evidence here for that.&lt;br /&gt;&lt;br /&gt;If this were the only factor, we could arrange the situation so that reviewers only looked at 500 documents at a time before they took a break.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Issue: Random samples with 95% confidence levels +/- 2% confidence intervals are unrealistically high.&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;It’s not entirely clear what this claim means.  On the one hand, there is a common misperception of what it means to have a 95% confidence level.  Some people mistakenly assume that the confidence level refers to the accuracy of the results. But the confidence level is not the same thing as the accuracy level.  &lt;br /&gt;The confidence level refers to our belief in the measurement’s reliability, it does not tell what we are measuring.  The confidence interval (e.g., ±2%) is a prediction of how precisely our sample estimates the true value of the whole population.  Put simply, a 95% confidence interval means that in 95% of the experiments with this confidence level, we expect to find that true population value within the range specified by the experiment’s confidence interval. &lt;br /&gt;&lt;br /&gt;For example, a recent CNN, Time Magazine poll found that 37% of likely Republican primary voters in South Carolina supported Mitt Romney, based on a survey sample size of 485 likely Republican primary voters.  With a 95% confidence level, these results are accurate to within ±4.5 percentage points (the confidence interval).  It does not mean that Romney is supported by 95% of the voters or that Romney is has a 95% probability of winning.  It means that if the election were held today, the survey predicts that 37% of the voters would vote for Romney.  I suspect that Losey means something different.&lt;br /&gt;&lt;br /&gt;I suspect that he is referring to the relatively weak levels of agreement found by the TREC studies and others.  If our measurement is not very precise, then we can hardly expect that our estimates will be more precise.  This concern, though, rests on obtaining the measurements in the same way that TREC has traditionally done it.  If we can reduce the variability of our training set and our comparison set, we can go a long way toward making our estimates more precise.  &lt;br /&gt;&lt;br /&gt;In any case, many relevant estimates do not depend on the accuracy of a group of human assessors.  In practice, for example, our estimates of such factors as the prevalence of responsive documents can rest on the decisions made by a single authoritative individual, perhaps the attorney responsible for signing the Rule 26(g) declaration.  Those estimates can be made precise enough with reasonable sample sizes.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The main problem with Losey’s discussion derives from taking the results reported by Webber and Voorhees as an immutable fact of information retrieval.  The observation that there is only moderate agreement among independent assessors is a description of the human judgments in these studies, it says nothing about any machine systems used to predict which documents are responsive or not.  The variability that leads to this moderate level of agreement can be reduced and when it is, the performance of machine-review can be more accurately measured.&lt;br /&gt;&lt;br /&gt;The second problem derives from the difficulty of attributing causation in experiments that were not designed to attribute such causation.  Within the data analyzed by Webber, for example, there is no way to distinguish the effects of chance from the effects due to assessor differences.  &lt;br /&gt;&lt;br /&gt;None of these comments should be interpreted as an indictment of TREC, Webber, or Losey.  Science proceeds when people with different perspectives have the chance to critique each other’s work and to raise questions that may not have previously been considered.  &lt;br /&gt;&lt;br /&gt;None of these comments is intended to argue that predictive coding is fundamentally inaccurate.  Rather my main argument is that the studies from which these data were derived were not designed to answer many of the questions we would like to ask of it.  They do not speak against the effectiveness of predictive coding, nor do they speak in favor of it.  Other studies will need to be conducted that address these questions specifically and are designed to answer them.&lt;br /&gt;&lt;br /&gt;Finally, even if we disagree about the effectiveness of predictive coding relative to human performance, there is little disagreement any more about the effectiveness of a purely human linear review or of using a simple keyword search to identify responsive documents.  The cost of human review continues to skyrocket as the volume of documents that must be considered increases.  In many cases, human review is simply impractical within the cost and time constraints of the matter.  Under those circumstances, something else has to be done to reduce that burden.  That something else, seems to be predictive coding and the fact that we can measure its accuracy only adds to its usefulness.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-47796752802630270?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/47796752802630270/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2012/01/on-some-selected-search-secrets.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/47796752802630270'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/47796752802630270'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2012/01/on-some-selected-search-secrets.html' title='On Some Selected Search Secrets'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-1733153839286988801</id><published>2011-08-08T12:03:00.000-07:00</published><updated>2011-08-08T12:03:52.521-07:00</updated><title type='text'>Optimal document decisioning</title><content type='html'>&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-xq-w_Xmpw20/TkAy1IAuLBI/AAAAAAAAABo/gCffkPtZ2Fc/s1600/iStock_000009374608XSmall.jpg" imageanchor="1" style="clear:left; float:left;margin-right:1em; margin-bottom:1em"&gt;&lt;img border="0" height="319" width="320" src="http://1.bp.blogspot.com/-xq-w_Xmpw20/TkAy1IAuLBI/AAAAAAAAABo/gCffkPtZ2Fc/s320/iStock_000009374608XSmall.jpg" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;There seems to be an emerging workflow in eDiscovery where predictive coding and highly professional reviewers are being used in place of large ad hoc groups of temporary attorneys.  There is recognition that without high levels of training and good quality-control methods, human review tends to be only moderately accurate.  Selecting or training effective reviewers requires an understanding of what makes a reviewer successful and of how to measure that success. We can look to optimal decision theory, and particularly to the branch of optimal decision theory called detection theory to provide insight into training and assessing reviewers.&lt;br /&gt;&lt;br /&gt;The work on optimal decision theory began during World War II to measure and understand, for example, how to characterize the sensitivity of radar to detect objects at a distance.  It then came to be applied to human decision making as well, work that was published after the war.  This type of optimal decision theory is often called detection theory.&lt;br /&gt;&lt;br /&gt;Detection theory concerns the question: Based on the available evidence, should we categorize an event as a member of set 1 or as a member of set 2?  In radar, the evidence is in the signal reflected from an object, and the sets are whether the reflection is from a plane or from, say, a cloud.  In document decisioning, the evidence consists of the words in the document and the sets are, for example, responsive and nonresponsive.  &lt;br /&gt;&lt;br /&gt;In order to isolate the essence of decisioning, we can simplify the situation further.  For the moment, let’s think about a decision where all we have to do is decide whether a tone was played at a particular time or not—a kind of hearing test.  Those events when the tone is present are analogous to a document being responsive and those events when the tone is absent are analogous to a document being nonresponsive.  &lt;br /&gt;&lt;br /&gt;Let’s put on a pair of head phones and listen for the tone.  When the tone is present it is played very softly, so there may be some uncertainty about whether the tone was present or not.  How do we decide whether we hear a tone or not?  &lt;br /&gt;&lt;br /&gt;At first, it may seem that detecting the tone is not a matter of making a decision.  It is either there or it is not.  But, one of the insights of detection theory is that it does actually require a decision and that decision is affected by more than just how loud the tone is.&lt;br /&gt;&lt;br /&gt;In detection theory, two kinds of factors influence our decisions. The first is the sensitivity of listener—how well can the listener distinguish between tone and nontone events?  The second factor is bias—how willing is the listener to say that the tone was present.  &lt;br /&gt;&lt;br /&gt;In our hearing test, we present a series of events or trials.  The listener has to decide on each of those events whether she is hearing the tone.  Detection theory describes how to combine the level of evidence (e.g., intensity of the tone) and these other factors to come up with the best decision possible.&lt;br /&gt;&lt;br /&gt;Some listeners have more sensitive hearing than others.  The more sensitive a person is, the softer the tone can be played and still be heard.  Some reviewers are more sensitive than others.  They can tell whether a document is responsive based on more subtle cues than other reviewers.&lt;br /&gt;&lt;br /&gt;Bias concerns the willingness or tendency of the speaker to identify an event as a tone event.  This willingness can be influenced by a number of factors, including the probability that a given event contains a tone and by the consequences of each type of decision.  Put simply, if tone events are very rare, then people will be less likely to say that a tone occurred when they are uncertain.  If tone events are more common, they will be more likely to say that a tone occurred when they are uncertain.  Reviewers are more likely to categorize a document as responsive if the collection contains more responsive documents.&lt;br /&gt;&lt;br /&gt;Similarly, if a person gets paid a dollar for correctly hearing a tone and gets charged 50 cents for an error, then that person will be more likely to say that he or she heard the tone.  If we reverse the payment plan so that correctly hearing a tone yields 50 cents, but errors cost a dollar, then that person will be reluctant to say that he or she hears the tone.  In the face of uncertainty, the optimal decision depends on the evidence available and the consequences of each type of decision.&lt;br /&gt;&lt;br /&gt;The point of this is that you can change the proportion of events that are said to contain the tone not only by making the tone louder or softer, but also by changing the consequences of decisions and the likelihood that the tone is actually present.  &lt;br /&gt;&lt;br /&gt;Bringing this back to document decisioning, the words in a document constitute the evidence that a document is responsive or not.  In the face of uncertainty, decision makers will decide whether a document is responsive based on the degree to which the evidence is consistent with the document being responsive, on their sensitivity to this evidence, on the proportion of responsive documents in a collection, and on the consequences of making each kind of decision. All of these factors play a role in document decisioning.&lt;br /&gt;&lt;br /&gt;In the paper by Roitblat, Kershaw, and Oot (2010, &lt;a href="http://onlinelibrary.wiley.com/doi/10.1002/asi.21233/abstract?"&gt;JASIST&lt;/a&gt;), for example, two teams of reviewers re-examined a sample of documents that had been reviewed by the original Verizon team.  In this re-review, Team A identified 24.2% of the documents in their sample as responsive and Team B identified 28.76% as responsive.  Although Team B identified significantly more documents as responsive, when the sensitivity of these two teams was measured in the way suggested by detection theory, the two teams did not differ significantly from one another in sensitivity.  They did differ in their bias, however, to call an uncertain document responsive. Team B was simply more willing than Team A to categorize documents as responsive without being any better at distinguishing responsive from nonresponsive documents.&lt;br /&gt;&lt;br /&gt;The most useful insight to be derived from an optimal decision theory approach to document decisioning is the separability of sensitivity and bias.  Reviewers can differ in how sensitive they are to identifying responsive documents and they can be guided to be more or less biased toward accepting documents as responsive when uncertain.  &lt;br /&gt;&lt;br /&gt;Presumably sensitivity will be affected by education.  The more that reviewers know about the factors that govern whether a document is responsive, the better they will be at distinguishing responsive from nonresponsive.  Their bias can be changed simply by asking them to be more or less fussy.  The optimum review needs not only to be maximally sensitive to the difference between responsive and nonresponsive documents, but to adopt the level of bias that is appropriate to the matter at hand.&lt;br /&gt;&lt;br /&gt;When assessing reviewers, optimal decision theory suggests that you separate out the sensitivity from the bias.  The quality of a reviewer is represented by his or her sensitivity, not by bias.  If all you measure, for example, is the proportion of responsive documents found by a candidate reviewer (where responsive is defined by someone authoritative), then you could easily miss highly competent reviewers because they have a different level of bias from the authoritative reviewer.  Equally likely, you could select a candidate who finds many responsive documents just because he or she is biased to call more documents responsive.  Although reviewer sensitivity may be difficult to change, bias is very easy to change.  You have only to ask the person to be more or less generous.  Unless you measure both bias and sensitivity, you won’t be able to make sound judgments about the quality of reviewers, whether those reviewers are machines or people.&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Note: Traditional information retrieval science uses precision and recall to measure performance.  These two measures recognize that there is a tradeoff between precision and recall.  You can increase precision by focusing the retrieval more narrowly, but this usually results in a decrease in recall.  You can get the highest recall by retrieving all documents, but then you would have very low precision.  Precision and recall measures are affected by both bias and sensitivity, but they do not provide any means to separate one from the other.    Sensitivity and bias have been used in information retrieval studies, but not as commonly as precision and recall.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-1733153839286988801?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/1733153839286988801/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2011/08/optimal-document-decisioning.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/1733153839286988801'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/1733153839286988801'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2011/08/optimal-document-decisioning.html' title='Optimal document decisioning'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-xq-w_Xmpw20/TkAy1IAuLBI/AAAAAAAAABo/gCffkPtZ2Fc/s72-c/iStock_000009374608XSmall.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-6012732722161858584</id><published>2011-06-12T21:31:00.000-07:00</published><updated>2011-06-22T07:47:02.789-07:00</updated><title type='text'>Competitor’s press release about predictive coding patent stretches the truth</title><content type='html'>[updated June 14, 2011]&lt;br /&gt;[updated June 22, 2011]&lt;br /&gt;One of &lt;a href="http://www.orcatec.com"&gt;OrcaTec&lt;/a&gt;'s competitors, Recommind, has recently been awarded a patent related to predictive coding.  In a &lt;a href="http://www.recommind.com/releases/20110608/recommind_patents_predictive_coding"&gt;press release&lt;/a&gt; dated June 8, 2011, announcing this award, they make some very grandiose claims with no basis in fact.  According to their press release, they actually claim to have patented predictive coding.  This claim is a gross exaggeration and unsupported by the details of the patent (No. 7,933,859) or the history of predictive coding.  Patents are intended to protect inventions and there is no evidence that Recommind invented predictive coding.&lt;br /&gt;&lt;br /&gt;Having examined the patent carefully, I can say that this patent covers only a very narrow method of computing in predictive coding and is unlikely to have any impact on the ability of any other eDiscovery service provider to continue to offer this game-changing capability.  Primarily it involves the combination of using three things: Probabilistic Latent Semantic Analysis (a key part of Recommind's core product), Support Vector Machines (a statistical learning tool), and user feedback.&lt;br /&gt;&lt;br /&gt;The scope of a patent is determined by its claims, not by the title of a press release.  A valid patent requires that the proposed invention be (among other things) novel and non-obvious. In contrast, what we now call predictive coding has a very long history.  I have written about some of this history previously.&lt;br /&gt;&lt;br /&gt;Further, the use of predictive coding in eDiscovery predates Recommind’s patent application by many years.  At the time that they filed their patent, predictive coding was in wide use, so it could not be considered  novel or non-obvious as the Patent Office defines these terms, though some specific methods for implementing it may meet these requirements.&lt;br /&gt;&lt;br /&gt;Predictive coding is a family of evidence-based document categorization technologies that are used to put documents or electronically stored information (ESI) into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged.  The underlying idea of using evidence to categorize objects has been around since the 18th Century.  The notion of applying similar ideas to document classification or categorization was described in 1961 by M.E. Maron.&lt;br /&gt;&lt;br /&gt;In his paper, &lt;a href="http://sci2s.ugr.es/keel/pdf/algorithm/articulo/Maron1961.pdf"&gt;Maron &lt;/a&gt;noted:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;Loosely speaking, it appears that there are two parts to the problem of classifying. The first part concerns the selection of certain relevant aspects of an item as pieces of evidence. The second part of the problem concerns the use of these pieces of evidence to predict the proper category to which the item in question belongs. &lt;/i&gt;(p. 404).&lt;br /&gt;&lt;br /&gt;The general process of updating a system’s classification rules as a result of user feedback is called “relevance feedback.”  It has been in use since at least 1971 (Rocchio, 1971).  For example, Lewis and Gale (1994) used relevance feedback and “uncertainty sampling” to predictively categorize news stories.  Uncertainty sampling is a method of selecting specific documents to be categorized by the user when improving the quality of the predictive categorizer.  The documents shown to the user for classification are those that would maximally reduce the uncertainty of the classifier.&lt;br /&gt;&lt;br /&gt;In 2002, Paul Graham introduced &lt;a href="http://www.paulgraham.com/spam.html"&gt;SpamBayes&lt;/a&gt;, which uses techniques similar to those described by Maron to distinguish SPAM from nonSPAM (or HAM) emails. An initial sample of categorized SPAM and HAM emails is analyzed by the program to learn how specific words provide evidence for one category or the other.  Subsequent emails are then classified according to these implicit rules and the evidence, which consists of the words in the emails.  If the system misclassifies an email as either SPAM or HAM, the user can flag these errors and the classifier will update itself to reflect these reclassified emails.  This seems to me to be a very clear example of predictive coding, which again argues that predictive coding, per se, does not meet the novelty or non-obviousness criteria.&lt;br /&gt;&lt;br /&gt;eDiscovery service providers have been doing predictive coding (sometimes called by other names) for many years. In January of 2010, before the Recommind patent was filed, the eDiscovery Institute published a paper (by Roitblat, Kershaw and Oot) in the &lt;i&gt;Journal of the American Society for Information Science and Technology&lt;/i&gt; describing two eDiscovery service providers and the accuracy of their predictive coding tools.  These tools, obviously, were in existence prior to Recommind’s filing.  Other service providers have been providing similar services even before that. So Recommind can hardly be considered to have invented predictive coding, yet none of this prior art was actually included in Recommind’s patent application.&lt;br /&gt;&lt;br /&gt;One pending patent, however, was included in their application as evidence of prior art in this field.  This pending application, assigned to Bank of America, is called “Predictive Coding of Documents in an Electronic Discovery System.”  Therefore, the patent application itself recognizes that Recommind cannot be the inventor of predictive coding.&lt;br /&gt;&lt;br /&gt;They also included in their list of prior art an article attributed to Thorsten (Thorsten is actually the author's first name, his last name is Joachims) describing a statistical machine-learning technique called SVM (Support Vector Machines), which is used in their claimed invention.  This same paper also describes relevance feedback.&lt;br /&gt;&lt;br /&gt;Given all of this prior art, it is very clear that Recommind is in no position to claim to have either invented or to “own” predictive coding.  Rather, their patent covers a  very specific, very narrow approach to predictive coding involving the use of two very specific statistical procedures and relevance feedback.  I will leave it to attorneys to determine whether even this circumscribed application constitutes a valid patent.&lt;br /&gt;&lt;br /&gt;Still, even if their patent is valid, it  leaves  plenty of room for other approaches to predictive coding, including the approach used by &lt;a href="http://www.orcatec.com"&gt;OrcaTec&lt;/a&gt;.  Nothing in the Recommind patent would preclude OrcaTec or any other service provider from offering predictive coding services in eDiscovery or any other area.  OrcaTec does not use the statistical procedures described in their patent.  We believe that &lt;a href="http://orcatec.com/index.php/products"&gt;OrcaCategorize&lt;/a&gt; is an easier to use, and more effective product, which can help attorneys achieve cost savings significantly beyond those claimed by Recommind.&lt;br /&gt;&lt;br /&gt;Grandiose claims like those in the Recommind press release indicate either a profound lack of understanding of just what is covered by the patent, or are a deliberate attempt to obfuscate the issues in the industry.  Attorneys involved in eDiscovery look to their service providers to provide open, honest and effective processes.  They are not well served by unnecessary hyperbole.&lt;br /&gt;&lt;br /&gt;Bob Tenant has an excellent &lt;a href="http://www.recommind.com/blogs/20110616/of_predictive_coding_and_patents"&gt;blog &lt;/a&gt;on this topic that I think resolves the issue.  I urge you to read it.  &lt;br /&gt;---&lt;br /&gt;About the author:&lt;br /&gt;&lt;br /&gt;Herbert L. Roitblat, Ph.D. is the CTO and Chief Scientist of OrcaTec.  He holds a number of patents in eDiscovery technology and other areas.  He is widely considered to be an expert in eDiscovery methodology and technology.  He is a member of the Sedona Working Group on Electronic Document Retention and Production, on the Advisory Board of the Georgetown Legal Center Advanced eDiscovery Institute, a member of the 2011 program committee for the Georgetown Legal Center Advanced eDiscovery Institute, and the chair of the Electronic Discovery Institute. He is a member of the Board of Governors of the Organization of Legal Professionals.  He is a frequent speaker on eDiscovery, particularly concerning search, categorization, predictive coding, and quality assurance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-6012732722161858584?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/6012732722161858584/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2011/06/competitors-press-release-about.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/6012732722161858584'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/6012732722161858584'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2011/06/competitors-press-release-about.html' title='Competitor’s press release about predictive coding patent stretches the truth'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-6797145123624381306</id><published>2011-04-06T16:08:00.000-07:00</published><updated>2011-04-06T16:08:57.134-07:00</updated><title type='text'>Is predictive coding defensible?</title><content type='html'>Predictive coding  is a family of evidence-based document categorization technologies that are used to put documents or ESI into matter-relevant categories, such as responsive/nonresponsive or privileged/nonprivileged. The evidence is quite clear from a number of studies, including &lt;a href="http://trec-legal.umiacs.umd.edu/"&gt;TREC&lt;/a&gt; and the &lt;a href="http://www.ediscoveryinstitute.org"&gt;eDiscovery Institute&lt;/a&gt;, that predictive coding is at least as accurate as having teams of reviewers read every document.  Despite these studies, there is still some skepticism about using predictive coding in eDiscovery.  I would like to address some of the issues that these skeptics have raised.  The two biggest concerns are whether predictive coding is reasonable and whether it is defensible.  I cannot pretend to offer a legal analysis of reasonableness and defensibility, but I do have some background information and opinions that may be useful to make the legal arguments.&lt;br /&gt;&lt;br /&gt;Some of the resistance to using predictive coding is the fear that the opposing party will object to the methods used to find responsive (for example) documents.  By my reading, the courts do not usually support objections based on the supposition that something might have been missed.  The opposing party has to respond with some particularity (e.g., &lt;a href="http://ralphlosey.files.wordpress.com/2009/09/fordmotor-v-edgewood.doc"&gt;Ford v Edgewood Properties&lt;/a&gt;).  In &lt;a href="http://scholar.google.com/scholar_case?case=18142239012809139297"&gt;In re Exxon&lt;/a&gt;, the trial court refused to compel a party to present a deponent to testify as to the efforts taken to search for documents.  &lt;a href="http://www.iediscovery.com/files/Multiven.pdf"&gt;Multiven, Inc. v. Cisco Systems&lt;/a&gt; might also be relevant in that the court ordered the party to use a vendor with better technology, rather than having someone read each document.  In a recent &lt;a href="http://www.iediscovery.com/files/Datel%20Holdings.pdf"&gt;decision &lt;/a&gt;in Datel v. Microsoft, Judge Laporte quoted the Federal Rules of Evidence: “Further, ‘[d]epending on the circumstances, a party that uses advanced analytical software applications and linguistic tools in screening for privilege and work product may be found to have taken “reasonable steps” to prevent inadvertent disclosure.’ Fed. R. Evid. 502 advisory committee’s note.” &lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.orcatec.com"&gt;OrcaTec&lt;/a&gt; system, at least, draws repeated random samples from the total population of documents so that we can always extrapolate from the latest sample to the population as a whole.  Because the sample is random, it is representative of the whole collection, so the effectiveness of the system on the sample is a statistical prediction of how well the tool would do with the whole collection.  You can also sample among the documents that have been and have not been selected for production (as Judge Grimm recommends in Creative Pipe).&lt;br /&gt;&lt;br /&gt;In general, predictive coding systems work by finding documents that are similar to those which have already been recognized as responsive (or nonresponsive) by some person of authority.  The machines do not make up the decisions, they merely implement them.  If there is a black-box argument that applies to predictive coding, it applies even more to hiring teams of human readers.  No one knows what is going on in their heads, their judgments change over time, they lose attention after half an hour, etc.&lt;br /&gt;&lt;br /&gt;If you cannot tell the difference between documents picked by human reviewers and documents picked by machines, then why should you pay for the difference?&lt;br /&gt;&lt;br /&gt;Predictive coding costs a few cents per document.  Human reviewers typically pay a dollar or more per document.  Proportionality and FRCP Rule 1 would suggest, I think, that a cheaper alternative with demonstrable quality should be preferred by the courts.  The burden should be on the opposing party to prove that something was wrong and that engaging a more expensive process of lower quality (people)  is somehow justified.&lt;br /&gt;&lt;br /&gt;The workflow I suggest is to start with some exploratory data analysis, then set up the predictive coding.  In the OrcaTec system, the system selects documents to learn from.  Some other systems use slightly different methods.  Once the predictive coding is done, you can take the documents that the computer recognizes as responsive and review those.  This the result of your first pass review.  They will be a small subset of the original collection, the size of which depends on the richness of the collection.  I would not recommend producing any documents that have not been reviewed by trusted people. But now, you're not wasting time reading documents that will never get produced.&lt;br /&gt;&lt;br /&gt;Predictive coding is great for first pass review, but it can also be used as a quality check on human review.  Even if you do not want to rely on predictive coding to do your first-pass review, you can exploit it to check up on your reviewers after they are done.&lt;br /&gt;&lt;br /&gt;Some attorneys worry about a potential Daubert challenge to predictive coding.  I’m not convinced that it is even pertinent to the discussion, still predictive coding would easily stand up to such a challenge.  Predictive coding is mathematical and statistically based.  It is main-stream science that has been in existence in one form or another since the &lt;a href="http://en.wikipedia.org/wiki/Thomas_Bayes"&gt;18th Century&lt;/a&gt;.  There is no voodoo magic in predictive coding, only mathematics.  I think that the facts supporting its accuracy are certainly substantial (peer reviewed, main stream science, etc.) and the systems are transparent enough that there should be no (rational) argument about the facts. Many attorneys happily use keyword searching, which has long been known to be rather ineffective.  There has never been a Daubert challenge to using keywords to identify responsive documents.  Seldom has there been any measurement done to justify the use of keyword searches as a reasonable way to limit the scope of documents that must be reviewed.  But if the weak method of keyword searching is acceptable, then a more sophisticated and powerful process should also be acceptable.&lt;br /&gt;&lt;br /&gt;Another concern that I hear raised frequently is that lawyers would have a hard time explaining predictive coding, if challenged.  I don’t think that the ideas behind predictive coding are very difficult to explain.  Predictive coding works by identifying documents that are similar to those that an authoritative person has identified as responsive (or as a member of another category).  Systems differ somewhat in how they compute that similarity, but they share the idea that the content of the document, the words and phrases in it, are the evidence to use when measuring this similarity.&lt;br /&gt;&lt;br /&gt;A document (or more generally, ESI, electronically stored information) consists of words, phrases, sentences, and paragraphs.  These textual units constitute the evidence that a document presents.  For simplicity, I will focus on words as the unit, but the same ideas apply to using other text units.  For further simplification, we will consider only two categories, responsive and nonresponsive.  Again, the same rules apply if we want to include other categories, or if we want to distinguish privileged from nonprivileged, etc.&lt;br /&gt;&lt;br /&gt;Based on the words in a document, then, the task is to decide whether this document is more likely to be responsive or more likely to be nonresponsive.  This is the task that a reviewer performs when reading a document and it is the task that any predictive coding system performs as well.  Both categorize the document as indicated by its content.&lt;br /&gt;&lt;br /&gt;When we rely exclusively on human reviewers we have no transparency into how they make their decisions.  We can provide them with instructions that are as detailed as we may like, but we do not have direct access to their thought processes.  We may ask them, after the fact, why they decided that a particular document was responsive, but their answer is almost always a reconstruction of what they “must have thought” rather than a true explanation.  Keyword searching, on the other hand, is very transparent—we can point to the presence of specific key words—but,  the presence of a specific key word does not necessarily mean that a document is automatically responsive.  One recent matter I worked on used keyword and custodian culling and still only 6% of the selected documents ended up being tagged responsive.&lt;br /&gt;&lt;br /&gt;It seems to me that saying, “these documents were chosen as responsive because they resembled documents that were judged by our expert to be responsive,” is a pretty straightforward explanation of predictive coding, whatever technology is used to perform it.  Couple that with explicit measurement of the system’s performance, and I think that you have a good case for a transparent process.  &lt;br /&gt;&lt;br /&gt;Predictive coding would appear to be an efficient, effective, and defensible process for winnowing down your document collection for first-pass review, and beyond.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-6797145123624381306?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/6797145123624381306/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2011/04/is-predictive-coding-defensible.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/6797145123624381306'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/6797145123624381306'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2011/04/is-predictive-coding-defensible.html' title='Is predictive coding defensible?'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-2287750955760405844</id><published>2011-04-04T14:06:00.000-07:00</published><updated>2011-04-04T14:06:44.854-07:00</updated><title type='text'>Everything new is old again</title><content type='html'>To be truthful, I have been quite surprised at all of the attention that predictive coding has been receiving lately, from the usual legal blogs to the &lt;a href="http://www.nytimes.com/2011/03/05/science/05legal.html"&gt;New York Times&lt;/a&gt; to &lt;a href="http://blogs.forbes.com/benkerschberg/2011/03/23/e-discovery-and-the-rise-of-predictive-coding/"&gt;Forbes &lt;/a&gt;Magazine.  It’s not  a particularly new technology.  It’s actually been around since 1763, when &lt;a href="http://en.wikipedia.org/wiki/Thomas_Bayes"&gt;Thomas Bayes&lt;/a&gt; first proposed his famous theorem.  It’s been used in document decision making, since about 1961.  But, when I tried to convince people that it was a useful tool in 2002 and 2003, my arguments fell on deaf ears.  Lawyers just were not interested.  It never went anywhere.  Times have certainly changed.&lt;br /&gt;&lt;br /&gt;Concept search took about six years to get into the legal mainstream.  Predictive coding, by whatever name, seems to have taken about 18 months.  I’m told by some of my lawyer colleagues, that it has now become a necessary part of many statements of work.   &lt;br /&gt;&lt;br /&gt;The first paper that I know of concerning what we would today call predictive coding is by &lt;a href="http://sci2s.ugr.es/keel/pdf/algorithm/articulo/Maron1961.pdf"&gt;Maron&lt;/a&gt;, and published in 1961.  He called it “automatic indexing.”  “The term ‘automatic indexing’ denotes the problem of deciding in a mechanical way to which category (subject or field of knowledge) a given document belongs. It concerns the problem of deciding automatically what a given document is ‘about’.”  He recognized that “To correctly classify an object or event is a mark of intelligence,” and found that, even in 1961, computers could be quite accurate at assigning documents to appropriate categories.&lt;br /&gt;&lt;br /&gt;Predictive coding  is a family of evidence-based document categorization technologies. The evidence is the word or words in the documents.  Predictive coding systems do not replace human judgment, but rather augment it.  They do not decide autonomously what is relevant, but take judgments about responsiveness from a relatively small set of documents (or other sources) and extend these judgments to additional documents.  In the &lt;a href="http://www.orcatec.com"&gt;OrcaTec &lt;/a&gt;system, a user trains the system by reviewing a sample of documents.  The computer watches the documents as they are reviewed and the decisions assigned by the reviewer.  At the same time, as the computer gains some experience, it predicts the appropriate tag to be applied to the document, making the reviewer more consistent while making the computer more closely approach the decision rules used by the reviewer. &lt;br /&gt;&lt;br /&gt;Although there are a number of computational algorithms that are used to compute these evidence-based categorizers, deep inside they all address the problem of finding the highest probability category or categories for a document, given its content.&lt;br /&gt;&lt;br /&gt;The interest in predictive coding stems I think from two factors.  First, the volume of documents that must be considered continues to grow exponentially, but the resources to deal with them do not.  The cost of review in eDiscovery frequently overwhelms the amount at risk.  There is widespread recognition that something has to be done.  The second, is the emergence of studies and investigations that examine both the quality of human review and the comparative ability of computational systems.  For a number of years, TREC, the Text Retrieval Conference, has been investigating information retrieval techniques.  For the last several years, they have applied the same methodology to document categorization in eDiscovery.  The Electronic Discovery Institute conducted a similar study (I was the lead author).  The evidence available from these studies is consistent in showing that people are only moderately accurate in categorizing documents and that computers can be at least as accurate, often more accurate.  The general idea is that if computer systems can at least replicate the quality of human review at a much lower cost, then it would be reasonable to use these systems to do first pass review.&lt;br /&gt;&lt;br /&gt;In my next post, I will discuss the skepticism that some lawyers have expressed and offer some suggestions for resolving that skipticism.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-2287750955760405844?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/2287750955760405844/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2011/04/everything-new-is-old-again.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/2287750955760405844'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/2287750955760405844'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2011/04/everything-new-is-old-again.html' title='Everything new is old again'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-8380815206041618821</id><published>2011-02-25T07:51:00.000-08:00</published><updated>2011-02-25T07:51:43.393-08:00</updated><title type='text'>Opening the Black Box of Predictive Coding</title><content type='html'>Yesterday, we did a very interesting &lt;a href="http://www.esibytes.com/?p=1649"&gt;podcast &lt;/a&gt;on &lt;a href="http://www.esibytes.com/"&gt;ESIBytes&lt;/a&gt; with &lt;a href="http://www.jurinnov.com/experts_schieneman.asp"&gt;Karl Schieneman&lt;/a&gt;, &lt;a href="http://www.cs.cmu.edu/~jgc/"&gt;Jaime Carbonell&lt;/a&gt;, and &lt;a href="http://www.cs.cmu.edu/~vasco/"&gt;Vasco Pedro&lt;/a&gt; on predictive coding.  Jaime and Vasco are well known in the machine learning space, but have not been active in eDiscovery, so it was really interesting to get their perspective on how these advanced technologies could be effective in eDiscovery.&lt;br /&gt;&lt;br /&gt;A few points that I would really like to emphasize from that conversation include the consensus among the four of us that machine learning tools are not magic bullets, but must involve humans.  &lt;br /&gt;&lt;br /&gt;We talked about two ways of using machine learning to organize documents--clustering and categorization.  In eDiscovery, categorization is often called predictive coding.  Clustering groups together documents that are similar to one another.  The computer derives the groupings to be used.  Categorization, on the other hand, starts with categories that are specified by people and puts documents into the category that provides the best match. &lt;br /&gt;&lt;br /&gt;When using clustering, people have to decide which of the groups of documents are important after the computer organizes them.  When using categorization, people have to design the categories before the computer organizes them. In neither case, do we rely on the computer to make the legal judgments about what is important to the matter.  That is a decision that is best made by someone with real-world experience and legal judgment.&lt;br /&gt;&lt;br /&gt;Computers can take the tedium out of implementing human legal judgments, but so far, they are not in a position to make the judgments themselves.  These systems do not take the lawyer out of the equation, they simply provide support to reduce the level of effort required to implement the attorney's judgment on ever-growing collections of documents.&lt;br /&gt;&lt;br /&gt;That brings me to the second point I want to discuss--machine learning algorithms and their ability to handle large data sets.  Jaime Carbonell remarked that most of the machine learning algorithms he was familiar with handled gigabyte size problems well, but did not do well on terabyte sized problems.  In general that's true, but he did mention that search algorithms were a major exception.  Google and others have shown ready capabilities of searching the World-Wide Web and its billions of documents.&lt;br /&gt;&lt;br /&gt;Not all machine learning algorithms are subject to the kinds of scaling constraints that Jaime mentioned.  For example, if it were possible to transform clustering and categorization into search problems, then we would expect that these algorithms would also scale to Web-sized problems.  That, in fact, is what we at OrcaTec have done.  We have recast the traditional clustering and categorization algorithms into scalable search-like algorithms that scale directly into very large collections in reasonable amounts of time.  As a result, our software can handle even very large collections and efficiently and effectively cluster and categorize the documents.&lt;br /&gt;&lt;br /&gt;Finally, that brings me to my third point from our conversation.  There was widespread agreement that assessment of the effectiveness of our processes is an essential component.  Whether you know or even care about the content of the black box or the head of the temp attorney hired to do the review, we all agreed that measurement was an essential part of the process.  You cannot know that you've done a reasonable job unless you can show that you measured the job that you did.  You cannot improve the quality of your process unless you measure that quality.&lt;br /&gt;&lt;br /&gt;When we have measured human review performance, it was not as consistent as one might have imagined.  When compared with machine learning, the machines come out at least as accurate as people do, and often more accurate.  From the things that I see and hear, measurement of eDisocovery processes is rapidly becoming the norm, and, in my opinion, should be.  Appropriate quality assessments are neither expensive nor time consuming.  They can, however, be invaluable in demonstrating that your review involved a reasonable process at a reasonable level of effectiveness.  &lt;a href="http://www.thesedonaconference.org/"&gt;The Sedona Conference&lt;/a&gt; has an excellent paper on achieving and measuring &lt;a href="http://www.thesedonaconference.org/dltForm?did=Achieving_Quality.pdf"&gt;quality in eDiscovery&lt;/a&gt;.  I highly recommend reading it.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-8380815206041618821?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/8380815206041618821/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2011/02/opening-black-box-of-predictive-coding.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8380815206041618821'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8380815206041618821'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2011/02/opening-black-box-of-predictive-coding.html' title='Opening the Black Box of Predictive Coding'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-8516523706368365829</id><published>2010-09-28T06:37:00.000-07:00</published><updated>2010-09-28T06:37:17.391-07:00</updated><title type='text'>OrcaTec Releases the OrcaTec Text Analytics Platform</title><content type='html'>&lt;i&gt;SDR and OrcaTec Join Forces to release the leading text analytic tools for Governance, Risk-Management, and Compliance (GRC) and eDiscovery&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We at OrcaTec announce our first new products under our combined new leadership. The OrcaTec Text Analytics platform delivers unprecedented levels of accuracy and efficiency, enabling users  to retrieve, filter, analyze, visualize and act on unstructured information for 90% less than conventional technology.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The biggest expense in responding to government inquiries, investigations, and eDiscovery is the cost of document review. That cost is an increasingly heavy burden on companies and most of it is wasted on considering irrelevant or nonresponsive documents. We now have the tools that can reduce that waste and inherently reduce the burden.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In January, of this year, Anne Kershaw, Patrick Oot, and I published a paper in the &lt;i&gt;Journal of the American Society for Information Science and Technology&lt;/i&gt;, showing that two computer assisted systems, neither of which were related to OrcaTec or to the Electronic Discovery Institute, could classify documents into responsive and nonresponsive categories at least as well as new human reviewers.  This paper also suggested some of the ways that the quality of categorization could be measured, whether it was done by attorneys or by machines.  After seeing the results of that study and their acceptance, I decided that we could create a categorizer that might do even better than the two systems that we tested.  Part of what we are announcing today is the release of that new software. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;By watching how an expert categorizes a sample of documents, the system can induce the implicit rules that the expert is using and replicate them.  The computer does not make up the decision rules, it merely implements the rules that the expert used.  Consistent with the results from the&lt;i&gt; JASIST&lt;/i&gt; paper, we have found that the level of accuracy or agreement between the computer and the expert is as high or higher than one would see if another person was doing the review.  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;By whatever means a document is categorized, there are two kinds of errors.  A nonresponsive document could be falsely categorized as responsive and a responsive document could be falsely categorized as nonresponsive.  Making the correct decision depends on a multistep process.  Correct results depend on each of these steps being executed flawlessly (or on simply a lucky guess).&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The document has to be considered.  If it is never seen by the decision maker, machine or person, then it cannot be judged.&amp;nbsp;&lt;/li&gt;&lt;li&gt;The document has to contain information that would allow it to be identified as responsive or nonresponsive.  For example, an email that said “sounds good,” all by itself would be unlikely to be identified as responsive absent of other information.&amp;nbsp;&lt;/li&gt;&lt;li&gt;The categorizer (machine or human) would have to notice that the distinguishing information was present.  People’s attention wanders, they get tired.&amp;nbsp;&lt;/li&gt;&lt;li&gt; The criteria used to make the decision have to be valid and reliable.  Validity means roughly that they are using the right rules and reliability means that they are using them consistently.&amp;nbsp;&lt;/li&gt;&lt;li&gt;They have to make the correct response.  For example, they have to push the responsive button when they intend to hit the responsive button.&lt;/li&gt;&lt;/ol&gt;&lt;br /&gt;Humans are more expert, still, at deciding on what the criteria are that distinguish responsive from nonresponsive documents.  It is difficult to send a computer to law school, but computers are better at practically everything else on this list.  Instead of looking at a hundred documents an hour, a computer can look at thousands.  The computer does not take breaks or vacations.  It does not think about what it is going to have for lunch.  It does not shift its criteria.  So, if we could just tell the computer how to distinguish between responsive and nonresponsive documents, by being more consistent, it would do a better job implementing these rules.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;There are different ways of providing the rules to the computer.  Some companies use teams of linguists and lawyers to formulate the rules.  We use a set of sample judgments.  Responsiveness is like pornography.  It is difficult to formulate explicit rules about what makes a document responsive, but we know it when we see it.  In our system, there is no need to formulate these explicit rules, rather the computer watches how the documents are classified during review and comes to mimic those decisions.&amp;nbsp;The result of this categorization a set of documents that are very likely to be responsive.&amp;nbsp; The documents that the computer identifies as responsive can be reviewed further while the documents identified as nonresponsive can be sampled for review.&amp;nbsp; Rather than taking weeks or months to wade through the documents more or less at random, the legal team can concentrate its efforts on those documents that are likely to be responsive at the beginning of the process.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Couple these processes with a detailed methodology of sampling and quality assurance and you have a result that is reliable, transparent, verifiable, and defensible.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;The OrcaTec Text Analytics Platform is available as a hosted SaaS application, or on-site as an appliance. On-site, it can be managed directly by the customer or by OrcaTec personnel. &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;About OrcaTec: The 2010 merger of OrcaTec with the software engineering company Strategic Data Retention combined the talents of a long time provider of SaaS eMail archiving and high speed discovery services with the OrcaTec analytics technology.  Come visit us at &lt;a href="http://www.orcatec.com/"&gt;www.orcatec.com&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-8516523706368365829?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/8516523706368365829/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2010/09/orcatec-releases-orcatec-text-analytics.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8516523706368365829'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8516523706368365829'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2010/09/orcatec-releases-orcatec-text-analytics.html' title='OrcaTec Releases the OrcaTec Text Analytics Platform'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-8223200998249150120</id><published>2010-01-11T10:18:00.000-08:00</published><updated>2010-01-11T10:18:55.405-08:00</updated><title type='text'>Computer Assisted Document Categorization in eDiscovery</title><content type='html'>The January issue of the &lt;i&gt;Journal of the American Society for Information Science and Technology&lt;/i&gt;, 61(1):1–11, 2010, has an article by Roitblat, Kershaw, and Oot describing a study that compared computer classification of eDiscovery documents with manual review.&amp;nbsp; It found that computer classification was at least as consistent as human review was at distinguishing responsive from nonresponsive documents.&amp;nbsp; If having attorneys review documents is a reasonable approach to identifying responsive documents, then any system that does as well as human review should also be considered a reasonable approach.&lt;br /&gt;&lt;br /&gt;The study compared an original categorization, done by contract attorneys in response to a Department of Justice Second Request with one done by two new human teams and two computer systems. The two re-review teams were employees of a service provider specializing in conducting legal reviews of this sort.&amp;nbsp; Each team consisted of 5 reviewers who were experienced in the subject matter of this collection.&amp;nbsp; The two teams independently reviewed a random sample of 5,000 documents.&amp;nbsp; The two computer systems were provided by experienced eDiscovery service providers, one in California, and one in Texas.&amp;nbsp; The authors of the study had no financial relationship with either service provider or with the company providing the re-review.&amp;nbsp; The companies donated their time and facilities to the study.&lt;br /&gt;&lt;br /&gt;The documents used in the study were collected in response to a "Second Request" concerning Verizon's acquisition of MCI.&amp;nbsp; The documents were collected from 83 employees in 10 US states.&amp;nbsp; Together they consisted of 1.3 terabytes of electronic files in the form of 2,319,346 documents.&amp;nbsp; The collection consisted of about 1.5 million email messages, 300,000 loose files, and 600,000 scanned documents.&amp;nbsp; After eliminating duplicates, 1,600,047 items were submitted for review.&amp;nbsp; The attorneys spent about four months, seven days a week, and 16 hours per day on the review at a total cost of $13,598,872.61 or about $8.50 per document.&amp;nbsp; After review, a total of 176,440 items were produced to the Justice Department.&lt;br /&gt;&lt;br /&gt;Accuracy was measured as agreement with the decisions made by the original review team.&amp;nbsp; The level of agreement between the two human review teams was also measured.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;The two re-review teams identified a greater proportion of the documents as responsive than did the original review.&amp;nbsp; Overall, their decisions agree with the original review on 75.6% and 72.0% of the documents.&amp;nbsp; The two teams agreed with one another on about 70% of the documents.&lt;br /&gt;&lt;br /&gt;About half of the documents that were identified as responsive by the original review were identified as responsive by either of the re-review teams.&amp;nbsp; Conversely, about a quarter of the documents identified as nonresponsive by the original review were identified as responsive by the new teams.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;Although the original review and the re-reviews were conducted by comparable people with comparable skills, their level of agreement was only moderate.&amp;nbsp; We do not know whether this was due to variability in the original review, or was due to some other factor, but these results are comparable to those seen in other situations where people make independent judgments about the categorization of documents (for example, in the TREC studies).&amp;nbsp; A senior attorney reclassified the documents on which the two teams disagreed.&amp;nbsp; After this reclassification, the level of agreement between this adjudicated set and the original review rose to 80%.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;The two computer systems identified fewer documents as responsive than did the human review teams, but still a bit more than were identified by the original review.&amp;nbsp; One system agreed with the original classification on 83.2% of the documents and the other on 83.6%.&amp;nbsp; Like the human review teams, about half of the documents identified as responsive by the original review were similarly classified by the computer systems.&lt;br /&gt;&lt;br /&gt;As legal professionals search for ways to reduce the costs of eDiscovery, this study suggests that it may be reasonable to employ computer-based categorization.&amp;nbsp; The two computer systems agreed with the original review at least as often as a human team did.&amp;nbsp; &lt;br /&gt;&lt;br /&gt;The computer systems did not create their decisions out of thin air.&amp;nbsp; One of the systems based its classifications in part on the adjudicated results of the two review teams and the senior attorney.&amp;nbsp; The other system based its process on an analysis of the Justice Department Request, the training documents given to the reviewers (both the original review and the two review teams), and on a proprietary ontology.&amp;nbsp; These two systems, in other words, implemented a set of human judgments.&amp;nbsp; These systems succeed to the extent that they can capture and reliably implement these judgments.&amp;nbsp; The computers and their software do not get tired, cannot be not distracted, and are able to work 24 hours a day.&amp;nbsp; These results imply that using a computer-based classification system is a viable way to produce reasonable eDiscovery document categorization.&lt;br /&gt;&lt;br /&gt;Please contact me (&lt;a href="mailto:herb@ediscoveryinstitute.org"&gt;herb@ediscoveryinstitute.org&lt;/a&gt;) or (&lt;a href="mailto:herb@orcatec.com"&gt;herb@orcatec.com&lt;/a&gt;) if you would like a copy of the full paper.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-8223200998249150120?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/8223200998249150120/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2010/01/computer-assisted-document.html#comment-form' title='39 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8223200998249150120'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8223200998249150120'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2010/01/computer-assisted-document.html' title='Computer Assisted Document Categorization in eDiscovery'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>39</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-6861238136064136291</id><published>2009-12-18T13:43:00.000-08:00</published><updated>2009-12-18T14:00:51.657-08:00</updated><title type='text'>Twelve Step Guide to Botching eDiscovery</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_B_ot9Nd0mAw/Syv7E5cIQfI/AAAAAAAAABE/9QpymXkZAgs/s1600-h/Smoke.jpg"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; float: left; cursor: pointer; width: 144px; height: 200px;" src="http://1.bp.blogspot.com/_B_ot9Nd0mAw/Syv7E5cIQfI/AAAAAAAAABE/9QpymXkZAgs/s200/Smoke.jpg" alt="" id="BLOGGER_PHOTO_ID_5416699038400004594" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-family:verdana;" &gt;&lt;/span&gt;My brother, who is also in the computer business, likes to say that he is interested in artificial stupidity, because there is generally one way to get things right, but real creativity in getting things wrong.  That certainly seems to be the case in eDiscovery. There are lots of ways that you can botch your case.  I have compiled a list for those of you who are too lazy to think of your own ways to screw it up.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;1.    Request that your opponent produce documents in paper form, because they'll be easier to read that way.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;2.    Redact PDFs by drawing black rectangles over the privileged text.&lt;br /&gt;&lt;br /&gt;3.    Email privileged documents to the other side after failing to detect that your email program's auto-fill had added the wrong address.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;4.    Leave signs that you used programs like Evidence Eliminator on your laptop after you have been notified of a pending suit.&lt;br /&gt;&lt;br /&gt;5.    Produce fabricated e-mails.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;6.    Don't bother with the requirement to meet and confer and then complain about not getting the ESI you want in the form you want.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;7.    Claim that you don't have documents and then magically find them when threatened with sanctions.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;8.    Delay discovery, ignore court discovery deadlines, tell employees to ignore court orders, and generally disregard warnings from the judge.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;9.    Replace employees' hard drives days before they are scheduled for forensic imaging or simply abandon them when you close your business .  You might also do a cursory search of your former employees' workstations, but don't bother with the server.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;10.    Continue a data destruction process for months after recognizing that the emails covered by a preservation order were being lost and even after the court reminded you to preserve them.&lt;br /&gt;&lt;br /&gt;11.    Don't bother to search your witnesses' computers or other relevant custodians' for relevant evidence, and don't use information provided by the other side to identify search terms.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Finally, the capper.  This one is almost guaranteed to work.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;12.    Lie to the court about what you are doing and what you have done during eDiscovery and get found out.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-6861238136064136291?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/6861238136064136291/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2009/12/twelve-step-guide-to-botching.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/6861238136064136291'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/6861238136064136291'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2009/12/twelve-step-guide-to-botching.html' title='Twelve Step Guide to Botching eDiscovery'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_B_ot9Nd0mAw/Syv7E5cIQfI/AAAAAAAAABE/9QpymXkZAgs/s72-c/Smoke.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-4780842346248008125</id><published>2009-04-13T11:10:00.000-07:00</published><updated>2009-04-13T11:48:13.078-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='information retrieval'/><category scheme='http://www.blogger.com/atom/ns#' term='electronic discovery'/><category scheme='http://www.blogger.com/atom/ns#' term='TREC'/><category scheme='http://www.blogger.com/atom/ns#' term='TREC legal track'/><title type='text'>TREC Legal Track 2008</title><content type='html'>I have been going through some of the results from the &lt;a href="http://trec.nist.gov/pubs/trec17/papers/LEGAL.OVERVIEW08.pdf"&gt;TREC legal track&lt;/a&gt; for 2008.  It is a veritable goldmine of interesting information.&lt;br /&gt;&lt;br /&gt;TREC (Text Retrieval Conference) is a major annual research effort sponsored by the National Institute of Standards and Technology and other government agencies. The current report covers the 17th annual conference, and the third version of the legal track.  The 2008 legal track was coordinated by a group including Doug Oard, Bruce Hedin, Stephen Tomlinson, and Jason Baron.  The goal of the legal track is "evaluation of search technology for discovery of electronically stored information in litigation and regulatory settings."&lt;br /&gt;&lt;br /&gt;There were three kinds of task evaluated last year:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Ad hoc retrieval involving automated search, where each team used its technology to retrieve documents using their own search technology.&lt;/li&gt;&lt;li&gt;Relevance feedback, where each team retrieved some documents, got feedback after this pass and then modified their searches to take advantage of this feedback.&lt;/li&gt;&lt;li&gt;Interactive, where each team was allowed to interact with a topic authority and revise their queries based on this feedback.  Each team was allowed ten hours of access to the authority.  In addition, they were allowed to appeal reviewer decisions that the team thought were inconsistent with the instructions from the topic authority.&lt;/li&gt;&lt;/ul&gt;Each team was free to use whatever technology they chose.&lt;br /&gt;&lt;br /&gt;Two other kinds of searches were also employed.  One was a Boolean search negotiated between a "plaintiff" and a "defendant." The second was a search that retrieved all of the documents in the collection.&lt;br /&gt;&lt;br /&gt;A sample of the documents that were retrieved was then judged by volunteer assessors (reviewers) to determine whether these documents were responsive to the topic. Finally, a random set of documents that were not retrieved by any of the technologies was sampled and assessed in an attempt to find out whether some responsive documents might have been missed by all of the teams.&lt;br /&gt;&lt;br /&gt;Some of the more interesting findings in this report concern the levels of agreement seen between assessors.  Some of the same topics were used in previous years of the TREC legal track, so it is possible to compare the judgments made during the current year with those made in previous years.  For example, the level of agreement between assessors in the 2008 project and those from 2006 and 2007 were reported.  Ten documents from each of the repeated topics that were previously judged to be relevant and ten that were previously judged to be non-relevant were assessed by 2008 reviewers.  It turns out that "just 58% of previously judged relevant documents were judged relevant again this year." Conversely, "18% of previously judged non-relevant documents were judged relevant this year."  Overall, the 2008 assessors agreed with the previous assessors 71.3% of the time.  Unfortunately, this is a fairly small sample, but it is consistent with other studies of inter-reviewer agreement.  In 2006, the TREC coordinators gave a sample of 25 relevant and 25 nonrelevant documents from each topic to a second assessor and measured the &lt;a href="http://cio.nist.gov/esd/emaildir/lists/ireval/msg00012.html"&gt;agreement &lt;/a&gt;between these two.  Here they found about 76% agreement.  Other studies outside of TREC have found similar levels of (dis)agreement.&lt;br /&gt;&lt;br /&gt;The interactive task also allowed the teams to appeal reviewer decisions, if they thought that the reviewers had made a mistake.  Of the 13,339 documents that were assessed for the interactive task, 966 were appealed to the topic authority.  This authority played the role, for example, of the senior litigator on the case, with the ultimate authority to overturn the decisions of the volunteer assessors.  In about 80% of these appeals, the topic authority agreed with the appeal and recategorized the document.  In one case (Topic 103), the appeal allowed the team with already highest recall rate (percentage of retrieved documents that were determined to be relevant) to improve its performance by 47%.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;&lt;span style="font-weight: bold;"&gt;How do we interpret these findings?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;These levels of (dis)agreement do not appear to be wildly different from those found in other studies.  Inter-assessor consistency presents challenges to any study of information retrieval effectiveness.  TREC studies have found repeatedly that this inconsistency does not affect the relative ranking of different approaches, but it could affect how we interpret the absolute levels of performance.  TREC may substantially under-estimate how well an application could do in a real world application, such as in discovery of electronically stored information, with consistent measurement.&lt;br /&gt;&lt;br /&gt;Like most studies of information retrieval, the TREC legal track takes assessor judgments to be the standard against which to judge the performance of various systems and approaches.  The legal track used tens of assessors, primarily second and third year law students.  With the volume of documents involved in the TREC legal track, the limited resources, and so on, there may not be a practical alternative to getting these judgments from many different reviewers.  The assessors averaged only 21.5 documents per hour, so the average assessor took 23.25 hours to review 500 documents—a substantial commitment of time and effort from a volunteer.&lt;br /&gt;The inconsistency in assessor judgments limits the ability of any system to yield reliable results.  The appeal process of the interactive task (topic 103), for example, demonstrates what can be gained by increasing consistency.  Practically every system showed an improvement in recall as a result of the appeal process, whether or not they were responsible for submitting the appeal.  Improving consistency appears to improve the absolute level of performance, sometimes substantially.&lt;br /&gt;&lt;br /&gt;The use of multiple assessors matches well the standard practice in electronic discovery of distributing documents to multiple reviewers.  The results described here, and others, suggest that there are likely to be similar levels of inconsistency in these cases.  Taking the prior year reviews as the standard against which to measure the 2008 assessors, they found only 58% of the documents deemed to be relevant by the prior review—58% recall.  Similarly, the 2006 study, found that the second reviewer recognized as relevant, again, only 58% of the documents deemed relevant by the first reviewer.  I do not believe that these results are an artifact of the TREC processes or procedures.  Rather, I think that this level of inconsistency is endemic in the process of having multiple reviewers review documents over time.&lt;br /&gt;&lt;br /&gt;In the practice of eDiscovery, human review suffers from unknown inconsistencies.  There is no reason to think that actual legal review should be any more consistent than that found in the TREC studies.  For that reason, standard review practice may be grossly under-delivering responsive documents.  At the very least, attorneys should seek to measure the consistency of their reviewers and the effectiveness of their classifications.&lt;br /&gt;&lt;br /&gt;The TREC legal track represents a tremendous resource for the legal community and for the information retrieval community as a whole.  It is a monumental effort, representing untold hours and uncounted dollars.  In future articles I plan to describe other interesting findings to come out of this study.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-4780842346248008125?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/4780842346248008125/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2009/04/trec-legal-track-2008.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/4780842346248008125'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/4780842346248008125'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2009/04/trec-legal-track-2008.html' title='TREC Legal Track 2008'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-8477317604000651980</id><published>2009-04-09T10:47:00.000-07:00</published><updated>2009-04-09T11:13:09.013-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ediscovery'/><category scheme='http://www.blogger.com/atom/ns#' term='Early Case Assessment'/><category scheme='http://www.blogger.com/atom/ns#' term='concept search'/><category scheme='http://www.blogger.com/atom/ns#' term='semantic search'/><category scheme='http://www.blogger.com/atom/ns#' term='ediscovery analytics'/><title type='text'>Analytics in Electronic Discovery</title><content type='html'>The goal of eDiscovery analytics is to understand your data, its volume, its content, and its challenges.  This information is critical to evaluating the risks inherent in the case, the resources that will be needed to advance it and to winnowing and organizing the data for efficient and effective processing. Said another way, the goal of eDiscovery analytics is to have as much useful knowledge as possible about the documents and other sources of information that are potentially discoverable. A related goal, early in the development of a case, is to have the information needed to prepare for an effective discussion with the other side on discovery plans.  It is difficult to formulate an effective eDiscovery strategy without broad and deep knowledge of the data and of the issues in the case.  And, there is typically great pressure to obtain this knowledge as quickly and as inexpensively as possible. Analysis is not a substitute for document review, but it can facilitate it and reduce the amount of time, effort, and cost it requires.&lt;br /&gt;&lt;br /&gt;eDiscovery analytics is a kind of text analytics directed at the kinds of information that attorneys will find useful for managing the case. Fundamentally, it is intended to say what a document collection is about. Are there obvious smoking guns? What proportion of the documents are likely to be responsive? How can we distinguish between documents that are potentially responsive and those that are not? Are there individuals that we have not yet identified who may be important to the case? Are there topics that we have not yet considered that may be important to the case? What documents are likely to be pertinent to each topic?&lt;br /&gt;&lt;br /&gt;There are a wide range of tools that are available for eDiscovery analytics. These include linguistic tools, for example, that identify the nouns, noun phrases, people, places, organizations, and other "entities" in the documents. Conceptual tools identify the concepts that appear in the documents. Clustering tools organize documents into groups based on their similarity. On top of these, there are visualization tools that help to display this information in ways that are easily understood.&lt;br /&gt;&lt;br /&gt;Social network analysis is often used with emails to identify who is "talking" to whom. This tool can be effective for identifying custodians who may have important information. The patterns of communication do not always follow the pattern that one would expect from an organizational chart and the people with knowledge may not be the same as the people charged with the responsibility by the management structure.&lt;br /&gt;&lt;br /&gt;Analysis, especially during early case assessment, is often conducted in the context of great uncertainty. After all, it is an effort directed at reducing that uncertainty. Data may not be fully collected, and it may not yet be fully decided whether the cost of eDiscovery is justified by the merits of the case and the amount at risk. For this reason, sampling is often used in eDiscovery analysis. If, because of time or cost constraints, you cannot analyze the complete collection, analyze a sample.&lt;br /&gt;&lt;br /&gt;The ideal sample is one that is randomly chosen from all of those that could be considered for responsiveness. A random sample is one where every document or each record, etc. has an equal probability of being included in the sample. A random sample is desirable because statisticians have found that this is the best, most reliable, way to get a representative sample of items and the best way to infer the nature of the population (all of the documents and records) from the sample. In practice, however, especially during early case assessment, a truly random sample may not be possible and so any inferences drawn from this almost random sample need to be drawn cautiously, but they are often still valuable and useful. The closer you can come to a truly random sample, the more reliable your analysis will be. The size of the sample does not depend on the size of the population, but the larger the sample, all other things being equal, the better will be the extrapolation from the sample to the population.&lt;br /&gt;&lt;br /&gt;Some specific questions that can be addressed through analysis include:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;     What topics are discussed in this ESI collection?&lt;/li&gt;&lt;li&gt;     Which individuals are likely to have pertinent knowledge?&lt;/li&gt;&lt;li&gt;     What is the time frame that should be collected?&lt;/li&gt;&lt;li&gt;     How can we identify those documents that are very likely to be responsive or nonresponsive?&lt;/li&gt;&lt;li&gt;     What search terms should we use to identify potentially responsive documents? Are they too vague, yielding too many documents to review or too specific, missing responsive documents?&lt;/li&gt;&lt;li&gt;     What resources will be needed to process and review the documents (e.g., languages, file types, volume)?&lt;/li&gt;&lt;li&gt;     Is there evidence apparent in the data that would support or compel a rapid settlement of the case?&lt;/li&gt;&lt;li&gt;     How can we respond with particularity to the demands of the other party?&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;Analysis is not a single stage in the processing of electronically stored information, rather it is an ongoing process that continuously reduces uncertainty. It is a systematic approach to understanding the information you have in the context of specific issues and matters. As Pasteur said, "chance favors the prepared mind." eDiscovery favors the prepared attorney and analysis is the means by which to be prepared.&lt;br /&gt;&lt;br /&gt;OrcaTec LLC has prepared a report on early case assessment analysis techniques that is available for a fee. This report discusses about 35 different reports, analytic techniques, and visualizations and about 48 questions to address during early case assessment. Its target audience is electronic discovery service providers, but it may also be of interest to attorneys. Contact &lt;a href="mailto:info@orcatec.com"&gt;info@orcatec.com&lt;/a&gt; for more information.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Related posts include:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.typepad.com/services/trackback/6a00e008daf3e0883401156f117fcf970c"&gt;Considering Analytics&lt;/a&gt;&lt;br /&gt;&lt;a href="http://www.typepad.com/services/trackback/6a00e008daf3e0883401156ff40ad0970b"&gt;ONCE MORE, IN ENGLISH: ANALYTICS IN ELECTRONIC DISCOVERY&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-8477317604000651980?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/8477317604000651980/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2009/04/analytics-in-electronic-discovery.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8477317604000651980'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8477317604000651980'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2009/04/analytics-in-electronic-discovery.html' title='Analytics in Electronic Discovery'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3720617976685068531.post-8441674347621125192</id><published>2009-04-09T09:57:00.000-07:00</published><updated>2009-04-09T10:25:30.168-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='truevert'/><category scheme='http://www.blogger.com/atom/ns#' term='concept search'/><category scheme='http://www.blogger.com/atom/ns#' term='semantic search'/><category scheme='http://www.blogger.com/atom/ns#' term='orcatec'/><title type='text'>Welcome to Information Discovery</title><content type='html'>"Information Discovery" is an occasional blog about finding and discovering information.  A major component of what we do, and of information discovery in general, is semantic search.  In the legal space, this is called "concept search."  Information discovery also involves other text analytic tools, including near duplicate detection, semantic clustering, language identification, and others. &lt;br /&gt;&lt;br /&gt;You can find out more about OrcaTec from &lt;a href="http://www.orcatec.com"&gt;www.orcatec.com&lt;/a&gt; or visit our Green Web search engine, Truevert, at &lt;a href="http://www.truevert.com"&gt;www.truevert.com&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3720617976685068531-8441674347621125192?l=orcatec.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://orcatec.blogspot.com/feeds/8441674347621125192/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://orcatec.blogspot.com/2009/04/welcome-to-information-discovery.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8441674347621125192'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3720617976685068531/posts/default/8441674347621125192'/><link rel='alternate' type='text/html' href='http://orcatec.blogspot.com/2009/04/welcome-to-information-discovery.html' title='Welcome to Information Discovery'/><author><name>Herbert L Roitblat, Ph.D</name><uri>http://www.blogger.com/profile/00995738843466481632</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://1.bp.blogspot.com/_B_ot9Nd0mAw/SNf9h7i04sI/AAAAAAAAAAM/OBLX-5O0og8/S220/Herb0304-1in.jpg'/></author><thr:total>0</thr:total></entry></feed>
