Thursday, April 9, 2009

Analytics in Electronic Discovery

The goal of eDiscovery analytics is to understand your data, its volume, its content, and its challenges. This information is critical to evaluating the risks inherent in the case, the resources that will be needed to advance it and to winnowing and organizing the data for efficient and effective processing. Said another way, the goal of eDiscovery analytics is to have as much useful knowledge as possible about the documents and other sources of information that are potentially discoverable. A related goal, early in the development of a case, is to have the information needed to prepare for an effective discussion with the other side on discovery plans. It is difficult to formulate an effective eDiscovery strategy without broad and deep knowledge of the data and of the issues in the case. And, there is typically great pressure to obtain this knowledge as quickly and as inexpensively as possible. Analysis is not a substitute for document review, but it can facilitate it and reduce the amount of time, effort, and cost it requires.

eDiscovery analytics is a kind of text analytics directed at the kinds of information that attorneys will find useful for managing the case. Fundamentally, it is intended to say what a document collection is about. Are there obvious smoking guns? What proportion of the documents are likely to be responsive? How can we distinguish between documents that are potentially responsive and those that are not? Are there individuals that we have not yet identified who may be important to the case? Are there topics that we have not yet considered that may be important to the case? What documents are likely to be pertinent to each topic?

There are a wide range of tools that are available for eDiscovery analytics. These include linguistic tools, for example, that identify the nouns, noun phrases, people, places, organizations, and other "entities" in the documents. Conceptual tools identify the concepts that appear in the documents. Clustering tools organize documents into groups based on their similarity. On top of these, there are visualization tools that help to display this information in ways that are easily understood.

Social network analysis is often used with emails to identify who is "talking" to whom. This tool can be effective for identifying custodians who may have important information. The patterns of communication do not always follow the pattern that one would expect from an organizational chart and the people with knowledge may not be the same as the people charged with the responsibility by the management structure.

Analysis, especially during early case assessment, is often conducted in the context of great uncertainty. After all, it is an effort directed at reducing that uncertainty. Data may not be fully collected, and it may not yet be fully decided whether the cost of eDiscovery is justified by the merits of the case and the amount at risk. For this reason, sampling is often used in eDiscovery analysis. If, because of time or cost constraints, you cannot analyze the complete collection, analyze a sample.

The ideal sample is one that is randomly chosen from all of those that could be considered for responsiveness. A random sample is one where every document or each record, etc. has an equal probability of being included in the sample. A random sample is desirable because statisticians have found that this is the best, most reliable, way to get a representative sample of items and the best way to infer the nature of the population (all of the documents and records) from the sample. In practice, however, especially during early case assessment, a truly random sample may not be possible and so any inferences drawn from this almost random sample need to be drawn cautiously, but they are often still valuable and useful. The closer you can come to a truly random sample, the more reliable your analysis will be. The size of the sample does not depend on the size of the population, but the larger the sample, all other things being equal, the better will be the extrapolation from the sample to the population.

Some specific questions that can be addressed through analysis include:

  • What topics are discussed in this ESI collection?
  • Which individuals are likely to have pertinent knowledge?
  • What is the time frame that should be collected?
  • How can we identify those documents that are very likely to be responsive or nonresponsive?
  • What search terms should we use to identify potentially responsive documents? Are they too vague, yielding too many documents to review or too specific, missing responsive documents?
  • What resources will be needed to process and review the documents (e.g., languages, file types, volume)?
  • Is there evidence apparent in the data that would support or compel a rapid settlement of the case?
  • How can we respond with particularity to the demands of the other party?

Analysis is not a single stage in the processing of electronically stored information, rather it is an ongoing process that continuously reduces uncertainty. It is a systematic approach to understanding the information you have in the context of specific issues and matters. As Pasteur said, "chance favors the prepared mind." eDiscovery favors the prepared attorney and analysis is the means by which to be prepared.

OrcaTec LLC has prepared a report on early case assessment analysis techniques that is available for a fee. This report discusses about 35 different reports, analytic techniques, and visualizations and about 48 questions to address during early case assessment. Its target audience is electronic discovery service providers, but it may also be of interest to attorneys. Contact for more information.

Related posts include:

Considering Analytics

1 comment:

  1. Sure is interesting to see how electronic discovery has been progressing over the past 5+ years.