This post is a follow-up to my January 7, 2013 post, The Role of Data Driven Models in Law Practice. I offer here ideas about data that lawyers should consider collecting and analyzing to predict and reduce legal problems.

In the earlier post, I discussed an article, In What Computer Models Can – and Can’t – Do, by Ryan McConnell (Baker & McKenzie partner), Dianne Ralston (Schlumberger Ltd. deputy GC), and Charlotte Simon (Baker & McKenzie associate). Their ideas intrigued me but I was disappointed that they seemed to conclude that, because of “noise”, data models would not be helpful in their compliance practices.

To spur thinking about where data collection and modeling might help avoid legal problems and support compliance, I offer below a few ideas to consider. These may be hard to execute or may fail. My deeper concern is epistemological: how do we know what might work?

Am I the only one who thinks it odd – and wrong – that large corporate law and compliance departments seemingly conduct little or no research and development? Companies that employ hundreds of lawyers and compliance professional already spend a lot on law. Why not do some R&D to find ways to reduce cost? Granted, that R&D might yield poor results. Without trying, however, how do we know? Perhaps the research would lead to much lower ongoing legal or compliance costs.

So, here goes with some possibilities:

The authors discuss the possibility of using job descriptions to aid in compliance bu conclude there is too much noise in that data. More data often solves noise problems, so why not aggregate job descriptions across companies – that could yield more insight into problematic positions or locations than any one company’s data. Thinking about ‘compliance as a utility’, there may be multiple opportunities for companies to share non-competitive data to improve compliance. Large data sets, as the authors observe, usually yield more reliable results.

Companies have a very rich, extant store of data that may well yield compliance clues: e-mail messages, files, and databases. Subject to privacy and other potential legal limits, companies could analyze the e-mail headers to look for suspicious patterns of communication. Suspicious might include too much, too little, or unusual combinations of people in touch. Start by finding a known compliance problem and do this analysis retroactively to learn what analysis might be predictive.

Going one step further with e-mail, companies could perform semantic analysis on e-mail content (not just headers) to look for suspicious substantive discussions. Already in the 1990s the US financial sector did this (using, for example, Assentor), to identify broker e-mail messages that violated securities rules. Today, with the predictive coding techniques developed for e-discovery, much more is possible – and affordable.

Corporate data does not stop with e-mail. Databases to support operations, sales, and expense management may also yield pointers for where to look for compliance issues. With social media, the possibilities seem endless.

If the data the authors cite, and if e-mail and corporate records do not suffice, then collect data. Compliance officers could consider web-based surveys. If that loses too much nuance, then they could deploy a team of low cost lawyers to make outbound calls to interview selected employees and systematically enter the interview results into a database for analysis. Who said we have to stop with off-the-shelf data?

Models may never be 100% reliable. The question is whether they are reliable enough for triage. If a model can bucket outcomes into ‘almost certainly not a problem’, ‘almost certainly a problem, and ‘may be a problem’, then lawyers at least have some indication of where to look. A team of offshore lawyers could apply human judgment to refine model results and surface the most suspicious findings to inhouse counsel.”

These ideas are not even in the Big Data realm. All these ideas can be tested with tools that have been available for years. The floor is open for other ideas, Big Data or otherwise.