Interview The International Consortium of Investigative Journalists has announced it will be releasing the structured data from the leaked Mossack Fonseca database on May 9.

The searchable database is not intended to be a "data dump", but will include curated information "about companies, trusts, foundations and funds incorporated in 21 tax havens, from Hong Kong to Nevada in the United States. It links to people in more than 200 countries and territories."

When the data is released, users will be able to search through it and visualise the networks around the offshore entities, including Mossack Fonseca’s internal records of the company’s true owners.

At a reception ahead of Neo Technology's GraphConnect Europe conference, The Register spoke to Mar Cabra, the head of the ICIJ's Data and Research Unit, which used the company's graph database technology as part of its analysis of the leaked Panama Papers.

“We believe that the best we can do the work is to get as many journalists as possible to look at the data and documents and do content curation on the documents based on public interest," said Cabra.

"We believe that is a secure way of ethically dealing with the documents, and we have the goal of publishing as much data as possible. Therefore we're going to release the structured data part of the Mossack Fonseca database. This includes the names of more than 200,000 offshore companies and the people behind them, because we believe, and the World Bank believes, and many experts believe, that corporate registry should be public," she added. "Therefore, there's no problem to make that information public. That will be done in early May.”

“We have developed this methodology at ICIJ for years where we get reporters to work with us, and so media organisations would put resources – so I think it's distributed economy applied to journalism – so The Guardian would put five reporters, and Le Monde would put another five, and all of a sudden you get a team of 370 journalists working together.”

“While working on stories like Offshore Leaks, I learned how important graph analysis is when investigating financial corruption,” Cabra said, in a customer case study for Neo4j.

“Connections are key to understanding what the real story is: they show you who’s doing business with whom. We decided early on that we needed to use a graph based approach for the HSBC Leaks,” said Cabra.

The Data and Research Unit's first move in that investigation was to recreate the HSBC client database from the provided plain Excel files. Next, they connected every name to one or several countries (both referred to as the ‘nodes’ in the graph database). Finally, they turned the data into a graph format to explore the connections between nodes.

The resulting graph database had more than 275,000 nodes with 400,000 relationships among them. The ICIJ worked with Talend to transfer the original dataset into Neo Technology’s Neo4j graph database.

Another Neo Technology partner, Linkurious, provided a web app as a user interface, so that the graph database could be visualised and easily accessed by reporters. The graph visualisation approach allowed ICIJ journalists to identify the connections between people and bank accounts, helping them “follow the money” to identify dozens of instances of fraud, corruption and tax evasion.

“It was a massive undertaking. Technology helped us a lot, without technology this would not have been possible.” Cabra told The Register.

The Panama Papers weighed in at more than 2.6TB, but were not immediately suitable for analysis. Süddeutsche Zeitung, the German publication which received the original leak, has explained the technology behind the investigation. Nuix was used to make text in the scanned image documents machine-readable.

Using Neo4j and Linkurious the ICIJ was able to establish patterns between those involved and discover what Cabra described as the “concentrators of activities” with HSBC and its affiliates notably turning up as having registered more than 2,300 of the shell companies.

“We don't do investigations at the ICIJ that don't have a data component. Data allows us to talk about systematic issues, to try to find patterns in what happens in the world, and that's what we do, expose cross-border issues that happen systematically.”

The reality is that our biggest projects, and the most notorious projects, in terms of impact, have been leaks, but I have to say we have done interesting projects with public data. So for example we did this project looking at public data from the World Bank, and data held within thousands of documents, and analysed that and created a unique database of projects financed by the World Bank that were displacing people economically or physically, and we found out that more than three million people had been displaced in the past decade by projects financed by the World Bank, and all that was done with public data, data from the World Bank website.

Cabra thought “there are many great stories held in public data, and we can get public data using freedom of information laws, but also from public data portals,” but said the ICIJ would not be making this information completely public.

“We have in our hands the private information of thousands of people. We have information on bank accounts, we have information on their passports, we've got information on minors—minor names and contact details—and we have information about tax evasion about corruption and so we believe that it would not be ethical from our journalistic perspective to just dump 11.5m files online, because there could be unintended consequences and collateral damage because of that private data.”

When the data is published, it will be included in the ICIJ's Offshore Leaks database. ®