I'm used to seeing large amounts of personal data left inadvertently exposed to the web. Recently, the Red Cross Blood Service down here left a huge amount of data exposed (well, at least the company doing their tech things did). Shortly afterwards, the global recruitment company Michael Page also lost a heap (also due to a partner, Capgemini). Both cases were obviously extremely embarrassing for the companies involved and they did exactly what you'd expect them to do once they found out about it - they pulled the data offline as fast as humanly possible.

And this is how it generally goes with incidents like this; lots of embarrassment, lots of scrambling to fix then lots of apologising afterwards. Which makes the behaviour of Health Solutions in India all the more confounding. Here's how it all unfolded:

On Wednesday, someone popped up on the Twitters and shared a link with me via DM which went to www.hsppl.com/pathology/downloads/downloadReports and returned this page:

You won't find anything there anymore because they've (eventually) been removed. But what you're looking at in that screen grab is files displayed in a folder with directory listing enabled, precisely the same vulnerable configuration the brought the Red Cross and Michael Page undone. There were 43,203 files in total relating to pathology reports because that's what Health Solutions does:

Now this is very sensitive data so let's take just a moment to reflect on the ethics involved when you discover (or are told of) an incident like this. It was quite clear what was in this folder so I didn't click on any of the documents. Not only did I not want to breach the confidentiality of those involved, I also didn't want to contaminate any logs which may later be subject to forensic review. Having been down this path many times before, one of the things (responsible) companies do after learning of an incident like this is trawl through the logs to get a sense of how broad the exposure is. It's not that I didn't just want my IP in there, I wouldn't want to hit it via VPN or Tor either because that only adds noise to the logs and makes it harder to get to the bottom of what's actually happened.

Having said that, there's another way to get a sense of what's in the documents without actually loading them and that's to see if Google has picked them up. Which it had:

It wasn't all 43,203 documents indexed, but then a bunch of them were .rar files anyway so that's to be expected. The problem was that not only were the files themselves indexed, the contents had also been cached. In other words, Google had been reading the pathology reports and the contents had now spread beyond the source web server.

Keeping the privacy aspect of things in mind, as unfortunate as it is that Google had indexed things it actually made it a lot easier to work out the scope of the leak without actually looking at the files. For example, one of the most sensitive pathology reports I would expect to find are tests for HIV:

So not only had the files been indexed by Google, they were available via cache and contained some very sensitive information too. This is about as bad as it gets in terms of exposed health data.

The next step was to get a sense of who Health Solutions was because all I had at this stage was a domain name. A quick browse around the website showed they were in Mumbai in India:

I found an email address on the website and emailed them at 13:55 on Nov 30. The email immediately bounced back with "No Such User Here"

So I tried the contact us form:

24 hours later, nothing. I then tried the WHOIS admin contact which appears to be a bloke named Sunil working for Maurya Consultancy Services:

That bounced back with the same error as the earlier email:

So I started digging deeper into the site and one of the first things I did was look at where it was hosted which, to my surprise, turned out to be in the US:

I first wondered if I'd made an error somewhere or misinterpreted the results, so I took a look on Shodan as well:

The host name there is server.askmyindianfellows.com which just goes off to a default cPanel page. However, it reinforced the Indian connection and a WHOIS on that name joined all the dots:

I took a bit more of a look around the site to get a sense of what was going on and I think this one image summarises it best:

A marquee tag on a page with a hit counter (broken) and a copyright of 2011. It's the trifecta of "this is old and unloved"! I was rapidly losing confidence in the likelihood of finding anyone who cared, so I turned to Twitter:

I need a Mumbai local to call up the local pathology center leaking all their patient data and have them get in touch with me. Any takers? — Troy Hunt (@troyhunt) December 1, 2016

A bunch of people responded, one of whom was Pranav Dixit, a BuzzFeed reporter. Pranav was excellent, immediately understanding the severity of the issue and getting in touch directly with Health Solutions. This was Thursday evening last week, more than 24 hours after I'd begun trying to raise someone there. I thought this would do it - the files will be gone in no time now - and then we got the most unfathomable responses from them...

Pranav later wrote these in a BuzzFeed piece I'll link to shortly so I'm comfortable sharing the following statements from Health Solutions he sent me via DM. He'd gotten the lab on the phone - this is the lab running the site that exposed the data - and they told him this:

We are not the doctors, we and our franchisees merely do the blood tests and maintaining doctor-patient confidentiality is not our problem

How is this even possible?! The relationship with the doctor is inconsequential because it's the lab that's leaking the data! This alone was unfathomable, but then they went even further:

we are moving to a new domain in january and retiring the existing website, so these problems will be fixed in jan

These messages were sent on the 1st of December so in other words, Health Solutions were saying they were going to leave the patient data exposed for at least another month and it wasn't really their problem anyway because confidentiality wasn't up to them. They had one more thing to say on the time frame:

but till then, we are not planning to do anything about this

We were both gobsmacked by this. How on earth can you leak this sort of data and just not care?! Look, it'd be one thing if there was a heap of engineering work to be done in order to secure the patient records, but where we were then it was simply a matter of removing the files or even better, just turning off the site until it was properly secured. Yes, that could have a business continuity impact but I'd never seen this stop an organisation from securing sensitive data that had been publicly exposed. Never.

The following day, the files were still there. It was Friday now and tens of thousands of pathology files remained publicly accessible and easily discoverable, especially given the Google index. I decided to start applying some social pressure, but I didn't want to name the lab as it'd take one simple Google search with an "inurl" and the documents would be found:

This is not going well, the path lab doesn't want to fix. Here's redacted details, Indian journos DM for details: https://t.co/G1DsIPKZ5f https://t.co/roBn8SZX4i — Troy Hunt (@troyhunt) December 2, 2016

I'm used to seeing inadvertent exposure of data, but it gets fixed immediately once reported. "We'll do it in Jan" is unfathomably reckless. — Troy Hunt (@troyhunt) December 2, 2016

I honestly didn't know how this would play out - if they're adamant that they're not going to fix the exposure and I can't publicly shame them by name, what's going to change? As it turns out, BuzzFeed solved that conundrum by publishing a piece titled The Medical Reports Of 43,000 People, Including HIV Patients, Were Accidentally Released Online shortly after my tweets. Obviously, my preference would have been to see the data secured first, but their position was that given the path lab didn't care, it was fair game. They also elected to print a redacted image showing one of the exposed reports (an HIV test) and shared other information derived from the exposed data such as the involvement of children:

Some included in the breach are as young as 17

BuzzFeed had managed to get in touch with Rodrigues Kustas who they refer to as "an administrator at Health Solutions" and he had some rather, uh, "enlightening" facts to share. For example, security incidents weren't exactly a new thing:

Health Solutions was moving to a new website in January because their current one had been “hacked” several times

In a subsequent interview with The Hindu, he was quoted as saying:

The data leak was six months ago and by now we already have a new server

In that same article, he talks about a "hack":

While the website has been hacked, none of the confidential information on health issue of any of our patients has been compromised

Let us be crystal clear about this - publishing files to a website without any access controls and in a path with directory listing enabled is not "hacking", it's incompetence. That's a really important distinction because the term "hack" shifts the blame to someone else when it should rest squarely on the shoulders of Rodrigues and co. And while we're here, saying that none of the info was compromised is blatantly wrong; BuzzFeed pulled HIV test results! Even if Health Solutions went back through the logs (which they may not even have), the fact that Google indexed it all and stored it in cache means that the files were copied outside their environment by a third party and they simply have no idea who has seen them.

In the BuzzFeed piece, he then goes on to shed some light on how such a shoddy development job may have come about:

the lab’s website was developed by a third-party developer that he described as a personal friend

Those of us who work in tech have seen it all before - someone's brother's uncle's mate's dog does some web development and they built the site. Rodrigues then apparently "refused to provide any more details" but he doesn't need to because it's all over the WHOIS records and hosting arrangement! The site was built by Maurya Consultancy Services, except there's not much there at the moment:

That site is running on the same IP address as all the exposed records were found on. Only 4 weeks ago, it looked like this:

That's the last recorded snapshot on archive.org and it appears that the site has been disabled (scrubbed?) around the time of this incident. Doing a reverse DNS lookup on the host name shows a raft of other sites running on the same IP address:

amcoweigh.com

aphali.com

askmyindianfellows.com

biglife4u.com

cafefumo.com

coloron.co.in

coloroncare.com

dashclinic.net

directproductmark.biz

futureenterprises.biz

gretdhara.com

homebulkdeal.com

hsppl.com

kantalaxmi.com

mauryaindustries.com

mcslinc.com

mymoneystation.com

mysvls.com

nixonchemicals.com

ourflame.biz

sagardevelopers.com

shreelandscaping.com

smileservices.biz

sribalajimedicare.com

swaroopkart.com

tarseelexports.com

thethomsonandthomson.com

thomsonandthomson.co.in

trimsnbeyond.com

vatanucoolengineers.com

At least they were running; every single one now shows the same "we're upgrading our servers" message. However, the internet archive can help fill in the gaps and a bit of searching shows everything from a pharmaceutical export business to a network marketing site ("happyness" is one of the things they do) to one which simply has directory listing enabled and a backup of the site in a zip file at the root (a familiar pattern by this time).

It's now pretty easy to join the dots on who's behind it:

A one-man show run by Sunil Maurya based in Mumbai building PHP, the language used for the Health Solutions site. I'd normally be reticent to name and shame in this way but the gravitas of the situation deserves it. An "MCA Fresher" per the by-line above is someone who has just completed their Masters of Computer Application. In fairness to Sunil, he may have had no idea what he was getting himself into and the blame really has to rest with Health Solutions for not pausing to consider that maybe a personal friend fresh out of college wasn't the most responsible choice for this class of data.

Just to make things even worse again, a bit more Googling revealed this:

For the non-technical readers, this a "shell" which allows an attacker to remotely run commands on a server. Someone has been able to place the malicious software on the Health Solutions websites which then enabled them to perform a whole raft of other nefarious activities on the machine. Google indexed it on the 10th of November, less than 3 weeks before the exposed data was reported to me. It appears to be unrelated to the fact that the site didn't secure the pathology reports in the first place, but it speaks to how poorly managed the entire thing was.

Summing it all up, here's what we know:

Health Solutions hired a personal friend fresh out of school working for himself to manage tens of thousands of pathology reports on his own server he hosted in a foreign country along with a bunch of other unrelated sites that was infected with a malicious shell

This needs calling out because it's such a grave violation of trust on behalf of those impacted. But there's an angle to all of this which we can't ignore, an "elephant in the room", if you like, and it's this:

@troyhunt privacy of personal data, is still not a thing most people in India are bothered about. — Anon desi (@AnonDesi) December 2, 2016

@troyhunt Privacy and security is a myth in India. Nobody gives a damn. — Binoy Xavier Joy ⚡️ (@binoyxj) December 2, 2016

Now I'm not sure it's entirely that black and white because there are inevitably many people that do care, but there are also a different set of priorities in India. I spent a great deal of the last decade and a half building medical systems used across Asia Pacific and as I recently wrote, the development was often off-shored to India. One of the things I quickly learned is that for emerging markets in general, they have issues that far and away trump the privacy situation outlined here. Issues such as poverty and rapid urbanisation. Issues of low literacy rates and high infant mortality. These are foreign concepts to most people living comfortable western lives, but you can no doubt see how privacy of data such as this could be considered a "luxury" we enjoy by virtue of being in more developed nations. None of that should excuse this situation, but it hopefully helps explain why it wasn't approached with the urgency we'd normally expect to see.

Eventually, the data was indeed removed. I first saw it gone Saturday morning then by Sunday, the results in Google were gone too hence my publishing this piece now. Many Indians who've contacted me have expressed concerns not just about this issue, but what it heralds as they rapidly digitise services without yet having the privacy frameworks required to protect data (i.e. there's no HIPAA equivalent). And they're right - it's worrying - and inevitably what we've seen here is far more common than we know.

I doubt that Health Solutions will now contact the thousands of people impacted by this as we'd see mandated in other parts of the world. I also doubt there'll be any legal or regulatory recourse as a result of their incompetence. But what I do know is that what we've seen here is consistent with so many of the other incidents we've seen around the world in terms of the technical failings. I do hope India can get the regulations in place to hold people accountable when it happens again because with the rush to digitise this sort of thing, it will happen again.