Not long ago I heard from a reader who wanted advice on how to stop someone from scanning his home network, or at least recommendations about to whom he should report the person doing the scanning. I couldn’t believe that people actually still cared about scanning, and I told him as much: These days there are countless entities — some benign and research-oriented, and some less benign — that are continuously mapping and cataloging virtually every device that’s put online.

One of the more benign is scans.io, a data repository of research findings collected through continuous scans of the public Internet. The project, hosted by the ZMap Team at the University of Michigan, includes huge, regularly updated results grouped around scanning for Internet hosts running some of the most commonly used “ports” or network entryways, such as Port 443 (think Web sites protected by the lock icon denoting SSL/TLS Web site encryption); Port 21, or file transfer protocol (FTP); and Port 25, or simple mail transfer protocol (SMTP), used by many businesses to send email.

When I was first getting my feet wet on the security beat roughly 15 years ago, the practice of scanning networks you didn’t own looking for the virtual equivalent of open doors and windows was still fairly frowned upon — if not grounds to get one into legal trouble. These days, complaining about being scanned is about as useful as griping that the top of your home is viewable via Google Earth. Trying to put devices on the Internet and then hoping that someone or something won’t find them is one of the most futile exercises in security-by-obscurity.

To get a gut check on this, I spoke at length last week with University of Michigan researcher Zakir Durumeric (ZD) and Michael D. Bailey at the University of Illinois at Urbana-Champaign (MB) about their ongoing and very public project to scan all the Internet-facing things. I was curious to get their perspective on how public perception of widespread Internet scanning has changed over the years, and how targeted scanning can actually lead to beneficial results for Internet users as a whole.

MB: Because of the historic bias against scanning and this debate between disclosure and security-by-obscurity, we’ve approached this very carefully. We certainly think that the benefits of publishing this information are huge, and that we’re just scratching the surface of what we can learn from it.

ZD: Yes, there are close to two dozen papers published now based on broad, Internet-wide scanning. People who are more focused on comprehensive scans tend to be the more serious publications that are trying to do statistical or large-scale analyses that are complete, versus just finding devices on the Internet. It’s really been in the last year that we’ve started ramping up and adding scans [to the scans.io site] more frequently.

BK: What are your short- and long-term goals with this project?

ZD: I think long-term we do want to add coverage of additional protocols. A lot of what we’re focused on is different aspects of a protocol. For example, if you’re looking at hosts running the “https://” protocol, there are many different ways you can ask questions depending on what perspective you come from. You see different attributes and behavior. So a lot of what we’ve done has revolved around https, which is of course hot right now within the research community.

MB: I’m excited to add other protocols. There are a handful of protocols that are critical to operations of the Internet, and I’m very interested in understanding the deployment of DNS, BGP, and TLS’s interception with SMTP. Right now, there’s a pretty long tail to all of these protocols, and so that’s where it starts to get interesting. We’d like to start looking at things like programmable logic controllers (PLCs) and things that are responding from industrial control systems.

ZD: One of the things we’re trying to pay more attention to is the world of embedded devices, or this ‘Internet of Things’ phenomenon. As Michael said, there are also industrial protocols, and there are different protocols that these embedded devices are supporting, and I think we’ll continue to add protocols around that class of devices as well because from a security perspective it’s incredibly interesting which devices are popping up on the Internet.

BK: What are some of the things you’ve found in your aggregate scanning results that surprised you?

ZD: I think one thing in the “https://” world that really popped out was we have this very large certificate authority ecosystem, and a lot of the attention is focused on a small number of authorities, but actually there is this very long tail — there are hundreds of certificate authorities that we don’t really think about on a daily basis, but that still have permission to sign for any Web site. That’s something we didn’t necessary expect. We knew there were a lot, but we didn’t really know what would come up until we looked at those.

There also was work we did a couple of years ago on cryptographic keys and how those are shared between devices. In one example, primes were being shared between RSA keys, and because of this we were able to factor a large number of keys, but we really wouldn’t have seen that unless we started to dig into that aspect [their research paper on this is available here].

MB: One of things we’ve been surprised about is when we measure these things at scale in a way that hasn’t been done before, often times these kinds of emergent behaviors become clear.

BK: Talk about what you hope to do with all this data.

ZD: We were involved a lot in the analysis of the Heartbleed vulnerability. And one of the surprising developments there wasn’t that there were lots of people vulnerable, but it was interesting to see who patched, how and how quickly. What we were able to find was by taking the data from these scans and actually doing vulnerability notifications to everybody, we were able to increase patching for the Heartbleed bug by 50 percent. So there was an interesting kind of surprise there, not what you learn from looking at the data, but in terms of what actions do you take from that analysis? And that’s something we’re incredibly interested in: Which is how can we spur progress within the community to improve security, whether that be through vulnerability notification, or helping with configurations.

BK: How do you know your notifications helped speed up patching?

MB: With the Heartbleed vulnerability, we took the known vulnerable population from scans, and ran an A/B test. We split the population that was vulnerable in half and notified one half of the population, while not notifying the other half, and then measured the difference in patching rates between the two populations. We did end up after a week notifying the second population…the other half.

BK: How many people did you notify after going through the data from the Heartbleed vulnerability scanning?

ZD: We took everyone on the IPv4 address space, found those that were vulnerable, and then contacted the registered abuse contact for each block of IP space. We used data from 200,000 hosts, which corresponded to 4,600 abuse contacts, and then we split those into an A/B test. [Their research on this testing was published here].

So, that’s the other thing that’s really exciting about this data. Notification is one thing, but the other is we’ve been building models that are predictive of organizational behavior. So, if you can watch, for example, how an organization runs their Web server, how they respond to certificate revocation, or how fast they patch — that actually tells you something about the security posture of the organization, and you can start to build models of risk profiles of those organizations. It moves away from this sort of patch-and-break or patch-and-pray game we’ve been playing. So, that’s the other thing we’ve been starting to see, which is the potential for being more proactive about security.

BK: How exactly do you go about the notification process? That’s a hard thing to do effectively and smoothly even if you already have a good relationship with the organization you’re notifying….

MB: I think one of the reasons why the Heartbleed notification experiment was so successful is we did notifications on the heels of a broad vulnerability disclosure. The press and the general atmosphere and culture provided the impetus for people to be excited about patching. The overwhelming response we received from notifications associated with that were very positive. A lot of people we reached out to say, ‘Hey, this is a great, please scan me again, and let me know if I’m patched.” Pretty much everyone was excited to have the help.

Another interesting challenge was that we did some filtering as well in cases where the IP address had no known patches. So, for example, where we got information from a national CERT [Computer Emergency Response Team] that this was an embedded device for which there was no patch available, we withheld that notification because we felt it would do more harm than good since there was no path forward for them. We did some aggregation as well, because it was clear there were a lot of DSL and dial-up pools affected, and we did some notifications to ISPs directly.

BK: You must get some pushback from people about being included in these scans. Do you think that idea that scanning is inherently bad or should somehow prompt some kind of reaction in and of itself, do you think that ship has sailed?

ZD: There is some small subset that does have issues. What we try to do with this is be as transparent as possible. All of our hosts we use for scanning, if look at them on WHOIS records or just visit them with a browser it will tell you right away that this machine is part of this research study, here’s the information we’re collecting and here’s how you can be excluded. A very small percentage of people who visit that page will read it and then contact us and ask to be excluded. If you send us an email [and request removal], we’ll remove you from all future scans. A lot of this comes down to education, a lot of people to whom we explain our process and motives are okay with it.

BK: Are those that object and ask to be removed more likely to be companies and governments, or individuals?

ZD: It’s a mix of all of them. I do remember offhand there were a fair number of academic institutions and government organizations, but there were a surprising number of home users. Actually, when we broke down the numbers last year (PDF), the largest category was small to mid-sized businesses. This time last year, we had excluded only 157 organizations that had asked for it.

BK: Was there any pattern to those that asked to be excluded?

ZD: I think that actually is somewhat interesting: The exclusion requests aren’t generally coming from large corporations, which likely notice our scanning but don’t have an issue with it. A lot of emails we get are from these small businesses and organizations that really don’t know how to interpret their logs, and often times just choose the most conservative route.

So we’ve been scanning for a several years now, and I think when we originally started scanning, we expected to have all the people who were watching for this to contact us all at once, and say ”Please exclude us.’ And then we sort of expected that the number of people who’d ask to be excluded would plateau, and we wouldn’t have problems again. But what we’ve seen is, almost the exact opposite. We still get [exclusion request] emails each day, but what we’re really finding is people aren’t discovering these scans proactively. Instead, they’re going through their logs while trying to troubleshoot some other issue, and they see a scan coming from us there and they don’t know who we are or why we’re contacting their servers. And so it’s not these organizations that are watching, it’s the ones who really aren’t watching who are contacting us.

BK: Do you guys go back and delete historic records associated with network owners that have asked to be excluded from scans going forward?

ZD: At this point we haven’t gone back and removed data. One reason is there are published research results that are based on those data sets, results, and so it’s very hard to change that information after the fact because if another researcher went back and tried to confirm an experiment or perform something similar, there would be no easy way of doing that.

BK: Is this what you’re thinking about for the future of your project? How to do more notification and build on the data you have for those purposes? Or are you going in a different or additional direction?

MB: When I think about the ethics of this kind of activity, I have very utilitarian view: I’m interested in doing as much good as we possibly can with the data we have. I think that lies in notifications, being proactive, helping organizations that run networks to better understand what their external posture looks like, and in building better safe defaults. But I’m most interested in a handful of core protocols that are under-serviced and not well understood. And so I think we should spend a majority of effort focusing on a small handful of those, including BGP, TLS, and DNS.

ZD: In many ways, we’re just kind of at the tip of this iceberg. We’re just starting to see what types of security questions we can answer from these large-scale analyses. I think in terms of notifications, it’s very exciting that there are things beyond the analysis that we can use to actually trigger actions, but that’s something that clearly needs a lot more analysis. The challenge is learning how to do this correctly. Every time we look at another protocol, we start seeing these weird trends and behavior we never noticed before. With every protocol we look at there are these endless questions that seem to need to be answered. And at this point there are far more questions than we have hours in the day to answer.

Tags: Heartbleed vulnerability, Michael D. Bailey, University of Michigan, Zakir Durumeric, Zmap