The goal of this project was to start with a base directory (in this case The Hidden Wiki) and start spidering out to discover all reachable Tor servers. Some restrictions were placed on this after a few trial runs:

Only HTML/JSON was parsed/spidered for more links to follow (no jpegs/xml, etc)

There were a few skipped websites, noteably: Facebook, Reddit, and a few Blockchain websites due to the amount of spidering/time that would be required

Limited to 10k visits per host so we wouldn’t infinitely keep spidering / some reasonable time frame to finish

Non 200 OK status responses were skipped

Table of Contents

I used a few different tools to build this out:

HA Proxy to load balance between tor SOCKs proxies so multiple could be run at the same time to saturate a network link

SOCKs proxies so multiple could be run at the same time to saturate a network link Redis to store state information about visits

Golang for the spidering

Postgres for data storage

This was all run on a single dedicated server over the period of about 1 week, multiple prototypes ran before that to flush out bugs.

Crawl Stats

Metric Count Total Hosts 107,067 Total Scanned Pages 14,177,383 Total Visited (non-200+) 17,038,091

Security Headers

Technology % using Content Security Policy (CSP) 0.15% Secure Cookie 0.01% – httpOnly 0% Cross-origin Resource Sharing (CORS) 0.07% – Subresource Integrity (SRI) 0% Public Key Pinning (HPKP) 0.01% Strict Transport Security (HSTS) 0.11% X-Content-Type-Options (XCTO) 0.52% X-Frame-Options (XFO) 0.58% X-XSS-Protection 0%

Some of these headers are interesting when viewed through a Tor light. HSTS and HPKP for example, can be used for super cookies and tracking (although tor does protect against this across new identities) (source).

Services implementing CORS also help protect users by preventing cookie finger printing via scripts and other malicious finger printing methods.

Software Stats

We can fingerprint and figure out exposed software by taking a look at a few different signatures, like cookies and headers. There are other methods to fingerprint using the response body but due to server restrictions and time I couldn’t save every single page source, so the results based on headers/titles are below:

Source code hosting

Software Type Identifier Gitea Cookie i_like_gitea [src] GitLab Cookie gitlab_session [src] Gogs Forked version has header X-Clacks-Overhead: GNU Terry Pratchett from NotABug.org

Build Servers

I’m going to focus on build servers because I think this is the most easy to breach front. Not only has Jenkins had some serious RCE’s in the past, it is very helpful in identifying itself with headers and debug information as seen below. People also generally store sensitive information in build servers as well, such as SSH keys and cloud provider credentials.

1 | X-Jenkins-Session: 8965d09b 2 | X-Instance-Identity: MIIBIjANBgkqhkiG9w0BAQEFAA..... 3 | Server: Jetty ( 9 .2.z-SNAPSHOT ) 4 | X-Xss-Protection: 1 5 | X-Jenkins: 2 .60.1 6 | X-Jenkins-Cli-Port: 46689 7 | X-Content-Type-Options: nosniff nosniff 8 | X-Frame-Options: sameorigin sameorigin 9 | X-Hudson-Theme: default 10 | X-Jenkins-Cli2-Port: 46689 11 | Referrer-Policy: same-origin 12 | Content-Type: text/html ; charset = UTF-8 13 | X-Hudson: 1 .395 14 | X-Hudson-Cli-Port: 46689 15 | Set-Cookie: JSESSIONID.112b5e69 = 16uts5qfqz6j....Path = / ; Secure ; HttpOnly

We can get Jenkins version, CLI ports, and Jetty versions all from just visiting the host.

Software Type Identifier Jenkins Headers X-Jenkins- and X-Hudson- style headers GitLab Cookie gitlab_session Gocd Cookie Path / Title Generally sets a cookie path at /go and uses - Go in <title> tags Drone Title Sets a drone title

Unfortunately I was unable to find any exposed Gocd or Drone servers.

Software Tracking

Software Type Identifier Trac Cookie trac_session Redmine Cookie redmine_session

I was not able to find any running BugZilla, Mantis or OTRS instances.

Popular Web Servers

Total with Server Header: 15,630

Total without header: 91,437

Top 10 (full list of 282 available for download)

1 nginx | 9619 2 Apache/2.4.6 ( CentOS ) OpenSSL/1.0.1e-fips PHP/5.6.30 | 2659 3 Apache | 1056 4 nginx/1.6.2 | 249 5 nginx/1.13.1 | 210 6 Apache/2.4.10 ( Debian ) | 161 7 Apache/2.4.18 ( Ubuntu ) | 100 8 Apache/2.2.22 ( Debian ) | 90 9 Apache/2.4.7 ( Ubuntu ) | 82 10 lighttpd/1.4.31 | 80 11 FobbaWeb/0.1 | 78

Just from the Server header we can gather a bunch of useful information:

2,659 servers are running a potentially vulnerable OpenSSL version (1.0.1e) [vulns] and vulnerable Apache version [vulns]

Many servers are leaving the OS tag on, revealing a mix of operating systems. I think it’s also a safe assumption to say the same people who would leave fingerprinting on will also be using the OS package of these servers, making it easy to combine both OS vulnerabilities and web server vulnerabilities to combine attack vectors: CentOS Debian Ubuntu Windows Raspbian Amazon Linux Fedora Red Hat Trisquel YellowDog FreeBSD Scientific Linux Vine

Some people are exposing application servers directly: thin node-static gunicorn Mojolicious WSGI Jetty GlassFish

Very old versions of IIS (5.0/6.0), Apache (1.3), and Nginx

Nginx appears to dominate the server share on Tor - just taking the top 2 in account, nginx is at least 3.5x as popular as Apache

Summary

This was a fun project to work on and I learned quite a bit about scaling up the tor binary in order to scan the network faster. I’m hoping to make this process a bit less manual and start publishing these results regularly over at my security data website, https://hnypots.com

Have any suggestions for other software to look for? Leave a comment and let me know!