When working on the Shrine library for handling file uploads, in multiple places I needed to be able to download a file from URL. If you know the Ruby standard library well, the solution might be obvious to you: open-uri.

require "open-uri" result = open ( "http://example.com/image.jpg" ) result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>

Open-uri is something that I indeed very much wanted to use for my use case. It ships with Ruby, so there are no external dependencies (just Net::HTTP), and it has many benefits:

downloads to a unique filesystem location (using Tempfile)

supports HTTP/HTTPS/FTP links

follows redirects

memory efficient

easy basic authentication

easy proxy

However, also considering that in my case the URL could come from user input, open-uri turned out to have many limiations and quirks:

Using Kernel#open makes you vulnerable to remote code execution

makes you vulnerable to remote code execution If the remote file is smaller than 10KB, open-uri actually returns a StringIO instead of a Tempfile

URL’s file extension isn’t preserved in downloaded Tempfile

You cannot limit maximum number of redirects

You cannot limit maximum filesize

I’ve thought about alternatives: rest-client, curl or wget . However, rest-client was a too heavy dependency just for downloading, and I didn’t want to depend on external CLI tools. Also, none of them were able to properly limit the maximum filesize, which I found important in context of Shrine.

So, realizing that I still wanted to use open-uri, I decided to make a wrapper around it that addresses these limitations. I want to guide you through my journey, fixing one issue at a time.

Improvements

Kernel#open

Ruby has a Kernel#open method, which given a file path acts as File.open . but given a string that starts with “|”, it interprets it as a shell command and returns an IO connected to the spawned subprocess:

open ( "| ls" ) # returns an IO connected to the `ls` shell command

Open-uri extends Kernel#open with the ability to accept URLs. However, if the URL is coming from user input, we should never pass it to Kernel#open , because different users have different ideas on what is a “URL”; someone might think that | rm -rf ~ is a nice looking URL.

A little known fact is that Kernel#open just delegates to URI::(HTTP|HTTPS|FTP)#open , and we can simply use that instead:

uri = URI . parse ( "http://example.com/image.jpg" ) #=> #<URI::HTTP> uri . open #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz>

StringIO

Stangely, if the remote file has less than 10KB, open-uri will actually return a StringIO instead of a Tempfile.

uri . open #=> #<StringIO>

In context of Shrine I wanted the returned IO to always be a file, for consistency and because it could later be given for processing. We can easily fix that:

io = uri . open if io . is_a? ( StringIO ) downloaded = Tempfile . new File . write ( downloaded . path , io . string ) else downloaded = io end downloaded # now always a Tempfile

File extension

Surprisingly, open-uri always creates a Tempfile without a file extension, even if the url has one. In Shrine I wanted that downloaded files (which will later be uploaded) always have an extension if it’s known.

So let’s copy the downloaded IO to a new Tempfile which has a file extension, but use mv if we can so that we don’t pay any performance penalty (and that the old file also gets deleted):

io = uri . open downloaded = Tempfile . new ([ File . basename ( uri . path ), File . extname ( uri . path )]) if io . is_a? ( Tempfile ) FileUtils . mv io . path , downloaded . path else # StringIO File . write ( downloaded . path , io . string ) end File . extname ( downloaded . path ) #=> ".jpg"

Redirects

What’s good is that open-uri can automatically follow redirects. What’s bad is that we cannot limit the maximum number of redirects. This allows the attacker to give a URL which causes a redirect loop, and open-uri would continue making requests forever. To be fair, open-uri has a detection for redirect loops, but only if URLs repeat.

So we disable open-uri’s following of redirects, which now raises OpenURI::HTTPRedirect on redirects, allowing us to reimplement it:

tries = 3 begin uri . open ( redirect: false ) rescue OpenURI :: HTTPRedirect => redirect uri = redirect . uri # assigned from the "Location" response header retry if ( tries -= 1 ) > 0 raise end

Maximum filesize

Since the URL can sometimes come from the user input, I wanted to give Shrine users the ability to limit maximum filesize of the remote file. Specifically, I wanted that download aborts as soon as the “Content-Length” header reveals that the file will be too large. Luckily, open-uri has the :content_length_proc option, which calls the given proc as soon as open-uri reads “Content-Length”:

uri . open ( content_length_proc: -> ( size ) { raise FileTooLarge if size > max_size }, )

However, an attacker could theoretically create an app which returns large files, but where the “Content-Length” response header is ommited on purpose. Luckily, open-uri has got our back on this one too with :progress_proc , which calls the given proc whenever a chunk is downloaded, with the current size. That means we can add it as a fallback in case “Content-Length” is missing:

uri . open ( content_length_proc: -> ( size ) { raise FileTooLarge if size && size > max_size }, progress_proc: -> ( size ) { raise FileTooLarge if size > max_size }, )

User agent

It turns out that when we’re making requests to an application, but we don’t include a “User-Agent” header, most applications will start rejecting our requests after some time.

Open-uri doesn’t include a “User-Agent” by default, but allows us to easily add one, since open-uri treats any unknown option as a request header:

uri . open ( "User-Agent" => "MyApp/1.0" )

Result

The result of this investigation is the Down gem, which incorporates all of these improvements, and more. You can use it like this:

require "down" result = Down . download ( "http://example.com/image.jpg" ) result #=> #<Tempfile:/var/folders/k7/6zx6dx6x7ys3rv3srh0nyfj00000gn/T/20160524-10403-xpdakz.jpg>

More advanced downloading could look something like this:

Down . download "http://example.com/image.jpg" , max_size: 20 * 1024 * 1024 , # 20 MB max_redirects: 5 , # default is 2 proxy: "http://proxy.com" # delegates to open-uri

Conclusion

I like that I was able to make a lightweight wrapper around open-uri, which already had most of the features that I wanted, but allowed me to complete the ones that I was missing. If you want to use open-uri, but without any of the mentioned quirks, consider using Down.