Lots of OkHttp and Retrofit users have reported bugs complaining that URL special characters (like + ; | = * ; ; | or * ) weren’t encoded as they expected.

Why can’t HttpUrl just encode ; as %3B ?

Extra escaping is safe in most programming languages and document formats. For example, in HTML there’s no behavior consequence to replace a non-delimiter " character with its escape sequence " . Or in JSON, it’s safe to replace the string "A" with "\u0041" .

But URLs are different because URL encoding is semantic: you cannot encode a URL without changing it. This is weird!

Too Much Encoding

Suppose we’re looking up 100% on DuckDuckGo. Since the code point for % is 0x25, that character encodes as %25 and the whole URL is https://duckduckgo.com/?q=100%25.

But what if we encode the already-encoded URL of that query? We would double-encode the % as %2525 and end up searching for 100%25 . Yuck: https://duckduckgo.com/?q=100%2525.

Too Little Encoding

Next we’ll search for #1 on Google. We’ll encode # as %23 and get this URL: https://www.google.ca/search?q=%231.

What if we forget to encode the # in the query? Since # is used as a delimiter for the URL’s fragment, we’ll end up with an empty query and a fragment of 1 : https://www.google.ca/search?q=#1.

Web servers define their own URLs

Ultimately it’s up to the web server to interpret the URLs requested of it. For example, since ; encodes as %3B some web servers will interpret paths like /foo;bar and /foo%3Bbar to be equal. But others can interpret these differently! Both strategies have consequences for security and performance.

These two URLs differ only in whether the ; character is encoded. Click through them to see that they serve different content.

Browsers and Specs

The best URL documents are the IETF’s RFC 3986 and WHAT-WG’s URL Standard. Browsers also do their thing, and I’ve built my own little catalog of what gets encoded where.

My advice

If you’re defining your own URLs, you’ll save a lot of trouble by avoiding characters like < , > , { , } , + , ^ , & , | , and ; .

Avoid attempting to decode a URL without also decomposing it. Otherwise delimiters like / , ? , and # are made ambiguous.

Servers own their URLs. If a server gives you a link to /foo;bar.html , don’t canonicalize it to /foo%3Bbar.html . It’s equally broken to do the opposite, converting /foo%3Bbar.html to /foo;bar.html .

HttpUrl a = HttpUrl.parse("https://publicobject.com/%3B.html"); HttpUrl b = HttpUrl.parse("https://publicobject.com/;.html"); // The decomposed form is decoded. assertEquals(";.html", a.pathSegments().get(0)); assertEquals(";.html", b.pathSegments().get(0)); // And yet the encoded form is preserved! assertEquals("/%3B.html", a.encodedPath()); assertEquals("/;.html", b.encodedPath());