Ubuntu privacy blunder over Amazon ads continues View this article as markdown By Etienne Perot <etienne (at) perot (dot) me> on 2012-09-25 .

First, some context: There have been quite a few complaints and concerns about Ubuntu's attempt to include advertisements in their operating system, in the form of Amazon-affiliate-tracked results showing up in Unity's Dash interface by default. There has also been some attempts to do some damage control over this PR disaster, including one by Mark Shuttleworth himself, Ubuntu's Self-Appointed Benevolent Dictator For Life (SABDFL).

To his credit, he isn't pulling any punches or dancing around the question:

Why are you telling Amazon what I am searching for? We are not telling Amazon what you are searching for. Your anonymity is preserved because we handle the query on your behalf. Don't trust us? Erm, we have root. You do trust us with your data already. You trust us not to screw up on your machine with every update. You trust Debian, and you trust a large swathe of the open source community. And most importantly, you trust us to address it when, being human, we err.

One of the statements here is pretty ominous at first: "Don't trust us? Erm, we have root." Mark refers to the fact that system updates are all done as root, and they can indeed slip in any code they want in there, which could include a remote-administration trojan or a little script uploading all of $HOME to Canonical's servers... But doing so would go directly against their users and instantly ruin their reputation. It is expectable from users to trust their operating system vendor will not snoop on them. The argument, while technically correct, doesn't hold much water when considering user expectations and Canonical's own business interests.

However, I'd like to challenge one particular passage (emphasis mine):

We are not telling Amazon what you are searching for. Your anonymity is preserved because we handle the query on your behalf.

There's a number of issues here.

"We are not telling Amazon what you are searching for."

The way the search is handled goes as follows:

User begins typing in the Dash search field

An HTTP request (not HTTPS!) is made to a server called productsearch.ubuntu.com , containing the keywords

, containing the keywords productsearch.ubuntu.com asks Amazon's API for search results; to do so, it obviously needs to send the search terms. It is unknown whether that query is made over HTTPS or not.

asks Amazon's API for search results; to do so, it obviously needs to send the search terms. It is unknown whether that query is made over HTTPS or not. The search results are sent back to the client in a JSON string

The request looks like this:

GET /v1/search?q=test HTTP / 1.1 Host : productsearch.ubuntu.com Accept-Encoding : gzip, deflate User-Agent : gvfs/1.13.9 Accept-Language : en-ca, en;q=0.9, en;q=0.8 Connection : Keep-Alive

And the response:

HTTP / 1.1 200 OK Date : Tue, 25 Sep 2012 07:17:39 GMT Server : gevent/0.13.0 gunicorn/0.13.4 Vary : X-Geo-Country Content-Type : application/json Content-Length : 44674 X-Cache : MISS from alkes.canonical.com X-Cache-Lookup : HIT from alkes.canonical.com:3128 Via : 1.0 alkes.canonical.com:3128 (squid/2.7.STABLE7) Via : 1.1 productsearch.ubuntu.com Keep-Alive : timeout=15, max=100 Connection : Keep-Alive ...

Of course, it is trivial to see why the statement is wrong in the first place: productsearch.ubuntu.com is telling Amazon what you're searching for. What it is not telling is who you are, because (supposedly) the API request doesn't contain any identifying information other than your search terms.

This oversight is most likely just poor wording on Mr. Shuttleworth's part, though. What the sentence is really trying to say is: "We are telling Amazon what Ubuntu users are searching for, but we are not telling them who these users are."

That's fine, although it still raises some important privacy questions. Indeed, this search is performed when the user is using Unity's "Home" lens, which is where you can search for applications, files in your $HOME folder, and now Amazon search results. However, the documents in one's $HOME folder are usually fairly private. Even their filename alone usually speaks volumes ( Top-secret plan to kill my boss.doc , Confessions of a (fill in the blank).pdf , How to (fill in the blank).epub , credit-card-(some number).kmy , etc.). They usually contain people's names in them, too. The search terms reveal a lot by themselves about the person typing them. Users are going to be searching for those files in the Home lens, because that is what they have always done and that is what they are used to. Unbeknownst to them, they are now sending these sensitive search terms over plain HTTP (visible to your local sysadmin, your boss (through your sysadmin), your ISP, and who knows, maybe your government (through subpoenas). And then Canonical sees it, and Amazon does too, and any other peer along the ride. The only thing that Canonical is doing is masking your IP address from Amazon.

The net result that Canonical claims: Canonical knows IPs and search terms, Amazon knows search terms. The only thing that Amazon doesn't know: Who is searching what.

Now, I'd like to question this claim by simply looking at the Wireshark output from running any search query. Try to do the following:

Install Wireshark

Start it and launch a capture

Open the dash and type a few characters

Check what Wireshark says

You'll see something like this:

GET /images/I/41Qemdr7ieL._SL160_.jpg HTTP / 1.1 Host : ecx.images-amazon.com Accept-Encoding : gzip, deflate User-Agent : gvfs/1.13.9 Accept-Language : en-us, en;q=0.9 Connection : Keep-Alive

And the response:

HTTP / 1.0 200 OK Date : Mon, 24 Sep 2012 15:55:23 GMT Server : Server Cache-Control : max-age=630720000,public Expires : Wed, 18 May 2033 03:33:20 GMT Content-Length : 4630 Last-Modified : Wed, 08 Aug 2012 22:34:21 GMT Content-Type : image/jpeg Age : 47375 X-Cache : Miss from cloudfront X-Amz-Cf-Id : aBNnNXkOlBFeFzoYljLrLBE2MTi0TMmDIvZfbzslKOM-8V1Wi9T2sA Via : 1.0 574341a971a46a2980db13237b8175da.cloudfront.net (CloudFront) Connection : keep-alive ...

This is simply the Dash downloading the thumbnails that accompany each search result. Each item in the dash has a prominent icon, and a label underneath:

Of course these images need to be downloaded from somewhere. Let's download them from the source, images-amazon.com ! What could possibly go wrong? This goes against Mr. Shuttleworth's claim that Amazon doesn't know who is searching what. Indeed, while Amazon can't map search terms to IP addresses, what they can do is log the requests on their images server, and simply look at the name of the corresponding product and figure out what the search terms were. Or simply correlate them with a recent API query received from productsearch.ubuntu.com .

Some additional nitpicks:

Those image requests are done over HTTP as well, even though Amazon provides an SSL version of their image service at ssl-images-amazon.com . Fixing it would be a simple one-line replace in the code. The gain from using SSL for image content isn't enormous, but if it's available, why not use it? Some may argue "for speed". I'd advise these people to try out the Unity Dash search by themselves, and get back to me about how fast it currently is. I doubt speed was a big concern.

. Fixing it would be a simple one-line replace in the code. The gain from using SSL for image content isn't enormous, but if it's available, why not use it? Some may argue "for speed". I'd advise these people to try out the Unity Dash search by themselves, and get back to me about how fast it currently is. I doubt speed was a big concern. The request uses a fairly unique User-Agent header: gvfs/1.13.9 . GVFS is a component of the GNOME desktop used for filesystem stuff, including mounting WebDAV shares and the like. Unity is likely using the GNOME library to perform these HTTP requests. However, I think there is little reason for that component to ever hit the amazon-images.com domain, other than because of the Unity Dash advertisements. As such, Amazon now has an easy way to identify which image requests result from a Unity Dash search.

header: . GVFS is a component of the GNOME desktop used for filesystem stuff, including mounting WebDAV shares and the like. Unity is likely using the GNOME library to perform these HTTP requests. However, I think there is little reason for that component to ever hit the domain, other than because of the Unity Dash advertisements. As such, Amazon now has an easy way to identify which image requests result from a Unity Dash search. The request contains an Accept-Language header which contains the user's locale. It is set to en-us, en if you install the US English version of Ubuntu, but can be set to fr if you install the French language pack and set it as default, and so on. This isn't a huge information leak, but it gives Amazon more data to correlate the terms with, because you probably typed your search terms in that language. At any rate, it is not necessary for Amazon to know the language in order to serve static image files, so why tell them?

I have filed a bug about all of these issues.

So there we have it. Something which may have started from good intentions ("Let's have the Dash search the web to provide users with richer search results!") turned into something much worse ("Let's put irrelevant revenue-generating advertisements on by default in a place where the user is likely to type private information and wouldn't expect that information to be sent out to anyone!") through a series of oversights. This was pushed through Ubuntu's Feature Freeze period because it had executive support from the top people, and its release was rushed through with little regard to the users' interest (there was no warning that this was coming), or to the PR disaster that was inevitably going to follow.

Oh, and did I mention that, privacy concerns aside, advertisements in an operating system are not a good idea in the first place? It's an intrusion of the user's personal space, and it drowns the search results in inconsistent, unnecessary, inappropriate, slow-loading, irrelevant noise that sometimes replaces existing local search results. It's especially annoying when you're about to click on one of these, and suddenly what you're clicking on just turned into an ad.

For the record: I don't use Ubuntu personally, although I tend to recommend it to non-technically-inclined people who want to try out a Linux distribution. This whole easily-avoidable advertising mess would make me change my tune.

How to fix it

So now that the damage has been done, how do we get things straightened out?

Step 0: Reconsider

It's not too late to reconsider everything, and to dismiss the idea entirely. There's plenty of justification for that in this very post, or in the comments thread of the main Launchpad bug report. People won't forget what happened, but they will certainly appreciate such a decision because it means that their complaints have been heard.

Step 1: Update your privacy policy

This is a no-brainer. If you're going to gather more data about your users than you previously did, you need to update the privacy policy.

Thankfully, there is already a bug report about this, so this is on Canonical's radar.

Step 2: Make things clear to users

Users don't read privacy policies. It's important to have one, but users won't read it. Yet, they need to be aware of what is happening to their own data. To this end, I propose the following solution:

Whenever the current lens is going to communicate with the Internet, replace the looking glass icon in the text field by a globe icon.

Whenever there is a web request actively going on, make the globe rotate (as opposed to the spinner animation currently in use for local searches).

Whenever the globe icon is clicked, open a little panel explaining to the user the implications of the search they are about to make.

This makes it clear that there is something going on that will send data over the network, and it gives the user easy access to more detailed information about what exactly is going to happen.

Here's quick mockup of what this could look like (though it needs better fonts and icons):

Think this message sounds scary? That's true. But then again, so is sending sensitive search terms to various unrelated third parties.

Step 3: Make it opt-in rather than opt-out

This is pretty self-explanatory. Any feature that goes against user expectations when enabled by default should be opt-in.

At the very least, it should be easy for the user to remove this feature. Currently, it isn't: The user needs to remove the unity-lens-shopping package:

$ sudo apt-get purge unity-lens-shopping

This is not user-friendly nor obvious. Canonical plans to address this, though they do not intend to make it opt-in at this time.

Step 4 option A: Make your actual strategy match your intended one

The current strategy doesn't respect the privacy guarantees that Canonical wants to provide. To fix this, here is what needs to happen:

Make the Dash use SSL/TLS when talking to productsearch.ubuntu.com (this is already in Canonical's plans)

(this is already in Canonical's plans) Open up the source code used on the backend servers at productsearch.ubuntu.com (why not?)

(why not?) Make the request from productsearch.ubuntu.com to Amazon use SSL as well. There's no reason not to, and having both hops over SSL strengthens the guarantee that only Canonical and Amazon can see the search terms.

to Amazon use SSL as well. There's no reason not to, and having both hops over SSL strengthens the guarantee that only Canonical and Amazon can see the search terms. Include the thumbnails of each item inside the reply from productsearch.ubuntu.com to the user. Use the data URI scheme to do that, or have the client request it by itself from productsearch.ubuntu.com (not Amazon), over SSL as well.

Step 4 option B: Actually make search terms anonymous

There is a relatively easy solution for Canonical to provide full search terms anonymization, such that Canonical only knows the IP of Ubuntu users (but not what they're searching), and Amazon only knows what Ubuntu users are searching for (but not who is searching what).

To pull this off, all Canonical needs to do is to set up a relay server instead of the current web server at productsearch.ubuntu.com . That relay server would simply forward whatever it gets from a client to Amazon, and send everything it got from Amazon back to the original client.

The client would effectively be performing an Amazon API request directly, using SSL, and Canonical's server would simply forward the encrypted bits along. This way, Canonical doesn't get to see which search terms are sent, thus any logging they may do would be useless. Amazon would see the search terms, but the only IP they would get is the Canonical server's IP. Users would still need to be warned that they shouldn't type identifying information as search terms, so that Amazon cannot link those search terms back to the users.

One of the consequences of this approach is that productsearch.ubuntu.com could now easily become a publicly-available spam relay towards Amazon's API servers. While I doubt that Amazon's API could be brought down solely from traffic coming from a Canonical server (my guess is that the Canonical server would crash and burn long before this happens), such a situation could potentially be solved through abuse complaints from Amazon to Canonical, asking Canonical to block certain IPs from sending further requests.

The downside of this system, of course, is that Canonical doesn't get to see the search terms. They claim they need to gather the search terms and click data so that they can "provide better, more relevant results", in order to make the user experience better.

I have an alternative suggestion for Canonical to make the user experience better: Allow users to rate search results. Add a little section to the Dash under the Amazon results that asks "Were these results relevant?", and corresponding "Yes"/"No" buttons. The data from these buttons will be a more precise metric than the current metric: "whatever the user clicks is relevant".

And if you're telling yourself: "This will never work! Users will click 'No' all the time!", then perhaps you should ask yourself whether this feature was really made with the users' interest at heart in the first place.