I have a copy of Amazon. Meaning that, on my hard drive there is a massive chunk of Amazon’s product and reviews database—a listing of nine million or so products and 80 million or so reviews taken from 1996 to 2014. The names of all the books in that chunk, their sales ranks, their categories. Every pair of pants for kids, every sock. All the books about Hitler; all the books about snakes. All the different Lego sets. Whatever.

The way I came to be in possession of this thing is that someone tweeted that it existed. I visited a web page, sent an email to a researcher at UC San Diego, and was sent a link to download the data, with a request to cite a paper associated with it, which was presented at the 2015 SIGIR conference: “Image Based Recommendations on Styles and Substitutes.” The data totaled about 20 gigabytes, compressed.

Illustration by Serafine Frey.

It’s not a perfect copy by any means, but neither is it a pirated one. Rather, it is “spidered” data, culled by automatically visiting Amazon’s web site and copying what is found, adding it up, aggregating it. One could do the same with Walmart.com, or with any big company. But Amazon is a special case: It is possibly the most purely optimized commercial enterprise in history, marrying hard computer science to ruthless labor practices in pursuit of delivering brown, branded boxes to anyone who might conceivably want them. It knows so much about us, and we know so little about it. Walmart has done terrible things for longer, but in comparison seems so amateur. Amazon is out for the world. And I write this as a hypocrite. Who knows how many Amazon boxes are on their way to my house? They show up daily sometimes. Fear is the coin flip of admiration.

In the data, the books don’t have authors, many prices are missing (and I can’t find any prices above $999.99), and there are other gaps besides. Nonetheless, it’s what was granted me. A conglomerate in a teacup. I decided to absorb the data into a database. The first draft of the code I wrote to do so informed me that it would take 25 days of computing processing to complete. That was too long. Also I was out of hard drive space. So I went to a store and bought a computer, a big, boxy, unfashionable PC with a 4-GHz quad-core processor and ten terabytes of extra hard-drive space, installed Linux on it, and got the most recent version of the PostgreSQL database. I could have done all this in the cloud of course, but it’s harder to just mess around in the cloud, and there’s something very comfortable about having your own big machine next to your knee. Besides, the cloud I know best is Amazon’s, and I didn’t want to get conflicted.

With the help of that machine and quite a few database tricks to massage and extract the data, I got 25 days down to one, with searchable titles, descriptions, and reviews. Seven days of programming and one day of absorption to beat one day of programming and 25 days of absorption: a pretty familiar set of trade-offs. You’re always trying to balance your time against the computer’s, but there’s also the challenge of the thing. I probably should have just let it run for four weeks.