Everyone who lives in a busy metropolis like San Francisco has had to deal with rising house prices and decreasing availability. Business insider reported that the median price for a 1-bedroom in San Fran was $3,100/month – officially making it more expensive than New York.

But did you know that rental services like Airbnb, Homeaway and Flipkey could be partly to blame? In a recent article, the San Francisco Chronicle analysed the latest data from these sites to determine just how bad these services are for the real estate market.

We were very excited to work with Carolyn from the Chronicle, to help her get data for this stories. Read on to see just how we gathered the data behind the news.

Finding all the properties in SF

In order to do her analysis, Carolyn wanted data (price, region, landlord’s name, number of bedrooms etc.) on all the rentals in San Francisco from Flipkey and Homeaway (she already had data on Airbnb).

To find all the rentals, we decided to do a search on each site for “San Francisco” and use Magic to extract the search results. This gave us an API that extracted the links for each house’s “profile page”.

Search results Magic extractor

In order to get all the subsequent pages, we needed to generate the URLs for the rest of the results. This was pretty simple as the pagination is nicely represented in the URL:

Using Excel, we were able to generate the correct amount of URLs needed to get all of the search results. Once we had a big list of all the URLs, we put them through Bulk Extract to get the link to every rental in the San Francisco area (Flipkey: 434, Homeaway: 1128).

With these links in hand, we were able to move on to stage 2…

Getting property information

Now you’ll no doubt notice that so far we’ve only collected links. The real data is on the rental “profile pages”. So, we built another API to get the data out of the profile pages.

Listing profile page Profile page as data

Then we ran our first list of URLs through another Bulk Extract to get the profile information for all the properties.

Data issues

In the course of building our second API we ran into a few snags that needed to be solved in order to collect all the data.

The first was collecting things like the number of bedrooms and bathrooms, property types etc. This seemed simple at first, but because the data moved around from page to page quite a lot, we had to use a manual Xpath to anchor the extraction to a specific word such as “bedroom” or “bathroom”. For example this Xpath…

…looks for the word “Bathrooms” and then extracts the data from the next element on the webpage (which should be the number of bathrooms there were in the property).

We ran into another issue when we tried to see how many properties each user has. Most of the hosts just use their first name which isn’t very helpful, so we had to build another API to their owner page to get their user ID.

Finally, there was the problem with latitude and longitudes, which we needed to dedupe properties across websites. These are not easy to find, but you can find (and extract) them by looking in the source code.

Cleansing the data

Once we had our massive data set, we downloaded it as a CSV, but before we could hand that data over to Carolyn for analysis, we needed to do a little extra work to clean it up. The first task was to de-duplicate all the properties across the different websites using their coordinates – surprisingly there weren’t that many.

We also found that even though all the properties we extracted came up in the San Francisco search, it turned out that a few of them weren’t actually in San Francisco. Apparently, listing your property in SF (even when it isn’t) helps you get a higher price.

Analyzing for insights

Armed with our clean data set, we began combing the data for insights into the rental market. For the full analysis I highly recommend you check out Carolyn’s article, but I’ll give you a few highlights here.

On average SF properties rent for between $282 – $302 per night

The Noe Valley has the most properties available to rent

Some owners (with multiple properties) earn up to $10,000/day from renting

Pretty cool eh? By extracting data from the web, we were able to help the SF Chronicle do some really cool in depth analysis into the housing market in San Francisco. Plus, because we’ve done it using APIs, it’ll be really easy to re-extract the data next year to see how the data has changed.

Data Journalism as a Service

If you’re a journalist who wants to start using data, you’re in luck. We’re offering a Data Journalism service where we work with you to find, extract and clean the data you need to create an awesome story.

[contentblock id=6 img=gcb.png]