Data-mixing and Web-scraping tool import.io has added a range of new features to its free service, promising even easier ways to turn any Web page into an API, instantly. As it attracts investment and interest from some of the biggest names in data search, co-founder and chief data officer Andrew Fogg speaks to ProgrammableWeb about the new tools and future directions.

“The vision for us is that you should be able to point import.io to any Web page and immediately get the data sorted into rows and columns,” explains Fogg.

“We have put a great deal of investment in our algorithm team," he says. "There have been a couple of papers out lately on Web extraction, and we have been looking at what’s out there and been building on that as well as doing something different, and the results we are able to achieve are constantly getting much better.”



Fogg also points to a unique way in which aspects of both the "sharing economy" and the freemium business model are coming together to create a better product and a more viable long-term business. Import.io’s algorithms are built on understanding the different ways that Web pages display data, so the more users that are remixing Web data into their own APIs via the import.io tool, the more the algorithm can be fine-tuned to understand how data is likely to be displayed in a wider variety of formats.

“The key benefit for the community of us taking this free platform strategy is that it helps us make the product better,” says Fogg. “We have had hundreds of thousands of APIs. Customers are creating around 500 new APIs with import.io every day. With hundreds of thousands of APIs to look at, we are able to make our algorithms better.”

Part of a Network of Data Startups

Last week, import.io also received a round of investment funding from key data industry founders including Jerry Yang (co-founder of Yahoo), Louis Monier (co-founder of AltaVista), and David Axmark and Michael “Monty” Widenius (co-founders of MySQL). Fogg confirms that this investment will go predominantly into further enhancements to the algorithms that import.io has created to understand data on the Web, but he also points to the networks and fellow startups that are made more accessible by being welcomed into the investor fold:

In a very short time, these investors have supported Evernote, Curbside, Altiscale, Layer, Treasure Data and Zendrive. These are some really nice portfolio companies, all with principal interests in mobile, cloud and big data. So, absolutely, being part of those networks is a great benefit. And if you look at people like David Axmark and Monty Widenius, as the founders of MySQL they have been entrepreneurs, and very clearly, they enjoy working with founders and slightly early-stage businesses.

Creating Point-and-Click APIs From Web Pages

In the previous iteration of import.io, users were able to train the import.io tool to identify where in a website the data is located in advance of import.io collecting all of that data, providing it in a spreadsheet and enabling access of it via an auto-generated API.

The new features make use of a more sophisticated algorithm to intuitively identify where the data is located.

For example, the new features allow users to point import.io at a real estate website and automatically start organizing data into rows and columns of housing data. In cases where import.io gets it wrong, users can still train the tool by showing it where the row and column data is located, but in the main, this process is now automated.

Above: Import.io automatically detects where data is stored in this real estate website, correctly identifying different data elements and storing in separate columns. From here, developers can click a publish API button and instantly create a machine-readable data source from this webpage.

More Data Querying In Situ

Once the data is in a spreadsheet, developers are able to test calls and see the JSON responses automatically without building the end code that is necessary to integrate the API into an application or solution.

“We have a lot of advanced users,” says Fogg. “Developers want to go straight into Google Sheets and validate the data. Once you have validated that, you are getting data back and it is all good and you want to start integrating.

“Being able to see the JSON call and response right there is a massive benefit for developers.”

For those with more specific data querying needs, the update release also allows users to edit XPath and regex queries as they point import.io at Web page data.

"On any column of data, you can specify an XPath override,” says Fogg. “You can also specify a regex override, and this allows you to do some powerful things.

"For example, import.io allows you to build an API for a specific website. We see it being used in a general way, extracting all the HTML from a Web page, then using a regex to just pull out social handles. What this allows users to do is to feed in, say, 10,000 home pages and get the social handles on all of them without training different tools to extract the right information," he says.

Tutorials to use the new features and the JSON and XPath/regex query builders are available. Developers can sign up for a free account on the import.io website.

ProgrammableWeb readers can also review how to use import.io to create a data business and review our suggested guidelines on how to protect your rights when data scraping from the Web.