How to Export Thousands of Pages into Excel

This guide will show you how to efficiently extract thousands of pages for analysis.

One frequent request that comes from clients is, “How do you extract repetitive information on multiple pages into excel?” Even if you are a programmer who is good at web-scraping, you need to spend a long time (hours to days) on analysing the structure of all pages and write code accordingly. Listly can solve this problem in minutes with just a few clicks.

Identifying content to extract

Here, we will use a lawyer association’s directory, which is already open to the public:

http://www.hklawsoc.org.hk/pub_e/memberlawlist

Each page (Fig 1.) has a person’s contact information. We will extract each URL link one by one using Listly and insert them into excel. Here are the steps to achieve this.

1) Extract the URL from each hyperlink associated with a lawyer name

2) Add these extracted URLs into Listly’s “bulk URL feature”

3) Generate a spreadsheet that contains information fields on each lawyer

Most data extraction tools on the web have issues capturing the data when the content of each page is different from one another. In this case, we can see that some information on the lawyers is omitted. Building a tool that takes into account these different cases would be extremely tedious. Your main work should be data analysis, not data cleansing.

Fig 1. http://www.hklawsoc.org.hk/pub_e/memberlawlist/member.asp?id=938699

Fig 2. Compared to Fig 1., some fields are omitted.

Fig 3. No information sometimes.

1) Extract the URL from each hyperlink associated with a lawyer name

Scrolling to the bottom of the web page and clicking next multiple times will give you specific URL patterns as shown below:

http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=1&sj=0

http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=2&sj=0

http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=3&sj=0

….etc

until final page http://www.hklawsoc.org.hk/pub_e/memberlawlist/mem_withcert.asp?name=&pg=186&sj=0

Notice how that the URL for each link after …?name=&pg=1&sj=0 changes from 1 to 2, and then 3.

We will add 186 of the URL’s into listly’s “bulk URL”. This will generate a spreadsheet that has all the lawyer names with hyperlinks attached to each name.