Here are my steps on how I collect the data from this table for a range of dates. You can skip to the bottom of the page if you want to see the full code snippet.

Step 1: Load the packages need

import numpy as np

import pandas as pd

import time

import datetime

import re

from selenium import webdriver

Selenium is a tool used to automate tasks and actions with a browser through programming languages such as Python or Java. You can think about it as controlling your web browser with commands that you provide it. We are going to use it to collect data from HTML webpages.

To initialize Selenium, you’ll need to load the webdriver at given path. (Here is the installation guide for Selenium using Python: http://selenium-python.readthedocs.io/installation.html)

webdriver.Chrome will load the required software to run Selenium on Chrome. driver.get() will open a new web browser that it can control.

Step 2: Get a sequence of dates to search for

def list_dates(start,end):

""" This creates a list of of dates between the 'start' date and the 'end' date """ # create datetime object for the start and end dates

start = datetime.datetime.strptime(start, '%Y-%m-%d')

end = datetime.datetime.strptime(end, '%Y-%m-%d') # generates list of dates between start and end dates

step = datetime.timedelta(days=1)

dates = []

while start <= end:

dates.append(start.date())

start += step # return the list of dates in string format

return [str(date) for date in dates]

I wrote the list_dates() function to produce a sequence of dates from a start date to an end date. This is useful because you won’t need to need to manually input dates.

# this dictionary is used to map the months produced in the previous function with the full name of the month

month_dict ={

1:'January',

2:'February',

3:'March',

4:'April',

5:'May',

6:'June',

7:'July',

8:'August',

9:'September',

10:'October',

11:'November',

12:'December'

} def date_part(data,f_mat ='%Y-%m-%d'):

"""Extracts the date information produced by list_dates() for Month,Day,Year """ # creates a pandas dataframe of dates

dates = pd.DataFrame(data,columns=['date'])

date_time = dates['date']

fld = pd.to_datetime(date_time, format=f_mat)

for n in ('Month', 'Day','Year'):

dates[n] = getattr(fld.dt,n.lower())

dates['Month'] = dates['Month'].map(month_dict)

return dates

The date_part() function is used to break down the dates produced by list_date() into Month, Day, and Year so that it can easily be sent to wunderground to search for a date to collect data from. For example, both functions will produce the following pandas dataframe:

start = '2014-1-1'

end = '2014-1-4'

date = list_dates(start,end)

date = date_part(date,'%Y-%m-%d')

Date dictionary to be used to scrape specific weather dates

Step 3: Search by location

Now that we have the date dictionary, you will need to provide a zip code so that the wunderground can pull up the weather information. You can automate this task with Selenium. If you go to the search bar, right-click it, and click inspect, you will get HTML tag information about that search bar. This is generalizable to any HTML-JavaScript-CSS feature on the web page. For example, here you can see that the unique ‘id’ for the search bar:

Inspecting source code of wunderground web page

Inside the developer source with the tag information

To send a zip code to the search bar, the following code will look for the search bar by a specific ‘id’ with ‘find_element_by_id’, clear the search bar, send the zip code, and finally submit it to wunderground to search for the city.

zipcode='91770'

search=driver.find_element_by_id('history-icao-search')

search.clear() # clears field

search.send_keys(zipcode) # send zipcode to search field

search.submit() # submits zip code to search

Step 4: Search for date

You can similarly search for dates like Step 3. By looking for HTML tags associated to the ‘Weather History Date’ pull down menus, you’ll see that it contains class attributes such as month, day and year. We can use this to send information to search for those pull down menus. You can iterate through the pandas data frame produced from Step 2 to search the weather of a given date.

Month='March'

Day='14'

Year='2018' month = driver.find_element_by_class_name('month')

month.send_keys(Month)

day = driver.find_element_by_class_name('day')

day.send_keys(Day)

year = driver.find_element_by_class_name('year')

year.send_keys(Year)

year.submit()

Step 5: Collect weather data

We’re now ready to collect the weather data. We can see that the weather table is contained in this div tag with unique ‘id’ of ‘observations_details’. We can use this information to scrape the whole weather table with a simple line of code:

weatherdata = driver.find_elements_by_id('observations_details')[0].text

Step 6: Bring it all together

Finally, we can combine all the previous steps into a compact script. I have also including some preprocessing steps and a function in runs everything with one function:

from selenium import webdriver

import time

import numpy as np

import pandas as pd

import datetime

import pickle

import re def list_dates(start,end):

""" This creates a list of of dates between the 'start' date and the 'end' date """

# create datetime object for the start and end dates

start = datetime.datetime.strptime(start, '%Y-%m-%d')

end = datetime.datetime.strptime(end, '%Y-%m-%d')

# generates list of dates between start and end dates

step = datetime.timedelta(days=1)

dates = []

while start <= end:

dates.append(start.date())

start += step

# return the list of dates in string format

return [str(date) for date in dates] # this dictionary is used to map the months produced in the previous function with the full name of the month

month_dict ={

1:'January',

2:'February',

3:'March',

4:'April',

5:'May',

6:'June',

7:'July',

8:'August',

9:'September',

10:'October',

11:'November',

12:'December'

}

def date_part(data,f_mat ='%Y-%m-%d'):

"""Extracts the date information produced by list_dates() for Month,Day,Year """

# creates a pandas dataframe of dates

dates = pd.DataFrame(data,columns=['date'])

date_time = dates['date']

fld = pd.to_datetime(date_time, format=f_mat)

for n in ('Month', 'Day','Year'):

dates[n] = getattr(fld.dt,n.lower())

dates['Month'] = dates['Month'].map(month_dict)

return dates

data=[] # list to append scrapped data

# submits the zipcode to find the closest weather center

search= driver.find_element_by_xpath('//*[

search.clear()

search.send_keys(zipcode)

search.submit()

time.sleep(3) # sleep timer to wait for page to load (not necessary)



# iterates through provided list of dates to scrap weather for

for i,v in dates.iterrows():

# inputs month, day, year into website to view information

month = driver.find_element_by_class_name('month')

month.send_keys(v['Month'])

day = driver.find_element_by_class_name('day')

day.send_keys(v['Day'])

year = driver.find_element_by_class_name('year')

year.send_keys(v['Year'])

year.submit() # submits to search for month, day, year

# time.sleep(3) # sleep timer to wait for page to load (not necessary) def scrapper(dates,zipcode):data=[] # list to append scrapped data# submits the zipcode to find the closest weather centersearch= driver.find_element_by_xpath('//*[ @id ="history-icao-search"]')search.clear()search.send_keys(zipcode)search.submit()time.sleep(3) # sleep timer to wait for page to load (not necessary)# iterates through provided list of dates to scrap weather forfor i,v in dates.iterrows():# inputs month, day, year into website to view informationmonth = driver.find_element_by_class_name('month')month.send_keys(v['Month'])day = driver.find_element_by_class_name('day')day.send_keys(v['Day'])year = driver.find_element_by_class_name('year')year.send_keys(v['Year'])year.submit() # submits to search for month, day, year# time.sleep(3) # sleep timer to wait for page to load (not necessary) # scraps table on the bottom for weather information

weatherdata = driver.find_elements_by_id('observations_details') # locates the data

x = weatherdata[0].text # scrapes that data

x= re.sub(r'[^\x00-\x7F]+',' ', x) # removes unicode

x = x.split('

') # breaks the data into observations per row

x = x[1:-1] # removes the last line

data.extend([i+' '+v['date'] for i in x]) # appends all scraped data

return data def preprocess_data(data):

"""Preprocess the scraped data and load the data into a pandas dataframe"""

dt = [i.replace('Calm Calm', 'Calm 0.0 mph') for i in data]

dt = [i.replace(' AM', 'AM') for i in dt]

dt = [i.replace(' PM', 'PM') for i in dt]

dt = [i.replace('%', '') for i in dt]

dt = [i.replace(' mi', '') for i in dt]

dt = [i.replace(' mph', '') for i in dt]

dt = [i.replace(' in', '') for i in dt]

dt = [re.sub(' +',' ',i) for i in dt] dt = [i.replace('Mostly ', 'Mostly') for i in dt]

dt = [i.replace('Partly ', 'Partly') for i in dt]

dt = [i.replace('Scattered ', 'Scattered') for i in dt]

dt = [i.replace('Light ', 'Light') for i in dt]

dt = [i.replace('Heavy ', 'Heavy') for i in dt]



dt = [i.replace('Fog , Rain', 'Rain') for i in dt]

dt = [i.replace('Fog , Snow', 'FogSnow') for i in dt]

dt = [i.replace('Fog', ' ',1) for i in dt]

dt = [i.replace('Rain , Thunderstorm', 'RainThunderstorm') for i in dt]



dt = [i.replace('Thunderstorm', '',1) for i in dt]

dt = [i.replace('Thunderstorms and Rain', 'ThunderstormsandRain') for i in dt] dt = [i.replace('Rain', '',1) for i in dt]

dt = [i.replace('Snow', '',1) for i in dt]

# dt = [i.replace('Light Drizzle', 'LightDrizzle') for i in dt]



dt = [i.replace('F', '',) for i in dt]

dt = [i.replace(' og', ' Fog') for i in dt]

dt = [i.replace('Patches of Fog', 'PatchesofFog',1) for i in dt]

dt = [i.replace('Lightreezing Rain', 'LightFreezingRain') for i in dt]



dt = [i.split() for i in dt]

dt = [ i[:2] +i[-10:] for i in dt]



dt = pd.DataFrame(dt,columns = ['time','temp(F)','dewpoint(F)','humidity(%)','pressure(in)','visibility(mi)','winddir','windspeed(mph)','gustspeed(mph)','precip(in)','conditions','date'])

dt['time'] = [datetime.datetime.strftime(datetime.datetime.strptime(val, "%I:%M%p"), "%H:%M") for val in dt['time']]

return dt def weather_scrapper(start_date,end_date, zipcode):

"""final webscrapper function"""

dates = list_dates(start_date,end_date)

dates = date_part(dates,'%Y-%m-%d')

data = scrapper(dates,zipcode)

return preprocess_data(data) ####################################################################

x = '

driver.get(x) driver = webdriver.Chrome('/usr/local/bin/chromedriver')x = ' https://www.wunderground.com/history/airport/KSFO/2018/2/24/DailyHistory.html?req_city=San%20Francisco&req_statename=California' driver.get(x) # YYYY-MM-DD

start = '2014-1-1'

end = '2014-1-5'

zipcode = '10001' weather_scrapper(start,end,zipcode)

Congratulations! Now you can collect your weather data with Selenium and enhance all your Kaggle / work / school / side projects without compromising validation scores! You can also use what you’ve from this and collect data from other sources too!

References: