You are on page 1of 8

Abhishek Khatri Follow

Nov 28, 2019 · 5 min read · Listen

Save

Web Scraping Weather Data using Python


Web Scraping

Web scraping is a method of extracting information from a web page.

It is the process of extracting useful information from the page and formatting the
dataininapp
Open the required format for further analysis and use. Sign up Sign In

In this blog we will be extracting weather data from the following:


http://www.estesparkweather.net/archive_reports.php?date=20001

Required Python Libraries

We will start by importing bunch of libraries:

import pandas as pd

import bs4

from bs4 import BeautifulSoup

import requests

import re

from tqdm import tqdm

Missing libraries can be installed using pip:

pip3 install tqdm

Let’s move ahead and start extracting data, but before we do this, let us first see how
our data looks like:
30 3

Data for Jan 1 2009

Note: We are going to extract data for a decade, let say starting from Jan 2009 till
October 2018. Indexing will be done based on the date for which data is available.

Just look at above image, we will be extracting info such that we get 19 columns
from Average Temperature till Maximum heat index and our index would be the date
on which these readings were recorded.

Creating a list of dates

Since we are extracting data for a decade, we will start our code with a dates
variable.

range_date = pd.date_range(start = '1/1/2009',end = '11/1/2018',freq


= 'M')

dates = [str(i)[:4] + str(i)[5:7] for i in range_date]


In above piece of code, a pandas datetimes object with monthly frequency was
created, a list of dates was then created using range_date.

dates[0:5]

Above cell will give us the following output: [‘200901’, ‘200902’, ‘200903’, ‘200904’,
‘200905’]

Extracting information from the page:

Now we will start by creating 2 empty lists, one for our data and other for index field
and we will run a loop till the number of elements in the list dates. Also, we will
create a variable url in which we will pass page per iteration for extracting data:

df_list = []

index = []

for k in tqdm(range(len(dates))):

url = "http://www.estesparkweather.net/archive_reports.php?date="
+ dates[k]

#Our loop will run for the number of elements in dates

Now to download the webpage use:

page = requests.get(url)

requests.get() will take the URL and will download the content of the web page. The
downloaded web page is stored as a string in the Response object’s text
variable(page). If access gets successful we will see 200 as the output for following:

page.status_code

Next step is parsing our data using BeautifulSoup, this module is used for extracting
information from HTML page.
We should have some basic knowledge of HTML tags before using this, next we’ll
create a BeautifulSoup object and will apply its method to extract data as per our
requirement:

soup = BeautifulSoup(page.content, 'html.parser')

In the above piece of block we are using HTML parser, we generally use either
HTML or XML parser, search over the net for other parsers.

Next, we use find_all method to locate all our <table> tags. For more info on this
method and related ones kindly click here.

table = soup.find_all('table')

type(table)

# This will give

#bs4.element.ResultSet

Next we’ll create a list of list which contains text data of all the rows under table
tags:

parsed_data = [row.text.splitlines() for row in table]

parsed_data = parsed_data[:-9]

In the last line of above cell, we are removing all the rows that are not required for
extraction, or we can say these are the junk information from the data that can be
ignored. So, the first element of our list will be:
parsed_data[0]

Removing all the empty list for better view using slicing:

for l in range(len(parsed_data)):

parsed_data[l] = parsed_data[l][2:len(parsed_data[l]):3]

After formatting, 1st element of parsed_data will look like:


parsed_data[0] after further formatting

Finally, we will use regex to extract numerical values from the string and create a
final list to store the data as per the required format:

for i in range(len(parsed_data)):

c = [('.'.join(re.findall("\d+",str(parsed_data[i][j].split()
[:5]))) for j in range(len(parsed_data[i]))]

df_list.append(c)

index.append(dates[k] + c[0])

Above code will run for all the months between the defined period and we now
have 2 list one containing data and indexes.

Now if we check the length of 1st element of df_list, output will be 20 and the
elements of our list will be: [‘1’, ‘37.8’, ‘35’, ‘12.7’, ‘29.7’, ‘26.4’, ‘36.8’, ‘274’,‘0.00’,‘0.00’,
‘0.00’, ‘40.1’, ‘34.5’, ‘44’, ‘27’, ‘29.762’, ‘29.596’,

‘41.4’, ‘59’, ‘40.1’]

But it has been observed for 8 data points, length of list is 22 and one such list looks
like: [‘’, ‘31.3’, ‘54’, ‘14.2’, ‘29.986’, ‘8.0’, ‘12.3’, ‘319’, ‘0.575’,

‘13.268’, ‘0.020’, ‘53.2’, ‘1.4’, ‘93’, ‘15’, ‘30.72’, ‘29.04’,

‘170.2’, ‘255.3’, ‘53.2’, ‘40.7’, ‘22.1’]

Before creating a dataframe we need to get rid of these 8 datapoints, this can be
achieved as follows:
f_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]

data = [df_list[i][1:] for i in range(len(df_list)) if


len(df_list[i][1:]) == 19]

After removing junk data points, we will convert the indexes in date format %Y-%m-
%d using following piece of code:

final_index = [datetime.strptime(str(f_index[i]),
'%Y%m%d').strftime('%Y-%m-%d') for i in range(len(f_index))]

So, we are all set, let us create dataframe:

col = ['Average temperature (°F)', 'Average humidity (%)',


'Average dewpoint (°F)', 'Average barometer (in)',

'Average windspeed (mph)', 'Average gustspeed (mph)',


'Average direction (°deg)', 'Rainfall for month (in)',
'Rainfall for year (in)', 'Maximum rain per minute',
'Maximum temperature (°F)', 'Minimum temperature (°F)',

'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',


'Minimum pressure', 'Maximum windspeed (mph)',
'Maximum gust speed (mph)', 'Maximum heat index (°F)']

final_df = pd.DataFrame(data, columns = col, index = final_index)

Final Output
With this we are done extracting data, we can type cast columns further using
.astype(float). Thanks for reading!

Python Web Scraping Beautifulsoup Regex Data Science

About Help Terms Privacy

Get the Medium app

You might also like