You are on page 1of 21

Punam Seal About Follow Sign in Get started

Performing Analysis of
Meteorological Data
Punam Seal Aug 23 · 9 min read

Data Analytics are all ways of referring to the value of extracting useful
information from the plentiful datasets now available across a wide
range of fields and areas of natural and human activity. The ability to
leverage data to improve understanding has always been important, but
is becoming increasingly so as data becomes more readily available. Ever
wondered how the news channel predicts the weather conditions accurately?
The answer is because of data science. It always works in the
background in the whole process of weather prediction. For all
individuals and organizations, it is a great deal to know the accurate
situation of the weather.

Many businesses are directly or indirectly linked with climatic conditions.


For instance, agriculture relies on weather forecasting to plan for when to
plant, irrigate and harvest. Similarly, other occupations like construction
work, airport control authorities and many more are dependent on the
forecasting of weather. With its help, businesses can work with more
accuracy and without any disruptions.

OVERVIEW:

In this blog, I am going to analyze one type of data that’s easier to find on
the net is Weather Dataset (please click this to view the dataset) which
provide historical data on many meteorological parameters such as
pressure, temperature, humidity, wind speed, visibility, etc. The dataset
has hourly temperature recorded for last 10 years starting from 2006–
04–0100:00:00.000 +0200 to 2016–09–09 23:00:00.000 +0200. It
corresponds to Finland, a country in the Northern Europe.

You can download the dataset from this Google drive link:
https://drive.google.com/open?id=1ScF_1a-
bkHi1qe8Rn78uxK6_5QwUD9Bu

GOAL:

Our goal is to transform the raw data into information and then convert
it into knowledge and to perform data cleaning, perform analysis for testing
the (given) Hypothesis i.e. The Null Hypothesis Ho is “Has the Apparent
temperature and humidity compared monthly across 10 years of the data
indicate an increase due to Global warming”.

The Ho means we need to find whether the average Apparent


temperature for the month of a month say April starting from 2006 to
2016 and the average humidity for the same period have increased or
not. This monthly analysis has to be done for all 12 months over the 10
year period.

So I basically used resampling function and perform the exploratory data


analysis and supported analysis by appropriate visualizations using
matplot lib and / or seaborn library.

TOOLS USED:

Language used- Python

IDE- Jupyter Notebook

DATA ANALYSIS:

Firstly, I want to describe our dataset using info, shape, columns,


describe, etc. functions and then data cleaning.

Import the required libraries


1. I will be using several Python several libraries such as Pandas, Numpy,
Matplotlib and Seaborn.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading and Reading the data


2. Loading the dataset using pd.read_csv as the dataset is in CSV form
and read the dataset and store it as a DataFrame object in the variable as
data.

data = pd.read_csv(r"C:\Users\Admin\Downloads\weatherHistory.csv")
data

3. To view a small sample of a Series or the DataFrame object, use the


head() and the tail() methods.

head() returns the first n rows(observe the index values). The default
number of elements to display is five, but you may pass a custom number.

tail() returns the last n rows(observe the index values). The default
number of elements to display is five, but you may pass a custom number.

data.head()
data.tail()

4. describe() is used to view some basic statistical details like percentile,


mean, std etc. of a data frame or a series of numeric values.

data.describe()

5. The function “shape’’ returns the shape of an array which means in


our dataset, we have 96453 lines and 11 columns.

data.shape
Out:(96453, 11)

6. The function ‘‘DataFrame.columns’’ attribute return the column


labels of the given DataFrame.
data.columns

7. The DataFrame.info() function is used to get a concise summary of


the dataframe which we can get the quick overview of the dataset when it
comes really doing the exploratory analysis of the data.

data.info()

8. The DataFrame.isnull().sum() function returns all the lists of the


numbers having the missing values in the dataset. As we can see, in
‘Precip Type’ has 517 missing values and remaining all the values have 0
missing values.

data.isnull().sum()
9. While analyzing the data, many times the user wants to see the unique
values in a particular column, which can be done using unique()
function. As, I have taken any particular column such as- Humidity ,we
can see all the unique values of it.

data['Humidity'].unique()

Visualizing the data


10. The DataFrame.corr() function is used to find the pairwise
correlation of all columns in the dataframe. Any null values are
automatically excluded. For any non-numeric data type columns in the
dataframe it is ignored. By plotting the heatmap, you can see the plot
where data is converted to data.corr().

sns.heatmap(data = data.corr(), annot=True)


plt.title("Pairwise correlation of all columns in the dataframe")
plt.show()
11.The to_datetime() function is used to convert argument to datetime.
I used Formatted Date which converted dtypes- object to datetime64[ns,
UTC].

data['Formatted Date'] = pd.to_datetime(data['Formatted Date'],


utc=True)
data['Formatted Date']

12. The set_index() function is a method to set a List, Series or Data


frame as index of a Data Frame. Index column can be set while making a
data frame too. By using this function, the updated data has shape of
(96453 rows × 10 columns).
data = data.set_index("Formatted Date")
data

Resampling Data
Resample is primarily used for time-series data. Convenience method for
frequency conversion and resampling of time series. The object must
have a datetime-like index ( DatetimeIndex , PeriodIndex , or
TimedeltaIndex ), or the caller must pass the label of a datetime-like
series/index to the on / level keyword parameter.

13. By using resample function, I averaged the resampled overall data


using the mean() function and converted hourly data to monthly data
using “MS” which denotes Month Starting. This function will help to
perform analysis for testing the hypothesis.

data_column = ['Apparent Temperature (C)', 'Humidity']


resampled_data_monthly_mean = data[data_column].resample('MS').mean()

resampled_data_monthly_mean.head()
resampled_data_monthly_mean.tail()

Exploratory Data Analysis


seaborn distplot lets you show a histogram with a line on it. A distplot
plots a univariate distribution of observations. The distplot() function
combines the matplotlib hist function with the seaborn kdeplot() and
rugplot() functions.

14. By using this function, it can plot the overall resampled mean data
showing the particular column i.e. Apparent Temperature.

import warnings
warnings.filterwarnings("ignore")
sns.distplot(resampled_data_monthly_mean['Apparent Temperature (C)'])
plt.show()
Similarly by using this function, it can plot the particular column i.e.
Humidity.

sns.distplot(resampled_data_monthly_mean['Humidity'])
plt.show()

15. reg.plot functions draw a scatterplot of two variables, x and y , and

then fit the regression model y ~ x and plot the resulting regression line
and a 95% confidence interval for that regression. By using this function,
I plotted the Relation between Apparent Temperature and Humidity
using the two variables of Apparent Temperature and Humidity.

sns.regplot(data = resampled_data_monthly_mean, x="Apparent


Temperature (C)", y="Humidity", color="red")
plt.title("Relation between Apparent Temperature (C) and Humidity")
plt.xlabel('Apparent Temperature (C)')
plt.ylabel('Humidity')
plt.show()
16. sns.lineplot shows the relationship between x and y can be shown
for different subsets of the data using the hue , size , and style

parameters. These parameters control what visual semantics are used to


identify the different subsets. By using this function, I plotted Variation
of Apparent Temperature and Humidity with time using the
resampled mean data.

sns.lineplot(data = resampled_data_monthly_mean)
plt.xlabel('Year')
plt.title("Variation of Apparent Temperature (C) and Humidity with
time")
plt.show()
17. seaborn.pairplot() plots multiple pairwise bivariate distributions in
a dataset, you can use the pairplot() function. This shows the
relationship for (n, 2) combination of variable in a DataFrame as a matrix
of plots and the diagonal plots are the univariate plots. By using this
function, I plotted the overall resampled mean data using the kind
parameter i.e. in scatter.

sns.pairplot(resampled_data_monthly_mean,kind = 'scatter')
plt.show()

Function for plotting Month-Wise plots for Apparent Temperature


& Humidity of 10 years.

18. DataFrame.iloc[] method is used when the index label of a data


frame is something other than numeric series of 0, 1, 2, 3….n or in case
the user doesn’t know the index label. Rows can be extracted using an
imaginary index position which isn’t visible in the data frame.

By using this function, I defined the Temp_data and Humidity_data.


Temp_data = resampled_data_monthly_mean.iloc[:,0]
Humidity_data = resampled_data_monthly_mean.iloc[:,1]
Temp_data

Humidity_data

19. This function plots the Month-wise Apparent Temperature &


Humidity of 10 years by using functions like- def which used to create,
(or define) a function, elif is short for else if and it allows us to check for
multiple expressions. I labelled the colour with months in which it will
shows different colours with different months which will make the plots
more accurate to understand.

def label_color(month):
if month == 1:
return 'January','red'
elif month == 2:
return 'February','purple'
elif month == 3:
return 'March', 'orange'
elif month == 4:
return 'April','green'
elif month == 5:
return 'May','darkblue'
elif month == 6:
return 'June','violet'
elif month == 7:
return 'July','yellow'
elif month == 8:
return 'August','pink'
elif month == 9:
return 'September','black'
elif month == 10:
return 'October','brown'
elif month == 11:
return 'November','grey'
else:
return 'December','blue'

def plot_month(month, data):


label, color = label_color(month)
month_data = data[resampled_data_monthly_mean.index.month ==
month]
sns.lineplot(data = month_data, label = label, color = color,
marker = 'o')

def sns_plot(title, data):


plt.title(title)
plt.xlabel('Year')
for i in range(1,13):
plot_month(i,data)

title = "Month-wise for Apparent Temperature (C) of 10 years"


sns_plot(title, Temp_data)
title = "Month-wise for Humidity of 10 years"
sns_plot(title, Humidity_data)

Function for plotting Apparent Temperature and Humidity plots


for each month

20. In this function, I plotted Apparent Temperature and Humidity graphs


from January to December throughout 10 years.

def sns_month(month):
label = label_color(month)[0]
plt.title('Apparent Temperature (C) & Humidity for
{}'.format(label))
plt.xlabel('Year')
data =
resampled_data_monthly_mean[resampled_data_monthly_mean.index.month
== month]
sns.lineplot(data = data, marker = 'o')
plt.show()
for month in range(1,13):
sns_month(month)
GITHUB LINK:

https://github.com/punamseal14/Suven-Consultants-and-Technology-
Tasks/blob/master/Performing%20Analysis%20of%20Meteorological%
20Data/main.ipynb

CONCLUSION:

From this analysis, I conclude that the Apparent Temperature and


Humidity compared monthly across 10 years of the data indicate an
increase due to Global warming and this monthly analysis has done for
all 12 months over the 10 year period.

I am thankful to mentors at https://internship.suvenconsultants.com for


providing awesome problem statements and giving many of us a Coding
Internship Experience. Thank you www.suvenconsultants.com .
More from Punam Seal Follow

Final year B.Tech ECE engineering student.

More From Medium

Explore data smarter not Kernel Methods: A Area Plot — Behind the 3 Crucial Types of
harder with help from Simple Introduction mountains are more Midfielders in Football
widgets Diego Unzueta in Towards Data
mountains Kartik Shanbhag
Ms GG Berg Science Elad Gvirtz

Cohort Analysis with A Tribute to My Big Data, Artificial Python package to make
Python Wrongness: How Health Intelligence, and data cleaning easy
Joe Tran in Towards Data
Tourism Interacts with Information Overload Zahash Z
Science
Airbnb Rents in Istanbul Knowmail
Eren Janberk Genç

About Write Help Legal

PDFmyURL.com - convert URLs, web pages or even full websites to PDF online. Easy API for developers!

You might also like