You are on page 1of 17

Sheila Thalia - Section Barcelona

Table of Contents

Intermediate assigment Objectives :


Data cleaning (Parsing,Duplicate,Null and Formatting)
Basic Spreadsheets (Importrange,Pivot and Vlookup)
Statistics in spreadsheets(charts and descriptive analysis)
Exploratory Data Analysis (EDA)

Advanced Assignment Objectives :


Hypothesis testing (A/B Test)
Correlation and regression analysis
Intermediate Case 1
Kuala Lumpur Property Listing Price Dataset
For the first case,we use Property Listing Price in Kuala lumpur
dataset with 8 columns and 5001 rows (before cleaning )
Data Cleaning Process

Remove Duplicate Remove Empty Value Split Column Transform Blank Remove Outliers

a. combine data (func. >>> Using filtering and From Units ( RM ) using Transform Blank to 0 Removing notably
B2&C2&D2&I2&J2&L2&N2& conditional formatting splits function for Room Bathroom different values(by
O2&Q2&Z2&AA2) Splits Property Type and Carpark using pivot or IQR)
b. Count Duplicated data using data >> splits text Transform x + x
(func. >>> to column format to x (rooms)
countif(AB:AB,AB2))
c. deleting duplicated data
(func. >>>>>
if(AB2=AB1,"duplicate","")
Outlier Cleaning (IQR Method)

After outlier taken out (right table), Upper value and


Lower value has changed. Consequently, the data is
reduced. As resulted, the median and the mode has
the same value (1,200,000) and the mean not far
from the median. The dataset now can be analyzed
Univariate Analysis of price

As we see in the number of property type


Residential it is concluded that has the
highest number of vacant property in Kuala
Lumpur, followed by Bungalow in 2nd, Other
property type has the lowest property price
variation which is below average
Intermediate Case 2
E-Commerce Public Dataset
Combine Dataset

Order Dataset

Customer Dataset

Payment Dataset
Cleaning Data

Cleaning process conducted in this dataset :


1.Check duplicate and erase duplicate
2.Check some typos (if any)
3.Fill blank with “unknown” for categorical data an 0 for
numerical data
4.Outlier cleaning (merged with descriptive statistic
analysis)
Descriptive Statistic (phase 1)
Payment Value
Data Interpretation :
The data is concentrated to the lower payment value. It is depicted in
payment value histogram graph and distribution graph (z score method).
Especially in the distribution graph, it is clear that the data is skewed to the far
left (positively skewed).

From the descriptive statistic table, the data interpretation is described as


follows
1.mean > median > mode ,indicates the data is skewed positive. For value
analysis , median should be used to calculate average value ( change the
mean in the data)
2.Coefficient of variation has exceed 1. This indicates that the mean value
can not be used for analysis.
3.Standar deviation almost 3 times than mean value. This indicates that the
data has is widely dispersed.
4.Maximum value is greater than the upper value (3*IQR+Q3), indicates that
the data has potential outlier that should be taken out from the analysis
5.This data has high variance value, indicates that the data points are very
spread out from the mean, and from one another.
6.Kurtosis value has high negative value, indicates that the data has low peak
7.Skewness is positive , indicates that the data is skewed positive or
concentrated at the lower value.
Descriptive Statistic (phase 1)
Actual Delivery Days
Data Interpretation :

The data is concentrated to the lower actual delivery day. It is


depicted in actual delivery histogram graph and distribution graph (z
score method) below. Especially in the distribution graph, it is clear
that the data is skewed to the far left (positively skewed).

From the descriptive statistic table, the data interpretation is


described as follows:
1.mean > median > mode ,indicates the data is skewed positive.
2.Coefficient of variation has not exceed 1. This indicates that the
mean value still can be used for analysis. It is because the mean value
is not far from the median.
3.Standar deviation is lower than mean value. This indicates that the
data has is narrowly dispersed or clustered around the mean.
4.Maximum value is greater than the upper value (3*IQR+Q3),
indicates that the data has potential outlier that should be taken out
from the analysis
5.This data has relatively small variance value, indicates that the data
points are not to spread out from the mean, and from one another.
6.Kurtosis value has relatively high negative value, indicates that the
data has low peak.
7.Skewness is positive , indicates that the data is skewed positive or
concentrated at the lower value.
Descriptive Statistic (phase 2) Payment Value

Data Interpretation :

After data above upper value is taken out (IQR), there is a change in the data interpretation
which described as follows:

mean > median > mode ,indicates the data is still skewed positive.
Coefficient of variation has change to be lower than 1. This indicates that the mean value can
used for analysis.
Standar deviation lower value than mean value. This indicates that the data has become
narrowly dispersed.
This data has lower variance value, indicates that the data points has become narrowly spread
out from the mean, and from one another.
Kurtosis value has become positive, indicates that the data has high peak
Skewness is still positive but has lower value than before, indicates that the data is still skewed
positive or still concentrated at the lower value.
Descriptive Statistic (phase 2) Actual Delivery Days

Data Interpretation :

After data above upper value is taken out, there is a change in the actual delivery
days data interpretation which described as follows:

mean > median > mode ,indicates the data is still skewed positive.
Coefficient of variation has change to be lower than 1. This indicates that the mean
value can used for analysis.
Standar deviation lower value than mean value. This indicates that the data has
become narrowly dispersed.
This data has lower variance value, indicates that the data points has become
narrowly spread out from the mean, and from one another.
Kurtosis value has relatively small negative value, indicates that the data has low
peak.but higher than before
Skewness is still positive but has lower value than before, indicates that the data is
still skewed positive or still concentrated at the lower value.
Exploratory Data Analysis
a. Number of orders per month

Insight :

From the graph, concluded that overall number of


order, sorted by month year, is increasing. The
highest number of order is achieved on November
2017. Only from January 2018 to August 2018, the
number of order is relatively stagnant.
Exploratory Data Analysis
Daily Order Trend
Payment Value x Payment Type (Bivariate)
Insight :

Median is used to find the insight, because in the


descriptive statistics, the data is skewed to the
left. (see the data interpretation for payment
value).

From the graph, concluded that customer used


credit card to buy product with payment value in
a range of 105 dollar (median). Followed by
boleto with payment value in a range of 93.5
dollar and voucher with payment value in a
range of 77.5 dollar
Exploratory Data Analysis
Payment Type Percentage

Insight :

Form the graph, concluded that


customer prefer to pay using credit card
to buy the product, followed by boleto
and voucher respectively.

You might also like