Data Analyst Portfolio Overview
Data Analyst Portfolio Overview
By :- Anushka Shukla
Professional background:
I'm an ambitious data analyst intern with a passion for data visualization, data
modelling, data understanding and project planning, with skills ranging from
visualization to presentation, my objective is to work closely with senior researchers to
analyze financial and economic data and deliver an economic impact report to
stakeholders, contributing to informed decision making.
I also gained an experience in breaking down complex problem into smaller manageable
parts, resulting in comprehensive solution and got skill in translating complex data into
clear, understandable insights for stakeholders to guide decision making.
INDEX :
1. PROFESSIONAL BACKGROUND……………………………………………………………..1
2. INDEX…………………………………………………………………………………………………………
3. INSTAGRAM USER ANALYTICS PROJECT PROFILE……………………………..3
4. APPROACH………………………………………………………………………………………………………
5. TASKS UNDERTAKEN……………………………………………………………………………….
6. ANALYSIS……………………………………………………………………………………………………
7. CONCLUSIONS…………………………………………………………………………………………..
8. OPERATION & METRIC ANALYSIS PROJECT PROFILE……………………………..7
9. APPROACH………………………………………………………………………………………………………
10. TASKS UNDERTAKEN……………………………………………………………………………….
11. ANALYSIS……………………………………………………………………………………………………
12. CONCLUSIONS…………………………………………………………………………………………..
13. HIRING PROCESS ANALYTICS PROJECT PROFILE……………………………..14
14. APPROACH………………………………………………………………………………………………………
15. TASKS UNDERTAKEN……………………………………………………………………………….
16. ANALYSIS……………………………………………………………………………………………………
17. CONCLUSIONS…………………………………………………………………………………………..
18. IMDB MOVIE ANALYSIS PROJECT PROFILE………………………………………….18
19. APPROACH………………………………………………………………………………………………………
20. TASKS UNDERTAKEN……………………………………………………………………………….
21. ANALYSIS……………………………………………………………………………………………………
22. CONCLUSIONS…………………………………………………………………………………………..
23. BANK LOAN CASE STUDY PROJECT PROFILE………………………………………23
24. APPROACH………………………………………………………………………………………………………
25. TASKS UNDERTAKEN……………………………………………………………………………….
26. ANALYSIS……………………………………………………………………………………………………
27. CONCLUSIONS…………………………………………………………………………………………..
28. IMPACT OF CAR FEATURES PROJECT PROFILE……………………………………32
29. APPROACH………………………………………………………………………………………………………
30. TASKS UNDERTAKEN……………………………………………………………………………….
31. ANALYSIS……………………………………………………………………………………………………
32. CONCLUSIONS…………………………………………………………………………………………..
33. ABC CALL VOLUME TREND PROJECT PROFILE……………………………………..39
34. APPROACH………………………………………………………………………………………………………
35. TASKS UNDERTAKEN……………………………………………………………………………….
36. ANALYSIS……………………………………………………………………………………………………
37. CONCLUSIONS…………………………………………………………………………………………..
38. APPENDIX………………………………………………………………………………………………………44
(1) INSTAGRAM USER ANALYTICS PROJECT PROFILE
Major objective of this project is to find out meaningful dataset from meta data and
visualize user’s interest thus, helpful in quantitative and qualitative analysis of user
activity. This project aims at understanding user engagement practices on Instagram
to help the product team improve the platform.
APPROACH
Approach: Main approach towards this project is to use SQL queries to analyze and
extract important data to track user engagement so that product team can launch
new campaigns and improve users experience on platform. Tech stack used by me
is ‘my SQL workbench 8.0 CE ‘which is GUI tool for my SQL. It helps me in creating
and designing database schemas, run SQL queries to work with stored data and
visualize reports related to user data.
TASKS UNDERTAKEN
Project description: Major objective of this project is to find out meaningful dataset from meta
data and visualize user’s interest thus, helpful in quantitative and qualitative analysis of user activity.
Approach: Main approach towards this project is to use SQL queries to analyze and extract
important data to track user engagement so that product team can launch new campaigns and
improve users experience on platform.
Tech stack used: Tech stack used by me is ‘my SQL workbench 8.0 CE ‘which is GUI tool for my SQL.
It helps me in creating and designing database schemas, run SQL queries to work with stored data
and visualize reports related to user data.
Insights:
A. Marketing analysis:
1. Loyal user reward (task)- identify the five oldest users on Instagram from the
provideddatabase.
Required output – Given below Is a list of five oldest users of Instagram-
Id Username Created At
'80' 'Darby_Herzog' '2016-05-06 , 00:14:21'
'67' 'Emilio_Bernier52' '2016-05-06 , 13:04:30'
'63' 'Elenor88' '2016-05-08 01:30:41'
'95' 'Nicole71' '2016-05-09 17:30:22'
'38' 'Jordyn.Jacobson2' '2016-05-14 07:56:26'
5. Ad Campaign Launch(task) – determine the day of the week when most users
registeron Instagram. provide insights on when to schedule an ad campaign.
Required output -– these are the two days on which users are mostly active on Instagram
day final
Thursday -- 16
Sunday -- 16
SQL Query used-
SELECT DAYNAME(created_at)AS bestday,
COUNT(*) as final
FROM users
GROUP BY bestday
ORDER BY final DESC
limit 2;
B. Investor metrics:
1. User engagement (task)- calculate average number of posts per user on Instagram.
also,provide the total number of posts on Instagram divided by the total number of
users.
Required output –
Average no. of post/user:- 2.57
Total posts/total users:- 0.0509
SQL Queries -
select*from
photos,users;with
user_involved as
(select u.id as userid,count(p.id) as photoid from users as u
left join photos as p on p.user_id=u.id group by u.id)
select sum(photoid) as all_images,count(photoid) as all_users,
sum(photoid)/count(userid)as post_per_user,
sum(photoid)/sum(userid)as avg_photos
from user_involved;
2. Bots and fake accounts (task) – identify users who have liked every single photo
on thesite ,as this is not typically possible for normal user.
Required output –
Id username
5 aniya_hackett
14 jaclyn81
21 rocio33
24 maxwell.halvorson
36 ollie_ledner37
41 mckenna17
54 duane60
57 julien_schmidt
66 mike.auer39
71 Nia_haag
75 leslie67
76 janelle.niikolaus81
91 bethany20
SQL Queries –
select username,count(*) as total_like from users
inner join likes
on likes.user_id=users.id
group by likes.user_id
having total_like=(select count(*) from photos);
ANALYSIS
Through the project I have analyzed that I was able to gain insights which can be
used by teams across the business helping them to launch a new marketing
campaign , tracking the success of the app by measuring user engagement
improving the experience altogether while helping the overall business structure.
CONCLUSIONS
November insights
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
25/11/2020 26/11/2020 27/11/2020 28/11/2020 29/11/2020 30/11/2020
1 1 1 2 1 2
total_time nov_result
Throughput/day==>
DAYS regular_throughput
2020-11-25 0.022
2020-11-26 0.018
2020-11-27 0.010
2020-11-28 0.061
2020-11-29 0.050
2020-11-30 0.050
I will prefer 7 day rolling average as compared to daily metric because the value of
daily metrics keeps fluctuating which can sometimes become cumbersome for
analysing growth trends of a company.7 day rolling average makes it easier to
identify the long term trends concealed by daily fluctuations which helps in making
informed decision and increasing overall growth and productivity.
select round(count(event)/sum(time_spent),3) as
week_throughput from job_data;
Throughput/day==>
select ds as days,
round(count(event)/sum(time_spent),3) as
regular_throughput from job_data
group by
ds order
by ds;
No_week user_active
17 663
18 1068
19 1113
20 1154
21 1121
22 1186
23 1232
24 1275
25 1264
26 1302
27 1372
28 1365
29 1376
30 1467
31 1299
32 1225
33 1225
34 1204
35 104
2. User Growth Analysis(Task) :-Write An SQL Query To Calculate The User Growth
Growth Of The Product.
Required output –
9 2013 1455
9 2014 1588
10 2013 1620
10 2014 1774
11 2013 1805
11 2014 1935
12 2013 1968
12 2014 2116
13 2013 2155
13 2014 2322
,,,,,,,,,AND SO ON.
,,,,,,,,,AND SO ON.
SQL Query used –
select device, extract(week from occured_at) as week_no,
count(distinct user_id) as user_record from event
where event_type="engagement"
group by device, week_no order by week_no;
select
(sum(case when
email_category="email_opened" then 1 else 0 end)/sum(case when
email_category="email_sent" then 1 else 0 end))*100 as open_record,
(sum(case when
email_category="email_clickthrough" then 1 else 0 end)/sum(case when
email_category="email_sent" then 1 else 0 end))*100 as click_record
from (
select *,
casE
("email_sent")
end as email_category
from email_events) as alias;
ANALYSIS
This project helps in analyzing answers of question like total number of jobs
reviewed, calculation of throughput, percentage share of each language and quantity
of duplicate rows. It also helps in finding details regarding users such as user
engagement, user growth, weekly retention and email engagement.
CONCLUSIONS
The Project helped me in following ways:-
Get an opportunity to analyse and handle large datasets and observe its trends
and metrics.
observing important insights helped me in clear understanding of advanced SQL..
understanding of data models and structure behind any data.
The project helped me to learn the skills of extracting valuable insights from
large datasets and polishing my advance sql skill
3) HIRING PROCESS ANALYTICS PROJECT PROFILE
The project aims at visualizing and extracting important insights from the raw data provided
, which helps in improving the overall hiring process of an organization. It involves using
skills of statistics and advance excel to get important insights enabling organization to
improve its hiring process.
APPROACH
major approach of project is to gain important insights using excel to find out
underlying trends regarding hiring process resulting in optimal growth of company. It
alsoincludes finding missing data, summarize data using statistics, visualize data using
charts,predicting the outliers and many more. tech stack used by me is :-
Microsoft excel-2019: A Spreadsheet editor ,helps to operate on huge dataset to get
useful insights and visualize it through charts, it allows me to store, format, analyze and
process mydataset in quick and efficient way.
Microsoft word-2019: A word processing software which is an effective and user
friendly toolfor editing and formatting text. It enables me to prepare to final reports
from the insights obtained.
TASKS UNDERTAKEN
1. Hiring Analysis: (task)- Determine the gender distribution of hires. How many
males andfemales have been hired by the company?
Required output – Number of male and females hired by company are as follows:-
no of
event name candidates
no of
male 2563 candidates
female 1856 300
0
250
0
200
0
150
male female
Pivot table given below represents salary distribution for hired candidates
4. Departmental Analysis (Task): Use a pie chart, bar graph, or any other suitable visualization
to show the proportion of people working in different departments.
Required output – proportion of people working in different department are as follows:-
4
Tota
% 3 2 l
% % 4 Finance Department
28% %
General Management
39%
Human Resource
10 DepartmentMarketing
% Operations
5 Department
% 5
%
proportion of
Row Labels dept.
Finance Department 3.75%
General Management 2.41%
Human Resource Department 1.49%
Marketing Department 4.30%
Operations Department 39.24%
Production Department 5.24%
Purchase Department 4.90%
Sales Department 10.33%
Service Department 28.36%
Grand Total 100.00%
5. Position Tier Analysis(task): : Use a chart or graph to represent the different position tiers
within the company. This will help you understand the distribution of positions across
different tiers.
Required output_ given below is the result of the different position tiers within the company.
Total
1400
1200
1000
800
600 Total
400
200
0
b9 c10 c5 c8 c9 i1 i4 i5 i6 i7 m6 n6
Total 308 105 1182 193 1239 151 32 511 337 635 2 2
Pivot table used to create chart:-
Count of Post
Row Labels Name
b9 308
c10 105
c5 1182
c8 193
c9 1239
i1 151
i4 32
i5 511
i6 337
i7 635
m6 2
n6 2
Grand Total 4697
ANALYSIS
Few analysis drawn by me is as follows. These insights helps in reviewing job
requirements, examine job efficiency .
No. of males hired by company is more than no. of females
Highest position tier in the company is of C9 followed by C5.
Largest no of peoples are working in operations department followed by
service department
Average salary offered by company is approximately 50000
CONCLUSIONS
Required output –
Given below is the list of all genres along with their descriptive analysis , and from the chart
it can be concluded that most common genres are -comedy,action and drama, followed by
adventure,crime,biography and horror.
genre satistics
Action 933
Mean 219
Adventure 365
Median 36
Animation 45
Mode 2
Biography 205
Standard
Devation 331.7097753
Comedy 1003
Variance 103558.9412
Crime 249
maximum 1003
Documentary 36
Minimum 1
Drama 658
range 1002
Family 3
Fantasy 35
Horror 155
Musical 2
Mystery 22
Romance 1
Sci-Fi 7
Thriller 2
Western 2
Tota
l
150
0
100
0 Tota
l
=MEDIAN(Q5:Q21)
=MODE(Q5:Q21)
=STDEV(Q5:Q21)
=VAR.P(Q5:Q21
Required output –
From the Given scatter plot it is clear that there is positive slope trendline between movie
duration and imdb score which means that movies with higher duration got high imdb
scores.
Descriptive statistics of movie duration is as follows;
Mean Median Mode Standard Devation
110.2634972 106 101 22.67832498
imdb_scor
14
e
12
10
SCORE
IMDB
50 10 15 20 25 30 35
0 0 0 0 0 0
MOVIE_DURATIO
N
C. Language Analysis-->Task: Determine the most common languages used in movies and
analyzetheir impact on the IMDB score using descriptive statistics.
Required output –
From the given insight of most common languages along with their mean and
median it is clear that most common language is english along with
french,spanish,mandarin,japanese and german
Count of
Row Labels language Mean Median
English 3566 6.427509815 6.6
French 34 7.355882353 6.6
Spanish 23 7.082608696 6.6
Mandarin 14 7.021428571 6.6
Japanese 10 7.66 6.6
German 10 7.77 6.6
Cantonese 7 7.342857143 6.6
Italian 7 7.185714286 6.6
Hindi 5 7.22 6.6
Portuguese 5 7.76 6.6
D. Director Analysis-->Task: Identify the top directors based on their average IMDB score and
analyzetheir contribution to the success of movies using percentile calculations.
Required output – given below is insight of top 15 directors on the basis of their average
scores along with percentiles.
Average of
Row Labels imdb_score PERCENTILE
Akira Kurosawa 8.7 0.937
Tony Kaye 8.6 0.812
Charles Chaplin 8.6 0.812
Required output –
Correlation coeffiecient is :- 0.098318102 , which means that there is
weakrelationship between budgets of movie and their respective gross
earnings.
Following is a list of movies with their profit margin value and from the given list it is
clear that movie with highest profit margin is avatar followed by others.
PROFIT budget
ANALYSIS
From the gained insights I have analyzed that
The most common genres of movie is comedy followed by acrtion and drama.
There is very weak relationship between budgets of movie and gross earnings.
Top director in the film industry is Akira Kurosawa followed by Tony Kaye and
Charles Chaplin
The most common language of movies is english followed by french and
spanish.
CONCLUSIONS
Project Is Beneficial In:-
Understanding the applications of advance excel mathematical and
statisticalfunctions so as to do the larger calculations in time efficient
manner.
Learning the ways to visualize and present the data by using graphs and
charts tomake the insights clearly visible and easy to understod for clients
and internal stakeholders.
understanding of data models and structure behind any data.
Revising the concepts of handling large amount of data, doing is analysis
andvisualization.
5) BANK LOAN CASE STUDY PROJECT PROFILE
the major objective of the project is to do the exploratory data analysis of bank loan to
save the banks from risk of financial crisis. The project deals with finding key factors
behind loan default to make better decisions about loan approval in future, these
information are useful for banks to make informed decisions I,e, whether they should
give loans to particular client or not, how much amount they should reduce while giving
loans, applying higher interest rates while giving loans to risky or defaulter candidates
and to prevent rejections of deserving candidates
APPROACH
: important approach followed by me while doing this are as follows.:-
Understanding distribution of loan data in previous application dataset and
currentapplication dataset
Cleaning and handling missing values through conditional formatting and excel
formulas,dropping irrelevant columns, imputing missing columns with statistical
values.
Doing analysis and operation on given tasks using excel i functions ,inbuilt charts
,pivot tables ,conditional formatting wherever necessary and extracting meaningful
insights fromit.
Visualizing the insights by using charts and finally collecting the necessary conclusions
drawnafter analysing charts.
Tech stack I have used for this purpose is:-
MS Excel-2019 :- A spreadsheet software which eases the mathematical and statistical calculations
and also helps to present and visualize the insights obtained from huge datasets through various
types of charts and graphs, apart from this it gives an opportunity to make calculation easier by using
pivot tables ,autofill, autosum, data analysis etc
MS Word-2019 :- A word processing software used for preparing the reports from gained insights
from large datasets. It makes it easier to write,edit and store these reports efficiently. It have many
time saving features such as autocorrect option, thesaurus tools,find/replace and many more.
TASKS UNDERTAKEN
A. Identify Missing Data and Deal with it Appropriately:( Task) Identify the missing
data in thedataset and decide on an appropriate method to deal with it using Excel built-in
functions andfeatures.
Required output –
Application Data File
Required output –
1- Used Excel Quartile Functions To Calculate Ouliers And Analysed Outliers For
ApplicationDataset An Previous Application Dataset Using Box Charts. For Example
: Outliers For Amt_Goods_Price Are Being Calculated As Follows:-
2- Outliers for other columns is calculated in the same way as above. It can be clearly
observed from the below charts that columns with AMT_GOODS_PRICE,
AMT_APPLICATION, AMT_ANNUITY etc. have much larger outliers as compared to
othercolumns.
AMT_ANNUIT AMT_GOODS_PRICE
Y
AMT_GOODS_PRICE DAYS_DECISIO
N
-1000 -
10000
-
-1500 15000
-2000
-
20000
-2500 -
25000
-3000
-
-3500 30000 1
AMT_ANNUIT AMT_INCOME_TOTAL
30000
Y
0 1400000…
25000
0
1200000…
20000
0
1000000…
15000
0
10000 80000000
0
5000
0 60000000
Required output –
1- for understanding the distribution of data imbalance I have used countif function to
calculatethe ratio between candidates who pay loans (target-o, paid) and who default
the loans
(target-1, unpaid)-
TARGET proportion FORMULAE
paid 91.94784 =COUNTIF('application dat'!C4:C50003,0)
unpaid 8.052161 =COUNTIF('application dat'!C5:C50004,1)
2- BAR chart shown below clearly depicts that ratio between loan defaulter and payer is
23:2 I,ethere is huge imbalance in application of loan And Distribution Of Classes In
Dataset Is
Skewed.
proportio
n
unpai
d
pai
d
20 40 60 80
100
D. Perform Univariate, Segmented Univariate, and Bivariate Analysis:
Task: Perform univariate analysis to understand the distribution of individual variables, segmented
univariate analysis to compare variable distributions for different scenarios, and bivariate analysis to
explore relationships between variables and the target variable using Excel functions and features.
Required output –
1- For Performing Univariate Analysis Single Variable Is Taken Into Cosideration While
DoingBivariate Anlysis We Have To Mange,Compare And Realte Two Variables At
Same Time.
2- It Can Be Clearly Observed From Univariate Analysis That
Amount Income Distribution:-Most Of The Candidates (45532) Have Average
IncomeOf 25000-2700000 While Only Only Candidate Have Income Above 11
Crores…
Amount Credit Distribution:-No Of Candidates Receiving Loans Between 45
Thousand -54 Lakhs Is Highest I,E, 27105 While Only 2 Candidate Have Received
LoanAbove 40 Lakhs
Family status:- married candidate take highest loans as compared to single one
andlowest amount of loan is taken by widows
Age:- individuals between age group of 40-50 take higher loans as compared to
people of other age groups and lowest no. of loan is taken by individuals having age
group >70
Similarly .Conclusions For Other Columns Are Being Drawn In Same Way.
AMT_ANNUITY
25000
21612
20000
500 13429 10414
15000
0 3411 838 153 71 36 20 1 5 1 6 1
10000
0 Total
INCOME
3000 TYPE 2601
0 0
2500 1154
3 892
0 0 351
2000 2
Total
0
0
2000
0
T
AMT_CREDI 0
5000
1000
1500
2000
2500
3000
3500
4000
4500
0
0
AL
AMT_INCOME_TOT
(BLAN
K)225650-
425650
625650-825650
1025650-1225650
1425650-1625650
1825650-2025650
l
Tota
l
Tota
3425650-3625650
AMT_INCOME_TOTA
L
5000
0
4000
0
3000
AMT_CREDI
2000 T
0
1000
0
AMT_ANNUIT
Y AG
4000
0
E
1500
3500 0
0
3000 1000
0 0
2500
0
20-30 30-40 40-50 50-60 60-
70
…
le
E. Identify Top Correlations for Different Scenarios:
Task: Segment the dataset based on different scenarios (e.g., clients with payment difficulties and all
other cases) and identify the top correlations for each segmented data using Excel functions.
Required output –
=CORREL(B:B,C:C)
2. Correlation value close to 1 shows strong correlation whereas correlation values close
to 0shows weak correlation. These values helps identifying relation between target
and otherfactors so as to determine predictor of loan default
3. top11 correlations for both -loan payers and defaulters are shown below And
FollowingConclusions Can Be Drawn From It:-
It Is Clear From Defaulters Correlation Table That Amt_Annuity Has Nearest Correlation
WithDays_Birth Having Correlation Coefficient Of 0.986944 Followed By Goods Price
With Correlation Coefficient Of 0.769499.
Similarly, it can be seen from payers correlation table that AMT_CREDIT has
nearestcorrelation with GOODS_PRICE having correlation coefficient of
0.986944
Second largest correlation in payers table is of AMT_ANNUITY with GOODS_PRICE
havingcorrelation coefficient of 0.774434.
AMT_INCOME_TOTAL 1
AMT_CREDIT 0.01089 1
4
AMT_ANNUITY -0.03243 0.06931 1
6
AMT_GOODS_PRICE -0.0124 0.08300 0.76949 1
9 9
abs days birth -0.04131 0.06988 0.98694 0.77443 1
6 4 4
abs-DAYS_EMPLOYED -0.07679 -0.016 0.05934 -0.00771 0.0576107 1
3
abs days reg. -0.04247 -0.03151 -0.06774 -0.10871 - 0.62172831 1
0.06505949
abs days publish -0.04234 -0.00995 -0.00345 -0.03322 - 0.33363250 0.20917 1
0.00610104 9 2
CNT_FAM_MEMBERS -0.04693 -0.00351 0.01222 -0.00672 0.01396776 0.27082514 0.27276 0.10429 1
9 1 7 9
REGION_RATING_CLIENT 0.01299 0.01122 0.06399 0.07737 0.06162435 -0.27724625 -0.23076 -0.17011 0.02607 1
2 7 8 9 8
CNT_CHILDREN 0.06613 -0.03819 -0.10051 -0.1258 - -0.0167792 0.20917 -0.08752 0.00230 0.025985 1
0.10372243 2 7
AMT_IN AMT_CR AMT_AN AMT_G abs days abs- abs days abs days CNT_FA REGION_R CNT_CHI
C E N O birt DAYS_EM r p M L
AMT_INCOME_TOTAL 1
AMT_CREDIT 0.06931 1
6
AMT_ANNUITY 0.08300 0.76949 1
9 9
AMT_GOODS_PRICE 0.06988 0.98694 0.77443 1
6 4 4
abs days birth -0.016 0.05934 -0.00771 0.05761 1
3 1
abs-DAYS_EMPLOYED -0.03151 -0.06774 -0.10871 -0.06506 0.62172831 1
abs days reg. -0.00995 -0.00345 -0.03322 -0.0061 0.33363251 0.20917213 1
3
abs days publish -0.00351 0.01222 -0.00672 0.01396 0.27082514 0.27276667 0.104299 1
9 8 2
CNT_FAM_MEMBERS 0.01122 0.06399 0.07737 0.06162 - -0.23076292 -0.17011 0.026078 1
7 8 9 4 0.27724625
REGION_RATING_CLIENT -0.03819 -0.10051 -0.1258 -0.10372 -0.0167792 0.03455865 -0.08752 0.002307 0.025985 1
6
CNT_CHILDREN 0.00958 0.00497 0.02617 0.00025 - -0.24153956 0.104299 0.032116 0.880454 0.025914 1
9 2 9 3 0.32926375
AMT_IN AMT_CR AMT_AN AMT_G abs days abs- abs days abs days CNT_FA REGION_CNT_CHILDREN
C E N O birt DAYS_EM r p M R
minimum -0.03819 -0.10051 -0.1258 -0.10372 - -0.24153956 -0.17011 0.002307 0.025985 0.025914 1
0.32926375
ANALYSIS
there is huge imbalance in application of loan And Distribution Of Classes. Which
means data is skewed.
Amt_Annuity Has Nearest Correlation WithDays_Birth Having Correlation Coefficient Of
0.986944 Followed By Goods Price With Correlation Coefficient Of 0.769499.
married candidate take highest loans as compared to single one andlowest amount of
loan is taken by widows.
individuals between age group of 40-50 take higher loans as compared to people of other
age groups.
CONCLUSIONS
Crossover,Diesel 7 6111
Crossover,Exotic,Luxury,High-Performance 1 238
Crossover,Exotic,Luxury,Performance 1 238
Crossover,Hatchback 72 120650
Crossover,Hatchback,Performance 6 12054
Crossover,Hybrid 42 107662
Crossover,Luxury,Diesel 34 73080
Crossover,Luxury,High-Performance 9 9335
Crossover,Luxury,Hybrid 24 15142
after analysing the given chart using slicer it can be obse3rved that
highestsum of popularity scores is of “N/A” Followed By
“CROSSOVER”,”FLEX FUEL” and “performance”.
400 700000
0 0
350 600000
0 0
300
0 500000
Count of Model
0
Luxury,Performan
Luxury,High-…
…
Factory…
Factory…
Exotic,Flex…
Crossover,Factory
Crossover,Hatchb
…
…
Exotic,Luxury,Hig
…Crossover,Flex…
Crossover,Luxury
Crossover,Luxury,
Flex Fuel,Hybrid
Hatchback,Hybrid
Diesel,Luxury
(blank)
250
Crossover
Hatchback,Factor
Hatchback,Perfor
Flex
Sum of
0 400000 Popularity
…
Insight Required: What is the relationship between a car's engine power and its price?
Task 2: Create a scatter chart that plots engine power on the x-axis and price on the y-axis. Add a
PRIC
Required output
Task 2
(Insights):- –
scatter chart with trendline between engine power and msrp are as follows:-
from the trendline it can be observed that there is positive deviation of engine
power withrespect to price as slope is inclined slightly upwards.this means that as
power of car engineincreases it price will also increase.
250000
0
200000
0
Insight Required: Which car features are most important in determining a car's price?
Task 3: Use regression analysis to identify the variables that have the strongest relationship with
a car's price. Then create a bar chart that shows the coefficient values for each variable to
visualize their relative importance.
Required output –:-
Task 3(Insights):
regression analysis is used to find the correlation between various
features of carwith its MSRP-this involves following steps:-
selecting range of data >> data >> data analysis >> regression.
From the given coefficient data and after analysing bar chart it is clear that
price ofcar is highly correlated with “engine cylinders” followed by
“highway mpg”.
factors Coefficients
-
Year 25.8990629
Engine HP 321.892799
Engine
Cylinders 6237.37657
highway MPG 753.131581
city mpg 367.199653
-
Popularity 3.11493774
Features v/s
Popularity price
FEATURES
city
CAR
mpghighway
MPG Engine
Cylinders
Insight Required: How does the average price of a car vary across different manufacturers?
● Task 4.A: Create a pivot table that shows the average price of cars for each manufacturer.
● Task 4.B: Create a bar chart or a horizontal stacked bar chart that visualizes the
relationshipbetween manufacturer and average price.
Required output –:-
Task 4.A(Insights):-
Pivot charts corresponding to each manufacturer are as follows:_
After analysing given chart using slicer it is clear that highest MSRP is of “bugati”
followed by“lamborgini” and “Maybach”.
1400000
1200000
1000000 Total
800000
600000
400000
200000
Insight Required: What is the relationship between fuel efficiency and the number of cylinders in a
car's engine?
Task 5.A: Create a scatter plot with the number of cylinders on the x-axis and highway
MPGon the y-axis. Then create a trendline on the scatter plot to visually estimate the
slope of therelationship and assess its significance.
Task 5.B: Calculate the correlation coefficient between the number of cylinders and
highwayMPG to quantify the strength and direction of the relationship.
Required output –
Task 5.A(Insights):-
Slope between number of cylinders and highway mpg shows negative
deviation which means efficiency of fuel keeps on decreasing with increasing
number of cylinders and vice versa.
40
0
35
0
HIGHWAY
30
0
MPG
25
0
2 4 6 8 10 12 14 16 18
20 NO OF
CYLINDER
Task 5.B(Insights):-
correlation coefficient between the number of cylinders and highway MPG is
-0.599665331
the value is negative which shows the weak correlation between number of
cylindersand fuels efficiency. It can also be observed from the trend given below.
formulae used : =CORREL('clean dataset'!G:G,'clean dataset'!N:N)
Building the Dashboard :
Task 1: How does the distribution of car prices vary by brand and body style?
Required output –
Task 1(Insights):-
After analysing the given chart with the help of slicer it is clear that ,from
cartypes- the cost of “sedan” is most followed by “door SUV
And from car types-chevrolet have highest value followed by
“Mercedes-benz”.
35000
0
Wagon
30000
0 Sedan
25000 Regular Cab Pickup
0 Passenger Van
Passenger Minivan
20000 Extended Cab
0 PickupCoupe
Convertible
15000
SUV
Convertible
Cargo Van
Task 2: Which car brands have the highest and lowest average MSRPs, and how does this vary by
body style?
Required output –
Task 2(Insights):-
After analysing the chart with slicers it can be said that ,with
respect to carbrands ” bugati” FROM “coupe” style have highest
MSRPS followed by “lamborghini” and “Maybach”.
While “Suzuki” from body style “4dr SUV” have lowest average
MSRPS.
AVERAGE
MSRPS
Volkswage 2dr Hatchback
n
2dr SUV
Su
zuki 4dr
Saab Hatchbac
Plymo
k4dr SUV
uth
Nissan Cargo
Mercedes- Minivan
Benz Cargo Van
0 5000 10000 15000 20000 Convertible
25000
Task 3: How do the different feature such as transmission type affect the MSRP, and how does this
vary by body style?
Required output –
Task 3(Insights):-
● From the chart it can be clearly seen that “automated
manual” with coupe style have highest average MSRP Value
of 99508.37061followed by direct_drive from “sedan” with
average MSRP of 47351.25.
25000
0
20000
0 0 2 4 6 8 10 12 14 16 18 20
Task 4: How does the fuel efficiency of cars vary across different body styles and model years?
Required output –
Task 4(Insights):-
● From the given chart it can be analysed that efficiency of
fuel increases as time passes by. Which shows the
increment in technological revolution over a time and it
can be also expectedthat fuel efficiency will increase in
future also.
45
0 (blank)
40 Wagon
0
Sedan
35
0 Regular Cab
30 PickupPassenger
0 Van Passenger
Minivan
year Crew Cab
Pickup
Task 5: How does the car's horsepower, MPG, and price vary across different Brands?
Required output –
Task 5(Insights):-
It is clear from given chart that there is negative deviation between highway MPG and
price of car that’s why “bugati” with highest MSRP HaveVery Less Highway MPG value
While there is direct correlation between engine horsepower and MSRPvalue
Brands variation
25000 Chart
00
Bugat
15000 ti
00
Mayba
engine
power
La
ch Rsm
F etoSBrlpbn
lsyaon-krcMeLra
A MLoMaeaerMrsCLd
nO IeM
PcC
G eSo
aPnreRhoLefR
Ld daC
sdro
eSctn rovm
io esztuB
sn ydcavlrtc
A
ixio
A Tes
0 aRHU tin Plymo
rroyce
M
D
G o
M lCFhn
iEd iN
MB
u gAH u n
la
- 0 isuh- 6 8 12
20 - 20 0 0 0
50000 highway
mpg
ANALYSIS
Efficiency Of Fuel Increases As Time Passes By. Which Shows The Increment In
Technological Revolution Over A Time
“Automated Manual” With Coupe Style Have Highest Average MSRP Value
There is direct relationship between engine power and msrp values.
it can be said that ,with respect to carbrands ” bugati” FROM “coupe” style have
highest MSRPS followed by “lamborghini” and “Maybach”. While “Suzuki” from
body style “4dr SUV” have lowest average MSRPS.
CONCLUSIONS
TASKS UNDERTAKEN
1. Average Call Duration: (Your Task): What is the average duration of calls for
each timebucket?
INSIGHTS
Total average call duration :-198.62, following value is obtained by selecting the
columnsof time bucket and call duration in second and inserting pivot table
according to it
Average call duration of each bucket is obtained in tabular form is given
Row Labels Average of Call_Seconds (s)
10_11 203.3310302
11_12 199.2550234
12_13 192.8887829
13_14 194.7401744
14_15 193.6770755
15_16 198.8889175
16_17 200.8681864
17_18 200.2487831
18_19 202.5509677
19_20 203.4060725
20_21 202.845993
9_10 199.0691057
Grand Total 198.6227745
AVG. CALL SECOND/TIME
BUCKET 202.84599
205203.3310302 200.868186 202.550967 3 10_11
4 7
200.248783 199.069105
199.255023 198.888917
4 194.740174 5 1 7 11_12
4
192.8887829
19 12_13
193.6770755
5
19 13_14
0
14_15
10_11 11_12 12_13 13_14 14_15 15_16 16_17 17_18 18_19 19_20 9_
10 15_16
20_21 16_17
2. Call Volume Analysis(Your Task): Can you create a chart or graph that shows the
number ofcalls received in each time bucket?
INSIGHTS:-Total number of call received in each bucket is being calculated in tabular as well as
graphical form:-
Row Labels Count of Customer_Phone_No Count of Customer_Phone_No2
0 1 0.00%
10_11 13313 11.28%
11_12 14626 12.40%
12_13 12652 10.72%
13_14 11561 9.80%
14_15 10561 8.95%
15_16 9159 7.76%
16_17 8788 7.45%
17_18 8534 7.23%
18_19 7238 6.13%
19_20 6463 5.48%
20_21 5505 4.67%
9_10 9588 8.13%
(blank) 0.00%
Grand Total 117989 100.00%
14.00 1600
% 0
12.00 1400
% 0
1200
10.00 0
%
8.00
% 800
0 Count of Customer_Phone_No
6.00
% 600 Count of
0
4.00 Customer_Phone_No2
% 400
0
2.00 200
% 0
0.00 0
%
3. Manpower Planning: (Your Task): What is the minimum number of agents required
in eachtime bucket to reduce the abandon rate to 10%?
INSIGHTS
no of agents needed 57
Improvement Suggestions:- from the graph given below it can be seen clearly that
percentage of abandoned calls is largest at morning and evenings which is an alarming
factor for call centre industry. company should work on dropping these abandoned call
percentage.
9 10 11 12
4. Night Shift Manpower Planning: (Your Task): Propose a manpower plan for each time
bucketthroughout the day, keeping the maximum abandon rate at 10%.
INSIGHTS :-
No. Of Agents Required In Different Time Labels Are As Follows
Time 9pm 10pm 11pm 12am 1am 2am 3am 4am 5am 6am 7am 8am
label - - - -1am - - - - - - - -
10p 11pm 12am 2am 3am 4am 5am 6am 7am 8am 9am
m
No of 2 2 1 1 1 1 1 1 2 2 2 3
agent
s
ANALYSIS
The average call duration is highest for time bucket 10-11 followed by 20-21.
The highest no. of agents required in nightshift is between 8-9 am.
Minimum no. of agents required in each time bucket to reduce abandon
rate to 10%:- 57
The highest no of call received in time bucket of 10-11
CONCLUSIONS
Through the project I got an opportunity to brush up my skills of predictive analysis
which includes data mining and statistical analysis, to improve customer experience
This project has also helped me to utilise my skills of numerical ability especially in
finding man power planning and night shift man power planning.
This helps me to understand the model of huge dataset ,its understanding and
preprocessing thus overall improving the skills of data modelling and
interpretation.
APPENDIX
GOOGLE SHEET LINK FOR FOLLOWING PROJECT ARE AS FOLLOWS:-