0% found this document useful (0 votes)
362 views51 pages

Data Analyst Portfolio Overview

Anushka Shukla is a data analyst intern with a BBA degree and skills in data visualization, modeling, and analysis. The portfolio includes various projects, such as Instagram user analytics and operational metrics analysis, showcasing her ability to derive insights from data using SQL and other tools. The document outlines her professional background, project approaches, tasks undertaken, and conclusions drawn from her analyses.

Uploaded by

Anushka Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
362 views51 pages

Data Analyst Portfolio Overview

Anushka Shukla is a data analyst intern with a BBA degree and skills in data visualization, modeling, and analysis. The portfolio includes various projects, such as Instagram user analytics and operational metrics analysis, showcasing her ability to derive insights from data using SQL and other tools. The document outlines her professional background, project approaches, tasks undertaken, and conclusions drawn from her analyses.

Uploaded by

Anushka Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Analyst Portfolio

By :- Anushka Shukla

Professional background:
I'm an ambitious data analyst intern with a passion for data visualization, data
modelling, data understanding and project planning, with skills ranging from
visualization to presentation, my objective is to work closely with senior researchers to
analyze financial and economic data and deliver an economic impact report to
stakeholders, contributing to informed decision making.

I have graduated in Bachelors in Business Administration in 2024 with 70% marks


.through the course, I have learnt the application of statistics very well which help me in
detecting patterns of data, avoid distortions, inconsistencies, and logical errors in my
assessment and also to produce accurate and consistent outcomes which is possible only
with solid foundation in statistics and probability.

I recently participated in Accenture’s data analytics and visualization job simulation on


the forage platform .through this program I realized that I really enjoy cleaning ,
modelling and analyzing client data, creating slides to communicate findings and
presenting insights back to the client .i would love to apply what I have learned in
diverse in a diverse project team at a company like Accenture.

I also gained an experience in breaking down complex problem into smaller manageable
parts, resulting in comprehensive solution and got skill in translating complex data into
clear, understandable insights for stakeholders to guide decision making.
INDEX :

1. PROFESSIONAL BACKGROUND……………………………………………………………..1
2. INDEX…………………………………………………………………………………………………………
3. INSTAGRAM USER ANALYTICS PROJECT PROFILE……………………………..3
4. APPROACH………………………………………………………………………………………………………
5. TASKS UNDERTAKEN……………………………………………………………………………….
6. ANALYSIS……………………………………………………………………………………………………
7. CONCLUSIONS…………………………………………………………………………………………..
8. OPERATION & METRIC ANALYSIS PROJECT PROFILE……………………………..7
9. APPROACH………………………………………………………………………………………………………
10. TASKS UNDERTAKEN……………………………………………………………………………….
11. ANALYSIS……………………………………………………………………………………………………
12. CONCLUSIONS…………………………………………………………………………………………..
13. HIRING PROCESS ANALYTICS PROJECT PROFILE……………………………..14
14. APPROACH………………………………………………………………………………………………………
15. TASKS UNDERTAKEN……………………………………………………………………………….
16. ANALYSIS……………………………………………………………………………………………………
17. CONCLUSIONS…………………………………………………………………………………………..
18. IMDB MOVIE ANALYSIS PROJECT PROFILE………………………………………….18
19. APPROACH………………………………………………………………………………………………………
20. TASKS UNDERTAKEN……………………………………………………………………………….
21. ANALYSIS……………………………………………………………………………………………………
22. CONCLUSIONS…………………………………………………………………………………………..
23. BANK LOAN CASE STUDY PROJECT PROFILE………………………………………23
24. APPROACH………………………………………………………………………………………………………
25. TASKS UNDERTAKEN……………………………………………………………………………….
26. ANALYSIS……………………………………………………………………………………………………
27. CONCLUSIONS…………………………………………………………………………………………..
28. IMPACT OF CAR FEATURES PROJECT PROFILE……………………………………32
29. APPROACH………………………………………………………………………………………………………
30. TASKS UNDERTAKEN……………………………………………………………………………….
31. ANALYSIS……………………………………………………………………………………………………
32. CONCLUSIONS…………………………………………………………………………………………..
33. ABC CALL VOLUME TREND PROJECT PROFILE……………………………………..39
34. APPROACH………………………………………………………………………………………………………
35. TASKS UNDERTAKEN……………………………………………………………………………….
36. ANALYSIS……………………………………………………………………………………………………
37. CONCLUSIONS…………………………………………………………………………………………..
38. APPENDIX………………………………………………………………………………………………………44
(1) INSTAGRAM USER ANALYTICS PROJECT PROFILE
Major objective of this project is to find out meaningful dataset from meta data and
visualize user’s interest thus, helpful in quantitative and qualitative analysis of user
activity. This project aims at understanding user engagement practices on Instagram
to help the product team improve the platform.
APPROACH
Approach: Main approach towards this project is to use SQL queries to analyze and
extract important data to track user engagement so that product team can launch
new campaigns and improve users experience on platform. Tech stack used by me
is ‘my SQL workbench 8.0 CE ‘which is GUI tool for my SQL. It helps me in creating
and designing database schemas, run SQL queries to work with stored data and
visualize reports related to user data.
TASKS UNDERTAKEN
Project description: Major objective of this project is to find out meaningful dataset from meta
data and visualize user’s interest thus, helpful in quantitative and qualitative analysis of user activity.
Approach: Main approach towards this project is to use SQL queries to analyze and extract
important data to track user engagement so that product team can launch new campaigns and
improve users experience on platform.
Tech stack used: Tech stack used by me is ‘my SQL workbench 8.0 CE ‘which is GUI tool for my SQL.
It helps me in creating and designing database schemas, run SQL queries to work with stored data
and visualize reports related to user data.
Insights:
A. Marketing analysis:
1. Loyal user reward (task)- identify the five oldest users on Instagram from the
provideddatabase.
Required output – Given below Is a list of five oldest users of Instagram-
Id Username Created At
'80' 'Darby_Herzog' '2016-05-06 , 00:14:21'
'67' 'Emilio_Bernier52' '2016-05-06 , 13:04:30'
'63' 'Elenor88' '2016-05-08 01:30:41'
'95' 'Nicole71' '2016-05-09 17:30:22'
'38' 'Jordyn.Jacobson2' '2016-05-14 07:56:26'

SQL Query used –


SELECT*FROM
users ORDER BY
created_atLIMIT 5;
2. Inactive user engagement(task) - identify users who have never posted a single
photo onInstagram.
Required output – the users who remain inactive on Instagram are as follows-
ID USERNAME CREATED_AT
1 Kenton_Kirlin 2017-02-16 18:22:11
2 Andre_Purdy85 2017-04-02 17:11:21
3 Harley_Lind18 2017-02-21 11:12:33
4 Arely_Bogan63 2016-08-13 01:28:43
5 Aniya_Hackett 2016-12-07 01:04:39
6 Travon.Waters 2017-04-30 13:26:14
7 Kasandra_Homenick 2016-12-12 06:50:08
8 Tabitha_Schamberger11 2016-08-20 02:19:46
9 Gus93 2016-06-24 19:36:31
10 Presley_McClure 2016-08-07 16:25:49
11 Justina.Gaylord27 2017-05-04 16:32:16
12 Dereck65 2017-01-19 01:34:14
13 Alexandro35 2017-03-29 17:09:02
14 Jaclyn81 2017-02-06 23:29:16
15 Billy52 2016-10-05 14:10:20
16 Annalise.McKenzie16 2016-08-02 21:32:46
17 Norbert_Carroll35 2017-02-06 22:05:43
18 Odessa2 2016-10-21 18:16:56
19 Hailee26 2017-04-29 18:53:40
20 Delpha.Kihn 2016-08-31 02:42:30
21 Rocio33 2017-01-23 11:51:15
22 Kenneth64 2016-12-27 09:48:17
23 Eveline95 2017-01-23 23:14:19
24 Maxwell.Halvorson 2017-04-18 02:32:44
25 Tierra.Trantow 2016-10-03 12:49:21
26 Josianne.Friesen 2016-06-07 12:47:01
27 Darwin29 2017-03-18 03:10:07
28 Dario77 2016-08-18 07:15:03
29 Jaime53 2016-09-11 18:51:57
30 Kaley9 2016-09-23 21:24:20
31 Aiyana_Hoeger 2016-09-29 20:28:12
32 Irwin.Larson 2016-08-26 19:36:22
33 Yvette.Gottlieb91 2016-11-14 12:32:01
34 Pearl7 2016-07-08 21:42:01
35 Lennie_Hartmann40 2017-03-30 03:25:22
36 Ollie_Ledner37 2016-08-04 15:42:20
37 Yazmin_Mills95 2016-07-27 00:56:44
38 Jordyn.Jacobson2 2016-05-14 07:56:26
39 Kelsi26 2016-06-08 17:48:08

3. Contest winner declaration(task) – determine the winner of the contest and


providetheir details to the team.
Required output – name of the person who have highest number of likes in single post is -
 username- Zack_Kemmer93 ;
Id 52
Photo_id 145
Final_likes 48

SQL Query used –


SELECT username,photos.id,photos.image_url,
COUNT(likes.user_id) as final_likes
FROM photos
INNER join likes
ON likes.photo_id=photos.id
INNER JOIN users
ON photos.user_id=users.id
GROUP BY photos.id
ORDER BY final_likes desc
LIMIT 1;
4. Hashtag Research(task) – identify and suggest the top five most commonly
usedhashtags on platform.
Required
output –
Tag_name
Smile
Beach
Party
Fun
concert

SQL Query used –


select*from tags;
with top_tags as
(select tag_id
from photo_tags
group by tag_id
order by count(tag_id) desc
limit 5)
select t.tag_name
from top_tags
join tags t on top_tags.tag_id=t.id;

5. Ad Campaign Launch(task) – determine the day of the week when most users
registeron Instagram. provide insights on when to schedule an ad campaign.
Required output -– these are the two days on which users are mostly active on Instagram

day final
Thursday -- 16
Sunday -- 16
SQL Query used-
SELECT DAYNAME(created_at)AS bestday,
COUNT(*) as final
FROM users
GROUP BY bestday
ORDER BY final DESC
limit 2;

B. Investor metrics:
1. User engagement (task)- calculate average number of posts per user on Instagram.
also,provide the total number of posts on Instagram divided by the total number of
users.
Required output –
Average no. of post/user:- 2.57
Total posts/total users:- 0.0509
SQL Queries -
select*from
photos,users;with
user_involved as
(select u.id as userid,count(p.id) as photoid from users as u
left join photos as p on p.user_id=u.id group by u.id)
select sum(photoid) as all_images,count(photoid) as all_users,
sum(photoid)/count(userid)as post_per_user,
sum(photoid)/sum(userid)as avg_photos
from user_involved;
2. Bots and fake accounts (task) – identify users who have liked every single photo
on thesite ,as this is not typically possible for normal user.

Required output –

Id username
5 aniya_hackett
14 jaclyn81
21 rocio33
24 maxwell.halvorson
36 ollie_ledner37
41 mckenna17
54 duane60
57 julien_schmidt
66 mike.auer39
71 Nia_haag
75 leslie67
76 janelle.niikolaus81
91 bethany20
SQL Queries –
select username,count(*) as total_like from users
inner join likes
on likes.user_id=users.id
group by likes.user_id
having total_like=(select count(*) from photos);

ANALYSIS
Through the project I have analyzed that I was able to gain insights which can be
used by teams across the business helping them to launch a new marketing
campaign , tracking the success of the app by measuring user engagement
improving the experience altogether while helping the overall business structure.

CONCLUSIONS

Through The Project I Get An Opportunity To Learn About Scalabiliy And


Accessibility Of SQL Which Helps In Managing And Handling Large Volumes Of Data
As Per The Requirements Of Application .It Also Helps In Clear Understanding Of
Data, Compared To Being Just Independent On Presented Data.
Understanding Of Data Models And Structure Behind Any Data, Handling Large
Amount Of Data, Doing its Analysis And Visualization
2) OPERATION AND METRIC ANALYSIS PROJECT PROFILE
this project aims at collecting necessary insights to measure a business performance. The data
obtained will be helpful in measuring productivity and efficiency which results in sustainable
growth and improved performance of an organization.
APPROACH
approach of the project is to identify the areas of improvement to obtain optimal growth,
understanding user engagement, gain insights regarding user interests using SQL queries and
finally present its reports. The Tech Stack I Have Used In ThIs Project Are As Follows
 ‘MY SQL workbench 8.0 CE ‘: GUI tool for my SQL, helps me in creating and designing
database schemas, run SQL queries to work with stored data , visualize reports and many
more..
 Microsoft excel-2019: Spreadsheet software which helps me in creating dashboards
and charts from the obtained insights and also useful in importing data to my SQL.
 Microsoft word-2019: to prepare the reports from final result obtained after
operating on whole dataset.
TASKS UNDERTAKEN

Case Study -I : Job Data Analysis


1. Jobs reviewed over time (task)- write a SQL query to calculate number of jobs
reviewed per hour for each day in november 2020
Required output – Given below Is a list of jobs reviewed per hour in november 2020-
total_id days total_time nov_result
1 2020-11-25 0.01 0.00000617
1 2020-11-26 0.02 0.00000496
1 2020-11-27 0.03 0.00000267
2 2020-11-28 0.01 0.00001684
1 2020-11-29 0.01 0.00001389
2 2020-11-30 0.01 0.00001389

November insights
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
25/11/2020 26/11/2020 27/11/2020 28/11/2020 29/11/2020 30/11/2020
1 1 1 2 1 2

total_time nov_result

SQL Query used –


SELECT count(job_id) as total_id,ds as days,round(sum(time_spent)/3600,2) as
total_time,(count(job_id)/sum(time_spent)/3600)as nov_result
FROM job_data
WHERE ds between '2020-11-1'and'2020-11-
30' GROUP BY ds
ORDER BY ds;
2. Throughput analysis(task) - write a SQL query to calculate 7 days rolling
average of throughput. Additionally explain you prefer daily metrics or 7 days
rolling average for throughput and why-
Required output –
Throughput/week==>0.027

Throughput/day==>

DAYS regular_throughput

2020-11-25 0.022

2020-11-26 0.018

2020-11-27 0.010

2020-11-28 0.061

2020-11-29 0.050

2020-11-30 0.050

 I will prefer 7 day rolling average as compared to daily metric because the value of
daily metrics keeps fluctuating which can sometimes become cumbersome for
analysing growth trends of a company.7 day rolling average makes it easier to
identify the long term trends concealed by daily fluctuations which helps in making
informed decision and increasing overall growth and productivity.

SQL Query used –


Throughput/week==>

select round(count(event)/sum(time_spent),3) as
week_throughput from job_data;
Throughput/day==>

select ds as days,
round(count(event)/sum(time_spent),3) as
regular_throughput from job_data
group by
ds order
by ds;

3. Languge share analysis(task)- write a SQL query to calculate percentage share


of each language in last 30 days.
Required output – percentage share of each language is as follows-
language total_lan lang_share
Persian 3 37.5000
English 1 12.5000
Arabic 1 12.5000
Hindi 1 12.5000
French 1 12.5000
Italian 1 12.5000

language share analysis

SQL Query used –


WITH dp as(
SELECT language, count(language) as
pqr FROM job_data
WHERE ds BETWEEN '2020-11-1'and'2020-11-30'
GROUP BY language)
SELECT language as lang, pqr as total_lan, (100*pqr/sum(pqr) over()) as
lang_share FROM dp
ORDER BY lang_share desc;
4. Duplicate rows detection(task) - write a SQL query to display duplicate values
from job data table
Required output –

job_id actor_id event language Time_spent org ds


23 1003 decision Persian 20 C 29/11/2020
23 1005 transfer Persian 22 D 28/11/2020
23 1004 skip Persian 56 A 26/11/2020

SQL Query used –


select*from job_data
where job_id in(select job_id from job_data group by job_id having count(*)>1)

Case Study -II : Investigating metric spike


1. Weekly User Engagement(Task) :-Write An SQL Query To Calculate Weekly User
Engagement
Required output_

No_week user_active
17 663
18 1068
19 1113
20 1154
21 1121

22 1186
23 1232
24 1275
25 1264
26 1302
27 1372
28 1365
29 1376
30 1467
31 1299
32 1225
33 1225
34 1204
35 104

SQL Query used –


select extract(week from occured_at) as no_week,count(distinct user_id)as
user_active from events
where event_type
='engagement' group by
no_week
order by no_week;

2. User Growth Analysis(Task) :-Write An SQL Query To Calculate The User Growth
Growth Of The Product.
Required output –

weeks years active_user_record


0 2013 23
0 2014 106
1 2013 136
1 2014 262
2 2013 310
2 2014 419
3 2013 455
3 2014 568
4 2013 598
4 2014 728
5 2013 776
5 2014 909
6 2013 947
6 2014 1082
7 2013 1124
7 2014 1249
8 2013 1283
8 2014 1412

9 2013 1455
9 2014 1588
10 2013 1620
10 2014 1774
11 2013 1805
11 2014 1935
12 2013 1968
12 2014 2116
13 2013 2155
13 2014 2322

,,,,,,,,,AND SO ON.

SQL Query used –


select weeks, years,
sum(active_users) over (order by weeks, years
rows between unbounded preceding and current row) as active_user_record
from (
select extract(week from activated_at) as weeks,
extract(year from activated_at) as years,
count(distinct user_id) as active_users from users
where state= "active"
group by years, weeks
order by years, weeks) as alias;
3. Weekly Retention Anlysis(Task) ;-Write An SQL Query To Calculate The Weekly
Retention Of Users Based On Their Sign Up Cohort.
Required output –
Weeks Users_record
17 72
18 163
19 185
20 176
21 103
22 196
23 196
24 229
25 207
26 201
27 222
28 215
29 221
30 238
31 193
32 245
33 261
34 259
35 18

SQL Query used –


select extract(week from occured_at) as weeks,
count(distinct user_id) as users_record from events
where event_type="signup_flow" and event_name="complete_signup"
group by weeks order by weeks;

4. Weekly Engagement Per Device(Task) ;-Write An SQL Query To Calculate The


Weekly Engagement Per Device.
Required output –
device week_no user_record
acer aspire desktop 17 9
acer aspire notebook 17 20
amazon fire phone 17 4
asus chromebook 17 21
dell inspiron desktop 17 18
dell inspiron notebook 17 46
hp pavilion desktop 17 14
htc one 17 16
ipad air 17 27
ipad mini 17 19
iphone 4s 17 21
iphone 5 17 65
iphone 5s 17 42
kindle fire 17 6
lenovo thinkpad 17 86
mac mini 17 6
macbook air 17 54
macbook pro 17 143
nexus 10 17 16
nexus 5 17 40
nexus 7 17 18
nokia lumia 635 17 17
samsumg galaxy tablet 17 8
samsung galaxy note 17 7
samsung galaxy s4 17 52
windows surface 17 10
acer aspire desktop 18 26

,,,,,,,,,AND SO ON.
SQL Query used –
select device, extract(week from occured_at) as week_no,
count(distinct user_id) as user_record from event
where event_type="engagement"
group by device, week_no order by week_no;

5. Email Engagement Analysis(Task) ;-Write An SQL Query To Calciulate The Email


Engagement Metrics.
Required output –
when action in ("sent_weekly_digest", "sent_reengagement_email") then

when action in ("email_open") then ("email_opened")


when action in ("email_clickthrough") then ("email_clickthrough")
open_record click_record
'33.5834', '14.7899'
SQL Query used –

select
(sum(case when
email_category="email_opened" then 1 else 0 end)/sum(case when
email_category="email_sent" then 1 else 0 end))*100 as open_record,
(sum(case when
email_category="email_clickthrough" then 1 else 0 end)/sum(case when
email_category="email_sent" then 1 else 0 end))*100 as click_record
from (
select *,
casE
("email_sent")
end as email_category
from email_events) as alias;

ANALYSIS
This project helps in analyzing answers of question like total number of jobs
reviewed, calculation of throughput, percentage share of each language and quantity
of duplicate rows. It also helps in finding details regarding users such as user
engagement, user growth, weekly retention and email engagement.

CONCLUSIONS
The Project helped me in following ways:-
 Get an opportunity to analyse and handle large datasets and observe its trends
and metrics.
 observing important insights helped me in clear understanding of advanced SQL..
 understanding of data models and structure behind any data.
 The project helped me to learn the skills of extracting valuable insights from
large datasets and polishing my advance sql skill
3) HIRING PROCESS ANALYTICS PROJECT PROFILE
The project aims at visualizing and extracting important insights from the raw data provided
, which helps in improving the overall hiring process of an organization. It involves using
skills of statistics and advance excel to get important insights enabling organization to
improve its hiring process.

APPROACH
major approach of project is to gain important insights using excel to find out
underlying trends regarding hiring process resulting in optimal growth of company. It
alsoincludes finding missing data, summarize data using statistics, visualize data using
charts,predicting the outliers and many more. tech stack used by me is :-
 Microsoft excel-2019: A Spreadsheet editor ,helps to operate on huge dataset to get
useful insights and visualize it through charts, it allows me to store, format, analyze and
process mydataset in quick and efficient way.
 Microsoft word-2019: A word processing software which is an effective and user
friendly toolfor editing and formatting text. It enables me to prepare to final reports
from the insights obtained.

TASKS UNDERTAKEN
1. Hiring Analysis: (task)- Determine the gender distribution of hires. How many
males andfemales have been hired by the company?
Required output – Number of male and females hired by company are as follows:-

no of
event name candidates
no of
male 2563 candidates
female 1856 300
0

250
0

200
0

150
male female

Formula used – For males: =COUNTIFS(D:D,"male",C:C,"hired")


For females: =COUNTIFS(D:D,"female",C:C,"hired")
2. Salary Analysis(task): What is the average salary offered by this company?
Use Excelfunctions to calculate this.

Required output – Averge Salary Offered Is :- 49983.02902

Formula used-- =SUM(G:G)/COUNT(A:A)


3. Salary Distribution(task):Create class intervals for the salaries in the company. This will
help youunderstand the salary distribution.
Required output –
 Pivot table given below represents salary distribution for all candidates
Row Labels Frequency(salary)
1-10000 678
10001-20000 732
20001-30000 711
30001-40000 710
40001-50000 781
50001-60000 750
60001-70000 698
70001-80000 734
80001-90000 711
90001-100000 659
190001-200000 1
290001-300000 1
390001-400000 1
Grand Total 7167

 Pivot table given below represents salary distribution for hired candidates

Row Labels Count of Offered Salary


1-10000 439
10001-20000 489
20001-30000 457
30001-40000 486
40001-50000 527
50001-60000 494
60001-70000 450
70001-80000 479
80001-90000 459
90001-100000 414
190001-200000 1
290001-300000 1
390001-400000 1
Grand Total 4697

4. Departmental Analysis (Task): Use a pie chart, bar graph, or any other suitable visualization
to show the proportion of people working in different departments.
Required output – proportion of people working in different department are as follows:-

4
Tota
% 3 2 l
% % 4 Finance Department
28% %
General Management
39%
Human Resource
10 DepartmentMarketing
% Operations
5 Department
% 5
%

Pivot table used to create chart:-

proportion of
Row Labels dept.
Finance Department 3.75%
General Management 2.41%
Human Resource Department 1.49%
Marketing Department 4.30%
Operations Department 39.24%
Production Department 5.24%
Purchase Department 4.90%
Sales Department 10.33%
Service Department 28.36%
Grand Total 100.00%

5. Position Tier Analysis(task): : Use a chart or graph to represent the different position tiers
within the company. This will help you understand the distribution of positions across
different tiers.
Required output_ given below is the result of the different position tiers within the company.

Total

1400

1200

1000

800

600 Total

400
200

0
b9 c10 c5 c8 c9 i1 i4 i5 i6 i7 m6 n6
Total 308 105 1182 193 1239 151 32 511 337 635 2 2
Pivot table used to create chart:-

Count of Post
Row Labels Name
b9 308
c10 105
c5 1182
c8 193
c9 1239
i1 151
i4 32
i5 511
i6 337
i7 635
m6 2
n6 2
Grand Total 4697

ANALYSIS
Few analysis drawn by me is as follows. These insights helps in reviewing job
requirements, examine job efficiency .
 No. of males hired by company is more than no. of females
 Highest position tier in the company is of C9 followed by C5.
 Largest no of peoples are working in operations department followed by
service department
 Average salary offered by company is approximately 50000

CONCLUSIONS

 Through the project I got an opportunity to visualize and operate on such a


hugedataset using excel and statistics and draw meaningful conclusions from
it.
 This project is helpful in brushing up my skills of advance excel and
datavisualization.
 It is useful to draw valuable insights such as salary distribution, overall salary, no.
ofinterviews taken etc. which results improving overall ROI of an organisation
4) IMDB MOVIE ANALYSIS PROJECT PROFILE
Project aims at predicting the factors influencing overall success of movie such as finding
trends regarding duration of movies, comparing budgets and profits, language trends
etc. this type of analysis proves helpful for investors, directors and producers so that
they can improve
their performance in future resulting in increased global audience thus extracting
meaningful insightsfrom the IMDB dataset proves very fruitful for internal stakeholders..
APPROACH
Main approach towards this project are as follows
o cleaning the obtained dataset by removing duplicate values, finding and removing
blankvalues, handling outliers, dropping unnecessary columns etc.
o Gaining meaningful insights from the pre processed dataset using advance excel
functions and pivot tables wherever necessary, sorting and filtering the values as per
the requirementof the question, Visualizing and presenting the obtained insights by
using excel charts and soon. Tech stack used in this project is :-
MS Excel-2019 -->An effective data analytics and visualization tool which eases the
mathematical and statistical calculations and also helps to present and visualize the insights
obtained from huge datasets through various types of charts and graphs.
 MS Word-2019--> A user friendly word processing software used for preparing the
reports obtained after extracting useful insights from large datasets. It makes it easier to
write,edit andstore these reports efficiently.
TASKS UNDERTAKEN
A. Movie Genre Analysis: -->Task: Determine the most common genres of movies in the
dataset. Then, for each genre, calculate descriptive statistics (mean, median, mode, range,
variance, standarddeviation) of the IMDB scores.

Required output –
Given below is the list of all genres along with their descriptive analysis , and from the chart
it can be concluded that most common genres are -comedy,action and drama, followed by
adventure,crime,biography and horror.

genres count of genres

genre satistics
Action 933
Mean 219
Adventure 365
Median 36
Animation 45
Mode 2
Biography 205
Standard
Devation 331.7097753
Comedy 1003
Variance 103558.9412
Crime 249
maximum 1003
Documentary 36
Minimum 1
Drama 658
range 1002
Family 3
Fantasy 35
Horror 155
Musical 2
Mystery 22
Romance 1
Sci-Fi 7
Thriller 2
Western 2

Tota
l

150
0

100
0 Tota
l

Functions used – =AVERAGE(Q5:Q21)

=MEDIAN(Q5:Q21)
=MODE(Q5:Q21)
=STDEV(Q5:Q21)
=VAR.P(Q5:Q21

B. Movie Duration Analysis-->Task: Analyze the distribution of movie durations and


identify therelationship between movie duration and IMDB score.

Required output –
From the Given scatter plot it is clear that there is positive slope trendline between movie
duration and imdb score which means that movies with higher duration got high imdb
scores.
Descriptive statistics of movie duration is as follows;
Mean Median Mode Standard Devation
110.2634972 106 101 22.67832498
imdb_scor
14
e
12

10
SCORE
IMDB

50 10 15 20 25 30 35
0 0 0 0 0 0
MOVIE_DURATIO
N
C. Language Analysis-->Task: Determine the most common languages used in movies and
analyzetheir impact on the IMDB score using descriptive statistics.

Required output –
From the given insight of most common languages along with their mean and
median it is clear that most common language is english along with
french,spanish,mandarin,japanese and german

Count of
Row Labels language Mean Median
English 3566 6.427509815 6.6
French 34 7.355882353 6.6
Spanish 23 7.082608696 6.6
Mandarin 14 7.021428571 6.6
Japanese 10 7.66 6.6
German 10 7.77 6.6
Cantonese 7 7.342857143 6.6
Italian 7 7.185714286 6.6
Hindi 5 7.22 6.6
Portuguese 5 7.76 6.6

Functions used – =AVERAGEIF(H:H,O52,J:J)


=MEDIAN(H:H,O52,J:J)
=VAR(H:H,O52,J:J)
=STDEV(H:H,O52,J:J)

D. Director Analysis-->Task: Identify the top directors based on their average IMDB score and
analyzetheir contribution to the success of movies using percentile calculations.
Required output – given below is insight of top 15 directors on the basis of their average
scores along with percentiles.
Average of
Row Labels imdb_score PERCENTILE
Akira Kurosawa 8.7 0.937
Tony Kaye 8.6 0.812
Charles Chaplin 8.6 0.812

Alfred Hitchcock 8.5 0.562


Majid Majidi 8.5 0.562
Ron Fricke 8.5 0.562
Damien Chazelle 8.5 0.562
Sergio Leone 8.433333333 0.5
Christopher Nolan 8.425 0.437
Asghar Farhadi 8.4 0.312
Richard Marquand 8.4 0.312
Lenny
Abrahamson 8.3 0.062
Lee Unkrich 8.3 0.062
Billy Wilder 8.3 0.062
E. Budget Analysis-->Task: Analyze the correlation between movie budgets and gross earnings,
andidentify the movies with the highest profit margin.

Required output –
Correlation coeffiecient is :- 0.098318102 , which means that there is
weakrelationship between budgets of movie and their respective gross
earnings.
Following is a list of movies with their profit margin value and from the given list it is
clear that movie with highest profit margin is avatar followed by others.

Function used: =CORREL(C:C,B:B)

movie_title PROFIT budget


Avatar 523505847 237000000
Jurassic World 502177271 150000000
Titanic 458672302 200000000
Star Wars: Episode IV - A New Hope 449935665 11000000
E.T. the Extra-Terrestrial 424449459 10500000
The Avengers 403279547 220000000
The Lion King 377783777 45000000
Star Wars: Episode I - The Phantom Menace 359544677 115000000
The Dark Knight 348316061 185000000
The Hunger Games 329999255 78000000
Deadpool 305024263 58000000
The Hunger Games: Catching Fire 294645577 130000000
Jurassic Park 293784000 63000000
Despicable Me 2Â 292049635 76000000
Chart Title
600000000
500000000
400000000
300000000
200000000
100000000
0

PROFIT budget

ANALYSIS
From the gained insights I have analyzed that
 The most common genres of movie is comedy followed by acrtion and drama.
 There is very weak relationship between budgets of movie and gross earnings.
 Top director in the film industry is Akira Kurosawa followed by Tony Kaye and
Charles Chaplin
 The most common language of movies is english followed by french and
spanish.

CONCLUSIONS
Project Is Beneficial In:-
 Understanding the applications of advance excel mathematical and
statisticalfunctions so as to do the larger calculations in time efficient
manner.
 Learning the ways to visualize and present the data by using graphs and
charts tomake the insights clearly visible and easy to understod for clients
and internal stakeholders.
 understanding of data models and structure behind any data.
 Revising the concepts of handling large amount of data, doing is analysis
andvisualization.
5) BANK LOAN CASE STUDY PROJECT PROFILE
the major objective of the project is to do the exploratory data analysis of bank loan to
save the banks from risk of financial crisis. The project deals with finding key factors
behind loan default to make better decisions about loan approval in future, these
information are useful for banks to make informed decisions I,e, whether they should
give loans to particular client or not, how much amount they should reduce while giving
loans, applying higher interest rates while giving loans to risky or defaulter candidates
and to prevent rejections of deserving candidates
APPROACH
: important approach followed by me while doing this are as follows.:-
 Understanding distribution of loan data in previous application dataset and
currentapplication dataset
 Cleaning and handling missing values through conditional formatting and excel
formulas,dropping irrelevant columns, imputing missing columns with statistical
values.
 Doing analysis and operation on given tasks using excel i functions ,inbuilt charts
,pivot tables ,conditional formatting wherever necessary and extracting meaningful
insights fromit.
 Visualizing the insights by using charts and finally collecting the necessary conclusions
drawnafter analysing charts.
Tech stack I have used for this purpose is:-
MS Excel-2019 :- A spreadsheet software which eases the mathematical and statistical calculations
and also helps to present and visualize the insights obtained from huge datasets through various
types of charts and graphs, apart from this it gives an opportunity to make calculation easier by using
pivot tables ,autofill, autosum, data analysis etc
MS Word-2019 :- A word processing software used for preparing the reports from gained insights
from large datasets. It makes it easier to write,edit and store these reports efficiently. It have many
time saving features such as autocorrect option, thesaurus tools,find/replace and many more.
TASKS UNDERTAKEN
A. Identify Missing Data and Deal with it Appropriately:( Task) Identify the missing
data in thedataset and decide on an appropriate method to deal with it using Excel built-in
functions andfeatures.

Required output –
Application Data File

1- The Proportion Of Null Values Are Created After


Finding Total Count Of Function From COUNTA
Formulae And Null Value Percentage Is Calculated On
The Basis Total Count Value.
2- 51 Columns having null value%>30 is
identified. these columns are being deleted to
handle dataset easily. Some unnecessary
columns such as flag mobile is also deleted
Because it contains all value as 0 except single
cell.
3- Some irrelevant columns such as flag documents
,exit source2,exit source 3 etc. are also being deleted as they are not needed in further
analysis.
4- Imputation of Missing values is being done by filling them of median values to
preventuneven scattering of data.

Previous Application Data File:-


1- Deleted - 11 columns having null values > 30% and some irrelevant columns such as
hourappr process start, nflag last appl in day etc.
2- Rest of process followed is same as that of application dataset and after doing all these
stepscleaned file is ready for further analysis.

B. Identify Outliers in the Dataset:


Task: Detect and identify outliers in the dataset using Excel statistical functions and features, focusing
on numerical variables.

Required output –

1- Used Excel Quartile Functions To Calculate Ouliers And Analysed Outliers For
ApplicationDataset An Previous Application Dataset Using Box Charts. For Example
: Outliers For Amt_Goods_Price Are Being Calculated As Follows:-

analysis AMT_GOODS_PRICE Formulae used:-


quartile 1 7189.74 =QUARTILE.INC(E4:E50003,1)
median 10879.92 =MEDIAN(E4:E50003)
quartile 3 16256.16 =QUARTILE.INC(E4:E50003,3)
IQR 9066.42 =E50009-E50007
lower limit -6409.89 =E50007-(1.5*E50010)
upper limit 29855.79 =E50009+(1.5*E50010)

2- Outliers for other columns is calculated in the same way as above. It can be clearly
observed from the below charts that columns with AMT_GOODS_PRICE,
AMT_APPLICATION, AMT_ANNUITY etc. have much larger outliers as compared to
othercolumns.

previous application chart:-

AMT_ANNUIT AMT_GOODS_PRICE
Y
AMT_GOODS_PRICE DAYS_DECISIO
N

Application Data charts:-

days last phone DAYS_BIRT


change H
0
-
-500 5000

-1000 -
10000

-
-1500 15000
-2000
-
20000

-2500 -
25000
-3000
-
-3500 30000 1
AMT_ANNUIT AMT_INCOME_TOTAL
30000
Y
0 1400000…

25000
0
1200000…
20000
0
1000000…
15000
0

10000 80000000
0

5000
0 60000000

C. Analyze Data Imbalance:


Task: Determine if there is data imbalance in the loan application dataset and calculate the ratio of
data imbalance using Excel functions.

Required output –

1- for understanding the distribution of data imbalance I have used countif function to
calculatethe ratio between candidates who pay loans (target-o, paid) and who default
the loans
(target-1, unpaid)-
TARGET proportion FORMULAE
paid 91.94784 =COUNTIF('application dat'!C4:C50003,0)
unpaid 8.052161 =COUNTIF('application dat'!C5:C50004,1)

2- BAR chart shown below clearly depicts that ratio between loan defaulter and payer is
23:2 I,ethere is huge imbalance in application of loan And Distribution Of Classes In
Dataset Is
Skewed.

proportio
n
unpai
d

pai
d
20 40 60 80
100
D. Perform Univariate, Segmented Univariate, and Bivariate Analysis:
Task: Perform univariate analysis to understand the distribution of individual variables, segmented
univariate analysis to compare variable distributions for different scenarios, and bivariate analysis to
explore relationships between variables and the target variable using Excel functions and features.

Required output –

1- For Performing Univariate Analysis Single Variable Is Taken Into Cosideration While
DoingBivariate Anlysis We Have To Mange,Compare And Realte Two Variables At
Same Time.
2- It Can Be Clearly Observed From Univariate Analysis That
 Amount Income Distribution:-Most Of The Candidates (45532) Have Average
IncomeOf 25000-2700000 While Only Only Candidate Have Income Above 11
Crores…
 Amount Credit Distribution:-No Of Candidates Receiving Loans Between 45
Thousand -54 Lakhs Is Highest I,E, 27105 While Only 2 Candidate Have Received
LoanAbove 40 Lakhs
 Family status:- married candidate take highest loans as compared to single one
andlowest amount of loan is taken by widows
 Age:- individuals between age group of 40-50 take higher loans as compared to
people of other age groups and lowest no. of loan is taken by individuals having age
group >70
 Similarly .Conclusions For Other Columns Are Being Drawn In Same Way.

AMT_ANNUITY
25000
21612
20000
500 13429 10414
15000
0 3411 838 153 71 36 20 1 5 1 6 1

10000

0 Total

INCOME
3000 TYPE 2601
0 0
2500 1154
3 892
0 0 351
2000 2
Total
0
0
2000
0

T
AMT_CREDI 0
5000

1000

1500

2000

2500

3000

3500

4000

4500
0

0
AL
AMT_INCOME_TOT
(BLAN
K)225650-
425650

625650-825650

1025650-1225650

1425650-1625650

1825650-2025650
l
Tota
l
Tota

3425650-3625650
AMT_INCOME_TOTA
L
5000
0

4000
0

3000

AMT_CREDI
2000 T
0

1000
0

AMT_ANNUIT
Y AG
4000
0
E
1500
3500 0
0

3000 1000
0 0
2500
0
20-30 30-40 40-50 50-60 60-
70

le
E. Identify Top Correlations for Different Scenarios:
Task: Segment the dataset based on different scenarios (e.g., clients with payment difficulties and all
other cases) and identify the top correlations for each segmented data using Excel functions.

Required output –

1. I have followed given procedure to find correlation between various columns of


dataset:-
Segregated tables with target 1 and target 0 using sorting techniqueuse correl function to
determine relationship between two fieldsuse minimum function to get minimum value
from it
FORMULAE USED:-

=CORREL(B:B,C:C)

2. Correlation value close to 1 shows strong correlation whereas correlation values close
to 0shows weak correlation. These values helps identifying relation between target
and otherfactors so as to determine predictor of loan default
3. top11 correlations for both -loan payers and defaulters are shown below And
FollowingConclusions Can Be Drawn From It:-
 It Is Clear From Defaulters Correlation Table That Amt_Annuity Has Nearest Correlation
WithDays_Birth Having Correlation Coefficient Of 0.986944 Followed By Goods Price
With Correlation Coefficient Of 0.769499.
 Similarly, it can be seen from payers correlation table that AMT_CREDIT has
nearestcorrelation with GOODS_PRICE having correlation coefficient of
0.986944
 Second largest correlation in payers table is of AMT_ANNUITY with GOODS_PRICE
havingcorrelation coefficient of 0.774434.
AMT_INCOME_TOTAL 1
AMT_CREDIT 0.01089 1
4
AMT_ANNUITY -0.03243 0.06931 1
6
AMT_GOODS_PRICE -0.0124 0.08300 0.76949 1
9 9
abs days birth -0.04131 0.06988 0.98694 0.77443 1
6 4 4
abs-DAYS_EMPLOYED -0.07679 -0.016 0.05934 -0.00771 0.0576107 1
3
abs days reg. -0.04247 -0.03151 -0.06774 -0.10871 - 0.62172831 1
0.06505949
abs days publish -0.04234 -0.00995 -0.00345 -0.03322 - 0.33363250 0.20917 1
0.00610104 9 2
CNT_FAM_MEMBERS -0.04693 -0.00351 0.01222 -0.00672 0.01396776 0.27082514 0.27276 0.10429 1
9 1 7 9
REGION_RATING_CLIENT 0.01299 0.01122 0.06399 0.07737 0.06162435 -0.27724625 -0.23076 -0.17011 0.02607 1
2 7 8 9 8
CNT_CHILDREN 0.06613 -0.03819 -0.10051 -0.1258 - -0.0167792 0.20917 -0.08752 0.00230 0.025985 1
0.10372243 2 7
AMT_IN AMT_CR AMT_AN AMT_G abs days abs- abs days abs days CNT_FA REGION_R CNT_CHI
C E N O birt DAYS_EM r p M L
AMT_INCOME_TOTAL 1
AMT_CREDIT 0.06931 1
6
AMT_ANNUITY 0.08300 0.76949 1
9 9
AMT_GOODS_PRICE 0.06988 0.98694 0.77443 1
6 4 4
abs days birth -0.016 0.05934 -0.00771 0.05761 1
3 1
abs-DAYS_EMPLOYED -0.03151 -0.06774 -0.10871 -0.06506 0.62172831 1
abs days reg. -0.00995 -0.00345 -0.03322 -0.0061 0.33363251 0.20917213 1
3
abs days publish -0.00351 0.01222 -0.00672 0.01396 0.27082514 0.27276667 0.104299 1
9 8 2
CNT_FAM_MEMBERS 0.01122 0.06399 0.07737 0.06162 - -0.23076292 -0.17011 0.026078 1
7 8 9 4 0.27724625
REGION_RATING_CLIENT -0.03819 -0.10051 -0.1258 -0.10372 -0.0167792 0.03455865 -0.08752 0.002307 0.025985 1
6
CNT_CHILDREN 0.00958 0.00497 0.02617 0.00025 - -0.24153956 0.104299 0.032116 0.880454 0.025914 1
9 2 9 3 0.32926375
AMT_IN AMT_CR AMT_AN AMT_G abs days abs- abs days abs days CNT_FA REGION_CNT_CHILDREN
C E N O birt DAYS_EM r p M R
minimum -0.03819 -0.10051 -0.1258 -0.10372 - -0.24153956 -0.17011 0.002307 0.025985 0.025914 1
0.32926375

ANALYSIS
 there is huge imbalance in application of loan And Distribution Of Classes. Which
means data is skewed.
 Amt_Annuity Has Nearest Correlation WithDays_Birth Having Correlation Coefficient Of
0.986944 Followed By Goods Price With Correlation Coefficient Of 0.769499.
 married candidate take highest loans as compared to single one andlowest amount of
loan is taken by widows.
 individuals between age group of 40-50 take higher loans as compared to people of other
age groups.
CONCLUSIONS

This project provide an opportunity to utilise my skills of data analysis,data


modelling and visualization thus giving me a real life experience of handling and
examining sucha huge dataset.
This project proves to be very useful in polishing the skills of descriptive analysis and
many important components of loan applications.
Useful in learning the applications of different charts at different places such as box
plots for handling outliers, column charts or bar charts to compare different variables.
Gives opportunity to learn time saving features of pivot table and pivot charts and
extracting meaningful insights from it.
6) IMPACT OF CAR FEATURES PROJECT PROFILE
Project aims at finding the effect of qualities of car on its price so as to analyse the trends between
car features and its price. this helps in making informed decisions for investors, manufacturers
and internal stakeholders so that they can improve features of their car for future growth .this
task involves comparing various features of car such as -fuel type, market category, vehicle size,
vehiclestyle, popularity etc and noticing their trends. which in turn proves to be beneficial for
overall success of automative industry.
APPROACH

main approach followed by me while working on this project are as follows


 Preprocessing of raw dataset which includes finding bank value percentage using
COUNTBLANK,COUNTA etc formulae and removing rows of blank values in column
containing text values and imputing median values in numerical column, removing
duplicatesetc.
 Finding insights using pivot tables, slicers , predefined charts, inbuilt excel functions etc
andpresenting the reports through interactive dashboards.
 Finding correlation between different factors affecting profit and price using regression
analysis, using statistical methods to find out quantitative distribution of different features of
car.
 Extracting meaningful insights from the above finding and presenting them in
readableformat through visualization charts such as- scatter plots, bar charts, pie
charts etc. tech stack used for this purpose is-
MS Excel-2019 :- a spreadsheet program used to record data in tabulated way which makes it
easier to analyse huge dataset ,
MS Word-2019 :- a word processor used to write ,edit ,save various types of document. its main
feature that proves to be useful for this project are -font formatting, using page layout options, spell
check, headers, footers, cut, copy, paste, alignment etc..
TASKS UNDERTAKEN
Insight Required: How does the popularity of a car model vary across different market categories?
 Task 1.A: Create a pivot table that shows the number of car models in each market
categoryand their corresponding popularity scores.
 Task 1.B: Create a combo chart that visualizes the relationship between market category
andpopularity

Task 1.A (Insights):-

 number of car models in few category with their corresponding


popularityscores are given in a pivot table below;
 given pivot table is created after selecting columns of make, and popularity
>>setting value of property through value field settings> >clicking ok.

Row Labels Count of Model Sum of Popularity


Crossover 1110 1715242

Crossover,Diesel 7 6111

Crossover,Exotic,Luxury,High-Performance 1 238

Crossover,Exotic,Luxury,Performance 1 238

Crossover,Factory Tuner,Luxury,High-Performance 26 47410

Crossover,Factory Tuner,Luxury,Performance 5 13037

Crossover,Factory Tuner,Performance 4 840

Crossover,Flex Fuel 64 132720

Crossover,Flex Fuel,Luxury 10 11732

Crossover,Flex Fuel,Luxury,Performance 6 9744

Crossover,Flex Fuel,Performance 6 33942

Crossover,Hatchback 72 120650

Crossover,Hatchback,Factory Tuner,Performance 6 12054


Crossover,Hatchback,Luxury 7 1428

Crossover,Hatchback,Performance 6 12054

Crossover,Hybrid 42 107662

Crossover,Luxury 410 362665

Crossover,Luxury,Diesel 34 73080

Crossover,Luxury,High-Performance 9 9335

Crossover,Luxury,Hybrid 24 15142

Crossover,Luxury,Performance 113 151968

 Task 1.B (Insights):-


 below is the combo chart showing relation between market
category andpopularity.from the given chart it can be observed
that

 after analysing the given chart using slicer it can be obse3rved that
highestsum of popularity scores is of “N/A” Followed By
“CROSSOVER”,”FLEX FUEL” and “performance”.

400 700000
0 0
350 600000
0 0
300
0 500000
Count of Model
0

Luxury,Performan
Luxury,High-…


Factory…

Factory…
Exotic,Flex…
Crossover,Factory

Crossover,Hatchb


Exotic,Luxury,Hig
…Crossover,Flex…

Crossover,Luxury

Crossover,Luxury,

Flex Fuel,Hybrid

Hatchback,Hybrid
Diesel,Luxury

(blank)
250
Crossover

Hatchback,Factor

Hatchback,Perfor
Flex

Sum of
0 400000 Popularity

Insight Required: What is the relationship between a car's engine power and its price?
Task 2: Create a scatter chart that plots engine power on the x-axis and price on the y-axis. Add a
PRIC

trendline to the chart to visualize the relationship between these variables.


E

Required output
Task 2
(Insights):- –
 scatter chart with trendline between engine power and msrp are as follows:-
 from the trendline it can be observed that there is positive deviation of engine
power withrespect to price as slope is inclined slightly upwards.this means that as
power of car engineincreases it price will also increase.

250000
0

200000
0
Insight Required: Which car features are most important in determining a car's price?
Task 3: Use regression analysis to identify the variables that have the strongest relationship with
a car's price. Then create a bar chart that shows the coefficient values for each variable to
visualize their relative importance.
Required output –:-
Task 3(Insights):
 regression analysis is used to find the correlation between various
features of carwith its MSRP-this involves following steps:-
selecting range of data >> data >> data analysis >> regression.
 From the given coefficient data and after analysing bar chart it is clear that
price ofcar is highly correlated with “engine cylinders” followed by
“highway mpg”.

factors Coefficients
-
Year 25.8990629
Engine HP 321.892799
Engine
Cylinders 6237.37657
highway MPG 753.131581
city mpg 367.199653
-
Popularity 3.11493774

Features v/s
Popularity price
FEATURES

city
CAR

mpghighway

MPG Engine

Cylinders

Engine HP 100 200 3000 400 500 600 700


0 0 0 0 0 0
MSR

Insight Required: How does the average price of a car vary across different manufacturers?
● Task 4.A: Create a pivot table that shows the average price of cars for each manufacturer.
● Task 4.B: Create a bar chart or a horizontal stacked bar chart that visualizes the
relationshipbetween manufacturer and average price.
Required output –:-
Task 4.A(Insights):-
Pivot charts corresponding to each manufacturer are as follows:_

Row Labels Average of MSRP Land Rover 67823.21678


0 0 Lexus 47549.06931
Acura 34887.5873 Lincoln 42839.82927
Alfa Romeo 61600 Lotus 69188.27586
Aston Martin 197910.3763 Maserati 114207.7069
Audi 53452.1128 Maybach 546221.875
Mazda 20039.38298
Bentley 247169.3243
McLaren 239805
BMW 61546.76347
Mercedes-Benz 71476.22946
Bugatti 1757223.667
Mitsubishi 21240.53521
Buick 28206.61224 Nissan 28583.4319
Cadillac 56231.31738 Oldsmobile 11542.54
Chevrolet 28350.38557 Plymouth 3122.902439
Chrysler 26722.96257 Pontiac Porsche 19321.54839
Dodge 22390.05911 Rolls-Royce 101622.3971
Ferrari 238218.8406 Saab 351130.6452
FIAT 22670.24194 Scion 27413.5045
Ford 27399.26674 Spyker 19932.5
Genesis 46616.66667 Subaru 213323.3333
Suzuki 24827.50391
GMC 30493.29903
Tesla 17900.9569
Honda 26674.34076
Toyota 85255.55556
HUMMER 36464.41176 Volkswagen 29030.01609
Hyundai 24597.0363 Volvo 28102.38072
Infiniti 42394.21212 (blank) 28541.16014
Kia 25310.17316
Lamborghini 331567.3077 Grand Total 40596.86031
Task 4.B(Insights)

 After analysing given chart using slicer it is clear that highest MSRP is of “bugati”
followed by“lamborgini” and “Maybach”.

avg. price v/s manufacturer


1800000
1600000

1400000
1200000
1000000 Total

800000
600000

400000
200000
Insight Required: What is the relationship between fuel efficiency and the number of cylinders in a
car's engine?
 Task 5.A: Create a scatter plot with the number of cylinders on the x-axis and highway
MPGon the y-axis. Then create a trendline on the scatter plot to visually estimate the
slope of therelationship and assess its significance.
 Task 5.B: Calculate the correlation coefficient between the number of cylinders and
highwayMPG to quantify the strength and direction of the relationship.

Required output –
 Task 5.A(Insights):-
 Slope between number of cylinders and highway mpg shows negative
deviation which means efficiency of fuel keeps on decreasing with increasing
number of cylinders and vice versa.

40
0

35
0
HIGHWAY

30
0
MPG

25
0
2 4 6 8 10 12 14 16 18
20 NO OF
CYLINDER

 Task 5.B(Insights):-
 correlation coefficient between the number of cylinders and highway MPG is 
-0.599665331
 the value is negative which shows the weak correlation between number of
cylindersand fuels efficiency. It can also be observed from the trend given below.
 formulae used : =CORREL('clean dataset'!G:G,'clean dataset'!N:N)


Building the Dashboard :

Task 1: How does the distribution of car prices vary by brand and body style?
Required output –
 Task 1(Insights):-
 After analysing the given chart with the help of slicer it is clear that ,from
cartypes- the cost of “sedan” is most followed by “door SUV
 And from car types-chevrolet have highest value followed by
“Mercedes-benz”.

35000
0
Wagon
30000
0 Sedan
25000 Regular Cab Pickup
0 Passenger Van
Passenger Minivan
20000 Extended Cab
0 PickupCoupe
Convertible
15000
SUV
Convertible
Cargo Van

Task 2: Which car brands have the highest and lowest average MSRPs, and how does this vary by
body style?

Required output –

 Task 2(Insights):-

 After analysing the chart with slicers it can be said that ,with
respect to carbrands ” bugati” FROM “coupe” style have highest
MSRPS followed by “lamborghini” and “Maybach”.
 While “Suzuki” from body style “4dr SUV” have lowest average
MSRPS.
AVERAGE
MSRPS
Volkswage 2dr Hatchback
n
2dr SUV
Su
zuki 4dr
Saab Hatchbac
Plymo
k4dr SUV
uth

Nissan Cargo

Mercedes- Minivan
Benz Cargo Van
0 5000 10000 15000 20000 Convertible
25000

Task 3: How do the different feature such as transmission type affect the MSRP, and how does this
vary by body style?
Required output –
 Task 3(Insights):-
● From the chart it can be clearly seen that “automated
manual” with coupe style have highest average MSRP Value
of 99508.37061followed by direct_drive from “sedan” with
average MSRP of 47351.25.

AUTOMAT DIRECT_DRI MANUAL AUTOMATED_MANU


IC VE UNKNO AL
30000
0
MSRP(AVG)

25000
0

20000
0 0 2 4 6 8 10 12 14 16 18 20

Task 4: How does the fuel efficiency of cars vary across different body styles and model years?
Required output –
 Task 4(Insights):-
● From the given chart it can be analysed that efficiency of
fuel increases as time passes by. Which shows the
increment in technological revolution over a time and it
can be also expectedthat fuel efficiency will increase in
future also.
45
0 (blank)
40 Wagon
0
Sedan
35
0 Regular Cab
30 PickupPassenger
0 Van Passenger
Minivan
year Crew Cab
Pickup

Task 5: How does the car's horsepower, MPG, and price vary across different Brands?
Required output –
 Task 5(Insights):-
It is clear from given chart that there is negative deviation between highway MPG and
price of car that’s why “bugati” with highest MSRP HaveVery Less Highway MPG value
While there is direct correlation between engine horsepower and MSRPvalue

Brands variation
25000 Chart
00
Bugat
15000 ti
00

Mayba
engine
power

La
ch Rsm
F etoSBrlpbn
lsyaon-krcMeLra
A MLoMaeaerMrsCLd
nO IeM
PcC
G eSo
aPnreRhoLefR
Ld daC
sdro
eSctn rovm
io esztuB
sn ydcavlrtc
A
ixio
A Tes
0 aRHU tin Plymo
rroyce
M
D
G o
M lCFhn
iEd iN
MB
u gAH u n
la
- 0 isuh- 6 8 12
20 - 20 0 0 0
50000 highway
mpg

ANALYSIS
 Efficiency Of Fuel Increases As Time Passes By. Which Shows The Increment In
Technological Revolution Over A Time
 “Automated Manual” With Coupe Style Have Highest Average MSRP Value
 There is direct relationship between engine power and msrp values.
 it can be said that ,with respect to carbrands ” bugati” FROM “coupe” style have
highest MSRPS followed by “lamborghini” and “Maybach”. While “Suzuki” from
body style “4dr SUV” have lowest average MSRPS.
CONCLUSIONS

 I get thorough understanding of different types of charts their applications ,and


their versatility. through the project I get a chance to check my visualization as
wellas presentation skills
 This project gave me the overview of handling datasets at large quantity in
shorterspan of time
 Through the project I got an opportunity to brush up my skills of data
cleaning,conditional formatting, handling outliers, data preprocessing etc.
(2) ABC CALL VOLUME ANALYSIS PROJECT PROFILE
The major objective of the project is to ease the customer support in order to attract,involve and
delight consumers so as to make them consistent consumers of company which proves to be
financially beneficial for the organisation. It helps in increment of overall sales of company.in this
project we have to find the data of the calls received and also have to ensure the quality of data.
APPROACH
approach followed in this project is given below:-
 Understanding whole dataset given, finding blank values, checking and removing
duplicates(if any), identifying appropriate columns, apply formatting techniques as per the
requirement .
 Imputing the blank cells with the mode value (I,e, “agent”) in the “wrapped by column”
usingconditional formatting .
 Using pivot tables and filters wherever necessary to to find the customer call data
accordingto different time distributions and applying mathematical and analytical skills
to find the details being asked in the task.
 Representing the gained insights with the help of chart, graphs or in tabular form as per
the requirement . overall, the project is bit challenging in applyiong the skills of
numerical ability.
TECH STACK used by me is ms excel for analysis and ms word for presentation.

TASKS UNDERTAKEN
1. Average Call Duration: (Your Task): What is the average duration of calls for
each timebucket?

INSIGHTS
 Total average call duration :-198.62, following value is obtained by selecting the
columnsof time bucket and call duration in second and inserting pivot table
according to it
 Average call duration of each bucket is obtained in tabular form is given
Row Labels Average of Call_Seconds (s)
10_11 203.3310302
11_12 199.2550234
12_13 192.8887829
13_14 194.7401744
14_15 193.6770755
15_16 198.8889175
16_17 200.8681864
17_18 200.2487831
18_19 202.5509677
19_20 203.4060725
20_21 202.845993
9_10 199.0691057
Grand Total 198.6227745
AVG. CALL SECOND/TIME
BUCKET 202.84599
205203.3310302 200.868186 202.550967 3 10_11
4 7
200.248783 199.069105
199.255023 198.888917
4 194.740174 5 1 7 11_12
4
192.8887829
19 12_13
193.6770755
5
19 13_14
0
14_15
10_11 11_12 12_13 13_14 14_15 15_16 16_17 17_18 18_19 19_20 9_
10 15_16
20_21 16_17

2. Call Volume Analysis(Your Task): Can you create a chart or graph that shows the
number ofcalls received in each time bucket?

INSIGHTS:-Total number of call received in each bucket is being calculated in tabular as well as
graphical form:-
Row Labels Count of Customer_Phone_No Count of Customer_Phone_No2
0 1 0.00%
10_11 13313 11.28%
11_12 14626 12.40%
12_13 12652 10.72%
13_14 11561 9.80%
14_15 10561 8.95%
15_16 9159 7.76%
16_17 8788 7.45%
17_18 8534 7.23%
18_19 7238 6.13%
19_20 6463 5.48%
20_21 5505 4.67%
9_10 9588 8.13%
(blank) 0.00%
Grand Total 117989 100.00%

14.00 1600
% 0
12.00 1400
% 0
1200
10.00 0
%
8.00
% 800
0 Count of Customer_Phone_No
6.00
% 600 Count of
0
4.00 Customer_Phone_No2
% 400
0
2.00 200
% 0
0.00 0
%
3. Manpower Planning: (Your Task): What is the minimum number of agents required
in eachtime bucket to reduce the abandon rate to 10%?

INSIGHTS

 Minimum no. of agents required in each time bucket to reduce


abandonrate to 10%:- 57

 Technique used to find the above value is as follows:-


al abandoned/day answered/day transferred/day total call/day abando
n

300.4782609 276.8695652 1.47826087 578.826087 52%


262.0869565 372.173913 1.652173913 635.9130435 41%
133.6086957 410.0869565 6.391304348 550.0869565 24%
113.7826087 383.8695652 5 502.6521739 23%
107.6086957 346.6956522 4.869565217 459.173913 23%
52.7826087 337.3913043 8.043478261 398.2173913 13%
32.47826087 341.3913043 8.217391304 382.0869565 9%
34.04347826 330.4782609 6.52173913 371.0434783 9%
40.56521739 269.5652174 4.565217391 314.6956522 13%
80.34782609 199.0434783 1.608695652 281 29%
114.1304348 124.7826087 0.434782609 239.3478261 48%
223.8695652 192.5217391 0.47826087 416.8695652 54%
0 0 0 0
average call/day 1495.78 3584.87 49.26 5129.96

% of average call/day 29% 70% 1% 100%


total average calls answered (in sec) 198.62
seconds needed to ma ke 254.7281638
answered call 90%

no of agents needed 57

 Improvement Suggestions:- from the graph given below it can be seen clearly that
percentage of abandoned calls is largest at morning and evenings which is an alarming
factor for call centre industry. company should work on dropping these abandoned call
percentage.

9 10 11 12

4. Night Shift Manpower Planning: (Your Task): Propose a manpower plan for each time
bucketthroughout the day, keeping the maximum abandon rate at 10%.

INSIGHTS :-
No. Of Agents Required In Different Time Labels Are As Follows

Time 9pm 10pm 11pm 12am 1am 2am 3am 4am 5am 6am 7am 8am
label - - - -1am - - - - - - - -
10p 11pm 12am 2am 3am 4am 5am 6am 7am 8am 9am
m
No of 2 2 1 1 1 1 1 1 2 2 2 3
agent
s

The following insights gained by me while working on excel sheet is as follows:-

total average calls answered (in sec) 198.62


averageall/day 5129.96
c
averageall (night shift) 1538.988
c
seconds needed to make 76.41844914
answered call 90%(night shift)
no of agents needed 17

time label call distribution time division agent req.(night)


9_10 3 10 2
10_11 3 10 2
11_12 2 15 1
12_1 2 15 1
1_2 1 30 1
2_3 1 30 1
3_4 1 30 1
4_5 1 30 1
5_6 3 10 2
6_7 4 7.5 2
7_8 4 7.5 2
8_9 5 6 3

ANALYSIS
 The average call duration is highest for time bucket 10-11 followed by 20-21.
 The highest no. of agents required in nightshift is between 8-9 am.
 Minimum no. of agents required in each time bucket to reduce abandon
rate to 10%:- 57
 The highest no of call received in time bucket of 10-11
CONCLUSIONS
Through the project I got an opportunity to brush up my skills of predictive analysis
which includes data mining and statistical analysis, to improve customer experience
This project has also helped me to utilise my skills of numerical ability especially in
finding man power planning and night shift man power planning.
This helps me to understand the model of huge dataset ,its understanding and
preprocessing thus overall improving the skills of data modelling and
interpretation.
APPENDIX
GOOGLE SHEET LINK FOR FOLLOWING PROJECT ARE AS FOLLOWS:-

o HIRING PROCESS ANALYTICS PROJECT


https://docs.google.com/spreadsheets/d/1K1rEc2AMMsCJjKgGFn6Zc0fdelspU_
eP/edit?usp=sharing&ouid=114188495578863486381&rtpof=true&sd=true
o IMDB MOVIE ANALYSIS PROJECT
https://docs.google.com/spreadsheets/d/1eD49IzKyZ-yjXKLisNKsqImZ-
EBOX1Ta/edit?usp=drive_link&ouid=114188495578863486381&rtpof=true&sd=true
o BANK LOAN CASE STUDY PROJECT
https://docs.google.com/spreadsheets/d/1tAjxsEobdG5IVpmZpanSWz5AqSfGPFCj/
edit?usp=drive_link&ouid=114188495578863486381&rtpof=true&sd=true
o IMPACT OF CAR FEATURES PROJECT
https://docs.google.com/spreadsheets/d/1NAh5u4SKNwHiTTgTBcJ8dHAw-
QcU15Q6/edit?usp=sharing&ouid=114188495578863486381&rtpof=true&sd=true
o ABC CALL VOLUME TREND PROJECT
https://docs.google.com/spreadsheets/d/1eIx4pDRD1i--
Jq78OwvBuzeMtVYZKgWg/edit?usp=drive_link&ouid=114188495578863486
381&rtpof=true&sd=true

You might also like