Problem Statement

Project 04
Data Science with Python — Real World Project
After learning about Data Science in depth, it is now time to implement the knowledge gained through
this course in real-life scenarios. We will provide you with four scenarios where you need to implement
data science solutions. To perform these tasks, you can use the different Python libraries such as
NumPy, SciPy, Pandas, scikit-learn, matplotlib, BeautifulSoup, and so on.
Movielens Dataset Analysis
The GroupLens Research Project is a research group in the Department of Computer Science and
Engineering in the University of Minnesota. The researchers of this group are involved in many research
projects related to the fields of information filtering, collaborative filtering, and recommender systems.
Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. In particular, we
want you to apply the tools of machine learning to predict the survivors of the tragedy.
The details of these projects and the scope of each project are listed in the sections below.
 Data acquisition of the movielens dataset

 users dataset
 rating dataset
 movies dataset
 Perform the Exploratory Data Analysis (EDA) for the users dataset
 Visualize user age distribution
 Visualize overall rating by users
 Find and visualize the user rating of the movie “Toy Story”
 Find and visualize the viewership of the movie “Toy Story” by age group
 Find and visualize the top 25 movies by viewership rating
 Find the rating for a particular user of user id = 2696
o Visualize the rating data by user of user id = 2696
 Perform machine learning on first 500 extracted records
 Use the following features:
o movie id
o age
o occupation
 Use rating as label
 Create train and test data set and perform the following:
 Create a histogram for movie, age, and occupation
https://www.kaggle.com/shubhamrastogi1994/movielens-case-study
https://sandipanweb.wordpress.com/2017/12/16/data-science-with-python-exploratory-analysis-with-
movie-ratings-and-fraud-detection-with-credit-card-transactions/
https://towardsdatascience.com/analyzing-wine-descriptions-using-the-natural-language-toolkit-in-
python-497ac1e228d5
https://github.com/kuhimans/Comcast-Telecom-Consumer-Complaints-Analysis/blob/master/P3.ipynb
getwd()
comcast = read.csv("Comcast Telecom Complaints data .csv",sep = ";")
head(comcast)
View(comcast)
str(comcast)
#- Provide the trend chart for the number of complaints at monthly and daily granularity levels.
# Daily Granuality
library(dplyr)
barplot(table(comcast$Date),xlab = "date", ylab = "freequency",col="orange")
#monthly granuality
comcast$Date = as.character(comcast$Date)
class(comcast$Date)
extractmonth = function(dt)
{
unlist(strsplit(x = dt, split = "-"))[2]
}
unlist(lapply(comcast$Date, extractmonth))->comcast$Month
comcast$Month = as.numeric(comcast$Month)
barplot(table(comcast$Month),xlab = "month",ylab = "freequency")
# Provide a table with the frequency of complaint types.

table(comcast$Status)
barplot(table(comcast$Status),xlab = "complaint type", ylab = "freequency", col =
c("red","blue","green","yellow"))
##Which complaint types are maximum i.e., around internet, network issues, or across any
other domains.
#Create a new categorical variable with value as Open and Closed. Open & Pending is to be
categorized
#as Open and Closed & Solved is to be categorized as Closed.
class(comcast$Status)
table(comcast$Status)
open = c("Open","Pending")
close = c("Closed","Solved")
comcast$status[comcast$Status %in% close] = "close"
comcast$status[comcast$Status %in% open] = "open"
#Provide state wise status of complaints in a stacked bar chart.

#Use the categorized variable from Q3. Provide insights on:
#create a cross table for the stacked barplot
table(comcast$status,comcast$State)
barplot(table(comcast$status,comcast$State),xlab ="state", ylab = "freequency", col =
c("green","red"), legend=T)
#Response:- Florida and Albama has the highest numner of complaints however, the number of
closed complaints are highe than the open once
#which means that the complaints are getting resolved .
#Which state has the maximum complaints

comcast%>%group_by(State)%>%summarise(n = n())%>%arrange(desc(n))->st
st
Georgia has the maximum complaints ( all the complaints put together ie - open , close ,
pending , solved)
##Which state has the highest percentage of unresolved complaints (open complaints)
comcast%>%group_by(State,status)%>%filter(status =="open")%>%summarise(total = n())%>
%arrange(desc(total))->t
sum(t$total)->s
ptage = (t$total/517)*100
ptage
#Provide the percentage of complaints resolved till date which were received through the
Internet and customer care calls.
comcast%>%group_by(comcast$Received.Via,comcast$status)%>%filter(status =="close")%>
%summarise(tot = n())->p
p
sum(p$tot)
ptage = (p$tot/1707)*100
ptage

Problem Statement

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Problem Statement

Uploaded by

Copyright:

Available Formats

Project 04

Data Science with Python — Real World Project

Movielens Dataset Analysis

 Data acquisition of the movielens dataset

# Provide a table with the frequency of complaint types.

#Provide state wise status of complaints in a stacked bar chart.

#Which state has the maximum complaints

You might also like