You are on page 1of 6

Project 04

Data Science with Python — Real World Project

After learning about Data Science in depth, it is now time to implement the knowledge gained through
this course in real-life scenarios. We will provide you with four scenarios where you need to implement
data science solutions. To perform these tasks, you can use the different Python libraries such as
NumPy, SciPy, Pandas, scikit-learn, matplotlib, BeautifulSoup, and so on.

Movielens Dataset Analysis

The GroupLens Research Project is a research group in the Department of Computer Science and
Engineering in the University of Minnesota. The researchers of this group are involved in many research
projects related to the fields of information filtering, collaborative filtering, and recommender systems.
Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. In particular, we
want you to apply the tools of machine learning to predict the survivors of the tragedy.

The details of these projects and the scope of each project are listed in the sections below.

 Data acquisition of the movielens dataset


 users dataset
 rating dataset
 movies dataset
 Perform the Exploratory Data Analysis (EDA) for the users dataset
 Visualize user age distribution
 Visualize overall rating by users
 Find and visualize the user rating of the movie “Toy Story”
 Find and visualize the viewership of the movie “Toy Story” by age group
 Find and visualize the top 25 movies by viewership rating
 Find the rating for a particular user of user id = 2696
o Visualize the rating data by user of user id = 2696
 Perform machine learning on first 500 extracted records
 Use the following features:
o movie id
o age
o occupation
 Use rating as label
 Create train and test data set and perform the following:
 Create a histogram for movie, age, and occupation

https://www.kaggle.com/shubhamrastogi1994/movielens-case-study

https://sandipanweb.wordpress.com/2017/12/16/data-science-with-python-exploratory-analysis-with-
movie-ratings-and-fraud-detection-with-credit-card-transactions/
https://towardsdatascience.com/analyzing-wine-descriptions-using-the-natural-language-toolkit-in-
python-497ac1e228d5

https://github.com/kuhimans/Comcast-Telecom-Consumer-Complaints-Analysis/blob/master/P3.ipynb
getwd()
comcast = read.csv("Comcast Telecom Complaints data .csv",sep = ";")
head(comcast)
View(comcast)
str(comcast)
#- Provide the trend chart for the number of complaints at monthly and daily granularity levels.
# Daily Granuality
library(dplyr)
barplot(table(comcast$Date),xlab = "date", ylab = "freequency",col="orange")

#monthly granuality
comcast$Date = as.character(comcast$Date)
class(comcast$Date)
extractmonth = function(dt)
{
unlist(strsplit(x = dt, split = "-"))[2]
}
unlist(lapply(comcast$Date, extractmonth))->comcast$Month
comcast$Month = as.numeric(comcast$Month)
barplot(table(comcast$Month),xlab = "month",ylab = "freequency")

# Provide a table with the frequency of complaint types.


table(comcast$Status)
barplot(table(comcast$Status),xlab = "complaint type", ylab = "freequency", col =
c("red","blue","green","yellow"))

##Which complaint types are maximum i.e., around internet, network issues, or across any
other domains.
#Create a new categorical variable with value as Open and Closed. Open & Pending is to be
categorized
#as Open and Closed & Solved is to be categorized as Closed.
class(comcast$Status)
table(comcast$Status)
open = c("Open","Pending")
close = c("Closed","Solved")
comcast$status[comcast$Status %in% close] = "close"
comcast$status[comcast$Status %in% open] = "open"

#Provide state wise status of complaints in a stacked bar chart.


#Use the categorized variable from Q3. Provide insights on:
#create a cross table for the stacked barplot
table(comcast$status,comcast$State)
barplot(table(comcast$status,comcast$State),xlab ="state", ylab = "freequency", col =
c("green","red"), legend=T)

#Response:- Florida and Albama has the highest numner of complaints however, the number of
closed complaints are highe than the open once
#which means that the complaints are getting resolved .

#Which state has the maximum complaints


comcast%>%group_by(State)%>%summarise(n = n())%>%arrange(desc(n))->st
st

Georgia has the maximum complaints ( all the complaints put together ie - open , close ,
pending , solved)

##Which state has the highest percentage of unresolved complaints (open complaints)
comcast%>%group_by(State,status)%>%filter(status =="open")%>%summarise(total = n())%>
%arrange(desc(total))->t
sum(t$total)->s
ptage = (t$total/517)*100
ptage

#Provide the percentage of complaints resolved till date which were received through the
Internet and customer care calls.
comcast%>%group_by(comcast$Received.Via,comcast$status)%>%filter(status =="close")%>
%summarise(tot = n())->p
p
sum(p$tot)
ptage = (p$tot/1707)*100
ptage

You might also like