Professional Documents
Culture Documents
Material 1
Material 1
Emp_data
write.csv(Emp_data, file="Bonus.csv")
Emp_data=cbind(Emp.data,bonus)
Emp_data
Empid=c(Empid="33",Lastname="Rao")
x=data.frame(Empid="33",Lastname="RK",Firstname="Pillay",Jobcode="PILOT",Salary=97000,bonus
=9700)
Emp_data=rbind(Emp_data,x)
Emp_data
Emp.data
sort(Emp.data$Salary)
Emp.data
Emp.data[3,2]
Emp.data[3,]
Emp.data[,3]
Emp.data[c(3,5),c("Lastname","Salary")]
sort(Emp.data$Salary)
ranks=order(Emp.data$Salary)
Emp.data[ranks,]
Emp.data[order(Emp.data$Salary,decreasing = TRUE),]
**********
bonus=emp$Salary*0.1
empd=data.frame(emp,bonus)
sum(empd$Salary)
r=data.frame(Empid="44",Lastname="Pillay",Firstname="RK",Jobcode="PILOT",Salary=97000,bonus=
9700)
empdr=rbind(empd,r)
empdr
empdr[1:5]
empdr[1:3]
empdr[c(1:3),c(1:5)]
empdr[3,]
sum(empdr$Salary+empdr$bonus)
Descriptive analytics are a very basic form of analytics that notify companies of past events and
patterns. More than 90 percent of companies use descriptive analytics.
Along with more advanced types of analytics like predictive analytics and prescriptive analytics,
descriptive analytics use big data to provide business solutions in virtually any industry. Descriptive
analytics are an important first step to computing data and understanding how data should be applied
to a company.
Descriptive analytics can reveal key performance indicators and details such as the frequency of events,
the cost of operations and the root cause of failures, according to a white paper from IBM. The
information can be displayed within a report or dashboard view, or companies can set up automated
solutions that issue alerts when potential problems arise.
Business Solutions for Descriptive Analytics
Examples
Descriptive analytics can benefit managers by showing basic data in charts or reports, charts explain.
These documents help answer questions for budgeting, sales, revenue and cost. “How much did we
sell in each region? What was our revenue and profit last quarter? How many and what types of
complaints did we resolve? Which factory has the lowest productivity? Descriptive analytics also help
companies to classify customers into different segments, which enable them to develop specific
marketing campaigns and advertising strategies,” Decision Line says.
The Dow Chemical company used descriptive analytics to increase facility utilization across its office
and lab spaces globally. The company identified under-utilized space, ultimately increasing facility use
20 percent and generating an annual savings of approximately $4 million in space consolidation, IBM
reports.
Wal-Mart mines terabytes of new data each day and pentabytes of historical data to uncover patterns
in sales, according to desire. Wal-Mart analyzes millions of online search keywords and hundreds of
millions of customers from different sources to look at certain actions. For instance, Wal-Mart examines
what consumers buy in store and online, what’s trending on Twitter and how the World Series and
weather affect buying patterns.
Descriptive analytics are important and useful, but their application is limited. Once past events and
patterns are understood, it’s natural to want to use that information to predict what will most likely
happen and what a company should do. Companies must make the transition from descriptive analytics
to predictive and prescriptive analytics to make the most of their data.
For instance, human resources analytics can examine how long certain employees have stayed with a
company, their salary, how many days they were absent in a year and compare it to performance. Using
simple demographic and performance indicators can help predict how long an employee with certain
qualities will stay with the company, Inostix explains. Companies can also establish best practices
based on these insights.
Historical data from Wal-Mart shows that before a hurricane, items besides tools like flashlights are in
demand. “We didn’t know in the past that strawberry Pop-Tarts increase in sales, like seven times their
normal sales rate, ahead of a hurricane,” Linda Dillman, former chief information officer at Wal-Mart and
now chief information officer at QVC, told New York Times. Wal-Mart now places Pop-Tarts at
checkouts before a hurricane.
The United States faces a shortage of 140,000 to 190,000 people with deep analytical skills, according
to Mckinsey company. There could also be a shortage of 1.5 million managers and analysts who are
able to use big data to make effective decisions.
“Job postings seeking data scientists and business analytics specialists abound these days MIS
quarterly says. “There is a clear shortage of professionals with the ‘deep’ knowledge required to
manage the three V’s of big data: volume, velocity, and variety. There is also an increasing demand for
individuals with the deep knowledge needed to manage the three ‘perspectives’ of business decision
making: descriptive, predictive, and prescriptive analytics.”
4. From the question #2, what is the code to create a column to know the values of the 10%
bonus on income?
Bonus=Cars$Income*0.1
mtcars
head(mtcars)
tail(mtcars)
str(mtcars)
names(mtcars)
summary(mtcars)
cor(mtcars$mpg, mtcars$hp)
dim(Credit)
names(Credit)
credit=Credit[-1,2:12]
credit
names(credit)
summary(credit)
ad.test(credit$Income)
qqnorm(credit$Income)
qqline(credit$Income, col="red")
cor(credit$Income,credit$Limit)
mc_data=credit[,1:length(credit[1:5])]
round(cor(mc_data),2)
model=lm(Income~Limit+Rating+Cards+Age,data=credit)
summary(model)
1. INTRODUCTION
2. GRAPHICS
3. MODEL CHOICE
4. PARAMETER’S ESTIMATE
5. MEASURES OF GOODNESS OF FIT
6. GOODNESS OF FIT TESTS, NORMALITY TESTS
Introduction:
Fitting distributions consists in finding a mathematical function which represents in a good way a
statistical variable. A statistician often is facing with this problem: he has some observations of a
quantitative character x1, x2,… xn and he wishes to test if those observations, being a sample of an
unknown population, belong from a population with a pdf (probability density function) f(x,θ), where
θ is a vector of parameters to estimate with available data. We can identify 4 steps in fitting
distributions:
1. Model/function choice
2. : hypothesize families of distributions;
3. Estimate parameters;
4. Evaluate quality of fit;
5. Goodness of fit statistical tests.
This session aims to face fitting distributions dealing shortly with theoretical issues and practical ones
using the statistical environment and language R.
Graphics:
Exploratory data analysis can be the first step, getting descriptive statistics (mean, standard deviation,
skewness, kurtosis, etc.) and using graphical techniques (histograms, density estimate, ECDF-Empirical
Cumulative Distribution Function) which can suggest the kind of pdf to use to fit the model.
We can obtain samples from some pdf (such as Gaussian, Poisson, Weibull, gamma, etc.) using R
statements and after we draw a histogram of these data. Suppose we have a sample of size n=100
belonging from a normal population N(10,2) with mean=10 and standard deviation=2:
x.norm=rnorm(n=200,m=10,sd=2)
We can estimate frequency density using density()and plot()to plot the graphic:
A Quantile-Quantile (Q-Q) plot is a scatter plot comparing the fitted and empirical distributions in
terms of the dimensional values of the variable (i.e., empirical quantiles). It is a graphical technique
for determining if a data set come from a known population. In this plot on the y-axis we have
empirical quantiles on the x-axis we have the ones got by the theoretical model. R offers to
statements: qqnorm(), to test the goodness of fit of a gaussian distribution, or qqplot() for any kind of
distribution.
names(Ip)
str(Ip)
summary(Ip$AHT)
range(Ip$AHT)
sd(Ip$AHT)
skewness(Ip$AHT)
kurtosis(Ip$AHT)
hist(Ip$AHT, main="Histogram")
plot(ecdf(Ip$AHT), main="ecdf")
library(fBasics)
skewness(Ip$AHT)
kurtosis(Ip$AHT)
ip.sort=sort(Ip$AHT)
x=ip.sort[1:699]
summary(x)
kurtosis(x)
hist(Ip$Calls, main="Histogram")
mean(Ip$Calls)
plot(ecdf(Ip$Calls))
PATH="https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/poisons.csv"
df=read.csv(PATH)
df
df_n=df[2:4]
str(df_n)
poison_n=as.factor(df_n$poison)
levels(poison_n)
df_n1=data.frame(df_n,poison_n)
df_n1
str(df_n1)
df_n2=df_n1[c(1,3,4)]
str(df_n2)
df_n2
# graph
library(ggplot2)
geom_boxplot() +
geom_jitter(shape = 15,
color = "steelblue",
position= position_jitter(0.21)) +
theme_classic()
# Anova
summary(anova_one_way)
Chi square
ibrary(MASS)
tbl=table(survey$Smoke,survey$Exer)
tbl
chisq.test(tbl)
ctbl
chisq.test(ctbl)
Predictive Modelling:
Here predictive analytics explained three distinct related parts, capture, predict and act based on
data triangle methodology. Let us understand capture first:
Capture is the part of one corner of a triangle related to data source. Source can be from traditional
or relational data bases, flat files, excel files and so on. Sources can also mean Big data, Hadoop,
NoSQL systems and analytical data sources etc.
On the second corner of the data triangle, we have form of data. This will include data at rest which
means data transactions of the past and data in motion which includes the functions of data
streaming.
On the third corner of the data triangle, we have types of data. This includes Structured data and
unstructured data. Structured data is data stored in rows and columns and unstructured data
includes free forms of text, videos and pictures ect.
These three corners of the triangle capture the data triangle methodology as first part of predictive
analytics.
Now we shall discuss the predictive analytical process and its techniques:
It goes beyond describing the characteristics of the data and the relationships among the
variables. It uses the past data to predict the future. They first identify the associations among
the variables and then predict the likelihood of phenomenon.
Predictive analytics further can be categorized in to:
If condition:
Discussion 1:
Traditional and Relational data basses:
Discussion 2:
data at rest which means data transactions of the past and data in motion which includes the functions of
data streaming.
Discussion 3:
Structured data is data stored in rows and columns and unstructured data includes free forms of text,
videos and pictures ect.
Poll 1:
cars
names(cars)
#scatter plot
plot(x=cars$speed,y=cars$dist, main="Distance~speed")
scatter.smooth(x=cars$speed,y=cars$dist, main="Distance~speed")
#box plot
boxplot(cars$speed, main="Speed")
boxplot(cars$dist, main="distance")
#Density plot
plot(density(cars$speed),
polygon(density(cars$speed), col="red")
plot(density(cars$dist),
polygon(density(cars$dist), col="blue")
#correlation
cor(cars$speed,cars$dist)
linearMod
summary(linearMod)
In logistic regression, response variable is binary. Hence the value of binaries is considered as (0,1), here
the values of response variable Y is also (0,1). Graph of logistic regression appears as follows:
Value of response variable Y is probability of odds in favour of the event and is written as follows:
𝑃
𝑙𝑜𝑔 ( ) = 𝛼 + 𝛽𝑥,
1−𝑃
𝑃
𝑙𝑜𝑔( )
𝑅𝑎𝑖𝑠𝑖𝑛𝑔 𝑡𝑜 e𝑜𝑛 𝑏𝑜𝑡ℎ 𝑠𝑖𝑑𝑒𝑠 𝑒 1−𝑃 = 𝛼 + 𝛽𝑥,
𝑃 𝑒 𝛼+𝛽𝑥
(1−𝑃) = 𝑒 𝛼+𝛽𝑥 , 𝑃 = 1−𝑒 𝛼+𝛽𝑥 is the function of probable value of logistic response
variable value.
crdtdfault
names(crdtdfault)
crdt=crdtdfault[-1]
names(crdt)
str(crdt)
Default=as.factor(crdt$default10yr)
crdt=crdt[,-5]
crdt1=data.frame(crdt,Default)
str(crdt1)
model0=glm(Default~.,data=crdt1, family=binomial)
summary(model0)
TrainingIndex=sample(1:nrow(crdt1), 0.80*nrow(crdt1))
Tng=crdt1[TrainingIndex,]
Test=crdt1[-TrainingIndex,]
model1=glm(Default~.,data=crdt1, family="binomial")
summary(model1)
pred=predict(model1,Test, type="response")
pred
pred=round(pred,digits =0)
pred
Actual_pred=data.frame(Test$Default,pred)
Actual_pred
confusion=table(Test$Default,pred)
confusion
x=344+41
y=4+11+344+41
Accuracy=x/y
Accuracy