You are on page 1of 15

1. What is the goal of Artificial Intelligence?

The goal of AI is creating to systems that can function intelligently and


independently.
2. What is speech recognition in AI?
Human can speak and listen to communicate through language. This is the field of
speech recognition. Much of speech recognition is statistically based (Statistical
learning). Hence it is called statistical learning.
3. What is NLP?
Human can write and read text in a language. This is field of NLP or natural language
process. NLP stands for Neuro-Linguistic Programming. Neuro refers to your
neurology; Linguistic refers to language; programming refers to how that neural
language functions. In other words, learning NLP is like learning the language of your
own mind!
4. What is machine learning?
Humans have ability to see patterns such as grouping of like objects. This is a field of
pattern recognition. Machines are even better at pattern recognition. Because they
can use more data and dimension of data, this is a field of machine learning.
5. Describe the field of Neural network in human context.
human brain is a network of neurons and we use these to learn things. If we can
replicate the structure and the function of the human brain, we might be able to get
cognitive capabilities in machines. This is a field of neural networks.
6. What is deep learning?
Deep Learning is a subfield of machine learning concerned with algorithms inspired by
the structure and function of the brain called artificial neural networks.
7. What is the difference between Descriptive Statistics and Descriptive Analytics?
Descriptive statistics is summarizing the data with patterns and charts using the
outcome of statistical measures mean, median, mode, range, standard deviation and
variance etc. The same thing using the machines and software’s for a bid data is
described, then it is known as Descriptive Analytics.

Must be in a position to write purpose and meaning of this code


Emp_data=data.frame(Emp.data,bonus)

Emp_data

write.csv(Emp_data, file="Bonus.csv")

Emp_data=cbind(Emp.data,bonus)

Emp_data

Empid=c(Empid="33",Lastname="Rao")

x=data.frame(Empid="33",Lastname="RK",Firstname="Pillay",Jobcode="PILOT",Salary=97000,bonus
=9700)

Emp_data=rbind(Emp_data,x)
Emp_data

Emp.data

sort(Emp.data$Salary)

Emp.data

Emp.data[3,2]

Emp.data[3,]

Emp.data[,3]

Emp.data[c(3,5),c("Lastname","Salary")]

sort(Emp.data$Salary)

ranks=order(Emp.data$Salary)

Emp.data[ranks,]

Emp.data[order(Emp.data$Salary,decreasing = TRUE),]

**********

emp=read.csv("E:/Business Analytics/BA 2018/Session 3/Bonus.csv")

bonus=emp$Salary*0.1

empd=data.frame(emp,bonus)

sum(empd$Salary)

write.csv(empd, file = "E:/Business Analytics/BA 2018/Session 3/empd.csv")

r=data.frame(Empid="44",Lastname="Pillay",Firstname="RK",Jobcode="PILOT",Salary=97000,bonus=
9700)

empdr=rbind(empd,r)

empdr

empdr[1:5]

empdr[1:3]

empdr[c(1:3),c(1:5)]

empdr[3,]

sum(empdr$Salary+empdr$bonus)

Business Solutions and Big Data Descriptive Analytics

Descriptive analytics are a very basic form of analytics that notify companies of past events and
patterns. More than 90 percent of companies use descriptive analytics.
Along with more advanced types of analytics like predictive analytics and prescriptive analytics,
descriptive analytics use big data to provide business solutions in virtually any industry. Descriptive
analytics are an important first step to computing data and understanding how data should be applied
to a company.

What Are Descriptive Analytics?


“The purpose of descriptive analytics is simply to summarize and tell you what happened,” says Michael
Wu, chief scientist at Lithium. Descriptive analytics are “mostly based on standard aggregate functions
[like average, maximum and mode] in databases that require nothing more than grade school math.
Even basic statistics are pretty rare.”
Most descriptive analytics can be classified into three categories.

 Event counters like number of posts and followers on social media.


 Simple mathematical operations like average response time and average number of replies
per post.
 Filtered analytics, such as average posts per week from the United Kingdom vs. average
posts per week from Japan.

Descriptive analytics can reveal key performance indicators and details such as the frequency of events,
the cost of operations and the root cause of failures, according to a white paper from IBM. The
information can be displayed within a report or dashboard view, or companies can set up automated
solutions that issue alerts when potential problems arise.
Business Solutions for Descriptive Analytics

Examples

Descriptive analytics can benefit managers by showing basic data in charts or reports, charts explain.
These documents help answer questions for budgeting, sales, revenue and cost. “How much did we
sell in each region? What was our revenue and profit last quarter? How many and what types of
complaints did we resolve? Which factory has the lowest productivity? Descriptive analytics also help
companies to classify customers into different segments, which enable them to develop specific
marketing campaigns and advertising strategies,” Decision Line says.

The Dow Chemical company used descriptive analytics to increase facility utilization across its office
and lab spaces globally. The company identified under-utilized space, ultimately increasing facility use
20 percent and generating an annual savings of approximately $4 million in space consolidation, IBM
reports.

Wal-Mart mines terabytes of new data each day and pentabytes of historical data to uncover patterns
in sales, according to desire. Wal-Mart analyzes millions of online search keywords and hundreds of
millions of customers from different sources to look at certain actions. For instance, Wal-Mart examines
what consumers buy in store and online, what’s trending on Twitter and how the World Series and
weather affect buying patterns.

Moving from Descriptive Analytics

Descriptive analytics are important and useful, but their application is limited. Once past events and
patterns are understood, it’s natural to want to use that information to predict what will most likely
happen and what a company should do. Companies must make the transition from descriptive analytics
to predictive and prescriptive analytics to make the most of their data.

For instance, human resources analytics can examine how long certain employees have stayed with a
company, their salary, how many days they were absent in a year and compare it to performance. Using
simple demographic and performance indicators can help predict how long an employee with certain
qualities will stay with the company, Inostix explains. Companies can also establish best practices
based on these insights.

Historical data from Wal-Mart shows that before a hurricane, items besides tools like flashlights are in
demand. “We didn’t know in the past that strawberry Pop-Tarts increase in sales, like seven times their
normal sales rate, ahead of a hurricane,” Linda Dillman, former chief information officer at Wal-Mart and
now chief information officer at QVC, told New York Times. Wal-Mart now places Pop-Tarts at
checkouts before a hurricane.

Dramatic Rise of Career Opportunities in Big Data

The United States faces a shortage of 140,000 to 190,000 people with deep analytical skills, according
to Mckinsey company. There could also be a shortage of 1.5 million managers and analysts who are
able to use big data to make effective decisions.

“Job postings seeking data scientists and business analytics specialists abound these days MIS
quarterly says. “There is a clear shortage of professionals with the ‘deep’ knowledge required to
manage the three V’s of big data: volume, velocity, and variety. There is also an increasing demand for
individuals with the deep knowledge needed to manage the three ‘perspectives’ of business decision
making: descriptive, predictive, and prescriptive analytics.”

1. How many ways discussed to upload a csv file in to R environment?


a. From Import data file-> local file and selecting heading Yes
b. Using the code: read.csv(“path of the data file/data set name.csv”)
2. A data set by name “Cars” has three variables, “Name”, “Age”, “Income”. How do you write
a code to see the values of the variable “Income”?
Code: Cars$Income
3. Assign 24 to x and 35 to y. what is the code to calculate the product of 24 and 35?
[1] x=24 (Run)
[2] y=35 (Run)
[3] x*y (Run)

4. From the question #2, what is the code to create a column to know the values of the 10%
bonus on income?

Bonus=Cars$Income*0.1

5. What is the function used for creating a data set?


Data.frame
6. Write a code to create a data set having 4 rows and 4 columns. One of the columns should
be bi-group variable.
7. From question # 6, Create a data set having 2nd and 3rd row; and 3rd and 4th column
8. Write a code to save the data set file in a drive where your project is saved

mtcars

write.csv(mtcars, file="E:/Business Analytics/BA 2018/Session 5/mtcars.csv")

head(mtcars)
tail(mtcars)

str(mtcars)

names(mtcars)

summary(mtcars)

cor(mtcars$mpg, mtcars$hp)

plot(mtcars$mpg, mtcars$hp, main="Scatterplot Example",

xlab="Miles per Gallon ", ylab="Horse power ", pch=19)

dim(Credit)

names(Credit)

credit=Credit[-1,2:12]

credit

names(credit)

summary(credit)

ad.test(credit$Income)

qqnorm(credit$Income)

qqline(credit$Income, col="red")

cor(credit$Income,credit$Limit)

mc_data=credit[,1:length(credit[1:5])]

round(cor(mc_data),2)

model=lm(Income~Limit+Rating+Cards+Age,data=credit)

summary(model)

FITTING DISTRIBUTION WITH R


TABLE OF CONTENTS:

1. INTRODUCTION
2. GRAPHICS
3. MODEL CHOICE
4. PARAMETER’S ESTIMATE
5. MEASURES OF GOODNESS OF FIT
6. GOODNESS OF FIT TESTS, NORMALITY TESTS

Introduction:

Fitting distributions consists in finding a mathematical function which represents in a good way a
statistical variable. A statistician often is facing with this problem: he has some observations of a
quantitative character x1, x2,… xn and he wishes to test if those observations, being a sample of an
unknown population, belong from a population with a pdf (probability density function) f(x,θ), where
θ is a vector of parameters to estimate with available data. We can identify 4 steps in fitting
distributions:

1. Model/function choice
2. : hypothesize families of distributions;
3. Estimate parameters;
4. Evaluate quality of fit;
5. Goodness of fit statistical tests.

This session aims to face fitting distributions dealing shortly with theoretical issues and practical ones
using the statistical environment and language R.

Graphics:

Exploratory data analysis can be the first step, getting descriptive statistics (mean, standard deviation,
skewness, kurtosis, etc.) and using graphical techniques (histograms, density estimate, ECDF-Empirical
Cumulative Distribution Function) which can suggest the kind of pdf to use to fit the model.

We can obtain samples from some pdf (such as Gaussian, Poisson, Weibull, gamma, etc.) using R
statements and after we draw a histogram of these data. Suppose we have a sample of size n=100
belonging from a normal population N(10,2) with mean=10 and standard deviation=2:

x.norm=rnorm(n=200,m=10,sd=2)

Get histogram using hist() function;

hist(x.norm,main="Histogram of observed data")


Histograms can provide insights on skewness, behaviour in the tails, presence of multi-modal
behaviour, and data outliers; histograms can be compared to the fundamental shapes associated with
standard analytic distributions.

We can estimate frequency density using density()and plot()to plot the graphic:

plot(density(x.norm),main="Density estimate of data")

plot(ecdf(x.norm),main=” Empirical cumulative distribution function”)

A Quantile-Quantile (Q-Q) plot is a scatter plot comparing the fitted and empirical distributions in
terms of the dimensional values of the variable (i.e., empirical quantiles). It is a graphical technique
for determining if a data set come from a known population. In this plot on the y-axis we have
empirical quantiles on the x-axis we have the ones got by the theoretical model. R offers to
statements: qqnorm(), to test the goodness of fit of a gaussian distribution, or qqplot() for any kind of
distribution.

z.norm=(x.norm-mean(x.norm))/sd(x.norm) ## standardized data

qqnorm(z.norm) ## drawing the QQplot

abline(0,1) ## drawing a 45-degree reference line


A 45-degree reference line is also plotted. If the empirical data come from the population with the
chosen distribution, the points should fall approximately along this reference line. The greater the
departure from this reference line, the greater the evidence for the conclusion that the data set have
come from a population with a different distribution.

names(Ip)

str(Ip)

summary(Ip$AHT)

range(Ip$AHT)

sd(Ip$AHT)

skewness(Ip$AHT)

kurtosis(Ip$AHT)

hist(Ip$AHT, main="Histogram")

plot(density(Ip$AHT), main="Debsity curve")75474

plot(ecdf(Ip$AHT), main="ecdf")

library(fBasics)

skewness(Ip$AHT)

kurtosis(Ip$AHT)

ip.sort=sort(Ip$AHT)

x=ip.sort[1:699]

summary(x)

hist(x, main="histogram of AHT")

plot(density(x), main="Density of smooth x")


skewness(x)

kurtosis(x)

hist(Ip$Calls, main="Histogram")

mean(Ip$Calls)

plot(ecdf(Ip$Calls))

PATH="https://raw.githubusercontent.com/thomaspernet/data_csv_r/master/data/poisons.csv"

df=read.csv(PATH)

df

df_n=df[2:4]

str(df_n)

poison_n=as.factor(df_n$poison)

levels(poison_n)

df_n1=data.frame(df_n,poison_n)

df_n1

str(df_n1)

df_n2=df_n1[c(1,3,4)]

str(df_n2)

df_n2

# graph

library(ggplot2)

ggplot(df_n2, aes(x = poison_n, y = time, fill = poison_n)) +

geom_boxplot() +

geom_jitter(shape = 15,

color = "steelblue",

position= position_jitter(0.21)) +

theme_classic()

# Anova

anova_one_way=aov(time~poison_n, data = df_n2)

summary(anova_one_way)

# Two way Anova

anova_two_way=aov(time~poison_n + treat, data = df_n2)


summary(anova_two_way)

Chi square

ibrary(MASS)

tbl=table(survey$Smoke,survey$Exer)

tbl

chisq.test(tbl)

#combine the second and third columns of tbl

ctbl = cbind(tbl[,"Freq"], tbl[,"None"] + tbl[,"Some"])

ctbl

chisq.test(ctbl)

Predictive Modelling:

Here predictive analytics explained three distinct related parts, capture, predict and act based on
data triangle methodology. Let us understand capture first:

Capture is the part of one corner of a triangle related to data source. Source can be from traditional
or relational data bases, flat files, excel files and so on. Sources can also mean Big data, Hadoop,
NoSQL systems and analytical data sources etc.

On the second corner of the data triangle, we have form of data. This will include data at rest which
means data transactions of the past and data in motion which includes the functions of data
streaming.

On the third corner of the data triangle, we have types of data. This includes Structured data and
unstructured data. Structured data is data stored in rows and columns and unstructured data
includes free forms of text, videos and pictures ect.

These three corners of the triangle capture the data triangle methodology as first part of predictive
analytics.

Now we shall discuss the predictive analytical process and its techniques:

It goes beyond describing the characteristics of the data and the relationships among the
variables. It uses the past data to predict the future. They first identify the associations among
the variables and then predict the likelihood of phenomenon.
Predictive analytics further can be categorized in to:

Predictive modelling, what will happen next.

If condition:

Root cause analysis, why this actually happen:

Data mining: Identifying correlated data:

Forecasting: What if the existing trend continue.

Monte Carlo simulation: What could happen?

Pattern identification and alerts:

When should an action can be invoked to correct a process:

Discussion 1:
Traditional and Relational data basses:

Traditional data bases:


Traditional data systems, such as relational databases and data warehouses, have been the primary way
businesses and organizations have stored and analyzed their data for the past 30 to 40 years. ...
Characteristics of structured data include the following: Clearly defined fields organized in records.

Relational data bases:


There are also many free and open-source RDBMS, such as MySQL, mSQL (mini-SQL) and the
embedded JavaDB (Apache Derby). A relational database organizes data in tables (or relations). A table is
made up of rows and columns. A row is also called a record (or tuple).

Discussion 2:

Data in Rest and Data in Motion:

data at rest which means data transactions of the past and data in motion which includes the functions of
data streaming.

Discussion 3:

Structured data and unstructured data:

Structured data is data stored in rows and columns and unstructured data includes free forms of text,
videos and pictures ect.

Poll 1:

Data Triangle methodology?

cars

names(cars)

#scatter plot

plot(x=cars$speed,y=cars$dist, main="Distance~speed")

scatter.smooth(x=cars$speed,y=cars$dist, main="Distance~speed")

#box plot

boxplot(cars$speed, main="Speed")

boxplot(cars$dist, main="distance")

#Density plot

plot(density(cars$speed),

main="Density Plot: Speed", ylab="Frequency",

sub=paste("Skewness:", round(e1071::skewness(cars$speed), 2)))

polygon(density(cars$speed), col="red")

plot(density(cars$dist),

main="Density Plot: Distance", ylab="Frequency",

sub=paste("Skewness:", round(e1071::skewness(cars$dist), 2)))

polygon(density(cars$dist), col="blue")

#correlation

cor(cars$speed,cars$dist)

#Build a linear model

linearMod=lm(dist ~ speed, data=cars)

linearMod
summary(linearMod)

Derive a probability function of logistic regression:

Regression equation is y=α+βx

Here value of is (-∞, +∞)

In logistic regression, response variable is binary. Hence the value of binaries is considered as (0,1), here
the values of response variable Y is also (0,1). Graph of logistic regression appears as follows:

Value of response variable Y is probability of odds in favour of the event and is written as follows:

𝑃
𝑙𝑜𝑔 ( ) = 𝛼 + 𝛽𝑥,
1−𝑃
𝑃
𝑙𝑜𝑔( )
𝑅𝑎𝑖𝑠𝑖𝑛𝑔 𝑡𝑜 e𝑜𝑛 𝑏𝑜𝑡ℎ 𝑠𝑖𝑑𝑒𝑠 𝑒 1−𝑃 = 𝛼 + 𝛽𝑥,
𝑃 𝑒 𝛼+𝛽𝑥
(1−𝑃) = 𝑒 𝛼+𝛽𝑥 , 𝑃 = 1−𝑒 𝛼+𝛽𝑥 is the function of probable value of logistic response
variable value.

crdtdfault
names(crdtdfault)
crdt=crdtdfault[-1]
names(crdt)
str(crdt)
Default=as.factor(crdt$default10yr)
crdt=crdt[,-5]
crdt1=data.frame(crdt,Default)
str(crdt1)
model0=glm(Default~.,data=crdt1, family=binomial)
summary(model0)
TrainingIndex=sample(1:nrow(crdt1), 0.80*nrow(crdt1))
Tng=crdt1[TrainingIndex,]
Test=crdt1[-TrainingIndex,]
model1=glm(Default~.,data=crdt1, family="binomial")
summary(model1)
pred=predict(model1,Test, type="response")
pred
pred=round(pred,digits =0)
pred
Actual_pred=data.frame(Test$Default,pred)
Actual_pred
confusion=table(Test$Default,pred)
confusion
x=344+41
y=4+11+344+41
Accuracy=x/y
Accuracy

You might also like