Professional Documents
Culture Documents
Statistics:
Visualization:
Machine Learning:
Deep Learning:
The data architect is responsible for all of the data and its
storage.
The data architect is responsible for all of the data and its
storage.
What are they doing to solve the problem now, and why isn’t
that good enough?
CONFUSION MATRIX
Predicted
Values
TP FP
Predicted
Values
FN TN
Predicted
Values
Actual 5 0
Values
1 9
Accuracy
Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.
What are they doing to solve the problem now, and why isn’t
that good enough?
CONFUSION MATRIX
Predicted
Values
TP FP
Predicted
Values
FN TN
Predicted
Values
Actual 5 0
Values
1 9
Accuracy
Recall
Recall or True Positive Rate is defined as the ratio of true
positives to total number of true positives and false negatives.
This loads the data and stores it in a new R data frame object called
uciCar.
summary(uciCar)
buying maint
high :432 high :432
low :432 low :432
med :432 med :432
vhigh:432 vhigh:432
The summary() command shows us the distribution of each
variable in the dataset.
XLS/XLSX—http://cran.r-project.org/doc/manuals/R-
data.html#Reading-Excel-spreadsheets
JSON—http://cran.r-
project.org/web/packages/rjson/index.html
XML—http://cran.r-project.org/web/packages/XML/index.html
MongoDB—http://cran.r-
project.org/web/packages/rmongodb/index.html
SQL—http://cran.r-project.org/web/packages/DBI/index.html
> table(d$Purpose,d$Good.Loan)
BadLoan GoodLoan
business 34 63
car (new) 89 145
car (used) 17 86
domestic appliances 4 8
education 22 28
furniture/equipment 58 123
others 5 7
radio/television 62 218
Repairs 8 14
SRM Institute of Science and Technology 5
WORKING WITH RELATIONAL DATABASES
RMySQL Package:
install.packages("RMySQL")
The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories
5
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql
Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33
The data staging area sits between the data source(s) and
the data target(s), which are often data warehouses, data
marts, or other data repositories
3
WORKING WITH RELATIONAL DATABASES
Connecting R to MySql
Output:
actor_id first_name last_name last_update
1 18 DAN TORN 2006-02-15 04:34:33
2 94 KENNETH TORN 2006-02-15 04:34:33 3
102 WALTER TORN 2006-02-15 04:34:33
Missing Valuees
Invalid Values And Outliers
Data Range
Units
> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286
Missing Values
is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)
> summary(custdata$income)
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000
UNITS
Missing Valuees
Invalid Values And Outliers
Data Range
Units
> summary(custdata)
custid sex
Min. : 2068 F:440
1st Qu.: 345667 M:560
Median : 693403
Mean : 698500
3rd Qu.:1044606
Max. :1414286
Missing Values
is.employed
Mode :logical
FALSE:73
TRUE :599
NA‘s :328 (Missing Values)
housing. type
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56 (Missing Values)
> summary(custdata$income)
> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000
UNITS
summary(custdata$Income)
summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])
summary(as.factor(custdata$is.employed.fix))
summary(custdata$Income)
summary(custdata[is.na(custdata$housing.type),
c("recent.move","num.vehicles")])
summary(as.factor(custdata$is.employed.fix))
A data type check confirms that the data entered has the
correct data type. For example, a field might only accept
numeric data.
Code Check
Range Check
Format Check
Consistency Check
Uniqueness Check
The test set is the data that you feed into the
resulting model, to verify that the model’s
predictions are accurate.
hh <- unique(hhdata$household_id)
households <- data.frame(household_id = hh, gp =
runif(length(hh)))
Structured
Semi- Structured
Quasi-Structured
Unstructured
Structured
Semi- Structured
Quasi-Structured
Unstructured