You are on page 1of 129

ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2023-2024

ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO. 1

Aim: Import the legacy data from different sources such as ( Excel, SqlServer, Oracle etc.) and
load in the target system. (You can download sample database such as Adventureworks,
Northwind, Foodmart etc.)

I) Importing Excel Data

Step 1: Open Power BI

Step 2: From the Home ribbon, select Get Data. Excel is one of the Most Common data
connections, so you can select it directly from the Get Data menu.
ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2020-2021

Step 3: Select required file and click on open, Navigator screen appears
ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2023-2024

Step 4: Select file and click on Edit

Step 5: Power Query Editor appears


ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2023-2024

II) Importing Data from OData Feed

 In this task, you'll bring in order data. This step represents connecting to a sales system.
 You import data into Power BI Desktop from the sample Northwind OData feed at the following URL,
which you can copy (and then paste) in the steps below:
http://services.odata.org/V3/Northwind/Northwind.svc/

Connect to an OData feed:


Step 1: From the Home ribbon tab in Query Editor, select Get Data.

Step 2: Browse to the OData Feed data source.

Step 3: In the OData Feed dialog box, paste the URL for the Northwind OData feed.
ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2023-2024

Step 4: Select OK.

Step 5: In the Navigator pane, select the Orders table, and then select Edit.

Step 6: Power Query Editor appears


ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO. 2(A)

Aim: Perform the Extraction Transformation and Loading (ETL) process to construct the
database in the Power BI.

Step 1: Data Extraction :

The data extraction is first step of ETL. There are 2 Types of Data Extraction
1. Full Extraction : All the data from source systems or operational systems gets extracted to staging area.
(Initial Load)
2. Partial Extraction : Sometimes we get notification from the source system to update specific date. It is
called as Delta load.
Source System Performance: The Extraction strategies should not affect source system performance.

Step 2 : Data Transformation :

The data transformation is second step.After extracting the data there is big need to do the transformation as
per the target system. I would like to give you some bullet points of Data Transformation.
• Data Extracted from source system is in to Raw format. We need to transform it before loading in to target
server.
• Data has to be cleaned, mapped and transformed

There are following important steps of Data Transformation :


1. Selection : Select data to load in target
2. Matching : Match the data with target system
3. Data Transforming : We need to change data as per target table structures

Real life examples of Data Transformation :


• Standardizing data : Data is fetched from multiple sources so it needs to be standardized as
per the target system.
• Character set conversion : Need to transform the character sets as per the target systems.
(Firstname and last name example)
• Calculated and derived values: In source system there is first val and second val and in
target we need the calculation of first val and second val.
• Data Conversion in different formats : If in source system date in in DDMMYY format and
in target the date is in DDMONYYYY format then this transformation needs to be done at
transformation phase.

Step 3 : Data Loading


• Data loading phase loads the prepared data from staging tables to main tables.

Step 1:Open Power BI, Click on Get Data → OData Feed

Paste URL http://services.odata.org/V3/Northwind/Northwind.svc/


And Click OK

Step 2: Click on Check Box of Products table and then click on Edit

1) Remove other columns to only display columns of interest

In Query Editor, select the ProductID, ProductName, QuantityPerUnit, and


UnitsInStock columns (use Ctrl+Click to select more than one column, or
Shift+Click to select columns that are beside each other).
Select Remove Columns > Remove Other Columns from the ribbon, or right-
click on a column header and click Remove Other Columns
After selecting Remove Other Columns only selected four columns aredisplayed other
columns are discarded.

2) Change the Data Type of the UnitsInStock column

a. Select the UnitsInStock column


b) Select the Data Type drop-down button in the Home ribbon.

c) If not already a Whole Number, select Whole Number for data type
from the drop down (the Data Type: button also displays the data type
for the current selection).

\
d) Now you can view fields of Products table on right side, check all the fields oftable to
get representation in charts form.

3) Expand the Orders Table

 Once You have loaded a data source, you can click on Recent Sources toselect
desired table (Orders).
 After selecting the URL, Navigator window will appear from which you canselect
Orders table.
 Click on Edit.
Query Editor Window will appear

1. In the Query View, scroll to the Order_Details column.

2. In the Order_Details column, select the expand icon .

3. In the Expand drop-down:


a. Select (Select All Columns) to clear all columns.
b. Select ProductID, UnitPrice, and Quantity.
c. Click OK.

After clicking on OK following screen appears with combined columns


ICLES’

MOTILAL JHUNJHUNWALA COLLEGE


DEPARTMENT OF C.S. / I.T

LABORATORY WORK

Name Signature:

Roll No. Date:

PRACTICAL NO- 2(B)

AIM: -Perform the Extraction Transformation and Loading (ETL) process to


construct the database in the SQL server.

Software requirements:
SQL SERVER 2012 FULL VERSION (SQLServer2012SP1-FullSlipstream-ENU-x86)

Step 1: Open SQL Server Management Studio to restore backup file

Step 2: Right click on Databases and then select Restore Database


Step 3: Select Device → click on icon towards end of device box.
Step 4: Click on Add → Select path of backup files

Step 5: Select both files at a time


Step 6: Click ok and in select backup devices window Add both files of AdventureWorks
Step 7: Open SQL Server Data Tools
Select File → New → Project → Business Intelligence → Integration Services Project &
give appropriate project name.
Environment consists of SQL Server Integration Services(SSIS)
Step 8: Right click on Connection Managers in solution explorer and click on New
Connection Manager.
Add SSIS connection manager window appears.
Step 9: Select OLEDB Connection Manager and Click on Add
Step 10: Configure OLE DB Connection Manager window appears → Click on New

Step 11: Select Server name (as per your machine) from drop down and database name.
Click on Test connection

If test connection is succeeded click on OK.


Step 12: Click on OK

Connection is added to connection manager


Step 13: Drag and drop Data Flow Task in Control Flow tab

Step 14: Drag OLE DB Source from Other Sources and drop into Data Flow tab
Step 15: Double click on OLE DB source → OLE DB Source Editor appears→ click on
New to add connection manager.

Select [Sales].[Store] table from drop down → ok


Click OK

Step 16: Drag ole db destination in data flow tab and connect both
Step 17: Double click on OLE DB destination
Click on New to run the query to get [OLE DB Destination] in Name of the table or the
view.
Click OK
Click on Mappings

Click on Error Output

Click on Ok

Step 18: Click on start debugging


Step 19: Go to SQL Server Management Studio
In database tab → Adventureworks → Right click on [dbo].[OLE DB Destination] →
Scrip Table as → SELECT To → New Query Editor Window
Step 20: Execute following query to get output.

USE [AdventureWorks2012]
GO
SELECT [BusinessEntityID]
,[Name]
,[SalesPersonID]
,[Demographics]
,[rowguid]
,[ModifiedDate]
FROM [dbo].[OLE DB Destination]
GO
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO. 3

Aim: Create the cube with suitable dimension and fact tables based on OLAP

Step 1: Creating Data Warehouse

Creating Data Warehouse


 Let us execute our T-SQL Script to create data warehouse with fact tables, dimensions
and populate them with appropriate test values.

 Download T-SQL script attached with this article for creation of Sales Data Warehouse
or download from this article “Create First Data Warehouse” and run it in your SQL
Server.

 Downloading "Data_WareHouse SQLScript.zip" from the article


https://www.codeproject.com/Articles/652108/Create-First-Data-WareHouse
 After downloading extract file in folder.
 Follow the given steps to run the query in SSMS (SQL Server Management Studio).
1. Open SQL Server Management Studio 2012
2. Connect Database Engine
3. Open New Query editor
4. Copy paste Scripts given below in various steps in new query editor window one
by one
5. To run the given SQL Script, press F5
6. It will create and populate “Sales_DW” database on your SQL Server
OR
1. Go to the extracted sql file and double click on it.
2. New Sql Query Editor will be opened containing Sales_DW Database.

Click Connect.
Step 2: Start SSDT environment and create New Data Source
Go to Sql Server Data Tools --> Right click and run as administrator
Click on File → New → Project
In Business Intelligence → Analysis Services Multidimensional and Data Mining models
→ appropriate project name → click OK
Right click on Data Sources in solution explorer → New Data Source

Data Source Wizard appears

Click on Next

Click on New
Select Server Name → select Use SQL Server Authentication → Select or enter a database
name (Sales_DW)

Note: Password for sa: admin123 (as given during installation of SQL 2012 full version)
Click on Test connection
Click OK

Click Next
Select Inherit → Next

Click Finish
Sales_DW.ds gets created under Data Sources in Solution Explorer

Step 3: Creating New Data Source View


In Solution explorer right click on Data Source View → Select New Data Source View
Click Next

Click Next
Select FactProductSales(dbo) from Available objects and put in Includes Objects by
clicking on

Click on Add Related Tables


Click Next

Click Finish
Sales DW.dsv appears in Data Source Views in Solution Explorer.

Step 4: Creating new cube


Right click on Cubes → New Cube
Click Next

Select Use existing tables in Select Creation Method → Next


In Select Measure Group Tables → Select FactProductSales→ Click Next

In Select Measures → check all measures → Next


In Select New Dimensions → Check all Dimensions → Next

Click on Finish
Sales_DW.cube is created
Step 5: Dimension Modification
In dimension tab → Double Click Dim Product.dim
Drag and Drop Product Name from Table in Data Source View and Add in Attribute
Pane at left side

Step 6: Creating Attribute Hierarchy in Date Dimension


Double click On Dim Date dimension -> Drag and Drop Fields from Table shown in Data
Source View to Attributes-> Drag and Drop attributes from leftmost pane of attributes to
middle pane of Hierarchy.
Drag fields in sequence from Attributes to Hierarchy window (Year, Quarter Name,
Month Name, Week of the Month, Full Date UK)

Step 7: Deploy Cube


Right click on Project name → Properties
This window appears

Do following changes and click on Apply & ok


Go to Deployment
Right click on project name → Deploy
Deployment successful
To process cube right click on Sales_DW.cube→ Process
Click run

Browse the cube for analysis in solution explorer


ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO. 4

Aim: Execute the MDX queries to extract the data from the data warehouse.

Step 1: Open SQL Server Management Studio and connect to Analysis Services.

Server type: Analysis Services

Server Name: according the base

machine)Click on connect
Step 2: Click on New Query & type following query based on Sales_DW
select [Measures].[Sales Time Alt Key] on
columnsfrom [Sales DW]
Click on execute
select [Measures].[Quantity] on
columnsfrom [Sales DW]

select [Measures].[Sales Invoice Number] on columns


from [Sales DW]
select [Measures].[Sales Total Cost] on
columnsfrom [Sales DW]

select [Measures].[Sales Total Cost] on columns


, [Dim Date].[Year].[Year] on
rowsfrom [Sales DW]
select [Measures].[Sales Total Cost] on columns
, NONEMPTY({[Dim Date].[Year].[Year]})
on rowsfrom [Sales DW]
select [Measures].[Sales Total Cost] on
columnsfrom [Sales DW]
Where [Dim Date].[Year].[Year].&[2013]
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO- 5(A)

AIM: -Import the datawarehouse data in Microsoft Excel and create the Pivot
table and Pivot Chart

Creating Data Warehouse


 Let us execute our T-SQL Script to create data warehouse with fact tables, dimensions
and populate them with appropriate test values.

 Download T-SQL script attached with this article for creation of Sales Data Warehouse
or download from this article “Create First Data Warehouse” and run it in your SQL
Server.

 Downloading "Data_WareHouse SQLScript.zip" from the article


https://www.codeproject.com/Articles/652108/Create-First-Data-WareHouse
 After downloading extract file in folder.

 Follow the given steps to run the query in SSMS (SQL Server Management Studio).
1. Open SQL Server Management Studio 2012
2. Connect Database Engine

Password for sa: admin123 (as given during installation)


Click Connect.
1. Open New Query editor
2. Copy paste Scripts given below in various steps in new query editor window one by
one
3. To run the given SQL Script, press F5
4. It will create and populate “Sales_DW” database on your SQL Server

OR

1. Go to the extracted sql file and double click on it.


2. New Sql Query Editor will be opened containing Sales_DW Database.
3. Click on execute or press F5 by selecting query one by one or directly click on
Execute.

4. After completing execution save and close SQL Server Management studio &
Reopen to see Sales_DW in Databases Tab.

Step 1: Open Excel 2016(Professional)

Go to Data tab  Get External Data From other Sources  From Data Connection
Wizard
Step 2: In Data Connection Wizard  Select Microsoft SQL Server  Click on Next
Step 3: In Connect to Database Server provide Server name (Microsoft SQL Server
Name)

Provide password for sa account as given during installation of SQL Server 2012 full
version.

Click on Next

Step 4: In Select Database and Table Select Sales_DW (already created in SQL) 
check all dimensions and import relationships between selected tables.
Click on Next

Step 5: In Save Data Connection files browse path and click finish
Step 6: In Import data select Pivot Chart and Click on OK

Step 7: In fields put SalesDateKey in filters, FullDateUK in axis and Sum of


ProductActualCost in values
Step 8: In Insert Tab → go to Pivot Table
Step 9: Click on Choose Connection to select existing connection with Sales_DW and click
on open
Steps 10: In fields put Customer Name and Gender in Rows, Sum of SalesPersonID and
Sum of ProductActualCost in values
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO- 5(B)

AIM: -Import the cube in Microsoft Excel and create the Pivot table and Pivot
Chart to perform data analysis

Creating Data Warehouse


 Let us execute our T-SQL Script to create data warehouse with fact tables, dimensions
and populate them with appropriate test values.
 Download T-SQL script attached with this article for creation of Sales Data Warehouse
or download from this article “Create First Data Warehouse” and run it in your SQL
Server.
 Downloading "Data_WareHouse SQLScript.zip" from the article
https://www.codeproject.com/Articles/652108/Create-First-Data-WareHouse
 After downloading extract file in folder.

 Follow the given steps to run the query in SSMS (SQL Server Management Studio).
1. Open SQL Server Management Studio 2012
2. Connect Database Engine

Password for sa: admin123 (as given during installation)


Click Connect.
1. Open New Query editor
2. Copy paste Scripts given below in various steps in new query editor window one by
one
3. To run the given SQL Script, press F5
4. It will create and populate “Sales_DW” database on your SQL Server

OR

1. Go to the extracted sql file and double click on it.


2. New Sql Query Editor will be opened containing Sales_DW Database.
3. Click on execute or press F5 by selecting query one by one or directly click on Execute.

4. After completing execution save and close SQL Server Management studio & Reopen to
see Sales_DW in Databases Tab.

Step 1: Open Excel 2016(Professional)

Go to Data tab  Get External Data From other Sources  From Analysis Services
Step 2: In Data Connection Wizard  Select Microsoft SQL Server name and Windows
Authentication and Click on Next.
Step 3: Select OLAP(as per created before) click on Next

Step 4: Browse and select path name and click on Finish


Step 5: Select PivotTableReport → OK

Step 6: Drag and Drop Fields in rows column and values

Put TimeKey in columns, CustomerID and ProductName in Rows, Quantity in Values.

Step 7: Go to Insert tab → pivot chart and select Pivot Chart from drop down
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO- 06

AIM: - Apply the what – if Analysis for data visualization.


Design and generate necessary reports based on the data warehouse data.

 A book store and have 100 books in storage. You sell a certain % for the highest price of $50 and a
certain % for the lower price of $20.
If you sell 60% for the highest price, cell D10 calculates a total profit of 60 * $50 +
40 * $20 = $3800.
Create Different Scenarios
 But what if you sell 70% for the highest price? And what if you sell 80% for the highest price? Or 90%,
or even 100%? Each different percentage is a different scenario.
 You can use the Scenario Manager to create these scenarios.

1. On the Data tab, in the Forecast group, click What-If Analysis.

2. Click Scenario Manager.


3. The Scenario Manager Dialog box appears.
Add a scenario by clicking on Add.
4. Type a name (60% highest), select cell C4 (% sold for the highest price) for the Changing cells
and click on OK.

5. Enter the corresponding value 0.6 and click on OK again.


6. Next, add 4 other scenarios (70%, 80%, 90% and 100%).

Finally, your Scenario Manager should be consistent with the picture below:
6. Next, add 4 other scenarios (70%, 80%, 90% and 100%).
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO- 07

AIM: - Implementation of Classification algorithm in R Programming. (Data


Analysisusing Time Series Analysis)

Software required: R 3.5.1

 Time series is a series of data points in which each data point is associated with a timestamp.
 A simple example is the price of a stock in the stock market at different points of time on a given day.
Another example is the amount of rainfall in a region at different months of the year.
 R language uses many functions to create, manipulate and plot the time series data.
 The data for the time series is stored in an R object called time-series object. It is also an R data object
like a vector or data frame.
 The time series object is created by using the ts() function.

Syntax
 The basic syntax for ts() function in time series analysis is −
timeseries.object.name <- ts(data, start, end, frequency)

 Following is the description of the parameters used −


• data is a vector or matrix containing the values used in the time series.
• start specifies the start time for the first observation in time series.
• end specifies the end time for the last observation in time series.
• frequency specifies the number of observations per unit time.
Except the parameter "data" all other parameters are optional

Consider the annual rainfall details at a place starting


from January 2012.We create an R time series object for a
period of 12 months and plot it.
Code to run in R

# Get the data points in form of a R vector.


rainfall <- c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)

# Convert it to a time series object.


rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)

# Print the timeseries


data.
print(rainfall.timeser
ies)

# Give the chart file a


name.png(file =
"rainfall.png")

# Plot a graph of the time series.


plot(rainfall.timeseries)

# Save the
file.dev.off()

After this again plot to get


chartplot(rainfall.timeseries)
OUTPUT:
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO- 08

AIM: Perform the data clustering using clustering algorithm.

Theory:

 K-means clustering is a type of unsupervised learning, which is used when you have
unlabeled data (i.e., data without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the number of groups represented by the
variable K. The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on feature
similarity. The results of the K-means clustering algorithm are:

 The centroids of the K clusters, which can be used to label new data

 Labels for the training data (each data point is assigned to a single cluster)

 Rather than defining groups before looking at the data, clustering allows you to find and
analyze the groups that have formed organically. The "Choosing K" section below
describes how the number of groups can be determined.
 Each centroid of a cluster is a collection of feature values which define the resulting
groups. Examining the centroid feature weights can be used to qualitatively interpret what
kind of group each cluster represents.
k-means clustering using R

#apply K means to iris and store result

newiris <- iris

newiris$Species <- NULL

(kc <- kmeans(newiris,3))

#Compare the Species label with the clustering resulttable(iris$Species,kc$cluster)


#Plot the clusters and their centers

plot(newiris[c("Sepal.Length","Sepal.Width")],col=kc$cluster)
points(kc$centers[,c("Sepal.Length","Sepal.Width")],col=1:3,pch=8,cex=2)

dev.off()

#Plot the clusters and their centre

plot(newiris[c("Sepal.Length","Sepal.Width")],col=kc$cluster)

points(kc$centers[,c("Sepal.Length","Sepal.Width")],col=1:3,pch=8,cex=2)
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO- 09

AIM: Perform the Linear regression on the given data warehouse data.

Theory:
 In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1.
 Mathematically a linear relationship represents a straight line when plotted as a graph. A non-
linear relationship where the exponent of any variable is not equal to 1 creates a curve.
 y = ax + b is an equation for linear regression.
Where, y is the response variable, x is the predictor variable and a and b are constants which are
called the coefficients.
 A simple example of regression is predicting weight of a person when his height is known. To do
this we need to have the relationship between height and weight of a person.

The steps to create the relationship is −


 Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
 Create a relationship model using the lm() functions in R.
 Find the coefficients from the model created and create the mathematical equation
using these
 Get a summary of the relationship model to know the average error in prediction.
Also called residuals.
 To predict the weight of new persons, use the predict() function in R.

Input Data:
Below is the sample data representing the observations –

# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function :
This function creates the relationship model between the
predictor and theresponse variable.

Syntax :
The basic syntax for lm() function in linear regression is −

lm(formula,data)
Following is the description of the parameters used :−
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be applied.

A. Create Relationship Model & get the Coefficients


# Values of height
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# Values of width
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm()


function.relation <-
lm(y~x)

print(relation)

OUTPUT:
B. Get the Summary of the Relationship
# Values of height
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# Values of width
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm()


function.relation <-
lm(y~x)
print(summary(relatio
n))

OUTPUT:
predict() Function

Syntax:
The basic syntax for predict() in linear
regression is −predict(object, newdata)

Following is the description of the parameters used −


• object is the formula which is already created using the lm() function.
• newdata is the vector containing the new value for predictor variable.

C. Predict the weight of new persons


# The predictor vector.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The response vector.


y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm()


function.relation <-
lm(y~x)

# Find weight of a person with height


170.a <- data.frame(x = 170)
result <-
predict(relation,a)
print(result)

OUTPUT:
D. Visualize the Regression Graphically
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)

# Give the chart file a name.


png(file =
"linearregression.png")

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression",
abline(lm(x~y)),cex =1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in
cm")

# Save the
file.dev.off()

# Plot the chart.


plot(y,x,col = "blue",main = "Height & Weight Regression")

plot(y,x,col = "blue",main = "Height & Weight Regression",


abline(lm(x~y)),cex =1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in
cm")
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK

Name Signature:
Roll No. Date:

PRACTICAL NO- 10

AIM: Perform the Logistic regression on the given data warehouse data.

Software required: R 3.5.1

Theory:
 Logistic regression is the appropriate regression analysis to conduct when the dependent variable is
dichotomous (binary).
 Like all regression analyses, the logistic regression is a predictive analysis.
 Logistic regression is used to describe data and to explain the relationship between one dependent binary
variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Data Exploration:
> library(datasets)
> ir_data<- iris
> head(ir_data)
Sepal.Length Sepal.Width Petal.Length
Petal.Width Species1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

> str(ir_data)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

So, we have a dataframe with 150 observations of 5 variables. The first 4 variables give information
about plantattributes in centimeters and the last one give us the name of plant species. Species are given
as Factor variable with 3 levels:
> levels(ir_data$Species)
[1] "setosa" "versicolor" "virginica"

We should check whether we have any NA values in our dataset:


> sum(is.na(ir_data))
[1] 0

So, we are dealing with a complete dataset here. As we want to use Logistic Regression in this post, let’s
subsetthe data so that we have to deal with 2 species of plants rather than 3 (because logistic regression
will be built on binary outcomes)

> ir_data<-ir_data[1:100,]
Also we will randomly define our Test and Control groups:

> set.seed(100)

> samp<-sample(1:100,80)
> ir_test<-ir_data[samp,]

> ir_ctrl<-ir_data[-samp,]

We will use the test set to create our model and control set to check our model. Now, lets
explore the dataset alittle bit more with the help of plots:
> install.packages("ggplot2")

Select france [Lyon 1][https]


> library(ggplot2)

> install.packages("GGally")
> library(GGally)
>
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
>
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
===============================> --- ] 88% est: 2s `stat_bin()` using `bins = 30`.
Pick better
value with
`binwidth`.plot:
[5,3]
[===================================================================
============
====================================> ] 92% est: 1s `stat_bin()` using `bins =
30`. Pick
better value with
`binwidth`.plot: [5,4]
[===================================================================
============
=========================================> ] 96% est: 1s `stat_bin()` using
`bins = 30`. Pick
better value with
`binwidth`.Model
Fitting

>
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
Now, we will try to model this data using Logistic Regression. Here, we will keep it
simple and will use only asingle variable:
> y<-ir_test$Species; x<-ir_test$Sepal.Length

> glfit<-glm(y~x, family = 'binomial')

The default link function for above model is ‘logit’, which is what we want. We can use
this simple model toget the probability of whether a given plant is ‘satosa’ or
‘versicolor’ based on its ‘sepal length’. Before jumping to predictions using, let us have
a look at the model itself.

> summary(glfit)
Call:
glm(formula = y ~ x, family =
"binomial")Deviance Residuals:
Min 1Q Median 3Q Max
-1.94538 -0.50121 0.04079 0.45923 2.26238
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -25.386 5.517 -4.601
4.20e-06 ***
x 4.675 1.017 4.596 4.31e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1(Dispersion parameter for binomial
family taken to be 1) Null deviance: 110.854
on 79 degrees of freedom Residual deviance:
56.716 on 78 degrees of freedom AIC: 60.716
Number of Fisher Scoring iterations: 6
We can see that the P-Values indicate highly significant results for this model.
Although, we should check any model deeply, but right now we will move to
the prediction part.###Checking Model’s Predictions

Let us use our Control set which we defined earlier to predict using this model:
> newdata<- data.frame(x=ir_ctrl$Sepal.Length)

> predicted_val<-predict(glfit, newdata, type="response")

> prediction<-data.frame(ir_ctrl$Sepal.Length, ir_ctrl$Species,predicted_val)

> prediction
ir_ctrl.Sepal.Length ir_ctrl.Species
predicted_val1 5.1 setosa 0.176005274
2 4.7 setosa 0.031871367
3 4.6 setosa 0.020210042
4 5.0 setosa 0.118037011
5 4.6 setosa 0.020210042
6 4.3 setosa 0.005048194
7 4.6 setosa 0.020210042
8 5.2 setosa 0.254235573
9 5.2 setosa 0.254235573
10 5.0 setosa 0.118037011
11 5.0 setosa 0.118037011
12 6.6 versicolor 0.995801728
13 5.2 versicolor 0.254235573
14 5.8 versicolor 0.849266756
15 6.2 versicolor 0.973373695
16 6.6 versicolor 0.995801728
17 5.5 versicolor 0.580872616
18 6.3 versicolor 0.983149322
19 5.7 versicolor 0.779260130
20 5.7 versicolor 0.779260130
We can see in the table above that what probability our model is predicting for a given
plant to be ‘versicolor’based on its ‘sepal length’. Let’s visualize this result using a
simple plot. Let’s say that we will consider any plant to be ‘versicolor’ if its probability
for the same is more than 0.5:
> qplot(prediction[,1], round(prediction[,3]), col=prediction[,2], xlab = 'Sepal Length', ylab = 'Prediction
usingLogistic Reg.')

So, from the above plot, we can see that our simple model is doing a fairly good prediction
for plant species. We can also see a blue dot in the bottom cluster. This blue dot is showing
that although correct specie of this plant is ‘versicolor’ but our model is predicting it as
‘setosa’.

You might also like