Professional Documents
Culture Documents
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK
Name Signature:
Roll No. Date:
PRACTICAL NO. 1
Aim: Import the legacy data from different sources such as ( Excel, SqlServer, Oracle etc.) and
load in the target system. (You can download sample database such as Adventureworks,
Northwind, Foodmart etc.)
Step 2: From the Home ribbon, select Get Data. Excel is one of the Most Common data
connections, so you can select it directly from the Get Data menu.
ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2020-2021
Step 3: Select required file and click on open, Navigator screen appears
ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2023-2024
In this task, you'll bring in order data. This step represents connecting to a sales system.
You import data into Power BI Desktop from the sample Northwind OData feed at the following URL,
which you can copy (and then paste) in the steps below:
http://services.odata.org/V3/Northwind/Northwind.svc/
Step 3: In the OData Feed dialog box, paste the URL for the Northwind OData feed.
ICLES’ MOTILAL JHUNJHUNWALA COLLEGE, VASHI BUSINESS INTELLIGENCE 2023-2024
Step 5: In the Navigator pane, select the Orders table, and then select Edit.
Name Signature:
Roll No. Date:
Aim: Perform the Extraction Transformation and Loading (ETL) process to construct the
database in the Power BI.
The data extraction is first step of ETL. There are 2 Types of Data Extraction
1. Full Extraction : All the data from source systems or operational systems gets extracted to staging area.
(Initial Load)
2. Partial Extraction : Sometimes we get notification from the source system to update specific date. It is
called as Delta load.
Source System Performance: The Extraction strategies should not affect source system performance.
The data transformation is second step.After extracting the data there is big need to do the transformation as
per the target system. I would like to give you some bullet points of Data Transformation.
• Data Extracted from source system is in to Raw format. We need to transform it before loading in to target
server.
• Data has to be cleaned, mapped and transformed
Step 2: Click on Check Box of Products table and then click on Edit
c) If not already a Whole Number, select Whole Number for data type
from the drop down (the Data Type: button also displays the data type
for the current selection).
\
d) Now you can view fields of Products table on right side, check all the fields oftable to
get representation in charts form.
Once You have loaded a data source, you can click on Recent Sources toselect
desired table (Orders).
After selecting the URL, Navigator window will appear from which you canselect
Orders table.
Click on Edit.
Query Editor Window will appear
LABORATORY WORK
Name Signature:
Software requirements:
SQL SERVER 2012 FULL VERSION (SQLServer2012SP1-FullSlipstream-ENU-x86)
Step 11: Select Server name (as per your machine) from drop down and database name.
Click on Test connection
Step 14: Drag OLE DB Source from Other Sources and drop into Data Flow tab
Step 15: Double click on OLE DB source → OLE DB Source Editor appears→ click on
New to add connection manager.
Step 16: Drag ole db destination in data flow tab and connect both
Step 17: Double click on OLE DB destination
Click on New to run the query to get [OLE DB Destination] in Name of the table or the
view.
Click OK
Click on Mappings
Click on Ok
USE [AdventureWorks2012]
GO
SELECT [BusinessEntityID]
,[Name]
,[SalesPersonID]
,[Demographics]
,[rowguid]
,[ModifiedDate]
FROM [dbo].[OLE DB Destination]
GO
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK
Name Signature:
Roll No. Date:
PRACTICAL NO. 3
Aim: Create the cube with suitable dimension and fact tables based on OLAP
Download T-SQL script attached with this article for creation of Sales Data Warehouse
or download from this article “Create First Data Warehouse” and run it in your SQL
Server.
Click Connect.
Step 2: Start SSDT environment and create New Data Source
Go to Sql Server Data Tools --> Right click and run as administrator
Click on File → New → Project
In Business Intelligence → Analysis Services Multidimensional and Data Mining models
→ appropriate project name → click OK
Right click on Data Sources in solution explorer → New Data Source
Click on Next
Click on New
Select Server Name → select Use SQL Server Authentication → Select or enter a database
name (Sales_DW)
Note: Password for sa: admin123 (as given during installation of SQL 2012 full version)
Click on Test connection
Click OK
Click Next
Select Inherit → Next
Click Finish
Sales_DW.ds gets created under Data Sources in Solution Explorer
Click Next
Select FactProductSales(dbo) from Available objects and put in Includes Objects by
clicking on
Click Finish
Sales DW.dsv appears in Data Source Views in Solution Explorer.
Click on Finish
Sales_DW.cube is created
Step 5: Dimension Modification
In dimension tab → Double Click Dim Product.dim
Drag and Drop Product Name from Table in Data Source View and Add in Attribute
Pane at left side
Name Signature:
Roll No. Date:
PRACTICAL NO. 4
Aim: Execute the MDX queries to extract the data from the data warehouse.
Step 1: Open SQL Server Management Studio and connect to Analysis Services.
machine)Click on connect
Step 2: Click on New Query & type following query based on Sales_DW
select [Measures].[Sales Time Alt Key] on
columnsfrom [Sales DW]
Click on execute
select [Measures].[Quantity] on
columnsfrom [Sales DW]
Name Signature:
Roll No. Date:
AIM: -Import the datawarehouse data in Microsoft Excel and create the Pivot
table and Pivot Chart
Download T-SQL script attached with this article for creation of Sales Data Warehouse
or download from this article “Create First Data Warehouse” and run it in your SQL
Server.
Follow the given steps to run the query in SSMS (SQL Server Management Studio).
1. Open SQL Server Management Studio 2012
2. Connect Database Engine
OR
4. After completing execution save and close SQL Server Management studio &
Reopen to see Sales_DW in Databases Tab.
Go to Data tab Get External Data From other Sources From Data Connection
Wizard
Step 2: In Data Connection Wizard Select Microsoft SQL Server Click on Next
Step 3: In Connect to Database Server provide Server name (Microsoft SQL Server
Name)
Provide password for sa account as given during installation of SQL Server 2012 full
version.
Click on Next
Step 4: In Select Database and Table Select Sales_DW (already created in SQL)
check all dimensions and import relationships between selected tables.
Click on Next
Step 5: In Save Data Connection files browse path and click finish
Step 6: In Import data select Pivot Chart and Click on OK
Name Signature:
Roll No. Date:
AIM: -Import the cube in Microsoft Excel and create the Pivot table and Pivot
Chart to perform data analysis
Follow the given steps to run the query in SSMS (SQL Server Management Studio).
1. Open SQL Server Management Studio 2012
2. Connect Database Engine
OR
4. After completing execution save and close SQL Server Management studio & Reopen to
see Sales_DW in Databases Tab.
Go to Data tab Get External Data From other Sources From Analysis Services
Step 2: In Data Connection Wizard Select Microsoft SQL Server name and Windows
Authentication and Click on Next.
Step 3: Select OLAP(as per created before) click on Next
Step 7: Go to Insert tab → pivot chart and select Pivot Chart from drop down
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK
Name Signature:
Roll No. Date:
PRACTICAL NO- 06
A book store and have 100 books in storage. You sell a certain % for the highest price of $50 and a
certain % for the lower price of $20.
If you sell 60% for the highest price, cell D10 calculates a total profit of 60 * $50 +
40 * $20 = $3800.
Create Different Scenarios
But what if you sell 70% for the highest price? And what if you sell 80% for the highest price? Or 90%,
or even 100%? Each different percentage is a different scenario.
You can use the Scenario Manager to create these scenarios.
Finally, your Scenario Manager should be consistent with the picture below:
6. Next, add 4 other scenarios (70%, 80%, 90% and 100%).
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK
Name Signature:
Roll No. Date:
PRACTICAL NO- 07
Time series is a series of data points in which each data point is associated with a timestamp.
A simple example is the price of a stock in the stock market at different points of time on a given day.
Another example is the amount of rainfall in a region at different months of the year.
R language uses many functions to create, manipulate and plot the time series data.
The data for the time series is stored in an R object called time-series object. It is also an R data object
like a vector or data frame.
The time series object is created by using the ts() function.
Syntax
The basic syntax for ts() function in time series analysis is −
timeseries.object.name <- ts(data, start, end, frequency)
# Save the
file.dev.off()
Name Signature:
Roll No. Date:
PRACTICAL NO- 08
Theory:
K-means clustering is a type of unsupervised learning, which is used when you have
unlabeled data (i.e., data without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the number of groups represented by the
variable K. The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on feature
similarity. The results of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and
analyze the groups that have formed organically. The "Choosing K" section below
describes how the number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting
groups. Examining the centroid feature weights can be used to qualitatively interpret what
kind of group each cluster represents.
k-means clustering using R
plot(newiris[c("Sepal.Length","Sepal.Width")],col=kc$cluster)
points(kc$centers[,c("Sepal.Length","Sepal.Width")],col=1:3,pch=8,cex=2)
dev.off()
plot(newiris[c("Sepal.Length","Sepal.Width")],col=kc$cluster)
points(kc$centers[,c("Sepal.Length","Sepal.Width")],col=1:3,pch=8,cex=2)
ICLES’
MOTILAL JHUNJHUNWALA COLLEGE
DEPARTMENT OF C.S. / I.T
LABORATORY WORK
Name Signature:
Roll No. Date:
PRACTICAL NO- 09
AIM: Perform the Linear regression on the given data warehouse data.
Theory:
In Linear Regression these two variables are related through an equation, where exponent
(power) of both these variables is 1.
Mathematically a linear relationship represents a straight line when plotted as a graph. A non-
linear relationship where the exponent of any variable is not equal to 1 creates a curve.
y = ax + b is an equation for linear regression.
Where, y is the response variable, x is the predictor variable and a and b are constants which are
called the coefficients.
A simple example of regression is predicting weight of a person when his height is known. To do
this we need to have the relationship between height and weight of a person.
Input Data:
Below is the sample data representing the observations –
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function :
This function creates the relationship model between the
predictor and theresponse variable.
Syntax :
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used :−
• formula is a symbol presenting the relation between x and y.
• data is the vector on which the formula will be applied.
# Values of width
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
OUTPUT:
B. Get the Summary of the Relationship
# Values of height
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
# Values of width
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
OUTPUT:
predict() Function
Syntax:
The basic syntax for predict() in linear
regression is −predict(object, newdata)
OUTPUT:
D. Visualize the Regression Graphically
# Create the predictor and response variable.
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
relation <- lm(y~x)
# Save the
file.dev.off()
Name Signature:
Roll No. Date:
PRACTICAL NO- 10
AIM: Perform the Logistic regression on the given data warehouse data.
Theory:
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is
dichotomous (binary).
Like all regression analyses, the logistic regression is a predictive analysis.
Logistic regression is used to describe data and to explain the relationship between one dependent binary
variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Data Exploration:
> library(datasets)
> ir_data<- iris
> head(ir_data)
Sepal.Length Sepal.Width Petal.Length
Petal.Width Species1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> str(ir_data)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
So, we have a dataframe with 150 observations of 5 variables. The first 4 variables give information
about plantattributes in centimeters and the last one give us the name of plant species. Species are given
as Factor variable with 3 levels:
> levels(ir_data$Species)
[1] "setosa" "versicolor" "virginica"
So, we are dealing with a complete dataset here. As we want to use Logistic Regression in this post, let’s
subsetthe data so that we have to deal with 2 species of plants rather than 3 (because logistic regression
will be built on binary outcomes)
> ir_data<-ir_data[1:100,]
Also we will randomly define our Test and Control groups:
> set.seed(100)
> samp<-sample(1:100,80)
> ir_test<-ir_data[samp,]
> ir_ctrl<-ir_data[-samp,]
We will use the test set to create our model and control set to check our model. Now, lets
explore the dataset alittle bit more with the help of plots:
> install.packages("ggplot2")
> install.packages("GGally")
> library(GGally)
>
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
>
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
===============================> --- ] 88% est: 2s `stat_bin()` using `bins = 30`.
Pick better
value with
`binwidth`.plot:
[5,3]
[===================================================================
============
====================================> ] 92% est: 1s `stat_bin()` using `bins =
30`. Pick
better value with
`binwidth`.plot: [5,4]
[===================================================================
============
=========================================> ] 96% est: 1s `stat_bin()` using
`bins = 30`. Pick
better value with
`binwidth`.Model
Fitting
>
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
ggpairs(ir_tes
t)plot: [5,1]
[===================================================================
============
==========================> --------- ] 84% est: 2s `stat_bin()` using `bins = 30`.
Now, we will try to model this data using Logistic Regression. Here, we will keep it
simple and will use only asingle variable:
> y<-ir_test$Species; x<-ir_test$Sepal.Length
The default link function for above model is ‘logit’, which is what we want. We can use
this simple model toget the probability of whether a given plant is ‘satosa’ or
‘versicolor’ based on its ‘sepal length’. Before jumping to predictions using, let us have
a look at the model itself.
> summary(glfit)
Call:
glm(formula = y ~ x, family =
"binomial")Deviance Residuals:
Min 1Q Median 3Q Max
-1.94538 -0.50121 0.04079 0.45923 2.26238
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -25.386 5.517 -4.601
4.20e-06 ***
x 4.675 1.017 4.596 4.31e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1(Dispersion parameter for binomial
family taken to be 1) Null deviance: 110.854
on 79 degrees of freedom Residual deviance:
56.716 on 78 degrees of freedom AIC: 60.716
Number of Fisher Scoring iterations: 6
We can see that the P-Values indicate highly significant results for this model.
Although, we should check any model deeply, but right now we will move to
the prediction part.###Checking Model’s Predictions
Let us use our Control set which we defined earlier to predict using this model:
> newdata<- data.frame(x=ir_ctrl$Sepal.Length)
> prediction
ir_ctrl.Sepal.Length ir_ctrl.Species
predicted_val1 5.1 setosa 0.176005274
2 4.7 setosa 0.031871367
3 4.6 setosa 0.020210042
4 5.0 setosa 0.118037011
5 4.6 setosa 0.020210042
6 4.3 setosa 0.005048194
7 4.6 setosa 0.020210042
8 5.2 setosa 0.254235573
9 5.2 setosa 0.254235573
10 5.0 setosa 0.118037011
11 5.0 setosa 0.118037011
12 6.6 versicolor 0.995801728
13 5.2 versicolor 0.254235573
14 5.8 versicolor 0.849266756
15 6.2 versicolor 0.973373695
16 6.6 versicolor 0.995801728
17 5.5 versicolor 0.580872616
18 6.3 versicolor 0.983149322
19 5.7 versicolor 0.779260130
20 5.7 versicolor 0.779260130
We can see in the table above that what probability our model is predicting for a given
plant to be ‘versicolor’based on its ‘sepal length’. Let’s visualize this result using a
simple plot. Let’s say that we will consider any plant to be ‘versicolor’ if its probability
for the same is more than 0.5:
> qplot(prediction[,1], round(prediction[,3]), col=prediction[,2], xlab = 'Sepal Length', ylab = 'Prediction
usingLogistic Reg.')
So, from the above plot, we can see that our simple model is doing a fairly good prediction
for plant species. We can also see a blue dot in the bottom cluster. This blue dot is showing
that although correct specie of this plant is ‘versicolor’ but our model is predicting it as
‘setosa’.