Lp Vi Lab Manual 2022-23 Final

LP-VI BE Computer Engineering (2022-23)
Vision
“Service to society through quality education.”
Mission
• Generation of national wealth through education and research
• Imparting quality technical education at the cost affordable to all strata of
the society
• Enhancing the quality of life through sustainable development
• Carrying out high-quality intellectual work
• Achieving the distinction of the highest preferred engineering college in the
eyes of the stakeholders
Department of Computer Engineering, AISSMS COE, Pune Page 1

Department of Computer Engineering
Vision
“Contributing to the welfare of society through technical

and quality education.”
Mission
• To produce Best Quality Computer Science Professionals by
imparting quality training, hands on experience and value education.
• To Strengthen links with Industry through partnerships and

collaborative developmental works.
• To attain self-sustainability and overall development through

Research, Consultancy and Development Activities.
• To extend technical expertise to other technical Institutions of the

region and play a lead role in imparting technical education.”

Programme Outcomes (POs)

PO1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals, and
an engineering specialization to the solution of complex engineering problems.
PO2. Problem analysis: Identify, formulate, research literature, and analyze complex engineering problems
reaching substantiated conclusions using first principles of mathematics, natural sciences, and engineering
sciences.
PO3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the public
health and safety, and the cultural, societal, and environmental considerations.
PO4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information to provide
valid conclusions.
PO5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern engineering
and IT tools including prediction and modelling to complex engineering activities with an understanding of the
limitations.
PO6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities relevant to the professional
engineering practice.
PO7. Environment and sustainability: Understand the impact of the professional engineering solutions in
societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable development.
PO8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of the
engineering practice.
PO9. Individual and team work: Function effectively as an individual, and as a member or leader in diverse
teams, and in multidisciplinary settings.
PO10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and design
documentation, make effective presentations, and give and receive clear instructions.
PO11. Project management and finance: Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and leader in a team, to manage
projects and in multidisciplinary environments.
PO12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
Programme Specific Outcomes (PSOs)
Computer Engineering graduate will be able to,
PSO1: Project Development: Successfully complete hardware and/or software related system or application
projects, using the phases of project development life cycle to meet the requirements of service and product
industries; government projects; and automate other engineering stream projects.
PSO2: Domain Expertise: Demonstrate effective application of knowledge gained from different computer
domains like, data structures, data bases, operating systems, computer networks, security, parallel
programming, in project development, research and higher education.
PSO3: Career Development: Achieve successful Career and Entrepreneurship- The ability to employ modern
computer languages, environments, and platforms in creating innovative career paths to be an entrepreneur,
and a zest for higher studies.

Companion Course: Business Intelligence Laboratory (410253)

Course Objectives:
 To understand the fundamental concepts and techniques of
natural language processing (NLP)
 To understand Digital Image Processing Concepts
 To learn the fundamentals of software defined networks
 Explore the knowledge of adaptive filtering and Multi-rate DSP
 To be familiar with the various application areas of soft
computing.
 To introduce the concepts and components of Business
Intelligence
 To study Quantum Algorithms and apply these to develop
hybrid solutions
Course Outcomes:
On completion of the course, learner will be able to
 CO1: Apply basic principles of elective subjects to problem solving and modeling.
 CO2: Use tools and techniques in the area of software development to build mini projects
 CO3: Design and develop applications on subjects of their choice.
 CO4: Generate and manage deployment, administration & security.

Lab Manual
(410256)
Laboratory Practice VI
BE Computer
Year: 2022-2023
Sem-II
Prepared By, Marking Scheme
Mrs S J Pachouly Term Work:50 Marks

INDEX
Sr. Page
No. Name of Assignment No. Date Remark
Perform tokenization (Whitespace, Punctuation- based,

1 Treebank, Tweet, MWE) using NLTK library. Use porter 8
stemmer and Snowball stemmer for Stemming. Use any
technique for lemmatization.
Perform bag-of-words approach (count occurrence, normalized
2 count occurrence), tf-idf on data. Create embeddings using 12
Word2Vec.
Perform text cleaning, perform lemmatization (any method),
3 remove stop words (any method), label encoding. Create 16
representations using TF-IDF, Save Outputs.
Create a transformer from scratch using the Pytorch library. 20
4
Morphology is the study of the way words are built up from

5 smaller meaning bearing units.
Study and understand the concepts of morphology by the use 23
of add delete table.
Mini Project (Fine tune transformers on your preferred task)

6 Finetune a pretrained transformer for any of the following tasks
on any relevant dataset of your choice:
 Neural Machine Translation
 Classification
 Summarization
Import the legacy data from different sources such as ( Excel ,

7 Sql Server, Oracle etc.) and
load in the target system. ( You can download sample database 26
such as Adventure works,
Northwind, foodmart etc.)
Perform the Extraction Transformation and Loading (ETL)

8 process to construct the database in the Sql server. 34
9 Create the cube with suitable dimension and fact tables based on
ROLAP, MOLAP and HOLAP model. 47
10 Import the data warehouse data in Microsoft Excel and create the
Pivot table and Pivot Chart. 58

11 Perform the data classification using classification algorithm. Or
Perform the data clustering using clustering algorithm. 68
12 Mini Project: Each group of 4 Students (max) assigned one

case study for this;
A BI report must be prepared outlining the following steps:
a) Problem definition, identifying which data mining task is
needed.
b) Identify and use a standard data mining dataset available for
the problem

Assignment: 1
Aim:
Perform tokenization (Whitespace, Punctuation- based, Treebank, Tweet, MWE) using NLTK library. Use
porter stemmer and Snowball stemmer for Stemming. Use any technique for lemmatization.
Tokenization:
1. Install NLTK library using pip.
2. Import the necessary modules from the NLTK library.
3. Load the text data that needs to be tokenized.
4. Use the appropriate tokenization method based on your requirements, such as whitespace tokenization,
punctuation-based tokenization, treebank tokenization, tweet tokenization or multi-word expression
tokenization.
5. Store the tokenized data in a variable for further processing.
Stemming:
2. Load the text data that needs to be stemmed.
3. Choose the appropriate stemming algorithm, such as Porter stemmer or Snowball stemmer.
4. Initialize the stemmer object and use the stem method to perform stemming on the text data.
5. Store the stemmed data in a variable for further processing.
Lemmatization:
2. Load the text data that needs to be lemmatized.
3. Use the WordNetLemmatizer module to create a lemmatizer object.
4. Use the lemmatize method to perform lemmatization on the text data.
5. Store the lemmatized data in a variable for further processing.
Input/ Dataset: Use any sample sentence.
Objective:
We should perform tokenization using Python language on any input data ( i.e. sentence) using the NLTK library.
Pre-Requisite:
Python programming
NLTK library
Theory:
Tokenization:
Tokenization is the process of breaking down a text into smaller components called tokens. Tokens can be words,

phrases, sentences or any other meaningful unit of text.

Tokenization is a crucial step in many Natural Language Processing (NLP) applications, such as text
classification, named entity recognition, and machine translation. There are many libraries available for
tokenization. Here are some popular libraries used for tokenization.
1] NLTK (Natural Language Tool-kit):

It is a popular python library for NLP tasks, including tokenization. NLTK provides different tokenization
methods including whitespace tokenization, word tokenization, sentence tokenization and more.
2] SpaCy:
It is another popular Python library for NLP tasks that provides efficient tokenization methods. SpaCy uses
Machine Learning (ML) algorithms to tokenize text and provides support for custom tokenization rules.
3] Stanford CoreNLP:
It is a suite of NLP tools written in Java and available in multiple languages. CoreNLP provides tokenization,
sentence splitting, part-of-speech, tagging and more.
4] TensorFlow Text:
It is a Python library for text processing and tokenization. It provides various tokenization methods, including
word tokenization, sentence tokenization and more.
5] TextBlob:
It is a Python library that provides various NLP tasks, including word tokenization. TextBlob provides different
tokenization methods, including word tokenization, sentence tokenization, and more.
Here, we are performing tokenization using NLTK library. There are different methods of tokenization using
NLTK, each with its own advantages and limitations. The methods we are using here are as follows:
1) Whitespace tokenization:
This technique splits the text into tokens based on whitespace characters such as space, tab, and newline. It is a
simple and fast technique, but it may not work well for languages that do not use whitespace as a delimiter.
2) Punctuation- Based tokenization:

This technique splits the text into tokens based on punctuation marks such as commas, periods, semi-colons, etc.
It is more flexible than whitespace tokenization, but it may split words that contains punctuation marks such as
“don’t” and “U.S.A.”
3) Treebank tokenization:
This technique uses a set of rules to tokenize text based on punctuation, as well as special cases such as
contractions and abbreviations. It is more accurate than the previous both techniques, but it requires more
processing power and may not work for non- standard languages.

4) Tweet tokenization:
This technique is similar to punctuation- based tokenization, but it also handles the special syntax and vocabulary
used in tweets, such as hashtags and mentions.
5) Multi-word Expression (MWE) tokenization:

This tokenization technique identifies and tokenizes multi- word expressions such as “New York” and “rock and
roll” as a single unit. It is important for tasks that require accurate semantic analysis, such as machine translation.
In addition of tokenization, other text processing techniques such as stemming and lemmatization are often used
to further normalize the tokens and reduce the vocabulary size.
Stemming:
Stemming is the process of reducing a word to its base or root form, which is called the stem. Stemming is a
common technique used in Natural Language Processing (NLP) to normalize text data and reduce the
dimensionality of text features. The NLTK library provides two popular stemming algorithms: the Porter
Stemming algorithm and the Snowball Stemming algorithm ( Also known as Porter2 Stemming algorithm).
1] Porter Stemming:
It is a rule- based algorithm that uses a set of rules to remove suffixes from words. The Porter stemming algorithm
is fast but can produce non-words or incorrect stems in some cases. For example, the word “running” is stemmed
to “run” using the Porter stemming algorithm, but the word “agreed” is stemmed to “agree” instead of “agr”.
2] Snowball Stemming:
It is an important algorithm over the Porter stemming algorithm and is also rule-based. The Snowball stemming
algorithm is more aggressive than the Porter stemming algorithm and can produce more accurate stems in most
cases. For example, the word “agreed” is stemmed to “agr” using the Snowfall stemming algorithm, which is a
correct stem.
Lemmatization:
Lemmatization is the process of reducing a word to its base or root form, known as the lemma. The base form of a
word may not always be the same as its dictionary form or the stem obtained through stemming. For example, the
lemma of the words “am”, “is”, and “are” is “be”.
Lemmatization is an important step in many Natural Language Processing (NLP) tasks, such as text classification,
information retrieval, and machine translation. It helps in reducing the number of unique words in text, which can
improve the efficiency and accuracy of NLP algorithms.
NLTK provides various methods for lemmatization, including WordNetLemmatizer and LancasterStemmer.
WordNetLemmatizer uses WordNet, a lexical database for English, to find the base form of a word.
LancasterStemmer, on other hand, uses a set of rules and heuristics to find the stem of a word. Here, we will use
WordNetLemmatizer method for lemmatization.
WordNet Lemmatizer:
WordNet is a lexical database for the English language that groups words into sets of synonyms called synsets,

each expressing a distinct concept. WordNet is often used in NLP for tasks such as lemmatization, part-of-speech
tagging, and a semantic analysis. WordNetLemmatizer is a class provided by NLTK that uses WordNet to
perform lemmatization. It maps a word to its base or dictionary form-based om its part of speech. The
‘lemmatize()’ method of the WordNetLemmatizer class takes two arguments: the word to be lemmatized, and its
part of speech. If the part of speech is not specified, the default is “noun”(NOUN).
Installations and downloads needed are as following:

For tokenization we need NLTK library to be downloaded and install. Hence, use “pip install nltk” command for
installation of NLTK. Now after installing NLTK we are going to perform tokenization like whitespace,
punctuation- based, treebank, tweet, and MWE tokenizations.
For downloading these classes of tokenizers, we will use the following command/ code:
import nltk
nltk.download(‘punkt’)
#tokenizer used for punctuation- based tokenizer.
Note: If already downloaded the required tokenizers in NLTK, you can skip the download step.
Conclusion:
Hence, we successfully studied and performed tokenization (Whitespace, Punctuation- based, Treebank, Tweet,
MWE) using NLTK library. Also did stemming using Porter stemming and Snowball stemming including the
lemmatization using the WordNetLemmatizer class.

Assignment : 2
Title of the Assignment:
Perform bag-of-words approach (count occurrence, normalized count occurrence), tf-idf on data. Create
embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset
1. Load the dataset.

2. Preprocess the text data by converting it to lowercase, removing non-alphabetic characters, and tokenizing the
words using an NLP library.
3. Perform bag-of-words using a CountVectorizer, which counts the occurrence of each word in each document
and creates a sparse matrix representation of the data.
4. Perform normalized bag-of-words by dividing each count by the total number of words in the document using
numpy operations.
5. Perform tf-idf using a TfidfVectorizer, which calculates the tf-idf weight of each word in each document and
creates a sparse matrix representation of the data.
6. Train Word2Vec embeddings using a Word2Vec model, which creates a vector representation of each word
based on its context in the text data.
7. Alternatively, load pre-trained Word2Vec embeddings using a pre-trained Word2Vec model that has already
been trained on a large corpus of text data.
8. Use the different types of input features (bag-of-words, normalized bag-of-words, tf-idf, or Word2Vec
embeddings) as input features in a machine learning model.
9. Split the data into training and test sets.
10. Train a machine learning model, such as logistic regression, on the training set using the input features.
11. Evaluate the performance of the machine learning model on the test set.
Objective of the Assignment:

The objective of this assignment is to teach students how to perform data wrangling and text analysis using
Python on an open-source dataset. The assignment aims to develop students' skills in preprocessing textual data,
implementing different text representations, training and using Word2Vec embeddings, comparing text
representations' performance in a machine learning model, and evaluating a model's accuracy.
Prerequisites:
1. Basic knowledge of Python programming.
2. Familiarity with data preprocessing techniques, including data formatting, normalization, and cleaning.

Theory :
The bag-of-words approach is a method for representing text data as a set of numerical features that can be used in
machine learning algorithms. It involves counting the frequency of each word in a text corpus and representing
each document as a vector of word counts. The name "bag-of-words" comes from the fact that the order of the
words is ignored and the text is treated as an unordered set (or "bag") of words.
Bag-of-Words (BoW) approach is a popular technique for text representation in natural language processing. It is
a simple and effective way to convert text into numerical vectors for machine learning models. The BoW
approach involves counting the frequency of words in the document corpus and representing each document as a
vector of word frequencies.
In the count occurrence approach, the BoW model counts the occurrence of each word in the document corpus,
resulting in a matrix of word frequencies. This approach treats each word as independent of the others,
disregarding the order in which they appear in the text.
In the normalized count occurrence approach, the frequency count of each word in the document corpus is
normalized by the total number of words in the corpus, resulting in a matrix of relative word frequencies. This
approach helps to account for the differences in document length and reduces the influence of highly frequent
words.
Term Frequency-Inverse Document Frequency (tf-idf) is a popular weighting scheme used in information
retrieval and text mining to evaluate the relevance of a word in a document. It measures the importance of a word
in a document corpus by taking into account the frequency of the word in the document and the frequency of the
word in the entire corpus. The tf-idf score of a word is high if the word occurs frequently in a particular
document, but rarely in other documents in the corpus.
Word2Vec is a neural network-based technique for generating word embeddings, which are numerical
representations of words in a high-dimensional space. Word embeddings capture the semantic and syntactic
relationships between words and are widely used in natural language processing applications such as text
classification, sentiment analysis, and machine translation. The Word2Vec model learns the embeddings by
training on a large corpus of text data and optimizing a loss function to predict the context words of a target word.
Python libraries used for the assignment are as follows :

1. pandas: Pandas is a powerful library for data manipulation and analysis in Python. It provides data structures for
efficiently storing and processing large datasets, and includes tools for data cleaning, filtering, merging, and more.
2. re: The re module is a built-in module in Python for working with regular expressions. Regular expressions are
a powerful tool for working with text data, allowing you to search, replace, and manipulate strings based on patterns.
3. nltk: The Natural Language Toolkit is a library for working with human language data, such as text. It provides
tools for tokenization (breaking text into words or sentences), stemming (reducing words to their base form),
lemmatization (converting words to their canonical form), and more. It also includes resources such as pre-trained
models and corpora for various natural language processing tasks.

4. sklearn: Scikit-learn is a powerful machine learning library in Python that provides a range of tools for statistical
modeling, including supervised and unsupervised learning algorithms. It includes a variety of classification,
regression, and clustering algorithms, as well as tools for model selection, preprocessing, and evaluation.
5. numpy: NumPy is a fundamental library for scientific computing in Python. It provides tools for working with
arrays, matrices, and other numerical data, as well as linear algebra operations, Fourier transforms, and more.
6. gensim: Gensim is a library for topic modeling and natural language processing in Python. It provides tools for
creating word embeddings (representations of words as vectors) using algorithms such as Word2Vec and Doc2Vec.
It also includes tools for performing semantic analysis, such as topic modeling and similarity search.
7. logistic regression: Logistic regression is a linear classification algorithm that is commonly used in machine
learning. It models the probability of a binary outcome (such as whether a car is a certain make or not) as a function
of the input variables (such as features of the car).
8. train_test_split: The train_test_split function is a tool provided by scikit-learn for splitting a dataset into training
and testing sets. This is commonly used in machine learning for evaluating the performance of models, as it allows
you to train a model on one set of data and test it on another set of data that it has not seen before.
Dataset information :
The dataset from CooperUnion's Car Dataset is a collection of information about various cars, including their
make, model, year, engine size, fuel type, and more. The dataset includes information on over 10,000 cars and is
suitable for use in machine learning applications such as natural language processing and sentiment analysis.
To perform a bag-of-words approach on this dataset, we can preprocess the text by tokenizing the car descriptions
and removing stop words and non-alphabetic characters. We can then use the CountVectorizer class from scikit-
learn to count the occurrence of each word in the preprocessed text.
To normalize the occurrence count, we can divide each count by the total number of words in the preprocessed
text. To perform tf-idf, we can use the TfidfVectorizer class from scikit-learn.
Finally, to create embeddings using Word2Vec, we can use the Word2Vec class from the gensim library. We can
train a Word2Vec model on the preprocessed text and use the resulting embeddings for downstream tasks such as
classification or clustering.
The CooperUnion Car Dataset provides a rich source of information for exploring the application of natural
language processing techniques to automotive data. A practical assignment could involve using the dataset to train
a model that predicts the fuel efficiency of a car based on its make, model, and other features. Students could be
tasked with preprocessing the text, performing bag-of-words and tf-idf, training a Word2Vec model, and using the
resulting embeddings in a machine learning model. The assignment could also involve exploring other natural
language processing techniques such as sentiment analysis or topic modeling.
For this assignment, you will need to install and download the following libraries and resources:
1. Pandas: to manipulate and analyze data in tabular form.
2. NumPy: to perform numerical computations in Python.
3. Scikit-learn: to implement the machine learning algorithms.
4. NLTK: to perform natural language processing tasks, such as tokenization and stemming.

5. Gensim: to create word embeddings using Word2Vec.

To download the required resources for NLTK, you will need to run the following command in your Python
environment:
import nltk nltk.download('punkt')
To download the dataset, you will need to create an account on Kaggle and download the dataset from the
following link: https://www.kaggle.com/datasets/CooperUnion/cardataset.
Conclusion :
We studied the use of bag-of-words approach (count occurrence, normalized count occurrence), tf-idf, and
Word2Vec embeddings on a car dataset. Through our experimentation, we were able to analyze and visualize the
dataset, perform the various techniques, and gain insights into the relationships between the words and the car
attributes.

Assignment: 3
Aim -
Perform text cleaning, perform lemmatization (any method), remove stop words (any method), label encoding.
Create representations using TF-IDF, Save Outputs.
Prerequisites -
Familiarity with python programming language
Basic Knowledge of pandas library , Scikit-learn, Pickle library
Familiar with natural language toolkit
Steps -
1. We start by importing the required libraries.
2. We load the news dataset using the read_pickle() method of Pandas.
3. We perform text cleaning by removing digits, punctuation, and converting the text to lowercase using
the str.replace() and str.lower() methods of Pandas.
4. We perform lemmatization using the WordNetLemmatizer from NLTK.
5. We remove stop words using the stopwords corpus from NLTK.
6. We perform label encoding using the LabelEncoder from Scikit-learn.
7. We create a TF-IDF representation of the text using the TfidfVectorizer from Scikit-learn.
8. We save the cleaned dataset, TF-IDF matrix, and label encoder using the pickle library.
Theory -
Pandas
The pandas library is an open-source Python library that provides powerful data manipulation and analysis tools
for working with structured data. It is built on top of NumPy and provides data structures for efficiently storing
and manipulating large datasets.
Some of the key features of the pandas library include:
● DataFrame: This is a two-dimensional table-like data structure with labeled columns and rows. It is similar
to a spreadsheet or SQL table and is one of the most commonly used data structures in pandas.
● Series: This is a one-dimensional labeled array that can hold any data type, including numbers, strings, and
objects.

● Data manipulation tools: pandas provides a wide range of data manipulation tools for filtering, sorting,
grouping, and aggregating data.
● Missing data handling: pandas provides powerful tools for handling missing data, including methods for
filling in missing values or dropping rows with missing data.
● Input and output tools: pandas provides functions for reading and writing data in various formats,
including CSV, Excel, SQL databases, and JSON.
● Time series functionality: pandas provides functionality for working with time series data, including tools
for resampling, shifting, and rolling window calculations.
Pandas is a Python library used for data manipulation and analysis. It offers many functions to work with strings
in pandas data frames. Two such functions are str.replace() and str.lower().
str.replace() is a method used to replace a substring in a string with a new substring. It can be used on a pandas
data frame column to replace a specific string value with another. For example, if we have a column called "City"
and we want to replace all occurrences of "New York" with "NYC"
str.lower() is a method used to convert a string to lowercase. It can also be used on a pandas data frame column to
convert all the strings to lowercase. For example, if we have a column called "City" and we want to convert all the
strings to lowercase
Lemmatization
Lemmatization is the process of transforming words to their base or root form, called the lemma. It involves
removing any inflectional endings such as -s, -es, -ed, -ing, and so on, to obtain the basic meaning of the word.
The goal of lemmatization is to reduce a word to its most basic form, making it easier to analyze and compare
words within a text.
Lemmatization is a crucial technique in natural language processing (NLP) and text analysis, where it helps to
normalize words and improve accuracy in tasks such as sentiment analysis, topic modeling, and information
retrieval. It is particularly useful in languages with complex inflectional systems, such as English, where there are
many irregular verbs and nouns.
There are various tools and libraries available for lemmatization, including NLTK, spaCy, and Stanford CoreNLP.
These tools use different algorithms and techniques to identify and transform words to their base form, depending
on the language and context of the text. For instance, the NLTK library uses WordNet, a lexical database of
English, to map words to their base forms. spaCy, on the other hand, uses rule-based and statistical methods to
perform lemmatization.
Lemmatization can also be combined with other text preprocessing techniques such as tokenization, stemming,
and stop word removal to further improve the accuracy and efficiency of text analysis. For instance, tokenization
involves breaking a text into smaller units such as words or phrases, while stemming involves reducing words to
their stem or root form, but without considering the context. Stop word removal involves removing common
words such as "the," "a," and "an" that do not carry much meaning in a text.

Text Cleaning
Text cleaning is the process of transforming raw unstructured text data into clean and structured data that can be
used for natural language processing (NLP) applications. The purpose of text cleaning is to remove any noise,
irrelevant information, or inconsistencies in the text data, while retaining the relevant information that is needed
for analysis.
Text cleaning involves several steps, including removing special characters, converting text to lowercase,
removing stop words, lemmatizing or stemming words, and removing any other unnecessary information such as
URLs, numbers, or punctuation.
Cleaning text data is an important step in NLP because it helps to improve the accuracy and efficiency of text
analysis tasks such as sentiment analysis, topic modeling, and text classification. It also helps to reduce the
computational resources required for processing large volumes of text data.
Overall, text cleaning is a crucial step in preparing text data for analysis in NLP applications. It helps to ensure
that the data is clean, structured, and relevant, which in turn improves the accuracy and efficiency of text analysis
tasks
NLTK (Natural Language Toolkit)
The Natural Language Toolkit, commonly known as NLTK, is a popular open-source platform used for natural
language processing (NLP) tasks. NLTK provides a comprehensive set of tools for processing text data, including
tokenization, part-of-speech tagging, parsing, stemming, and sentiment analysis.
NLTK is written in Python and includes a wide range of data sets, corpora, and models for various NLP tasks. Its
user-friendly interface and extensive documentation make it an accessible tool for researchers, developers, and
students alike.
With NLTK, you can perform tasks such as text classification, information extraction, machine translation, and
text summarization. It also provides methods for working with various text formats, such as HTML, PDF, and
XML.
NLTK has become a popular choice for developing NLP applications due to its versatility, ease of use, and
extensive community support. It has been used in numerous research studies, commercial products, and
educational resources. Overall, NLTK is a valuable tool for anyone working with text data and seeking to apply
NLP techniques.
Scikit-Learn

Scikit-learn is a popular open-source Python library for machine learning tasks such as classification, regression,
and clustering. It provides simple and efficient tools for data mining and data analysis. The library includes a wide
range of algorithms, including support vector machines, decision trees, k-nearest neighbors, and random forests.
Scikit-learn also offers tools for data preprocessing, feature selection, and model selection. The library is widely
used in both academia and industry due to its ease of use, versatility, and robustness. Scikit-learn is built on top of
other popular Python libraries such as NumPy, SciPy, and Matplotlib, making it easy to integrate with other
scientific computing tools.
Pickle Library
The pickle library in Python is used for serialization and deserialization of Python objects. Serialization is the
process of converting an object into a format that can be stored or transmitted, while deserialization is the process
of recreating the original object from the serialized form.
The pickle library allows Python objects to be serialized and deserialized in a compact binary format, making it
easy to store and transmit data between Python programs or across different platforms. The serialized form of the
object can be saved to a file or sent over a network, and later retrieved and deserialized back into its original form.
The pickle library supports most Python objects, including built-in types such as lists, dictionaries, and strings, as
well as user-defined classes and objects. However, it cannot serialize certain types of objects, such as file handles
or network sockets.
TF-IDF and Label Encoding
A TF-IDF (Term Frequency-Inverse Document Frequency) matrix is a numerical representation of text data that
is commonly used in natural language processing. It measures the importance of each word in a document or
corpus by taking into account its frequency and how often it appears in other documents. This is helpful in tasks
like text classification, where certain words may be more indicative of a certain category.
A label encoder is a tool used to convert categorical variables into numerical form for use in machine learning
models. It assigns a unique numerical value to each category in the variable, allowing the model to process the
data more easily. This is useful in classification tasks where the target variable is a categorical variable.
Conclusion -
Therefore , We performed text cleaning , lemmatization , removed stop words , label encoding and created
representations using TF-IDF and saved the output in this assignment.

Assignment : 4
Aim:
Create a transformer from scratch using the Pytorch library.
Prerequisites:
 Basic knowledge of Python programming
 Familiarity with PyTorch library
 Understanding of deep learning concepts, including feedforward neural networks and backpropagation
 Knowledge of the Transformer architecture and its components
Content of Theory:
1. Steps involved
2. Introduction to Transformers and their applications
3. Understanding the Transformer architecture, including its components: Multi-Head Attention,
Feedforward Neural Network, and Layer Normalization
4. Preprocessing the data for the Transformer model
5. Building the Transformer model from scratch using PyTorch
6. Training the Transformer model using a custom dataset
7. Evaluating the performance of the Transformer model
8. Improving the Transformer model performance using techniques such as fine-tuning, transfer learning, and
hyperparameter tuning
9. Libraries imported
Steps:
Creating a transformer from scratch using PyTorch involves several steps:
1. Importing necessary packages: First, you need to import the necessary packages, including PyTorch,
torch.nn, and torch.nn.functional.
2. Defining the encoder and decoder: The transformer consists of an encoder and a decoder. You need to
define the encoder and decoder classes separately.
3. Implementing the self-attention mechanism: The self-attention mechanism is the key component of the
transformer. You need to implement it by defining the multi-head attention and the feed-forward neural
network.
4. Defining the position encoding: Position encoding is used to inject positional information into the input
embeddings. You can define the position encoding function as a separate class.
5. Implementing the transformer encoder: The transformer encoder consists of multiple layers of self-
attention and feed-forward neural networks. You can define the transformer encoder class and implement the
forward function.
6. Implementing the transformer decoder: The transformer decoder is similar to the encoder, but it also
includes an attention mechanism that takes the encoder output as input. You can define the transformer decoder
class and implement the forward function.
7. Defining the transformer model: Finally, you can define the transformer model class that combines the
encoder and decoder.

Theory:
Transformers are a type of deep neural network architecture that are used for natural language processing tasks
such as machine translation, language modeling, and text generation. The Transformer architecture was
introduced in 2017 by Vaswani et al. and has since become one of the most popular deep learning architectures in
NLP.
Understanding the Transformer architecture: The Transformer architecture is based on the use of attention
mechanisms. It consists of two main components: the encoder and the decoder. The encoder takes in the input
sequence and generates a representation of the sequence, which is then passed to the decoder to generate the
output sequence.
The key components of the Transformer architecture are:
1. Multi-Head Attention: This component allows the model to attend to different parts of the input sequence
simultaneously, making it well-suited for sequence-to-sequence tasks.
2. Feedforward Neural Network: This component applies a non-linear transformation to the output of the
multi-head attention layer.
3. Layer Normalization: This component helps to stabilize the training process by normalizing the inputs to
each layer.
4. Preprocessing the data for the Transformer model: The input data for the Transformer model must be
preprocessed before training. This involves tokenizing the input sequence, converting the tokens to integer IDs,
and creating input and output sequences.
5. Building the Transformer model from scratch using PyTorch: To build the Transformer model from
scratch using PyTorch, we first define the necessary components of the model, including the multi-head
attention layer, the feedforward neural network, and the layer normalization layer. We then define the encoder
and decoder modules, and use these modules to define the full Transformer model.
6. Training the Transformer model using a custom dataset: Once the Transformer model is defined, we can
train it using a custom dataset. We define the loss function and optimizer, and then train the model on the
training data.
7. Evaluating the performance of the Transformer model: We can evaluate the performance of the
Transformer model using various metrics such as accuracy, precision, recall, and F1 score. We can also
visualize the performance of the model using tools such as confusion matrices and ROC curves.
10. Improving the Transformer model performance: To improve the performance of the Transformer model,
we can use techniques such as fine-tuning, transfer learning, and hyperparameter tuning. Fine-tuning involves
training the model on a specific task to improve its performance on that task. Transfer learning involves using a
pre-trained model on a related task and fine-tuning it on the target task. Hyperparameter tuning involves
optimizing the hyperparameters of the model to improve its performance.
One key innovation in transformers is the use of positional encodings to incorporate information about the
position of each word in the input sequence. This is necessary because self-attention alone does not take into
account the order of the words in the input sequence.
Training a transformer involves minimizing a loss function that measures the difference between the model's

predictions and the true outputs. The most common loss function used in NLP tasks is cross-entropy loss.
Transformers have achieved state-of-the-art performance on a wide range of NLP tasks, including machine
translation, language modeling, sentiment analysis, and question answering. They have also been used to generate
natural language text, such as in the popular GPT (Generative Pre-trained Transformer) series of models.
However, transformers are computationally expensive and require large amounts of data to train, making them
difficult to train from scratch for many researchers and developers.
Creating a transformer from scratch using PyTorch involves using several PyTorch libraries, including:
1. torch.nn: This library provides a wide range of neural network layers and functions, such as linear layers,
convolutional layers, activation functions, and loss functions. It is the main library used for building the
transformer model.
2. torch.nn.functional: This library provides many of the same functions as torch.nn, but as functional
interfaces rather than object-oriented interfaces. It is often used in conjunction with torch.nn to define custom
neural network modules.
3. torch.optim: This library provides a variety of optimization algorithms, such as stochastic gradient descent
(SGD), Adam, and Adagrad. These algorithms are used to update the parameters of the transformer model
during training.
4. torch.utils.data: This library provides utilities for loading and preprocessing data, such as Dataset and
DataLoader. These are used to load the training and validation data and to create batches for training the
model.
5. torchtext: This is a library that provides tools for working with text data, such as tokenization, vocabulary
creation, and dataset loading. It is often used in conjunction with PyTorch for NLP tasks.
6. tqdm: This is a library that provides a progress bar for long-running operations, such as training a neural
network. It is often used to monitor the progress of the training process.
Other libraries may also be used depending on the specific requirements of the project. For example, if the
transformer model needs to be deployed on a web server, additional libraries such as Flask or Django may be
required.
Conclusion:
In this assignment, we have created a Transformer model from scratch using PyTorch.

Assignment: 5
Title of the assignment:
Morphology is the study of the way words are built up from smaller meaning bearing units.
Study and understand the concepts of morphology by the use of add delete table.
Prerequisite:
1. Basic of Python Programming

2. Concept of Data Preprocessing, Data Formatting , Data Normalization and Data Cleaning.
Theory:
Introduction to morphology:
Morphology is the study of words and their parts. Morphemes, like prefixes, suffixes and base words, are
defined as the smallest meaningful units of meaning. Morphemes are important for phonics in both reading
and spelling, as well as in vocabulary and comprehension.
Teaching morphemes unlocks the structures and meanings within words. It is very useful to have a strong
awareness of prefixes, suffixes and base words. These are often spelt the same across different words, even
when the sound changes, and often have a consistent purpose and/or meaning.
Types of morphemes:
Free vs. bound

Morphemes can be either single words (free morphemes) or parts of words (bound morphemes).
A free morpheme can stand alone as its own word
 gentle
 father
 licence
 picture
 gem
A bound morpheme only occurs as part of a word

-s as in cat+s
-ed as in crumb+ed

un- as in un+happy
mis- as in mis-fortune
-er as in teach+er
In the example above: un+system+atic+al+ly, there is a root word (system) and bound morphemes that
attach to the root (un-, -atic, -al, -ly)
system = root un-, -atic, -al, -ly = bound morphemes
If two free morphemes are joined together they create a compound word.
Inflectional vs. derivational :

Morphemes can also be divided into inflectional or derivational morphemes.
Inflectional morphemes change what a word does in terms of grammar, but does not create a new word.
For example, the word <skip> has many forms: skip (base form), skipping (present progressive), skipped
(past tense).
The inflectional morphemes -ing and -ed are added to the base word skip, to indicate the tense of the word.
If a word has an inflectional morpheme, it is still the same word, with a few suffixes added. So if you
looked up <skip> in the dictionary, then only the base word <skip> would get its own entry into the
dictionary. Skipping and skipped are listed under skip, as they are inflections of the base word. Skipping
and skipped do not get their own dictionary entry.
For example, we can create new words from <act> by adding derivational prefixes (e.g. re- en-) and suffixes
(e.g. -or).
Thus out of <act> we can get re+act = react en+act = enact act+or = actor.
Whenever a derivational morpheme is added, a new word (and dictionary entry) is derived/created.
Add-delete tables are a useful tool for analyzing the structure of words. An add-delete table is a table that
shows the changes in meaning and form that occur when prefixes and suffixes are added or deleted from a
word. By using an add-delete table, you can break down a word into its constituent morphemes and analyze
its meaning and form.
Procedure:

1. Choose five words and break them down into their constituent morphemes. For example, the word
"unhappily" can be broken down into "un-" (a prefix meaning "not"), "happy" (a free morpheme meaning
"feeling or showing pleasure"), and "-ly" (a suffix indicating manner or quality).
2. Create an add-delete table for each word, listing the changes in meaning and form that occur when
prefixes and suffixes are added or deleted. For example:
Word: unhappily
3. Analyze the add-delete tables and discuss the meaning and form of each word.
Conclusion:
Through the use of add-delete tables, we can better understand the structure of words and the meaning of
their constituent morphemes. By breaking down words into their component parts, we can analyze their
meaning and form and gain a deeper understanding of language.

Assignment 6

Import the legacy data from different sources such as ( Excel , Sql Server, Oracle etc.) and load
in the target system. ( You can download sample database such as Adventure works, Northwind,
foodmart etc.)
To introduce the concepts and components of Business Intelligence (BI)
Prerequisite:
1. Basic of dataset extensions.
2. Concept of data import.
Contents for Theory:
1. Legacy Data
2. Sources of Legacy Data
3. How to import legacy data step by step

1. What is Legacy Data?

Legacy data, according to Business Dictionary, is "information maintained in an old or out-of-
date format or computer system that is consequently challenging to access or handle."
2. Sources of Legacy Data

Where does legacy data come from? Virtually everywhere. Figure 1 indicates that there aremany
sources from which you may obtain legacy data. This includes existing databases, often relational,
although non-RDBs such as hierarchical, network, object, XML, object/relational databases, and
NoSQL databases. Files, such as XML documents or "flat files" such as configuration files and
comma-delimited text files, are also common sources of legacy data. Software, including legacy
applications that have been wrapped (perhaps via CORBA) and legacy services such as web
services or CICS transactions, can also provide access to existing information. The point to be
made is that there is often far more to gaining access to legacy data than simply writing an SQL
query against an existing relational database.
4. How to import legacy data step by step.

Step 1: Open Power BI
Step 2 : Click on Get data following list will be displayed → select Excel

Step 3: Select required file and click on Open, Navigator screen appears

Step 4: Select file and click on edit

Step 5: Power query editor appears
Step 6: Again, go to Get Data and select OData feed

Step 7:
Paste url as
http://services.odata.org/V3/Northwind/Northwind.svc/ Click on
ok

Step 8: Select the orders

tableAnd click on edit
Note: If you just want to see the preview you can just click on the
table namewithout clicking on the checkbox
Click on edit to view the table
Conclusion: In this way we import the Legacy datasets using the Power BI Tool.

Assignment: 7

Perform the Extraction Transformation and Loading (ETL) process to construct the database
in the Sql server.
The objective of the Assignment:
ETL process in SQL Server:
Prerequisite:
 Basic of SQL.
 Concept of data extraction.

1. ETL Process.
2. Data Transformation and Loading.

Step 1: Data Extraction:
The data extraction is first step of ETL. There are 2 Types of Data Extraction
1. Full Extraction: All the data from source systems or operational systems gets
extracted tostaging area. (Initial Load)
2. Partial Extraction: Sometimes we get notification from the source system to

updatespecific date. It is called as Delta load.
Source System Performance: The Extraction strategies should not affect source
systemperformance.
Step 2: Data Transformation:
The data transformation is second step.After extracting the data there is big need to do the
transformation as per the target system.I would like to give you some bullet points of Data
Transformation.
 Data Extracted from source system is in to Raw format. We need to transform itbefore
loading in to target server.
 Data has to be cleaned, mapped and transformed
 There are following important steps of Data Transformation:
1. Selection: Select data to load in target
2. Matching: Match the data with target system
3. Data Transforming: We need to change data as per target table structures

Real life examples of Data Transformation:
 Standardizing data : Data is fetched from multiple sources so it needs to be standardizedas per
the target system.
 Character set conversion : Need to transform the character sets as per the target systems.
(Firstname and last name example)
 Calculated and derived values: In source system there is first val and second val and intarget
we need the calculation of first val and second val.
 Data Conversion in different formats : If in source system date in in DDMMYY formatand in
target the date is in DDMONYYYY format then this transformation needs to be done at
transformation phase.
Step 3 : Data Loading
 Data loading phase loads the prepared data from staging tables to main tables.
ETL process in SQL Server:
Following are the steps to open BIDS\SSDT.
Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL
Serverprograms group. The following screen appears.
Step 2 − The above screen shows SSDT has opened. Go to file at the top left corner in
theabove image and click New. Select project and the following screen opens.

Step 3 − Select Integration Services under Business Intelligence on the top left corner in
theabove screen to get the following screen.
Step 4 − In the above screen, select either Integration Services Project or Integration
ServicesImport Project Wizard based on your requirement to develop\create the package.
Modes
There are two modes − Native Mode (SQL Server Mode) and Share Point Mode.
Models
There are two models − Tabular Model (For Team and Personal Analysis) and
MultiDimensions Model (For Corporate Analysis).

The BIDS (Business Intelligence Studio till 2008 R2) and SSDT (SQL Server Data Tools
from 2012) are environments to work with SSAS.
Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL Server
programs group. The following screen will appear.
Step 2 − The above screen shows SSDT has opened. Go to file on the top left corner in the
above image and click New. Select project and the following screen opens.
Step 3 − Select Analysis Services in the above screen under Business Intelligence as seen on
the top left corner. The following screen pops up.

Step 4 − In the above screen, select any one option from the listed five options based on your
requirement to work with Analysis services.
ETL Process in Power BI
1) Remove other columns to only display columns of interest
In this step you remove all columns except ProductID, ProductName, UnitsInStock, and
QuantityPerUnit
Power BI Desktop includes Query Editor, which is where you shape and transform your data
connections. Query Editor opens automatically when you select Edit from Navigator. You can
also open the Query Editor by selecting Edit Queries from the Home ribbon in Power BI
Desktop. The following steps are performed in Query Editor.
1. In Query Editor, select the ProductID, ProductName, QuantityPerUnit, and
UnitsInStock columns (use Ctrl+Click to select more than one column, or
Shift+Click to select columns that are beside each other).
2. Select Remove Columns > Remove Other Columns from the ribbon, or right-click
on a column header and click Remove Other Columns.

3. Change the data type of the UnitsInStock column
When Query Editor connects to data, it reviews each field and to determine the best data type.
For the Excel workbook, products in stock will always be a whole number, so in this step you
confirm the UnitsInStock column’s datatype is Whole Number.
1. Select the UnitsInStock column.
2. Select the Data Type drop-down button in the Home ribbon.
3. If not already a Whole Number, select Whole Number for data type from the drop down
(the Data Type: button also displays the data type for the current selection).
3. Expand the Order_Details table

The Orders table contains a reference to a Details table, which contains the individual products
that were included in each Order. When you connect to data sources with multiples tables (such
as a relational database) you can use these references to build up your query
In this step, you expand the Order_Details table that is related to the Orders table, to combine
the ProductID, UnitPrice, and Quantity columns from Order_Details into the Orders table.
This is a representation of the data in these tables:
The Expand operation combines columns from a related table into a subject table. When the
query runs, rows from the related table (Order_Details) are combined into rows from the
subject table (Orders).
After you expand the Order_Details table, three new columns and additional rows are added
to the Orders table, one for each row in the nested or related table.
1. In the Query View, scroll to the Order_Details column.
2. In the Order_Details column, select the expand icon ( ).
3. In the Expand drop-down:
a. Select (Select All Columns) to clear all columns.
b. Select ProductID, UnitPrice, and Quantity.
c. Click OK.
4. Calculate the line total for each Order_Details row

Power BI Desktop lets you to create calculations based on the columns you are importing, so
you can enrich the data that you connect to. In this step, you create a Custom Column to
calculate the line total for each Order_Details row.
Calculate the line total for each Order_Details row:
1. In the Add Column ribbon tab, click Add Custom Column.

2. In the Add Custom Column dialog box, in the Custom Column Formula textbox, enter
[Order_Details.UnitPrice] * [Order_Details.Quantity].
3. In the New column name textbox, enter LineTotal.
4. Click OK.
5. Rename and reorder columns in the query

In this step you finish making the model easy to work with when creating reports, by
renaming the final columns and changing their order.
1. In Query Editor, drag the LineTotal column to the left, after ShipCountry.

2.Remove the Order_Details. prefix from the Order_Details.ProductID,

Order_Details.UnitPrice and Order_Details.Quantity columns, by double-clicking on each
column header, and then deleting that text from the column name.
6. Combine the Products and Total Sales queries

Power BI Desktop does not require you to combine queries to report on them. Instead, you
can create Relationships between datasets. These relationships can be created on any column
that is common to your datasets
we have Orders and Products data that share a common 'ProductID' field, so we need to
ensure there's a relationship between them in the model we're using with Power BI Desktop.
Simply specify in Power BI Desktop that the columns from each table are related (i.e.
columns that have the same values). Power BI Desktop works out the direction and
cardinality of the relationship for you. In some cases, it will even detect the relationships
automatically.
In this task, you confirm that a relationship is established in Power BI Desktop between the
Products and Total Sales queries
Step 1: Confirm the relationship between Products and Total Sales
1. First, we need to load the model that we created in Query Editor into Power BI
Desktop. From the Home ribbon of Query Editor, select Close & Load.

2. Power BI Desktop loads the data from the two queries.
3. Once the data is loaded, select the Manage Relationships button home ribbon.
4. Select the New… button
5.When we attempt to create the relationship, we see that one already exists! As shown in the
Create Relationship dialog (by the shaded columns), the ProductsID fields in each query
already have an established relationship.

5. Select Cancel, and then select Relationship view in Power BI Desktop.
6. We see the following, which visualizes the relationship between the queries.

5. When you double-click the arrow on the line that connects the to queries, an Edit
Relationship dialog appears.
6. No need to make any changes, so we'll just select Cancel to close the Edit
Relationship dialog.

Assignment: 8

Create the cube with suitable dimension and fact tables based on ROLAP, MOLAP and
HOLAP model.
To introduce the concepts of ROLAP, MOLAP and HOLAP model.
Prerequisite:
1. Understanding of Queries in SSMS.
2. Knowledge of Microsoft BIDS Environment.
3. Developing an OLAP Cube
OLAP:
LAP (Online Analytical Processing) was introduced into the business intelligence (BI) space over 20 years
ago, in a time where computer hardware and software technology weren’t nearly as powerful as they are
today. OLAP introduced a groundbreaking way for business users (typically analysts) to easily perform
multidimensional analysis of large volumes of business data.
Aggregating, grouping, and joining data are the most difficult types of queries for a relational database to
process. The magic behind OLAP derives from its ability to pre-calculate and pre-aggregate data. Otherwise,
end users would be spending most of their time waiting for query results to be returned by the database.
However, it is also what causes OLAP-based solutions to be extremely rigid and IT-intensive.
ROLAP:
ROLAP stands for "Relational Online Analytical Processing". It is a type of OLAP (Online Analytical Processing)
that uses a relational database management system (RDBMS) to store and manage data. ROLAP technology allows
users to perform complex queries and analysis on large volumes of data, and to quickly retrieve the results in a tabular
format.
In a ROLAP system, data is stored in a relational database, typically using SQL (Structured Query Language) as the
query language. ROLAP uses SQL queries to aggregate and summarize data across multiple tables in the database,
and to create multidimensional views of the data. These views can then be used to analyze the data and create reports.
ROLAP systems are particularly useful for analyzing large amounts of data, especially in business intelligence and
data warehousing applications. They allow users to perform complex queries and analysis on large volumes of data,
and to quickly retrieve the results in a tabular format.
MOLAP:
MOLAP stands for Multidimensional Online Analytical Processing. MOLAP uses a multidimensional cube that
accesses stored data through various combinations. Data is pre-computed, pre-summarized, and stored (a difference
from ROLAP, where queries are served on-demand).
A multicube approach has proved successful in MOLAP products. In this approach,Page a series
47 of dense, small,
Department of Computer Engineering, AISSMS COE, Pune
precalculated cubes make up a hypercube. Tools that incorporate MOLAP include Oracle Essbase, IBM Cognos, and
Apache Kylin.
Its simple interface makes MOLAP easy to use, even for inexperienced users. Its speedy data retrieval makes it the
best for “slicing and dicing” operations. One major disadvantage of MOLAP is that it is less scalable than ROLAP, as
it can handle a limited amount of data.
HOLAP:
HOLAP stands for Hybrid Online Analytical Processing. As the name suggests, the HOLAP storage mode connects
attributes of both MOLAP and ROLAP. Since HOLAP involves storing part of your data in a ROLAP store and
another part in a MOLAP store, developers get the benefits of both.
With this use of the two OLAPs, the data is stored in both multidimensional databases and relational databases. The
decision to access one of the databases depends on which is most appropriate for the requested processing application
or type. This setup allows much more flexibility for handling data. For theoretical processing, the data is stored in a
multidimensional database. For heavy processing, the data is stored in a relational database.
Creating a Cube in SSDT
Cube
and select New
Cube
Right Click on
To create a cube right click on Cube and select New Cube as










Assignment: 9
Problem Statement:
Import the data warehouse data in Microsoft Excel and create the Pivot table and Pivot Chart.
To introduce the concepts and components of Business Intelligence (BI)
Prerequisite:
1. Basic of dataset extensions.
2. Concept of data import.
Data Warehouse:
A data warehouse is a type of data management system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical data. The data within a data warehouse is usually
derived from a wide range of sources such as application log files and transaction applications.
A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical
capabilities allow organizations to derive valuable business insights from their data to improve decision-
making. Over time, it builds a historical record that can be invaluable to data scientists and business analysts.
Because of these capabilities, a data warehouse can be considered an organization’s “single source of truth.”
Pivot Table:
Pivot tables are among the most useful and powerful features in Excel. We use them in summarizing the
data stored in a table. They organize and rearrange statistics (or "pivot") to draw attention to the valuable
facts. You can take an extremely large data set and see the relevant information you need in a clean,
concise, manageable way.
Pivot Chart:
A pivot chart in Excel is a visual representation of the data. It gives you the big picture of your raw data. It
allows you to analyze data using various types of graphs and layouts. It is considered to be the best chart
during a business presentation that involves huge data.

Data Model is used for building a model where data from various sources can be combined by
creating relationships among the data sources. A Data Model integrates the tables, enabling
extensive analysis using PivotTables, Power Pivot, and Power View.
A Data Model is created automatically when you import two or more tables simultaneously
from a database. The existing database relationships between those tables is used to create the
Data Model in Excel.
Step 1 − Open a new blank Workbook in Excel.
Step 2 − Click on the DATA tab.
Step 3 − In the Get External Data group, click on the option From Access. The Select Data
Source dialog box opens.
Step 4 − Select Events.accdb, Events Access Database file.
Step 5 − The Select Table window, displaying all the tables found in the database, appears.


Step 6 − Tables in a database are similar to the tables in Excel. Check the ‘Enable selection
of multiple tables’ box, and select all the tables. Then click OK.
Step 7 − The Import Data window appears. Select the PivotTable Report option. This option
imports the tables into Excel and prepares a PivotTable for analyzing the imported tables.
Notice that the checkbox at the bottom of the window - ‘Add this data to the Data Model’ is
selected and disabled.

Step 8 − The data is imported, and a PivotTable is created using the imported tables.
Explore Data Using PivotTable
Step 1 − You know how to add fields to PivotTable and drag fields across areas. Even if you
are not sure of the final report that you want, you can play with the data and choose the best-
suited report.
In PivotTable Fields, click on the arrow beside the table - Medals to expand it to show the
fields in that table. Drag the NOC_CountryRegion field in the Medals table to the
COLUMNS area.
Step 2 − Drag Discipline from the Disciplines table to the ROWS area.

Step 3 − Filter Discipline to display only five sports: Archery, Diving, Fencing, Figure
Skating, and Speed Skating. This can be done either in PivotTable Fields area, or from the
Row Labels filter in the PivotTable itself.
Step 4 − In PivotTable Fields, from the Medals table, drag Medal to the VALUES area.
Step 5 − From the Medals table, select Medal again and drag it into the FILTERS area.
Step 6 − Click the dropdown list button to the right of the Column labels.
Step 7 − Select Value Filters and then select Greater Than…
Step 8 − Click OK.
The Value Filters dialog box for the count of Medals is greater than appears.

Step 9 − Type 80 in the Right Field.
Step 10 − Click OK.
The PivotTable displays only those regions, which has more than total 80 medals.
Create Relationship between Tables
Relationships let you analyze your collections of the data in Excel, and create interesting and
aesthetic reports from the data you import.
Step 1 − Insert a new Worksheet.
Step 2 − Create a new table with new data. Name the new table as Sports.

Step 3 − Now you can create relationship between this new table and the other tables that
already exist in the Data Model in Excel. Rename the Sheet1 as Medals and Sheet2 as
Sports.
On the Medals sheet, in the PivotTable Fields List, click All. A complete list of available
tables will be displayed. The newly added table - Sports will also be displayed.
Step 4 − Click on Sports. In the expanded list of fields, select Sports. Excel messages you to
create a relationship between tables.

Step 5 − Click on CREATE. The Create Relationship dialog box opens.
Step 6 − To create the relationship, one of the tables must have a column of unique, non-
repeated, values. In the Disciplines table, SportID column has such values. The table Sports
that we have created also has the SportID column. In Table, select Disciplines.
Step 7 − In Column (Foreign), select SportID.
Step 8 − In Related Table, select Sports.

Step 9 − In Related Column (Primary), SportID gets selected automatically. Click OK.
Step 10 − The PivotTable is modified to reflect the addition of the new Data Field Sport.
Adjust the order of the fields in the Rows area to maintain the Hierarchy. In this case, Sport
should be first and Discipline should be the next, as Discipline will be nested in Sport as
asub-category.

Assignment: 10

Perform the data classification using classification algorithm. Or perform the data clustering
using clustering algorithm.
To introduce the concepts of classification and clustering algorithms.
Prerequisite:
 Basic of data classification.
 Concept of data clustering.
1. classification and clustering algorithms.

2. Time Series

CLASSIFICATION:
Classification is the process of putting something into a category. Classification of all your clothes by color
may make it easier for you to put together an outfit, especially if you favor a monochrome look.
Classification involves putting things into a class or group according to particular characteristics so it’s easier
to make sense of them, whether you’re organizing your shoes, your stock portfolio, or a group of
invertebrates. If you’re an international spy, you might know that classification also can mean a government’s
system for keeping secrets. If you have a high level of security classification, then you know really top secret
stuff.
CLUSTERING:
Clustering is an unsupervised machine learning task. You might also hear this referred to as cluster analysis because of
the way this method works.
Using a clustering algorithm means you're going to give the algorithm a lot of input data with no labels and let it find
any groupings in the data it can.
Those groupings are called clusters. A cluster is a group of data points that are similar to each other based on their
relation to surrounding data points. Clustering is used for things like feature engineering or pattern discovery.
When you're starting with data you know nothing about, clustering might be a good place to get some insight.
Types of clustering algorithms
There are different types of clustering algorithms that handle all kinds of unique data.
Density-based
In density-based clustering, data is grouped by areas of high concentrations of data points surrounded by areas of low
concentrations of data points. Basically the algorithm finds the places that are dense with data points and calls those
clusters.
The great thing about this is that the clusters can be any shape. You aren't constrained to expected conditions.
The clustering algorithms under this type don't try to assign outliers to clusters, so they get ignored.
Distribution-based
With a distribution-based clustering approach, all of the data points are considered parts of a cluster based on the
probability that they belong to a given cluster.
It works like this: there is a center-point, and as the distance of a data point from the center increases, the probability of
it being a part of that cluster decreases.
If you aren't sure of how the distribution in your data might be, you should consider a different type of algorithm.
Centroid-based
Centroid-based clusterisng is the one you probably hear about the most. It's a little sensitive to the initial parameters you
give it, but it's fast and efficient.
These types of algorithms separate data points based on multiple centroids in the data. Each data point is assigned to a
cluster based on its squared distance from the centroid. This is the most commonly used type of clustering.
Hierarchical-based
Hierarchical-based clustering is typically used on hierarchical data, like you would get from a company database or
taxonomies. It builds a tree of clusters so everything is organized from the top-down.
This is more restrictive than the other clustering types, but it's perfect for specific kinds of data sets.
TIME SERIES
Time Series Data Analysis is a way of studying the characteristics of the response variable with respect to time as the independent
variable. To estimate the target variable in the name of predicting or forecasting, use the time variable as the point of reference. A
Time-Series represents a series of time-based orders. It would be Years, Months, Weeks, Days, Horus, Minutes, and Seconds. It is
an observation from the sequence of discrete time of successive intervals.
The time variable/feature is the independent variable and supports the target variable to predict the results. Time Series Analysis
(TSA) is used in different fields for time-based predictions – like Weather Forecasting models, Stock market predictions, Signal
processing, Engineering domain – Control Systems, and Communications Systems. Since TSA involves producing the set of
information in a particular sequence, this makes it distinct from spatial and other analyses. We could predict the future using AR,
MA, ARMA, and ARIMA models.
Consider the annual rainfall details at a place starting from January 2012. We create an R time series object
for a period of 12 months and plot it.
# Get the data points in form of a R vector.rainfall <-
c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
# Convert it to a time series object.

rainfall.timeseries <- ts(rainfall,start = c(2012,1),frequency = 12)
# Print the timeseries data.

print(rainfall.timeseries)
# Give the chart file a name.png(file =

"rainfall.png")
# Plot a graph of the time series.plot(rainfall.timeseries)
# Save the file.dev.off()
Output:
When we execute the above code, it produces the following result and chart −
Jan Feb Mar Apr May Jun Jul Aug Sep

2012 799.0 1174.8 865.1 1334.6 635.4 918.5 685.5 998.6 784.2
Oct Nov Dec
2012 985.0 882.8 1071.0


Lp Vi Lab Manual 2022-23 Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lp Vi Lab Manual 2022-23 Final

Uploaded by

Copyright:

Available Formats

LP-VI BE Computer Engineering (2022-23)

“Service to society through quality education.”

Department of Computer Engineering, AISSMS COE, Pune Page 1

Department of Computer Engineering

“Contributing to the welfare of society through technical

• To Strengthen links with Industry through partnerships and

• To attain self-sustainability and overall development through

• To extend technical expertise to other technical Institutions of the

Department of Computer Engineering, AISSMS COE, Pune Page 2

Programme Outcomes (POs)

Department of Computer Engineering, AISSMS COE, Pune Page 3

Companion Course: Business Intelligence Laboratory (410253)

Department of Computer Engineering, AISSMS COE, Pune Page 4

Prepared By, Marking Scheme

Mrs S J Pachouly Term Work:50 Marks

Department of Computer Engineering, AISSMS COE, Pune Page 5

Perform tokenization (Whitespace, Punctuation- based,

Morphology is the study of the way words are built up from

Mini Project (Fine tune transformers on your preferred task)

Import the legacy data from different sources such as ( Excel ,

Perform the Extraction Transformation and Loading (ETL)

Department of Computer Engineering, AISSMS COE, Pune Page 6

12 Mini Project: Each group of 4 Students (max) assigned one

Department of Computer Engineering, AISSMS COE, Pune Page 7

Input/ Dataset: Use any sample sentence.

Department of Computer Engineering, AISSMS COE, Pune Page 8

phrases, sentences or any other meaningful unit of text.

1] NLTK (Natural Language Tool-kit):

2) Punctuation- Based tokenization:

Department of Computer Engineering, AISSMS COE, Pune Page 9

5) Multi-word Expression (MWE) tokenization:

Department of Computer Engineering, AISSMS COE, Pune Page 10

Installations and downloads needed are as following:

Department of Computer Engineering, AISSMS COE, Pune Page 11

1. Load the dataset.

Objective of the Assignment:

Department of Computer Engineering, AISSMS COE, Pune Page 12

Python libraries used for the assignment are as follows :

Department of Computer Engineering, AISSMS COE, Pune Page 13

Department of Computer Engineering, AISSMS COE, Pune Page 14

5. Gensim: to create word embeddings using Word2Vec.

Department of Computer Engineering, AISSMS COE, Pune Page 15

Some of the key features of the pandas library include:

Department of Computer Engineering, AISSMS COE, Pune Page 16

Department of Computer Engineering, AISSMS COE, Pune Page 17

NLTK (Natural Language Toolkit)

Department of Computer Engineering, AISSMS COE, Pune Page 18

TF-IDF and Label Encoding

Department of Computer Engineering, AISSMS COE, Pune Page 19

Department of Computer Engineering, AISSMS COE, Pune Page 20

Department of Computer Engineering, AISSMS COE, Pune Page 21

Department of Computer Engineering, AISSMS COE, Pune Page 22

1. Basic of Python Programming

Free vs. bound

A bound morpheme only occurs as part of a word

Department of Computer Engineering, AISSMS COE, Pune Page 23

Inflectional vs. derivational :

Department of Computer Engineering, AISSMS COE, Pune Page 24

Department of Computer Engineering, AISSMS COE, Pune Page 25

Title of the Assignment:

Objective of the Assignment:

To introduce the concepts and components of Business Intelligence (BI)

Contents for Theory:

Department of Computer Engineering, AISSMS COE, Pune Page 26