Professional Documents
Culture Documents
Vision
Mission
• Generation of national wealth through education and research
• Imparting quality technical education at the cost affordable to all strata of
the society
• Enhancing the quality of life through sustainable development
• Carrying out high-quality intellectual work
• Achieving the distinction of the highest preferred engineering college in the
eyes of the stakeholders
Vision
Mission
• To produce Best Quality Computer Science Professionals by
imparting quality training, hands on experience and value education.
CO1: Apply basic principles of elective subjects to problem solving and modeling.
CO2: Use tools and techniques in the area of software development to build mini projects
CO3: Design and develop applications on subjects of their choice.
CO4: Generate and manage deployment, administration & security.
Lab Manual
(410256)
Laboratory Practice VI
BE Computer
Year: 2022-2023
Sem-II
INDEX
Sr. Page
No. Name of Assignment No. Date Remark
9 Create the cube with suitable dimension and fact tables based on
ROLAP, MOLAP and HOLAP model. 47
10 Import the data warehouse data in Microsoft Excel and create the
Pivot table and Pivot Chart. 58
Assignment: 1
Aim:
Perform tokenization (Whitespace, Punctuation- based, Treebank, Tweet, MWE) using NLTK library. Use
porter stemmer and Snowball stemmer for Stemming. Use any technique for lemmatization.
Tokenization:
1. Install NLTK library using pip.
2. Import the necessary modules from the NLTK library.
3. Load the text data that needs to be tokenized.
4. Use the appropriate tokenization method based on your requirements, such as whitespace tokenization,
punctuation-based tokenization, treebank tokenization, tweet tokenization or multi-word expression
tokenization.
5. Store the tokenized data in a variable for further processing.
Stemming:
1. Import the necessary modules from the NLTK library.
2. Load the text data that needs to be stemmed.
3. Choose the appropriate stemming algorithm, such as Porter stemmer or Snowball stemmer.
4. Initialize the stemmer object and use the stem method to perform stemming on the text data.
5. Store the stemmed data in a variable for further processing.
Lemmatization:
1. Import the necessary modules from the NLTK library.
2. Load the text data that needs to be lemmatized.
3. Use the WordNetLemmatizer module to create a lemmatizer object.
4. Use the lemmatize method to perform lemmatization on the text data.
5. Store the lemmatized data in a variable for further processing.
Objective:
We should perform tokenization using Python language on any input data ( i.e. sentence) using the NLTK library.
Pre-Requisite:
Python programming
NLTK library
Theory:
Tokenization:
Tokenization is the process of breaking down a text into smaller components called tokens. Tokens can be words,
classification, named entity recognition, and machine translation. There are many libraries available for
tokenization. Here are some popular libraries used for tokenization.
2] SpaCy:
It is another popular Python library for NLP tasks that provides efficient tokenization methods. SpaCy uses
Machine Learning (ML) algorithms to tokenize text and provides support for custom tokenization rules.
3] Stanford CoreNLP:
It is a suite of NLP tools written in Java and available in multiple languages. CoreNLP provides tokenization,
sentence splitting, part-of-speech, tagging and more.
4] TensorFlow Text:
It is a Python library for text processing and tokenization. It provides various tokenization methods, including
word tokenization, sentence tokenization and more.
5] TextBlob:
It is a Python library that provides various NLP tasks, including word tokenization. TextBlob provides different
tokenization methods, including word tokenization, sentence tokenization, and more.
Here, we are performing tokenization using NLTK library. There are different methods of tokenization using
NLTK, each with its own advantages and limitations. The methods we are using here are as follows:
1) Whitespace tokenization:
This technique splits the text into tokens based on whitespace characters such as space, tab, and newline. It is a
simple and fast technique, but it may not work well for languages that do not use whitespace as a delimiter.
3) Treebank tokenization:
This technique uses a set of rules to tokenize text based on punctuation, as well as special cases such as
contractions and abbreviations. It is more accurate than the previous both techniques, but it requires more
processing power and may not work for non- standard languages.
4) Tweet tokenization:
This technique is similar to punctuation- based tokenization, but it also handles the special syntax and vocabulary
used in tweets, such as hashtags and mentions.
In addition of tokenization, other text processing techniques such as stemming and lemmatization are often used
to further normalize the tokens and reduce the vocabulary size.
Stemming:
Stemming is the process of reducing a word to its base or root form, which is called the stem. Stemming is a
common technique used in Natural Language Processing (NLP) to normalize text data and reduce the
dimensionality of text features. The NLTK library provides two popular stemming algorithms: the Porter
Stemming algorithm and the Snowball Stemming algorithm ( Also known as Porter2 Stemming algorithm).
1] Porter Stemming:
It is a rule- based algorithm that uses a set of rules to remove suffixes from words. The Porter stemming algorithm
is fast but can produce non-words or incorrect stems in some cases. For example, the word “running” is stemmed
to “run” using the Porter stemming algorithm, but the word “agreed” is stemmed to “agree” instead of “agr”.
2] Snowball Stemming:
It is an important algorithm over the Porter stemming algorithm and is also rule-based. The Snowball stemming
algorithm is more aggressive than the Porter stemming algorithm and can produce more accurate stems in most
cases. For example, the word “agreed” is stemmed to “agr” using the Snowfall stemming algorithm, which is a
correct stem.
Lemmatization:
Lemmatization is the process of reducing a word to its base or root form, known as the lemma. The base form of a
word may not always be the same as its dictionary form or the stem obtained through stemming. For example, the
lemma of the words “am”, “is”, and “are” is “be”.
Lemmatization is an important step in many Natural Language Processing (NLP) tasks, such as text classification,
information retrieval, and machine translation. It helps in reducing the number of unique words in text, which can
improve the efficiency and accuracy of NLP algorithms.
NLTK provides various methods for lemmatization, including WordNetLemmatizer and LancasterStemmer.
WordNetLemmatizer uses WordNet, a lexical database for English, to find the base form of a word.
LancasterStemmer, on other hand, uses a set of rules and heuristics to find the stem of a word. Here, we will use
WordNetLemmatizer method for lemmatization.
WordNet Lemmatizer:
WordNet is a lexical database for the English language that groups words into sets of synonyms called synsets,
each expressing a distinct concept. WordNet is often used in NLP for tasks such as lemmatization, part-of-speech
tagging, and a semantic analysis. WordNetLemmatizer is a class provided by NLTK that uses WordNet to
perform lemmatization. It maps a word to its base or dictionary form-based om its part of speech. The
‘lemmatize()’ method of the WordNetLemmatizer class takes two arguments: the word to be lemmatized, and its
part of speech. If the part of speech is not specified, the default is “noun”(NOUN).
Note: If already downloaded the required tokenizers in NLTK, you can skip the download step.
Conclusion:
Hence, we successfully studied and performed tokenization (Whitespace, Punctuation- based, Treebank, Tweet,
MWE) using NLTK library. Also did stemming using Porter stemming and Snowball stemming including the
lemmatization using the WordNetLemmatizer class.
Assignment : 2
Title of the Assignment:
Perform bag-of-words approach (count occurrence, normalized count occurrence), tf-idf on data. Create
embeddings using Word2Vec.
Dataset to be used: https://www.kaggle.com/datasets/CooperUnion/cardataset
Prerequisites:
1. Basic knowledge of Python programming.
2. Familiarity with data preprocessing techniques, including data formatting, normalization, and cleaning.
Theory :
The bag-of-words approach is a method for representing text data as a set of numerical features that can be used in
machine learning algorithms. It involves counting the frequency of each word in a text corpus and representing
each document as a vector of word counts. The name "bag-of-words" comes from the fact that the order of the
words is ignored and the text is treated as an unordered set (or "bag") of words.
Bag-of-Words (BoW) approach is a popular technique for text representation in natural language processing. It is
a simple and effective way to convert text into numerical vectors for machine learning models. The BoW
approach involves counting the frequency of words in the document corpus and representing each document as a
vector of word frequencies.
In the count occurrence approach, the BoW model counts the occurrence of each word in the document corpus,
resulting in a matrix of word frequencies. This approach treats each word as independent of the others,
disregarding the order in which they appear in the text.
In the normalized count occurrence approach, the frequency count of each word in the document corpus is
normalized by the total number of words in the corpus, resulting in a matrix of relative word frequencies. This
approach helps to account for the differences in document length and reduces the influence of highly frequent
words.
Term Frequency-Inverse Document Frequency (tf-idf) is a popular weighting scheme used in information
retrieval and text mining to evaluate the relevance of a word in a document. It measures the importance of a word
in a document corpus by taking into account the frequency of the word in the document and the frequency of the
word in the entire corpus. The tf-idf score of a word is high if the word occurs frequently in a particular
document, but rarely in other documents in the corpus.
Word2Vec is a neural network-based technique for generating word embeddings, which are numerical
representations of words in a high-dimensional space. Word embeddings capture the semantic and syntactic
relationships between words and are widely used in natural language processing applications such as text
classification, sentiment analysis, and machine translation. The Word2Vec model learns the embeddings by
training on a large corpus of text data and optimizing a loss function to predict the context words of a target word.
4. sklearn: Scikit-learn is a powerful machine learning library in Python that provides a range of tools for statistical
modeling, including supervised and unsupervised learning algorithms. It includes a variety of classification,
regression, and clustering algorithms, as well as tools for model selection, preprocessing, and evaluation.
5. numpy: NumPy is a fundamental library for scientific computing in Python. It provides tools for working with
arrays, matrices, and other numerical data, as well as linear algebra operations, Fourier transforms, and more.
6. gensim: Gensim is a library for topic modeling and natural language processing in Python. It provides tools for
creating word embeddings (representations of words as vectors) using algorithms such as Word2Vec and Doc2Vec.
It also includes tools for performing semantic analysis, such as topic modeling and similarity search.
7. logistic regression: Logistic regression is a linear classification algorithm that is commonly used in machine
learning. It models the probability of a binary outcome (such as whether a car is a certain make or not) as a function
of the input variables (such as features of the car).
8. train_test_split: The train_test_split function is a tool provided by scikit-learn for splitting a dataset into training
and testing sets. This is commonly used in machine learning for evaluating the performance of models, as it allows
you to train a model on one set of data and test it on another set of data that it has not seen before.
Dataset information :
The dataset from CooperUnion's Car Dataset is a collection of information about various cars, including their
make, model, year, engine size, fuel type, and more. The dataset includes information on over 10,000 cars and is
suitable for use in machine learning applications such as natural language processing and sentiment analysis.
To perform a bag-of-words approach on this dataset, we can preprocess the text by tokenizing the car descriptions
and removing stop words and non-alphabetic characters. We can then use the CountVectorizer class from scikit-
learn to count the occurrence of each word in the preprocessed text.
To normalize the occurrence count, we can divide each count by the total number of words in the preprocessed
text. To perform tf-idf, we can use the TfidfVectorizer class from scikit-learn.
Finally, to create embeddings using Word2Vec, we can use the Word2Vec class from the gensim library. We can
train a Word2Vec model on the preprocessed text and use the resulting embeddings for downstream tasks such as
classification or clustering.
The CooperUnion Car Dataset provides a rich source of information for exploring the application of natural
language processing techniques to automotive data. A practical assignment could involve using the dataset to train
a model that predicts the fuel efficiency of a car based on its make, model, and other features. Students could be
tasked with preprocessing the text, performing bag-of-words and tf-idf, training a Word2Vec model, and using the
resulting embeddings in a machine learning model. The assignment could also involve exploring other natural
language processing techniques such as sentiment analysis or topic modeling.
For this assignment, you will need to install and download the following libraries and resources:
1. Pandas: to manipulate and analyze data in tabular form.
2. NumPy: to perform numerical computations in Python.
3. Scikit-learn: to implement the machine learning algorithms.
4. NLTK: to perform natural language processing tasks, such as tokenization and stemming.
Conclusion :
We studied the use of bag-of-words approach (count occurrence, normalized count occurrence), tf-idf, and
Word2Vec embeddings on a car dataset. Through our experimentation, we were able to analyze and visualize the
dataset, perform the various techniques, and gain insights into the relationships between the words and the car
attributes.
Assignment: 3
Aim -
Perform text cleaning, perform lemmatization (any method), remove stop words (any method), label encoding.
Create representations using TF-IDF, Save Outputs.
Prerequisites -
Familiarity with python programming language
Basic Knowledge of pandas library , Scikit-learn, Pickle library
Familiar with natural language toolkit
Steps -
1. We start by importing the required libraries.
2. We load the news dataset using the read_pickle() method of Pandas.
3. We perform text cleaning by removing digits, punctuation, and converting the text to lowercase using
the str.replace() and str.lower() methods of Pandas.
4. We perform lemmatization using the WordNetLemmatizer from NLTK.
5. We remove stop words using the stopwords corpus from NLTK.
6. We perform label encoding using the LabelEncoder from Scikit-learn.
7. We create a TF-IDF representation of the text using the TfidfVectorizer from Scikit-learn.
8. We save the cleaned dataset, TF-IDF matrix, and label encoder using the pickle library.
Theory -
Pandas
The pandas library is an open-source Python library that provides powerful data manipulation and analysis tools
for working with structured data. It is built on top of NumPy and provides data structures for efficiently storing
and manipulating large datasets.
● DataFrame: This is a two-dimensional table-like data structure with labeled columns and rows. It is similar
to a spreadsheet or SQL table and is one of the most commonly used data structures in pandas.
● Series: This is a one-dimensional labeled array that can hold any data type, including numbers, strings, and
objects.
● Data manipulation tools: pandas provides a wide range of data manipulation tools for filtering, sorting,
grouping, and aggregating data.
● Missing data handling: pandas provides powerful tools for handling missing data, including methods for
filling in missing values or dropping rows with missing data.
● Input and output tools: pandas provides functions for reading and writing data in various formats,
including CSV, Excel, SQL databases, and JSON.
● Time series functionality: pandas provides functionality for working with time series data, including tools
for resampling, shifting, and rolling window calculations.
Pandas is a Python library used for data manipulation and analysis. It offers many functions to work with strings
in pandas data frames. Two such functions are str.replace() and str.lower().
str.replace() is a method used to replace a substring in a string with a new substring. It can be used on a pandas
data frame column to replace a specific string value with another. For example, if we have a column called "City"
and we want to replace all occurrences of "New York" with "NYC"
str.lower() is a method used to convert a string to lowercase. It can also be used on a pandas data frame column to
convert all the strings to lowercase. For example, if we have a column called "City" and we want to convert all the
strings to lowercase
Lemmatization
Lemmatization is the process of transforming words to their base or root form, called the lemma. It involves
removing any inflectional endings such as -s, -es, -ed, -ing, and so on, to obtain the basic meaning of the word.
The goal of lemmatization is to reduce a word to its most basic form, making it easier to analyze and compare
words within a text.
Lemmatization is a crucial technique in natural language processing (NLP) and text analysis, where it helps to
normalize words and improve accuracy in tasks such as sentiment analysis, topic modeling, and information
retrieval. It is particularly useful in languages with complex inflectional systems, such as English, where there are
many irregular verbs and nouns.
There are various tools and libraries available for lemmatization, including NLTK, spaCy, and Stanford CoreNLP.
These tools use different algorithms and techniques to identify and transform words to their base form, depending
on the language and context of the text. For instance, the NLTK library uses WordNet, a lexical database of
English, to map words to their base forms. spaCy, on the other hand, uses rule-based and statistical methods to
perform lemmatization.
Lemmatization can also be combined with other text preprocessing techniques such as tokenization, stemming,
and stop word removal to further improve the accuracy and efficiency of text analysis. For instance, tokenization
involves breaking a text into smaller units such as words or phrases, while stemming involves reducing words to
their stem or root form, but without considering the context. Stop word removal involves removing common
words such as "the," "a," and "an" that do not carry much meaning in a text.
Text Cleaning
Text cleaning is the process of transforming raw unstructured text data into clean and structured data that can be
used for natural language processing (NLP) applications. The purpose of text cleaning is to remove any noise,
irrelevant information, or inconsistencies in the text data, while retaining the relevant information that is needed
for analysis.
Text cleaning involves several steps, including removing special characters, converting text to lowercase,
removing stop words, lemmatizing or stemming words, and removing any other unnecessary information such as
URLs, numbers, or punctuation.
Cleaning text data is an important step in NLP because it helps to improve the accuracy and efficiency of text
analysis tasks such as sentiment analysis, topic modeling, and text classification. It also helps to reduce the
computational resources required for processing large volumes of text data.
Overall, text cleaning is a crucial step in preparing text data for analysis in NLP applications. It helps to ensure
that the data is clean, structured, and relevant, which in turn improves the accuracy and efficiency of text analysis
tasks
The Natural Language Toolkit, commonly known as NLTK, is a popular open-source platform used for natural
language processing (NLP) tasks. NLTK provides a comprehensive set of tools for processing text data, including
tokenization, part-of-speech tagging, parsing, stemming, and sentiment analysis.
NLTK is written in Python and includes a wide range of data sets, corpora, and models for various NLP tasks. Its
user-friendly interface and extensive documentation make it an accessible tool for researchers, developers, and
students alike.
With NLTK, you can perform tasks such as text classification, information extraction, machine translation, and
text summarization. It also provides methods for working with various text formats, such as HTML, PDF, and
XML.
NLTK has become a popular choice for developing NLP applications due to its versatility, ease of use, and
extensive community support. It has been used in numerous research studies, commercial products, and
educational resources. Overall, NLTK is a valuable tool for anyone working with text data and seeking to apply
NLP techniques.
Scikit-Learn
Scikit-learn is a popular open-source Python library for machine learning tasks such as classification, regression,
and clustering. It provides simple and efficient tools for data mining and data analysis. The library includes a wide
range of algorithms, including support vector machines, decision trees, k-nearest neighbors, and random forests.
Scikit-learn also offers tools for data preprocessing, feature selection, and model selection. The library is widely
used in both academia and industry due to its ease of use, versatility, and robustness. Scikit-learn is built on top of
other popular Python libraries such as NumPy, SciPy, and Matplotlib, making it easy to integrate with other
scientific computing tools.
Pickle Library
The pickle library in Python is used for serialization and deserialization of Python objects. Serialization is the
process of converting an object into a format that can be stored or transmitted, while deserialization is the process
of recreating the original object from the serialized form.
The pickle library allows Python objects to be serialized and deserialized in a compact binary format, making it
easy to store and transmit data between Python programs or across different platforms. The serialized form of the
object can be saved to a file or sent over a network, and later retrieved and deserialized back into its original form.
The pickle library supports most Python objects, including built-in types such as lists, dictionaries, and strings, as
well as user-defined classes and objects. However, it cannot serialize certain types of objects, such as file handles
or network sockets.
A TF-IDF (Term Frequency-Inverse Document Frequency) matrix is a numerical representation of text data that
is commonly used in natural language processing. It measures the importance of each word in a document or
corpus by taking into account its frequency and how often it appears in other documents. This is helpful in tasks
like text classification, where certain words may be more indicative of a certain category.
A label encoder is a tool used to convert categorical variables into numerical form for use in machine learning
models. It assigns a unique numerical value to each category in the variable, allowing the model to process the
data more easily. This is useful in classification tasks where the target variable is a categorical variable.
Conclusion -
Therefore , We performed text cleaning , lemmatization , removed stop words , label encoding and created
representations using TF-IDF and saved the output in this assignment.
Assignment : 4
Aim:
Create a transformer from scratch using the Pytorch library.
Prerequisites:
Basic knowledge of Python programming
Familiarity with PyTorch library
Understanding of deep learning concepts, including feedforward neural networks and backpropagation
Knowledge of the Transformer architecture and its components
Content of Theory:
1. Steps involved
2. Introduction to Transformers and their applications
3. Understanding the Transformer architecture, including its components: Multi-Head Attention,
Feedforward Neural Network, and Layer Normalization
4. Preprocessing the data for the Transformer model
5. Building the Transformer model from scratch using PyTorch
6. Training the Transformer model using a custom dataset
7. Evaluating the performance of the Transformer model
8. Improving the Transformer model performance using techniques such as fine-tuning, transfer learning, and
hyperparameter tuning
9. Libraries imported
Steps:
Creating a transformer from scratch using PyTorch involves several steps:
1. Importing necessary packages: First, you need to import the necessary packages, including PyTorch,
torch.nn, and torch.nn.functional.
2. Defining the encoder and decoder: The transformer consists of an encoder and a decoder. You need to
define the encoder and decoder classes separately.
3. Implementing the self-attention mechanism: The self-attention mechanism is the key component of the
transformer. You need to implement it by defining the multi-head attention and the feed-forward neural
network.
4. Defining the position encoding: Position encoding is used to inject positional information into the input
embeddings. You can define the position encoding function as a separate class.
5. Implementing the transformer encoder: The transformer encoder consists of multiple layers of self-
attention and feed-forward neural networks. You can define the transformer encoder class and implement the
forward function.
6. Implementing the transformer decoder: The transformer decoder is similar to the encoder, but it also
includes an attention mechanism that takes the encoder output as input. You can define the transformer decoder
class and implement the forward function.
7. Defining the transformer model: Finally, you can define the transformer model class that combines the
encoder and decoder.
Theory:
Transformers are a type of deep neural network architecture that are used for natural language processing tasks
such as machine translation, language modeling, and text generation. The Transformer architecture was
introduced in 2017 by Vaswani et al. and has since become one of the most popular deep learning architectures in
NLP.
Understanding the Transformer architecture: The Transformer architecture is based on the use of attention
mechanisms. It consists of two main components: the encoder and the decoder. The encoder takes in the input
sequence and generates a representation of the sequence, which is then passed to the decoder to generate the
output sequence.
The key components of the Transformer architecture are:
1. Multi-Head Attention: This component allows the model to attend to different parts of the input sequence
simultaneously, making it well-suited for sequence-to-sequence tasks.
2. Feedforward Neural Network: This component applies a non-linear transformation to the output of the
multi-head attention layer.
3. Layer Normalization: This component helps to stabilize the training process by normalizing the inputs to
each layer.
4. Preprocessing the data for the Transformer model: The input data for the Transformer model must be
preprocessed before training. This involves tokenizing the input sequence, converting the tokens to integer IDs,
and creating input and output sequences.
5. Building the Transformer model from scratch using PyTorch: To build the Transformer model from
scratch using PyTorch, we first define the necessary components of the model, including the multi-head
attention layer, the feedforward neural network, and the layer normalization layer. We then define the encoder
and decoder modules, and use these modules to define the full Transformer model.
6. Training the Transformer model using a custom dataset: Once the Transformer model is defined, we can
train it using a custom dataset. We define the loss function and optimizer, and then train the model on the
training data.
7. Evaluating the performance of the Transformer model: We can evaluate the performance of the
Transformer model using various metrics such as accuracy, precision, recall, and F1 score. We can also
visualize the performance of the model using tools such as confusion matrices and ROC curves.
10. Improving the Transformer model performance: To improve the performance of the Transformer model,
we can use techniques such as fine-tuning, transfer learning, and hyperparameter tuning. Fine-tuning involves
training the model on a specific task to improve its performance on that task. Transfer learning involves using a
pre-trained model on a related task and fine-tuning it on the target task. Hyperparameter tuning involves
optimizing the hyperparameters of the model to improve its performance.
One key innovation in transformers is the use of positional encodings to incorporate information about the
position of each word in the input sequence. This is necessary because self-attention alone does not take into
account the order of the words in the input sequence.
Training a transformer involves minimizing a loss function that measures the difference between the model's
predictions and the true outputs. The most common loss function used in NLP tasks is cross-entropy loss.
Transformers have achieved state-of-the-art performance on a wide range of NLP tasks, including machine
translation, language modeling, sentiment analysis, and question answering. They have also been used to generate
natural language text, such as in the popular GPT (Generative Pre-trained Transformer) series of models.
However, transformers are computationally expensive and require large amounts of data to train, making them
difficult to train from scratch for many researchers and developers.
Creating a transformer from scratch using PyTorch involves using several PyTorch libraries, including:
1. torch.nn: This library provides a wide range of neural network layers and functions, such as linear layers,
convolutional layers, activation functions, and loss functions. It is the main library used for building the
transformer model.
2. torch.nn.functional: This library provides many of the same functions as torch.nn, but as functional
interfaces rather than object-oriented interfaces. It is often used in conjunction with torch.nn to define custom
neural network modules.
3. torch.optim: This library provides a variety of optimization algorithms, such as stochastic gradient descent
(SGD), Adam, and Adagrad. These algorithms are used to update the parameters of the transformer model
during training.
4. torch.utils.data: This library provides utilities for loading and preprocessing data, such as Dataset and
DataLoader. These are used to load the training and validation data and to create batches for training the
model.
5. torchtext: This is a library that provides tools for working with text data, such as tokenization, vocabulary
creation, and dataset loading. It is often used in conjunction with PyTorch for NLP tasks.
6. tqdm: This is a library that provides a progress bar for long-running operations, such as training a neural
network. It is often used to monitor the progress of the training process.
Other libraries may also be used depending on the specific requirements of the project. For example, if the
transformer model needs to be deployed on a web server, additional libraries such as Flask or Django may be
required.
Conclusion:
In this assignment, we have created a Transformer model from scratch using PyTorch.
Assignment: 5
Title of the assignment:
Morphology is the study of the way words are built up from smaller meaning bearing units.
Study and understand the concepts of morphology by the use of add delete table.
Prerequisite:
Theory:
Introduction to morphology:
Morphology is the study of words and their parts. Morphemes, like prefixes, suffixes and base words, are
defined as the smallest meaningful units of meaning. Morphemes are important for phonics in both reading
and spelling, as well as in vocabulary and comprehension.
Teaching morphemes unlocks the structures and meanings within words. It is very useful to have a strong
awareness of prefixes, suffixes and base words. These are often spelt the same across different words, even
when the sound changes, and often have a consistent purpose and/or meaning.
Types of morphemes:
un- as in un+happy
mis- as in mis-fortune
-er as in teach+er
In the example above: un+system+atic+al+ly, there is a root word (system) and bound morphemes that
attach to the root (un-, -atic, -al, -ly)
system = root un-, -atic, -al, -ly = bound morphemes
If two free morphemes are joined together they create a compound word.
Add-delete tables are a useful tool for analyzing the structure of words. An add-delete table is a table that
shows the changes in meaning and form that occur when prefixes and suffixes are added or deleted from a
word. By using an add-delete table, you can break down a word into its constituent morphemes and analyze
its meaning and form.
Procedure:
1. Choose five words and break them down into their constituent morphemes. For example, the word
"unhappily" can be broken down into "un-" (a prefix meaning "not"), "happy" (a free morpheme meaning
"feeling or showing pleasure"), and "-ly" (a suffix indicating manner or quality).
2. Create an add-delete table for each word, listing the changes in meaning and form that occur when
prefixes and suffixes are added or deleted. For example:
Word: unhappily
3. Analyze the add-delete tables and discuss the meaning and form of each word.
Conclusion:
Through the use of add-delete tables, we can better understand the structure of words and the meaning of
their constituent morphemes. By breaking down words into their component parts, we can analyze their
meaning and form and gain a deeper understanding of language.
Assignment 6
Prerequisite:
1. Basic of dataset extensions.
2. Concept of data import.
1. Legacy Data
2. Sources of Legacy Data
3. How to import legacy data step by step
Step 2 : Click on Get data following list will be displayed → select Excel
Step 3: Select required file and click on Open, Navigator screen appears
Step 7:
Paste url as
http://services.odata.org/V3/Northwind/Northwind.svc/ Click on
ok
Note: If you just want to see the preview you can just click on the
table namewithout clicking on the checkbox
Conclusion: In this way we import the Legacy datasets using the Power BI Tool.
Assignment: 7
Prerequisite:
Basic of SQL.
Concept of data extraction.
The data extraction is first step of ETL. There are 2 Types of Data Extraction
1. Full Extraction: All the data from source systems or operational systems gets
extracted tostaging area. (Initial Load)
Source System Performance: The Extraction strategies should not affect source
systemperformance.
The data transformation is second step.After extracting the data there is big need to do the
transformation as per the target system.I would like to give you some bullet points of Data
Transformation.
Data Extracted from source system is in to Raw format. We need to transform itbefore
loading in to target server.
Data has to be cleaned, mapped and transformed
Standardizing data : Data is fetched from multiple sources so it needs to be standardizedas per
the target system.
Character set conversion : Need to transform the character sets as per the target systems.
(Firstname and last name example)
Calculated and derived values: In source system there is first val and second val and intarget
we need the calculation of first val and second val.
Data Conversion in different formats : If in source system date in in DDMMYY formatand in
target the date is in DDMONYYYY format then this transformation needs to be done at
transformation phase.
Data loading phase loads the prepared data from staging tables to main tables.
Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL
Serverprograms group. The following screen appears.
Step 2 − The above screen shows SSDT has opened. Go to file at the top left corner in
theabove image and click New. Select project and the following screen opens.
Step 3 − Select Integration Services under Business Intelligence on the top left corner in
theabove screen to get the following screen.
Step 4 − In the above screen, select either Integration Services Project or Integration
ServicesImport Project Wizard based on your requirement to develop\create the package.
Modes
There are two modes − Native Mode (SQL Server Mode) and Share Point Mode.
Models
There are two models − Tabular Model (For Team and Personal Analysis) and
MultiDimensions Model (For Corporate Analysis).
The BIDS (Business Intelligence Studio till 2008 R2) and SSDT (SQL Server Data Tools
from 2012) are environments to work with SSAS.
Step 1 − Open either BIDS\SSDT based on the version from the Microsoft SQL Server
programs group. The following screen will appear.
Step 2 − The above screen shows SSDT has opened. Go to file on the top left corner in the
above image and click New. Select project and the following screen opens.
Step 3 − Select Analysis Services in the above screen under Business Intelligence as seen on
the top left corner. The following screen pops up.
Step 4 − In the above screen, select any one option from the listed five options based on your
requirement to work with Analysis services.
In this step you remove all columns except ProductID, ProductName, UnitsInStock, and
QuantityPerUnit
Power BI Desktop includes Query Editor, which is where you shape and transform your data
connections. Query Editor opens automatically when you select Edit from Navigator. You can
also open the Query Editor by selecting Edit Queries from the Home ribbon in Power BI
Desktop. The following steps are performed in Query Editor.
1. In Query Editor, select the ProductID, ProductName, QuantityPerUnit, and
UnitsInStock columns (use Ctrl+Click to select more than one column, or
Shift+Click to select columns that are beside each other).
2. Select Remove Columns > Remove Other Columns from the ribbon, or right-click
on a column header and click Remove Other Columns.
When Query Editor connects to data, it reviews each field and to determine the best data type.
For the Excel workbook, products in stock will always be a whole number, so in this step you
confirm the UnitsInStock column’s datatype is Whole Number.
1. Select the UnitsInStock column.
2. Select the Data Type drop-down button in the Home ribbon.
3. If not already a Whole Number, select Whole Number for data type from the drop down
(the Data Type: button also displays the data type for the current selection).
In this step, you expand the Order_Details table that is related to the Orders table, to combine
the ProductID, UnitPrice, and Quantity columns from Order_Details into the Orders table.
This is a representation of the data in these tables:
The Expand operation combines columns from a related table into a subject table. When the
query runs, rows from the related table (Order_Details) are combined into rows from the
subject table (Orders).
After you expand the Order_Details table, three new columns and additional rows are added
to the Orders table, one for each row in the nested or related table.
1. In the Query View, scroll to the Order_Details column.
2. In the Order_Details column, select the expand icon ( ).
3. In the Expand drop-down:
a. Select (Select All Columns) to clear all columns.
b. Select ProductID, UnitPrice, and Quantity.
c. Click OK.
2. In the Add Custom Column dialog box, in the Custom Column Formula textbox, enter
[Order_Details.UnitPrice] * [Order_Details.Quantity].
3. In the New column name textbox, enter LineTotal.
4. Click OK.
3. Once the data is loaded, select the Manage Relationships button home ribbon.
5.When we attempt to create the relationship, we see that one already exists! As shown in the
Create Relationship dialog (by the shaded columns), the ProductsID fields in each query
already have an established relationship.
6. We see the following, which visualizes the relationship between the queries.
5. When you double-click the arrow on the line that connects the to queries, an Edit
Relationship dialog appears.
6. No need to make any changes, so we'll just select Cancel to close the Edit
Relationship dialog.
Assignment: 8
Prerequisite:
1. Understanding of Queries in SSMS.
2. Knowledge of Microsoft BIDS Environment.
OLAP:
LAP (Online Analytical Processing) was introduced into the business intelligence (BI) space over 20 years
ago, in a time where computer hardware and software technology weren’t nearly as powerful as they are
today. OLAP introduced a groundbreaking way for business users (typically analysts) to easily perform
multidimensional analysis of large volumes of business data.
Aggregating, grouping, and joining data are the most difficult types of queries for a relational database to
process. The magic behind OLAP derives from its ability to pre-calculate and pre-aggregate data. Otherwise,
end users would be spending most of their time waiting for query results to be returned by the database.
However, it is also what causes OLAP-based solutions to be extremely rigid and IT-intensive.
ROLAP:
ROLAP stands for "Relational Online Analytical Processing". It is a type of OLAP (Online Analytical Processing)
that uses a relational database management system (RDBMS) to store and manage data. ROLAP technology allows
users to perform complex queries and analysis on large volumes of data, and to quickly retrieve the results in a tabular
format.
In a ROLAP system, data is stored in a relational database, typically using SQL (Structured Query Language) as the
query language. ROLAP uses SQL queries to aggregate and summarize data across multiple tables in the database,
and to create multidimensional views of the data. These views can then be used to analyze the data and create reports.
ROLAP systems are particularly useful for analyzing large amounts of data, especially in business intelligence and
data warehousing applications. They allow users to perform complex queries and analysis on large volumes of data,
and to quickly retrieve the results in a tabular format.
MOLAP:
MOLAP stands for Multidimensional Online Analytical Processing. MOLAP uses a multidimensional cube that
accesses stored data through various combinations. Data is pre-computed, pre-summarized, and stored (a difference
from ROLAP, where queries are served on-demand).
A multicube approach has proved successful in MOLAP products. In this approach,Page a series
47 of dense, small,
Department of Computer Engineering, AISSMS COE, Pune
precalculated cubes make up a hypercube. Tools that incorporate MOLAP include Oracle Essbase, IBM Cognos, and
LP-VI BE Computer Engineering (2022-23)
Apache Kylin.
Its simple interface makes MOLAP easy to use, even for inexperienced users. Its speedy data retrieval makes it the
best for “slicing and dicing” operations. One major disadvantage of MOLAP is that it is less scalable than ROLAP, as
it can handle a limited amount of data.
HOLAP:
HOLAP stands for Hybrid Online Analytical Processing. As the name suggests, the HOLAP storage mode connects
attributes of both MOLAP and ROLAP. Since HOLAP involves storing part of your data in a ROLAP store and
another part in a MOLAP store, developers get the benefits of both.
With this use of the two OLAPs, the data is stored in both multidimensional databases and relational databases. The
decision to access one of the databases depends on which is most appropriate for the requested processing application
or type. This setup allows much more flexibility for handling data. For theoretical processing, the data is stored in a
multidimensional database. For heavy processing, the data is stored in a relational database.
Cube
and select New
Cube
Right Click on
To create a cube right click on Cube and select New Cube as
Assignment: 9
Problem Statement:
Import the data warehouse data in Microsoft Excel and create the Pivot table and Pivot Chart.
Prerequisite:
1. Basic of dataset extensions.
2. Concept of data import.
Data Warehouse:
A data warehouse is a type of data management system that is designed to enable and support business
intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and
analysis and often contain large amounts of historical data. The data within a data warehouse is usually
derived from a wide range of sources such as application log files and transaction applications.
A data warehouse centralizes and consolidates large amounts of data from multiple sources. Its analytical
capabilities allow organizations to derive valuable business insights from their data to improve decision-
making. Over time, it builds a historical record that can be invaluable to data scientists and business analysts.
Because of these capabilities, a data warehouse can be considered an organization’s “single source of truth.”
Pivot Table:
Pivot tables are among the most useful and powerful features in Excel. We use them in summarizing the
data stored in a table. They organize and rearrange statistics (or "pivot") to draw attention to the valuable
facts. You can take an extremely large data set and see the relevant information you need in a clean,
concise, manageable way.
Pivot Chart:
A pivot chart in Excel is a visual representation of the data. It gives you the big picture of your raw data. It
allows you to analyze data using various types of graphs and layouts. It is considered to be the best chart
during a business presentation that involves huge data.
Data Model is used for building a model where data from various sources can be combined by
creating relationships among the data sources. A Data Model integrates the tables, enabling
extensive analysis using PivotTables, Power Pivot, and Power View.
A Data Model is created automatically when you import two or more tables simultaneously
from a database. The existing database relationships between those tables is used to create the
Data Model in Excel.
Step 3 − In the Get External Data group, click on the option From Access. The Select Data
Source dialog box opens.
Step 5 − The Select Table window, displaying all the tables found in the database, appears.
Step 6 − Tables in a database are similar to the tables in Excel. Check the ‘Enable selection
of multiple tables’ box, and select all the tables. Then click OK.
Step 7 − The Import Data window appears. Select the PivotTable Report option. This option
imports the tables into Excel and prepares a PivotTable for analyzing the imported tables.
Notice that the checkbox at the bottom of the window - ‘Add this data to the Data Model’ is
selected and disabled.
Step 8 − The data is imported, and a PivotTable is created using the imported tables.
Step 1 − You know how to add fields to PivotTable and drag fields across areas. Even if you
are not sure of the final report that you want, you can play with the data and choose the best-
suited report.
In PivotTable Fields, click on the arrow beside the table - Medals to expand it to show the
fields in that table. Drag the NOC_CountryRegion field in the Medals table to the
COLUMNS area.
Step 2 − Drag Discipline from the Disciplines table to the ROWS area.
Step 3 − Filter Discipline to display only five sports: Archery, Diving, Fencing, Figure
Skating, and Speed Skating. This can be done either in PivotTable Fields area, or from the
Row Labels filter in the PivotTable itself.
Step 4 − In PivotTable Fields, from the Medals table, drag Medal to the VALUES area.
Step 5 − From the Medals table, select Medal again and drag it into the FILTERS area.
Step 6 − Click the dropdown list button to the right of the Column labels.
The Value Filters dialog box for the count of Medals is greater than appears.
The PivotTable displays only those regions, which has more than total 80 medals.
Relationships let you analyze your collections of the data in Excel, and create interesting and
aesthetic reports from the data you import.
Step 2 − Create a new table with new data. Name the new table as Sports.
Step 3 − Now you can create relationship between this new table and the other tables that
already exist in the Data Model in Excel. Rename the Sheet1 as Medals and Sheet2 as
Sports.
On the Medals sheet, in the PivotTable Fields List, click All. A complete list of available
tables will be displayed. The newly added table - Sports will also be displayed.
Step 4 − Click on Sports. In the expanded list of fields, select Sports. Excel messages you to
create a relationship between tables.
Step 6 − To create the relationship, one of the tables must have a column of unique, non-
repeated, values. In the Disciplines table, SportID column has such values. The table Sports
that we have created also has the SportID column. In Table, select Disciplines.
Step 9 − In Related Column (Primary), SportID gets selected automatically. Click OK.
Step 10 − The PivotTable is modified to reflect the addition of the new Data Field Sport.
Adjust the order of the fields in the Rows area to maintain the Hierarchy. In this case, Sport
should be first and Discipline should be the next, as Discipline will be nested in Sport as
asub-category.
Prerequisite:
Basic of data classification.
Concept of data clustering.
Classification is the process of putting something into a category. Classification of all your clothes by color
may make it easier for you to put together an outfit, especially if you favor a monochrome look.
Classification involves putting things into a class or group according to particular characteristics so it’s easier
to make sense of them, whether you’re organizing your shoes, your stock portfolio, or a group of
invertebrates. If you’re an international spy, you might know that classification also can mean a government’s
system for keeping secrets. If you have a high level of security classification, then you know really top secret
stuff.
CLUSTERING:
Clustering is an unsupervised machine learning task. You might also hear this referred to as cluster analysis because of
the way this method works.
Using a clustering algorithm means you're going to give the algorithm a lot of input data with no labels and let it find
any groupings in the data it can.
Those groupings are called clusters. A cluster is a group of data points that are similar to each other based on their
relation to surrounding data points. Clustering is used for things like feature engineering or pattern discovery.
When you're starting with data you know nothing about, clustering might be a good place to get some insight.
There are different types of clustering algorithms that handle all kinds of unique data.
Density-based
In density-based clustering, data is grouped by areas of high concentrations of data points surrounded by areas of low
concentrations of data points. Basically the algorithm finds the places that are dense with data points and calls those
clusters.
The great thing about this is that the clusters can be any shape. You aren't constrained to expected conditions.
The clustering algorithms under this type don't try to assign outliers to clusters, so they get ignored.
Distribution-based
With a distribution-based clustering approach, all of the data points are considered parts of a cluster based on the
Department of Computer Engineering, AISSMS COE, Pune Page 69
probability that they belong to a given cluster.
It works like this: there is a center-point, and as the distance of a data point from the center increases, the probability of
it being a part of that cluster decreases.
If you aren't sure of how the distribution in your data might be, you should consider a different type of algorithm.
Centroid-based
Centroid-based clusterisng is the one you probably hear about the most. It's a little sensitive to the initial parameters you
give it, but it's fast and efficient.
These types of algorithms separate data points based on multiple centroids in the data. Each data point is assigned to a
cluster based on its squared distance from the centroid. This is the most commonly used type of clustering.
Hierarchical-based
Hierarchical-based clustering is typically used on hierarchical data, like you would get from a company database or
taxonomies. It builds a tree of clusters so everything is organized from the top-down.
This is more restrictive than the other clustering types, but it's perfect for specific kinds of data sets.
TIME SERIES
Time Series Data Analysis is a way of studying the characteristics of the response variable with respect to time as the independent
variable. To estimate the target variable in the name of predicting or forecasting, use the time variable as the point of reference. A
Time-Series represents a series of time-based orders. It would be Years, Months, Weeks, Days, Horus, Minutes, and Seconds. It is
an observation from the sequence of discrete time of successive intervals.
The time variable/feature is the independent variable and supports the target variable to predict the results. Time Series Analysis
(TSA) is used in different fields for time-based predictions – like Weather Forecasting models, Stock market predictions, Signal
processing, Engineering domain – Control Systems, and Communications Systems. Since TSA involves producing the set of
information in a particular sequence, this makes it distinct from spatial and other analyses. We could predict the future using AR,
MA, ARMA, and ARIMA models.
Consider the annual rainfall details at a place starting from January 2012. We create an R time series object
for a period of 12 months and plot it.
Department of Computer Engineering, AISSMS COE, Pune Page 70
# Get the data points in form of a R vector.rainfall <-
c(799,1174.8,865.1,1334.6,635.4,918.5,685.5,998.6,784.2,985,882.8,1071)
Output:
When we execute the above code, it produces the following result and chart −