UNIT-2

UNIT-2
DATA, Database,
Data Warehouse
and Data Mining
DATA
Data is a plural of ‘Datum’, which is originally a Latin noun meaning
“something given.”
Today, data means “facts or pieces of information” or just “information’,

especially facts or numbers, in an electronic form that can be stored and
used by a computer.
According to a Forbes article, there are 2.5 quintillion bytes of data

created each day at our current pace, but that pace is only accelerating
with the growth of the Internet of Things (IoT). Over the last two years
alone 90 percent of the data in the world was generated.
VECTOR
● In computer programming, a vector is either a pointer or an array with only one
dimension.
● A vector, in programming, is a type of array that is one dimensional.
● A vector is often represented as a 1-dimensional array of numbers, referred to as
components and is displayed either in column form or row form.
● Vectors are a logical element in programming languages that are used for storing data.
● Vectors are similar to arrays but their actual implementation and operation differs.
● Furthermore, vectors can typically hold any object
● Each item in the vector has to be the same length and type.
DATA FRAME
A DataFrame is a data structure that organizes data
into a 2-dimensional table of rows and columns,
much like a spreadsheet.
Every DataFrame contains a blueprint, known as a

schema, that defines the name and data type of each
column.
However, the difference between them is that while

a spreadsheet sits on one computer in one specific
location, a DataFrame can span thousands of
computers. In this way, DataFrames make it
possible to do analytics on big data, using
distributed computing clusters.
TYPES OF DATA SOURCE
1. Structured Data:
Structured data is highly organized and formatted in a specific way,
typically stored in databases with a clear schema defining the data
types and relationships.
Characteristics:
- Easily searchable and queryable.
- Follows a defined schema.
- Suitable for traditional data analysis techniques.
2. Semi-Structured Data:
Semi-structured data lacks the rigid structure of structured data but still
has some level of organization. It does not conform to a fixed schema but
may have tags, markers, or other mechanisms for organizing and labeling
the data.
Characteristics:
- Less rigid structure compared to structured data.
- May contain repeating elements or nested structures.
- Suitable for scenarios where data schemas may evolve over time.
3. Unstructured Data:
Unstructured data lacks a specific organization or structure. It is typically

stored in its native format without any predefined schema, making it
challenging to analyze using traditional methods.
Characteristics:
- No predefined structure or organization.
- Difficult to analyze using traditional databases or methods.
- Requires advanced techniques such as natural language processing

DATA WAREHOUSE V/S DATABASE
DATA WAREHOUSE V/S DATABASE
RELATIONAL V/S NON-RELATIONAL DATABASE
RELATIONAL V/S NON-RELATIONAL DATABASE
RDBMS DATA STRUCTURE
RDBMS (Relational Database Management System) data structure refers

to the way data is organized and stored within a relational database. In an
RDBMS, data is structured into tables, which consist of rows and
columns. The breakdown of the key components of the data structure in
an RDBMS:
1. Tables: Tables are the fundamental structure in an RDBMS. Each table

represents a specific entity or concept within the database. For example,
in a database for a library, there might be tables for books, authors, and
borrowers.
2. Rows: Rows, also known as records or tuples, represent individual

instances or entries within a table. Each row corresponds to a single
entity or data record. For example, in a table representing books, each
row might represent a specific book with its attributes such as title,
author, and publication year.
3. Columns: Columns, also known as attributes or fields, represent the

characteristics or properties of the entities stored in the table. Each
column corresponds to a specific piece of data for every row in the table.
For example, in a table representing books, columns might include title,
author, publication year, and ISBN.
4. Primary Key: A primary key is a column or combination of columns

that uniquely identifies each row in a table. It ensures that there are no
duplicate rows and enables efficient data retrieval and indexing. Primary
keys enforce entity integrity within the database.
5. Foreign Key: A foreign key is a column or set of columns in one table

that refers to the primary key in another table. It establishes relationships
between tables, known as referential integrity, ensuring consistency and
integrity of data across multiple tables.
6. Indexes: Indexes are data structures that improve the speed of data
retrieval operations on tables. They are created on one or more columns
of a table to facilitate fast lookup of data based on those columns.
Indexes are especially useful for columns frequently used in queries and
joins.
7. Constraints: Constraints are rules enforced on the data stored in tables

to maintain data integrity and consistency. Common constraints include
NOT NULL, UNIQUE, CHECK, and DEFAULT constraints. They
ensure that data entered into the database meets certain criteria specified
by the database schema.
COLUMNAR DATA STRUCTURES
Columnar data structures refer to a method of organizing and storing data in
which data for each column is stored together separately from the data in
other columns. This is in contrast to row-based storage, where data for each
row is stored together. In columnar data structures, all values for a specific
column are stored contiguously, allowing for more efficient storage and
retrieval of columnar data. Some key characteristics and advantages of
columnar data structures:
1. Storage Efficiency: Columnar storage minimizes the amount of disk I/O

required to access specific columns, especially when only a subset of
columns is needed for a query. This efficiency is achieved because only the
relevant columns need to be read from disk, reducing data transfer times.
2. Compression: Columnar storage allows for effective compression
techniques to be applied at the column level. Since columns often
contain repetitive or similar values, compression algorithms can exploit
this redundancy to reduce storage space without loss of data integrity.
3. Data Encoding: Columnar data structures often utilize specialized

encoding schemes optimized for each column's data type. For example,
numerical columns might be encoded using delta encoding or run-length
encoding, while string columns might use dictionary encoding. These
encoding schemes further improve storage efficiency and query
performance.
4. Query Performance: Columnar storage can significantly improve
query performance, especially for analytical workloads that involve
aggregations, filtering, and selective projections of specific columns.
Operations like summation, counting, and averaging can be performed
more efficiently on columnar data due to its structure.
5. Analytics and Data Processing: Columnar data structures are

well-suited for analytical processing and data warehousing applications
where complex queries need to be executed against large datasets.
Analytical queries often involve scanning and aggregating large volumes
of data, and columnar storage helps optimize these operations for better
performance.
6. Parallel Processing: Columnar storage facilitates parallel processing
of queries across multiple CPU cores or distributed computing nodes.
Since data for each column is stored separately, query operations can be
parallelized more effectively, leading to faster query execution times.
Popular columnar databases like Apache Parquet, Apache ORC

(Optimized Row Columnar), and ClickHouse leverage columnar data
structures to provide efficient storage and processing capabilities for big
data analytics and data warehousing use cases.
DATA ISSUES
DATA ISSUES
Data Privacy Concerns:
With the increasing amount of personal data collected by
organizations, concerns about data privacy have become more
pronounced. Issues such as data breaches, unauthorized data
sharing, and the misuse of personal information have raised
questions about the protection of individual privacy rights.
Data Security Breaches:

Data breaches continue to be a significant concern for
organizations across industries. Cyberattacks targeting sensitive
data, such as financial information, personal records, and
intellectual property, have become more sophisticated and
frequent. These breaches not only compromise data integrity but
also damage the reputation and trust of the affected
DATA ISSUES
Bias in AI and Machine Learning Algorithms:

Bias in AI and machine learning algorithms remains a critical issue,
especially in applications such as hiring, lending, and criminal
justice. Biased training data, algorithmic biases, and inadequate
algorithmic transparency can lead to unfair outcomes, perpetuate
societal inequalities, and undermine trust in automated
decision-making systems.
Data Quality and Governance:

Ensuring data quality and governance continues to be a challenge
for organizations dealing with large volumes of data from diverse
sources. Issues such as data inconsistency, incompleteness, and
inaccuracies can undermine the reliability and usability of data for
decision-making and analytics purposes.
DATA ISSUES
Data Sovereignty and Compliance:
Data sovereignty issues arise from the jurisdictional differences in
data protection and privacy laws across countries. Compliance with
regulations such as the GDPR (General Data Protection Regulation) in
the EU, CCPA (California Consumer Privacy Act) in California, and
similar laws in other regions presents challenges for multinational
organizations in managing and transferring data across borders.
Data Integration Challenges:

Integrating data from disparate sources, including structured and
unstructured data, poses significant challenges for organizations
seeking to derive insights from their data assets. Issues such as data
silos, incompatible data formats, and data integration complexity
hinder efforts to create a unified view of data across the organization.
DATA MINING
● Data mining is the process of analyzing a large batch of
information to discern trends and patterns.
● Data mining can be used by corporations for everything from
learning about what customers are interested in or want to buy
to fraud detection and spam filtering.
● Data mining programs break down patterns and connections in
data based on what information users request or provide.
● Social media companies use data mining techniques to
commodify their users in order to generate profit.
● This use of data mining has come under criticism lately as users
are often unaware of the data mining happening with their
personal information, especially when it is used to influence
preferences.
DATA MINING
■ Data mining combines statistics, artificial intelligence and

machine learning to find patterns, relationships and
anomalies in large data sets.
■ An organization can mine its data to improve many
aspects of its business, though the technique is
particularly useful for improving sales and customer
relations.
■ Data mining can be used to find relationships and
patterns in current data and then apply those to new data
to predict future trends or detect anomalies, such as
fraud.
DATA MINING
DATA MINING
Few Examples
■ Banking: Data mining is used to predict successful loan
applicants as well as to detect fraud in credit cards.
■ Retail: Create effective advertisements based on past
responses.
■ Insurance: Predict probability and costs for future
disasters, based on past hurricanes or tornadoes.
■ Grocery stores: Analyze market baskets to find products
usually bought together. Running a sales promotion on
one item can improve sales of the other item at its normal
price.
DATA MINING
■ Manufacturing: Implement just-in-time fulfillment by

predicting when new supplies should be ordered or when
equipment is likely to fail.
■ Customer relationship management: Identify
characteristics of customers who move to competitors,
then offer special deals to retain other customers with
those same characteristics.
■ Security: Intrusion detection techniques use data mining
to identify anomalies that could be network break-ins.
ASSOCIATION RULES
Association rule mining involves the employment of machine

learning models to analyze information for patterns
terribly information.
The association rule learning algorithm is a rule-based machine

learning approach to find patterns from items that are dependent
on one another and map the connections between them. It is usually
used in a large database to find interesting relationships in how and
why two items are connected. Association rule learning algorithm
finds application in many real-life scenarios.
ASSOCIATION RULES
For instance, companies use it to understand consumer

behavior and put tailor-suited products in front of their
customers. Association rule learning algorithm detects patterns
from past sales which are used to present other related
products to customers.
Association rule learning algorithms work like conditional

statements (ex, if A then B). In this case, A is called the
antecedent while B is called the consequent.
ASSOCIATION RULES
Algorithms of Association rules:

ASSOCIATION RULES
1. Apriori Algorithm: The Apriori algorithm is one of the most commonly used algorithms for
association rule mining. It employs a breadth-first search strategy to discover frequent
itemsets by iteratively generating candidate itemsets and pruning those that do not meet a
minimum support threshold.
2. FP-Growth Algorithm: The FP-Growth (Frequent Pattern Growth) algorithm is an alternative
to the Apriori algorithm that uses a divide-and-conquer approach to efficiently discover
frequent itemsets without generating candidate itemsets. It constructs a compact data
structure called the FP-tree to represent the dataset and mines frequent patterns directly
from the tree structure.
3. Eclat algorithm : (Equivalence Class Clustering and Bottom-Up Lattice Traversal) is another
popular method for mining frequent itemsets in association rule mining. Similar to the Apriori
algorithm, Eclat is used to identify sets of items that frequently occur together in transactions.
However, it utilizes a different approach that focuses on exploiting vertical data format (also
known as Transaction ID List or tid-list) to achieve efficiency.
ASSOCIATION RULES
Application of the rule:
Association rule mining has numerous applications, including market

basket analysis (identifying frequently co-purchased items),
recommendation systems (suggesting related products or services based
on past behavior), cross-selling strategies (promoting complementary
products), web usage mining (analyzing user navigation patterns), and
bioinformatics (finding associations between genes and diseases).

UNIT-2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT-2

Uploaded by

Copyright:

Available Formats

UNIT-2

Today, data means “facts or pieces of information” or just “information’,

According to a Forbes article, there are 2.5 quintillion bytes of data

● Furthermore, vectors can typically hold any object

Every DataFrame contains a blueprint, known as a

However, the difference between them is that while

Unstructured data lacks a specific organization or structure. It is typically

- No predefined structure or organization.

- Difficult to analyze using traditional databases or methods.

- Requires advanced techniques such as natural language processing

RDBMS (Relational Database Management System) data structure refers

1. Tables: Tables are the fundamental structure in an RDBMS. Each table

2. Rows: Rows, also known as records or tuples, represent individual

3. Columns: Columns, also known as attributes or fields, represent the

4. Primary Key: A primary key is a column or combination of columns

5. Foreign Key: A foreign key is a column or set of columns in one table

7. Constraints: Constraints are rules enforced on the data stored in tables

1. Storage Efficiency: Columnar storage minimizes the amount of disk I/O

3. Data Encoding: Columnar data structures often utilize specialized

5. Analytics and Data Processing: Columnar data structures are

Popular columnar databases like Apache Parquet, Apache ORC

Data Security Breaches:

Bias in AI and Machine Learning Algorithms:

Data Quality and Governance:

Data Integration Challenges:

■ Data mining combines statistics, artiﬁcial intelligence and

■ Manufacturing: Implement just-in-time fulﬁllment by

Association rule mining involves the employment of machine

The association rule learning algorithm is a rule-based machine

For instance, companies use it to understand consumer

Association rule learning algorithms work like conditional

Algorithms of Association rules:

Application of the rule:

Association rule mining has numerous applications, including market

You might also like