You are on page 1of 34

UNIT-2

DATA, Database,
Data Warehouse
and Data Mining
DATA
Data is a plural of ‘Datum’, which is originally a Latin noun meaning
“something given.”

Today, data means “facts or pieces of information” or just “information’,


especially facts or numbers, in an electronic form that can be stored and
used by a computer.

According to a Forbes article, there are 2.5 quintillion bytes of data


created each day at our current pace, but that pace is only accelerating
with the growth of the Internet of Things (IoT). Over the last two years
alone 90 percent of the data in the world was generated.
VECTOR
● In computer programming, a vector is either a pointer or an array with only one
dimension.
● A vector, in programming, is a type of array that is one dimensional.
● A vector is often represented as a 1-dimensional array of numbers, referred to as
components and is displayed either in column form or row form.

● Vectors are a logical element in programming languages that are used for storing data.

● Vectors are similar to arrays but their actual implementation and operation differs.

● Furthermore, vectors can typically hold any object

● Each item in the vector has to be the same length and type.
DATA FRAME
A DataFrame is a data structure that organizes data
into a 2-dimensional table of rows and columns,
much like a spreadsheet.

Every DataFrame contains a blueprint, known as a


schema, that defines the name and data type of each
column.

However, the difference between them is that while


a spreadsheet sits on one computer in one specific
location, a DataFrame can span thousands of
computers. In this way, DataFrames make it
possible to do analytics on big data, using
distributed computing clusters.
TYPES OF DATA SOURCE
1. Structured Data:
Structured data is highly organized and formatted in a specific way,
typically stored in databases with a clear schema defining the data
types and relationships.
Characteristics:
- Easily searchable and queryable.
- Follows a defined schema.
- Suitable for traditional data analysis techniques.
TYPES OF DATA SOURCE
2. Semi-Structured Data:
Semi-structured data lacks the rigid structure of structured data but still
has some level of organization. It does not conform to a fixed schema but
may have tags, markers, or other mechanisms for organizing and labeling
the data.
Characteristics:
- Less rigid structure compared to structured data.
- May contain repeating elements or nested structures.
- Suitable for scenarios where data schemas may evolve over time.
TYPES OF DATA SOURCE
3. Unstructured Data:

Unstructured data lacks a specific organization or structure. It is typically


stored in its native format without any predefined schema, making it
challenging to analyze using traditional methods.

Characteristics:

- No predefined structure or organization.

- Difficult to analyze using traditional databases or methods.

- Requires advanced techniques such as natural language processing


DATA WAREHOUSE V/S DATABASE
DATA WAREHOUSE V/S DATABASE
RELATIONAL V/S NON-RELATIONAL DATABASE
RELATIONAL V/S NON-RELATIONAL DATABASE
RDBMS DATA STRUCTURE

RDBMS (Relational Database Management System) data structure refers


to the way data is organized and stored within a relational database. In an
RDBMS, data is structured into tables, which consist of rows and
columns. The breakdown of the key components of the data structure in
an RDBMS:

1. Tables: Tables are the fundamental structure in an RDBMS. Each table


represents a specific entity or concept within the database. For example,
in a database for a library, there might be tables for books, authors, and
borrowers.
RDBMS DATA STRUCTURE

2. Rows: Rows, also known as records or tuples, represent individual


instances or entries within a table. Each row corresponds to a single
entity or data record. For example, in a table representing books, each
row might represent a specific book with its attributes such as title,
author, and publication year.

3. Columns: Columns, also known as attributes or fields, represent the


characteristics or properties of the entities stored in the table. Each
column corresponds to a specific piece of data for every row in the table.
For example, in a table representing books, columns might include title,
author, publication year, and ISBN.
RDBMS DATA STRUCTURE

4. Primary Key: A primary key is a column or combination of columns


that uniquely identifies each row in a table. It ensures that there are no
duplicate rows and enables efficient data retrieval and indexing. Primary
keys enforce entity integrity within the database.

5. Foreign Key: A foreign key is a column or set of columns in one table


that refers to the primary key in another table. It establishes relationships
between tables, known as referential integrity, ensuring consistency and
integrity of data across multiple tables.
RDBMS DATA STRUCTURE

6. Indexes: Indexes are data structures that improve the speed of data
retrieval operations on tables. They are created on one or more columns
of a table to facilitate fast lookup of data based on those columns.
Indexes are especially useful for columns frequently used in queries and
joins.

7. Constraints: Constraints are rules enforced on the data stored in tables


to maintain data integrity and consistency. Common constraints include
NOT NULL, UNIQUE, CHECK, and DEFAULT constraints. They
ensure that data entered into the database meets certain criteria specified
by the database schema.
COLUMNAR DATA STRUCTURES
COLUMNAR DATA STRUCTURES
Columnar data structures refer to a method of organizing and storing data in
which data for each column is stored together separately from the data in
other columns. This is in contrast to row-based storage, where data for each
row is stored together. In columnar data structures, all values for a specific
column are stored contiguously, allowing for more efficient storage and
retrieval of columnar data. Some key characteristics and advantages of
columnar data structures:

1. Storage Efficiency: Columnar storage minimizes the amount of disk I/O


required to access specific columns, especially when only a subset of
columns is needed for a query. This efficiency is achieved because only the
relevant columns need to be read from disk, reducing data transfer times.
COLUMNAR DATA STRUCTURES
2. Compression: Columnar storage allows for effective compression
techniques to be applied at the column level. Since columns often
contain repetitive or similar values, compression algorithms can exploit
this redundancy to reduce storage space without loss of data integrity.

3. Data Encoding: Columnar data structures often utilize specialized


encoding schemes optimized for each column's data type. For example,
numerical columns might be encoded using delta encoding or run-length
encoding, while string columns might use dictionary encoding. These
encoding schemes further improve storage efficiency and query
performance.
COLUMNAR DATA STRUCTURES
4. Query Performance: Columnar storage can significantly improve
query performance, especially for analytical workloads that involve
aggregations, filtering, and selective projections of specific columns.
Operations like summation, counting, and averaging can be performed
more efficiently on columnar data due to its structure.

5. Analytics and Data Processing: Columnar data structures are


well-suited for analytical processing and data warehousing applications
where complex queries need to be executed against large datasets.
Analytical queries often involve scanning and aggregating large volumes
of data, and columnar storage helps optimize these operations for better
performance.
COLUMNAR DATA STRUCTURES
6. Parallel Processing: Columnar storage facilitates parallel processing
of queries across multiple CPU cores or distributed computing nodes.
Since data for each column is stored separately, query operations can be
parallelized more effectively, leading to faster query execution times.

Popular columnar databases like Apache Parquet, Apache ORC


(Optimized Row Columnar), and ClickHouse leverage columnar data
structures to provide efficient storage and processing capabilities for big
data analytics and data warehousing use cases.
DATA ISSUES
DATA ISSUES
Data Privacy Concerns:
With the increasing amount of personal data collected by
organizations, concerns about data privacy have become more
pronounced. Issues such as data breaches, unauthorized data
sharing, and the misuse of personal information have raised
questions about the protection of individual privacy rights.

Data Security Breaches:


Data breaches continue to be a significant concern for
organizations across industries. Cyberattacks targeting sensitive
data, such as financial information, personal records, and
intellectual property, have become more sophisticated and
frequent. These breaches not only compromise data integrity but
also damage the reputation and trust of the affected
DATA ISSUES

Bias in AI and Machine Learning Algorithms:


Bias in AI and machine learning algorithms remains a critical issue,
especially in applications such as hiring, lending, and criminal
justice. Biased training data, algorithmic biases, and inadequate
algorithmic transparency can lead to unfair outcomes, perpetuate
societal inequalities, and undermine trust in automated
decision-making systems.

Data Quality and Governance:


Ensuring data quality and governance continues to be a challenge
for organizations dealing with large volumes of data from diverse
sources. Issues such as data inconsistency, incompleteness, and
inaccuracies can undermine the reliability and usability of data for
decision-making and analytics purposes.
DATA ISSUES
Data Sovereignty and Compliance:
Data sovereignty issues arise from the jurisdictional differences in
data protection and privacy laws across countries. Compliance with
regulations such as the GDPR (General Data Protection Regulation) in
the EU, CCPA (California Consumer Privacy Act) in California, and
similar laws in other regions presents challenges for multinational
organizations in managing and transferring data across borders.

Data Integration Challenges:


Integrating data from disparate sources, including structured and
unstructured data, poses significant challenges for organizations
seeking to derive insights from their data assets. Issues such as data
silos, incompatible data formats, and data integration complexity
hinder efforts to create a unified view of data across the organization.
DATA MINING
● Data mining is the process of analyzing a large batch of
information to discern trends and patterns.
● Data mining can be used by corporations for everything from
learning about what customers are interested in or want to buy
to fraud detection and spam filtering.
● Data mining programs break down patterns and connections in
data based on what information users request or provide.
● Social media companies use data mining techniques to
commodify their users in order to generate profit.
● This use of data mining has come under criticism lately as users
are often unaware of the data mining happening with their
personal information, especially when it is used to influence
preferences.
DATA MINING

■ Data mining combines statistics, artificial intelligence and


machine learning to find patterns, relationships and
anomalies in large data sets.
■ An organization can mine its data to improve many
aspects of its business, though the technique is
particularly useful for improving sales and customer
relations.
■ Data mining can be used to find relationships and
patterns in current data and then apply those to new data
to predict future trends or detect anomalies, such as
fraud.
DATA MINING
DATA MINING

Few Examples
■ Banking: Data mining is used to predict successful loan
applicants as well as to detect fraud in credit cards.
■ Retail: Create effective advertisements based on past
responses.
■ Insurance: Predict probability and costs for future
disasters, based on past hurricanes or tornadoes.
■ Grocery stores: Analyze market baskets to find products
usually bought together. Running a sales promotion on
one item can improve sales of the other item at its normal
price.
DATA MINING

■ Manufacturing: Implement just-in-time fulfillment by


predicting when new supplies should be ordered or when
equipment is likely to fail.
■ Customer relationship management: Identify
characteristics of customers who move to competitors,
then offer special deals to retain other customers with
those same characteristics.
■ Security: Intrusion detection techniques use data mining
to identify anomalies that could be network break-ins.
ASSOCIATION RULES

Association rule mining involves the employment of machine


learning models to analyze information for patterns
terribly information.

The association rule learning algorithm is a rule-based machine


learning approach to find patterns from items that are dependent
on one another and map the connections between them. It is usually
used in a large database to find interesting relationships in how and
why two items are connected. Association rule learning algorithm
finds application in many real-life scenarios.
ASSOCIATION RULES

For instance, companies use it to understand consumer


behavior and put tailor-suited products in front of their
customers. Association rule learning algorithm detects patterns
from past sales which are used to present other related
products to customers.

Association rule learning algorithms work like conditional


statements (ex, if A then B). In this case, A is called the
antecedent while B is called the consequent.
ASSOCIATION RULES

Algorithms of Association rules:


ASSOCIATION RULES

1. Apriori Algorithm: The Apriori algorithm is one of the most commonly used algorithms for
association rule mining. It employs a breadth-first search strategy to discover frequent
itemsets by iteratively generating candidate itemsets and pruning those that do not meet a
minimum support threshold.
2. FP-Growth Algorithm: The FP-Growth (Frequent Pattern Growth) algorithm is an alternative
to the Apriori algorithm that uses a divide-and-conquer approach to efficiently discover
frequent itemsets without generating candidate itemsets. It constructs a compact data
structure called the FP-tree to represent the dataset and mines frequent patterns directly
from the tree structure.
3. Eclat algorithm : (Equivalence Class Clustering and Bottom-Up Lattice Traversal) is another
popular method for mining frequent itemsets in association rule mining. Similar to the Apriori
algorithm, Eclat is used to identify sets of items that frequently occur together in transactions.
However, it utilizes a different approach that focuses on exploiting vertical data format (also
known as Transaction ID List or tid-list) to achieve efficiency.
ASSOCIATION RULES

Application of the rule:

Association rule mining has numerous applications, including market


basket analysis (identifying frequently co-purchased items),
recommendation systems (suggesting related products or services based
on past behavior), cross-selling strategies (promoting complementary
products), web usage mining (analyzing user navigation patterns), and
bioinformatics (finding associations between genes and diseases).

You might also like