You are on page 1of 12

UNIT – 1-Part-1

 Each organization and institution are best defined by its data.


 Whatever the organization may be its value is increased through its greatest asset i.e., its data.
 No matter what form it takes i.e., homogeneous, or heterogeneous it becomes an asset to that company.
 Now the point that comes it what is data??
DATA:
The known facts that can be recorded and that have implicit meaning (It means it does not have
external meaning).
Now if data does not have external meaning or external sense why is it an asset to a company?
Here comes the point of information.
Data and information are both completely different.
Information: The processed data that has and gives external meaning is called information.
The main aim of any Analytics company is to convert or extract information from the data.
Now why is this information important??
What can we do with information?
This information can be used to get valuable insight of that data which helps the organization to make its
data to be an asset.
Insight: Insight is gained by analyzing data and information to understand what is going on with a particular
situation or phenomena.

Data Information

Information Insight

Data that is available in Real World can be of any form. But on a broader scale all these forms fall into the
classification of data based on its structure.

1. CLASSIFICATION OF DIGITAL DATA:


The digital data can be classified into 3 types based on its structure. They are:
1. Structured Data.
2. Unstructured Data.
3. Semi-Structured Data.
Structured Data:
We say data is in structured format if it follows a predefined structured schema. The data in this format
is in an organized form and can be used easily by a computer program.
The best example for structured data is Data stored in RDBMS i.e., data in the form of tables where
each row is a tuple, and each column follows a particular data type.
Only 10% of real-world data is structured.
If we consider a typical RDBMS as an example, the tables inside the DB can be related to each other.
Each table has its own cardinality (Number of rows/tuples in a Table) and its own degree (Number of
columns in a Table).
-> Sources of Structured Data:
Main sources of Structured data are:
1. Databases like MySQL, PostGre and etc.
2. Spreadsheets
3. OLTP systems.
-> Advantages of Structured Data:

Insert/Update/Delete All data manipulation operations are easy in and with structured data.

Security This data is safeguarded with encryption and decryption mechanism.

Indexing As data is stored in rows/columns random accessing of data is easy


through indexing.

Scalability All these DB's can be increased and decreased according to size.

Transaction Processing All transactions in RDBMS are always safe and secure as it follows ACID
properties.

Semi-Structured Data:
We say that data is in semi-structured format if it does not contribute to any data model (example of data
model is RDMS), but it has some structure.
For example, XML files do not contribute to any data model, but they do have some structure.
Another example could be a C program, C program does not contribute to any data model, but it has its
own structure.
Though it does not have a data model contribution, it does have metadata that defines the data inside it.
But this metadata is not sufficient to describe the original data.
Likely to structured data even semi structured data has only 10% contribution to real world data.
Usually, semi-structured data is in the form of tags and attributes as in XML and HTML.
These tags try to create a hierarchy of records to establish relationships between data.
There exists no separate schema.
-> Sources of semi-structured data:
Greatest sources of semi-structured data are:
1. XML files
2. JSON files
Unstructured Data:
Data in unstructured format neither contributes to any data model nor have any format and it is not easily
understood to computer.
Unlike structured and unstructured data, it contributes almost 80-90% of data in an organization and in the
real world.
Ex: pdfs, jpegs, Png, mp3, mp4, ppt, docx etc.
We have multiple issues with this unstructured form of data.
-> Issues with unstructured data:
 Main issue with unstructured data is computer cannot understand it.
 Another issue with unstructured data is, though the extension of file may be some .txt or .pdf but it
still has data in structured format. But due to its extension we might assume that data is in
unstructured form and miss out the valuable insight.
 Another issue is the data might have some structure but may not be properly defined. In this case
though data is structured it falls into the category of unstructured data.
 Another issue is data might be highly structured, but the structure is highly unannounced and
unaccepted.
We know that computers cannot understand this unstructured form of data. Now what can we do??
-> Dealing Unstructured Data:
To make computer understand Unstructured data, we must convert it to structured format.
For doing so we have multiple methods. Some of them includes:
3. Data Mining:
a. Using Associate Rule Mining
b. Regression Analysis
4. Using Text Mining and Analytics.
5. Natural Language Processing.
6. POS – Parts of speech tagging
Etc...

-> Sources of Unstructured Data:


- What's app chats - Images - Videos
- pdfs from web - emails
-web pages - Audios

themselves Etc..
2. Big Data: Definition
Big Data is high-volume, high velocity, and high- variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision-making.
(or)

Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured
data that continues to grow exponentially over time. These datasets are so huge and complex in volume,
velocity, and variety, that traditional data management systems cannot store, process, and analyze them.
3. Characteristics of Big Data:
Big data definitions may vary slightly, but it will always be described in terms of volume, velocity, and
variety. These big data characteristics are often referred to as the “3 Vs of big data” and were first defined by
Gartner in 2001.

1. Volume: - As the term implies, big data analytics entails handling and analyzing vast amounts of data.
To effectively work with such massive datasets, specialized tools and infrastructure are necessary for
capturing, storing, managing, cleaning, transforming, analyzing, and reporting the data.
2. Velocity: - Velocity denotes the speed at which data is generated. To keep up with the rapid generation
of data, systems for processing and analyzing data must possess sufficient capacity to handle the
influx of data and deliver timely, actionable insights.
3. Variety: - Variety refers to the diversity of data types and sources. Data can manifest in various forms,
originate from diverse sources, and exist in structured or unstructured formats. Understanding the types
of data and their sources, as well as the interrelationships within the datasets, is vital for generating
meaningful insights from big data.
4. Variability: - Big data often contains noisy and incomplete data points, which can obscure valuable
insights. Addressing this variability typically involves data cleaning and validation processes to ensure
data quality.
5. Veracity: - Veracity pertains to the accuracy and authenticity of the data. Data must undergo validation
to ensure that it accurately represents essential business functions and that any data manipulation,
modeling, and analysis does not compromise the data's accuracy.
6. Value: - A successful big data analytics strategy must generate value. The insights derived from the
analysis should provide meaningful guidance for improving operations, enhancing customer service,
or creating other forms of value. An integral part of developing a big data analytics strategy is
distinguishing between data that can contribute value and data that cannot.
7. Visualization: - Visualization plays a vital role in data analytics, as it involves presenting the analyzed
data in a visually comprehensible manner. When planning data visualization, it is essential to consider
the end user and the decisions the visualizations aim to support. Well-executed data visualization
facilitates swift and well-informed decision-making.

1.4 Evolution Of Big Data:


1970s - 1990s: Relational Database Management Systems (RDBMS) Era
 1970s: The concept of RDBMS was introduced with the development of foundational database
models like the relational model by Edgar Codd and SQL by IBM.
 1980s - 1990s: Oracle, IBM DB2, Microsoft SQL Server, and other RDBMS systems became
dominant for managing structured data. These systems were efficient for structured data but
struggled to handle unstructured or semi-structured data.
2000s - 2010s: Big Data and Distributed Computing
 Early 2000s: The limitations of RDBMS became apparent as data volumes exploded. Google
introduced the MapReduce paradigm to process large-scale data across distributed clusters.
 2006: Apache Hadoop was developed, offering a distributed file system (HDFS) and the MapReduce
programming model. Hadoop allowed scalable and distributed processing of large datasets
across clusters of commodity hardware.
 2010s: NoSQL databases emerged, offering alternatives to traditional RDBMS, better suited for
handling unstructured or semi-structured data. Technologies like MongoDB, Cassandra, and
others gained popularity for their flexibility and scalability.
2014: Apache Spark Emergence
 2014: Apache Spark was introduced as an open-source, distributed computing system. It offered in-
memory processing, faster data processing compared to MapReduce, and a more versatile
framework for various data processing tasks.
 Spark's Advantages: Its ability to handle diverse workloads, including batch processing, interactive
queries, streaming, machine learning, and graph processing, made Spark highly popular
among data scientists and engineers.
2015 - Present: Spark's Growth and Dominance
 2015-2020s: Spark continued to evolve rapidly, gaining a large user base and becoming a de
facto standard for big data processing due to its speed, ease of use, and extensive APIs for
different data processing tasks.
 Integration with Machine Learning and AI: Spark's integration with libraries like MLlib
for machine learning and other analytics tools further solidified its position as a
versatile and comprehensive big data processing framework.
 Real-time Processing: Spark Streaming and later Structured Streaming provided capabilities for
real- time data processing, addressing the increasing demand for real-time analytics.
 Cloud Adoption: Spark’s compatibility with various cloud platforms enhanced its accessibility
and scalability, leading to increased adoption across different industries.
1.5 Challenges of Big Data:
1. Capture:
 Challenge: Ensuring the collection of relevant and accurate data at the source without missing
critical information. Issues may arise due to the volume, velocity, and variety of
incoming data streams. Maintaining data consistency and quality during capture can be
challenging.
2. Storage:
 Challenge: Storing vast amounts of data efficiently and securely. Dealing with different data types
and formats can complicate storage solutions. Scaling storage infrastructure to accommodate
increasing data volumes while ensuring accessibility and reliability is a significant
challenge.
3. Curation:
 Challenge: Cleaning, organizing, and preparing data for analysis. Managing data quality, dealing with
missing values, inconsistencies, and ensuring data integrity pose challenges. Aligning
various data sources and formats for cohesive analysis is essential but challenging.
4. Search:
 Challenge: Efficiently retrieving and accessing relevant data from large, diverse datasets. Developing
effective search algorithms to navigate through massive amounts of structured and unstructured
data is a challenge. Ensuring quick and accurate search results despite the volume and variety of
data is crucial.
5. Analytics:
 Challenge: Processing and analyzing data to derive meaningful insights. Handling complex
data processing tasks, applying appropriate algorithms, and dealing with real-time analytics
can be challenging. Ensuring that analytics tools can handle large-scale data and provide
accurate results is crucial.
6. Transfer:
 Challenge: Moving data between systems, platforms, or locations securely and efficiently. Ensuring
data integrity during transfer, especially across different environments, dealing with network
limitations, and minimizing latency are significant challenges, particularly in real-time
data transfer scenarios.
7. Visualization:
 Challenge: Representing complex data in a visually understandable format. Creating intuitive
and informative data visualizations that effectively communicate insights is challenging.
Dealing with diverse data types and ensuring visualizations are both accurate and actionable poses
8. Privacy Violations:
 Challenge: Safeguarding sensitive and personally identifiable information from unauthorized access or
misuse. Ensuring compliance with data protection regulations while using and sharing data for
analytics purposes is a critical challenge. Balancing data utility with privacy concerns presents
ongoing challenges in data management.
1.6 Platform Requirements:
Platform requirements for handling big data involve a combination of technological components and
infrastructure to effectively manage, process, and derive insights from large and complex datasets. Here are
the essential platform requirements:
9. Scalability:
 The platform must be scalable to handle increasing data volumes without sacrificing performance. It
should scale both vertically (increasing hardware resources) and horizontally (adding more nodes or
servers) to accommodate growing data needs.
2. Distributed Computing:
 Leveraging distributed computing architectures allows for parallel processing across multiple nodes or
clusters. Platforms should support frameworks like Hadoop or Apache Spark, enabling distributed storage
and processing of data.
3. Fault Tolerance:
 The platform should have built-in fault tolerance mechanisms to ensure system reliability. Redundancy
and data replication across nodes help maintain operations in case of hardware failures.
4. Data Variety Support:
 The platform should handle diverse data types, including structured, semi-structured, and unstructured
data. This includes support for data ingestion, storage, and processing of various formats such as text,
images, videos, and IoT-generated data.
5. Real-time Processing:
 Capabilities for real-time or near-real-time data processing are crucial. Platforms should support stream
processing frameworks like Kafka or Spark Streaming to analyze data as it arrives.
6. Data Integration:
 The platform should facilitate seamless integration of data from multiple sources and formats. It should
support ETL (Extract, Transform, Load) processes, enabling data transformation and integration into a
unified format for analysis.
7. Security and Privacy Measures:
 Robust security features are essential to protect sensitive data. Access control, encryption, authentication
mechanisms, and compliance with data privacy regulations (GDPR, CCPA) should be integral to the
platform.
8. Data Governance and Metadata Management:
 The platform should have capabilities for metadata management to catalog and track data lineage, ensuring
data quality, compliance, and governance. This involves managing metadata to understand data context,
usage, and relationships.
9. Analytics and Machine Learning Support:
 The platform should provide tools and libraries for data analysis, machine learning, and advanced
analytics. Integration with analytics frameworks like TensorFlow, Scikit-learn, or R enables users to derive
insights and build predictive models.
10. Cost-Effectiveness:
 Balancing performance with cost efficiency is crucial. The platform should offer flexible pricing models
and the ability to optimize resource utilization to manage costs effectively.
11. Cloud and Hybrid Deployment Support:
 Support for cloud deployment allows for scalability, flexibility, and cost-efficiency. Platforms that support
hybrid deployments, combining on-premises and cloud resources, offer additional flexibility.
12. Ease of Use and Accessibility:
 User-friendly interfaces, APIs, and tools are vital for accessibility. Platforms should be intuitive for data
engineers, data scientists, and analysts to interact with and derive value from the data effectively.
1.7 Traditional Business Intelligence Vs Big Data:
Aspect Traditional Business Intelligence (BI) Big Data
Data Size Dealing with manageable data sizes Handling massive volumes of data, often in
typically in gigabytes or low terabytes. petabytes or more.
Data Types Primarily structured data from internal Deals with structured, semi-structured, and
sources like databases. unstructured data from varied sources
including social media, IoT devices, etc.
Processing Relies on structured query language Utilizes distributed computing and parallel
Approach (SQL) and relational databases. processing frameworks like Hadoop or
Spark.
Processing Focuses on batch processing Combines batch, real-time, and near-real-
Speed and historical analysis. time processing for quick insights and real-
time decision-making.
Analytics Traditional BI emphasizes predefined Big Data analytics involves exploratory
Methods queries, reports, and dashboards. analysis, predictive modeling, and machine
learning algorithms.
Infrastructure Often relies on on-premises Utilizes distributed computing clusters,
data warehouses and relational databases. cloud-based storage, and processing
resources.
Tools Relies on tools like Tableau, Power BI, Utilizes frameworks like Hadoop, Spark,
and and SQL-based databases. NoSQL databases, Kafka, and specialized
Technologies tools for big data analytics.
Use Cases Primarily used for historical reporting, Applied in predictive analytics, sentiment
trend analysis, and performance analysis, IoT data processing, and handling
monitoring. large-scale real-time data.
Decision- Often supports strategic and tactical Supports both strategic decisions and quick
Making decision-making over longer periods. tactical decisions due to real-time analytics
Timeframe capabilities.
Cost Generally lower costs due to smaller Can involve higher costs due to
datasets and simpler infrastructure. infrastructure requirements and complexities
in managing and analyzing large volumes of
data.
Skill Sets Requires SQL and domain- Requires skills in distributed computing,
specific knowledge, often by business data engineering, machine learning, and data
analysts. science.
Regulatory Concerns primarily around standard data Faces additional challenges concerning
Compliance governance and compliance. privacy, compliance, and ethical
considerations due to diverse data sources
and types.

1.8 A Typical Warehouse Environment:


In a typical Data Warehouse Environment, no matter which kind of data including Operational or
Transactional or Business data, the data is gathered from multiple sources which includes Enterprise
Resource Planning (ERP), Customer Relationship Management(CRM), Legacy Systems and many other
3rd party platforms.
This data may have any form and can be from same geographic location or from different.
All this data is integrated, cleaned, transformed and standardized through the process of Extraction,
Transformation and Loading, i.e., ETL process.
Then after the transformed data is loaded in data warehouse which can be available at enterprise level or
into data marts that are available at business level.

ERP Reporting/Dashboard

CRM OLAP

Data
Legacy Warehouse Ad hoc querying

Third Party Modeling

1.9 Introduction To Big Data Analytics: Definition


1.10 Classification:
Big Data analytics is the practice of examining data to answer questions, identify trends, and extract insights.
When data analytics is used in business, it’s often called business analytics.
The four different types of business analytics are descriptive(What happened in past) , predictive(What will
happen in future), prescriptive (How will it happen), and diagnostic (Why did it happened).
 Descriptive Analytics:
Descriptive analytics examines what happened in the past.
You’re utilizing descriptive analytics when you examine past data sets for patterns and trends.
This is the core of most businesses’ analytics because it answers important questions like how much you
sold and if you hit specific goals.
Descriptive analytics functions by identifying what metrics you want to measure, collecting that data, and
analyzing it.
It turns the stream of facts your business has collected into information you can act on, plan around, and
measure.
Examples of descriptive analytics include:
o Annual revenue reports
o Survey response summaries
o Year-over-year sales reports
The main difficulty of descriptive analytics is its limitations. It’s a helpful first step for decision makers
and managers, but it can’t go beyond analyzing data from past events.
Once descriptive analytics is done, it’s up to your team to ask how or why those trends occurred,
brainstorm and develop possible responses or solutions, and choose how to move forward.

 Diagnostic Analytics:
Another common type of analytics is diagnostic analytics and it helps explain why things happened the
way they did.
It’s a more complex version of descriptive analytics, extending beyond what happened to why it happened.
Diagnostics analytics identifies trends or patterns in the past and then goes a step further to explain why
the trends occurred the way they did.
Diagnostic analytics applies data to figure out why something happened so you can develop better
strategies without so much trial and error.
Examples of diagnostic analytics include:
o Why did year-over-year sales go up?
o Why did a certain product perform above expectations?
o Why did we lose customers in Q3?
The main flaw with diagnostic analytics is its limitation of providing actionable observations about the
future by focusing on past occurrences.
Understanding the causal relationships and sequences may be enough for some businesses, but it may not
provide sufficient answers for others.

 Predictive analytics
Predictive analytics is what it sounds like — it aims to predict likely outcomes and make educated
forecasts using historical data.
Predictive analytics extends trends into the future to see possible outcomes.
This is a more complex version of data analytics because it uses probabilities for predictions instead of
simply interpreting existing facts.
Statistical modeling or machine learning are commonly used with predictive analytics.
A business is in a better position to set realistic goals and avoid risks if they use data to create a list of
likely outcomes.
Predictive analytics can keep your team or the company as a whole aligned on the same strategic vision.
Examples of predictive analytics include:
o Ecommerce businesses that use a customer’s browsing and purchasing history to make product
recommendations.
o Financial organizations that need help determining whether a customer is likely to pay their credit
card bill on time.
o Marketers who analyze data to determine the likelihood that new customers will respond favorably
to a given campaign or product offering.
The primary challenge with predictive analytics is that the insights it generates are limited to the data.
First, that means that smaller or incomplete data sets will not yield predictions as accurate as larger data
sets might.
Additionally, the challenge of predictive analytics being restricted to the data simply means that even the
best algorithms with the biggest data sets can’t weigh intangible or distinctly human factors.

 Prescriptive analytics
Prescriptive analytics uses the data from a variety of sources — including statistics, machine learning, and
data mining — to identify possible future outcomes and show the best option.
Prescriptive analytics is the most advanced of the three types because it provides actionable insights
instead of raw data.
This methodology is how you determine what should happen, not just what could happen.
Using prescriptive analytics enables you to not only envision future outcomes, but to understand why
they will happen.
Prescriptive analytics also can predict the effect of future decisions, including the ripple effects those
decisions can have on different parts of the business. And it does this in whatever order the decisions may
occur.
Prescriptive analytics is a complex process that involves many variables and tools like algorithms,
machine learning, and big data.
Examples of prescriptive analytics include:
o Calculating client risk in the insurance industry to determine what plans and rates an account
should be offered.
o Discovering what features to include in a new product to ensure its success in the market, possibly
by analyzing data like customer surveys and market research to identify what features are most
desirable for customers and prospects.
o Identifying tactics to optimize patient care in healthcare, like assessing the risk for developing
specific health problems in the future and targeting treatment decisions to reduce those risks.

The most common issue with prescriptive analytics is that it requires a lot of data to produce useful
results, but a large amount of data isn’t always available. This type of analytics could easily become
inaccessible for most.
Though the use of machine learning dramatically reduces the possibility of human error, an additional
downside is that it can’t always account for all external variables since it often relies on machine learning
algorithms.
1.11 Challenges:
1. Scale
2. Security
3. Schema
4. Continuous Availability
5. Consistency
6. Partition tolerant
7. Data Quality
1.12 Terminologies Used in Big Data Environments:
8. In-Memory Analytics
9. In-Database Processing
10.Symmetric Multiprocesor System
11.Massive Parallel Processing
12.Parallel and Distributed systems(Differences)
13.Shared Nothing Architecture.
14.CAP Theorem

You might also like