You are on page 1of 37

Introduction to Big Data

Introduction to Big Data


• Definition: Big data refers to extremely large and
complex datasets that exceed the capabilities of
traditional data processing tools.
• Characteristics: The "3Vs" - Volume, Velocity, Variety.

8/11/23 UNIT: I 2
Characteristics of Big Data
• Volume: Massive amounts of data, ranging from
terabytes to petabytes.
• Velocity: Data generated and processed in real-time
or at high speed.
• Variety: Diverse data types and formats, including
structured, semi-structured, and unstructured data.

8/11/23 UNIT: I 3
The case for Big Data
• Building an effective business case for a Big Data project involves
identifying key elements tied directly to a business process.
• Introducing Big Data into an enterprise can be disruptive, impacting scale,
storage, and data center design, resulting in additional costs for hardware,
software, staff, and support.
• Return on investment (ROI) and total cost of ownership (TCO) are crucial
considerations in the business plan.
• To accelerate ROI and reduce TCO, integrating the Big Data project with
existing IT projects driven by business needs can be advantageous.
• Building a business case involves using case scenarios and supporting
information, with many examples and resources available from major
vendors like IBM, Oracle, and HP.

8/11/23 UNIT: I 4
The case for Big Data
• A solid business case for Big Data analytics should cover the following aspects:
• Background: Provide project drivers, align Big Data with business processes, and
clarify the overall goal.
• Benefits Analysis: Identify both tangible and intangible benefits of Big Data,
aligning them with business needs.
• Options: Explore different approaches to Big Data implementation and compare
their pros and cons.
• Scope and Costs: Define the project scope, resource requirements, training, and
associated costs for accurate ROI calculations.
• Risk Analysis: Assess potential risks related to technology, security, compatibility,
integration, and business disruptions.
• ROI: The return on investment is a crucial factor in determining project feasibility
and success.
8/11/23 UNIT: I 5
Big Data Options
• Traditional data warehousing solutions were not designed to handle
the diverse data formats.
• Big Data analytics requires parallel processing across multiple servers.
• Big Data includes various data types like structured, semistructured,
and unstructured data from sources such as log files, machine-
generated data, and social network comments.
• Moore’s Law contributes to the exponential growth of data as
processor and server capabilities increase.
• Big Data solutions gather and interact with all generated data,
allowing administrators and analysts to explore data usage later.
• Hadoop – Google – Yahoo.
8/11/23 UNIT: I 6
The Team Challenge
• Finding and hiring skilled analytics professionals.
• Deciding how to organize the team is crucial.
• Centralized corporate structures may place analytics teams under IT or
a business intelligence.
• Effective organization can involve placing analytics teams by business
function.
• Placing the analytics team in a department where the data has
immediate value can accelerate.

8/11/23 UNIT: I 7
Different team different goals
• An engineering firm may deal with large volumes of unstructured data
for technical analysis.
• Big Data analytics in this context may involve various data sources.
• Adding market data and economic factors may require a different skill
set.
• Different departments within the firm may have varying needs for
analytics.
• As organizations grow and analytics needs increase, roles, processes,
and relationships

8/11/23 UNIT: I 8
Don’t forget the data
• Three primary capabilities are essential in a data analytics team:
locating the data, normalizing the data, and analyzing the data.
• For locating the data, the team member must find relevant data from
internal and external sources.
• Normalizing the data involves preparing raw data by removing
spurious data.
• Analyzing the data is a critical task for the data scientist.
• The data analytics team’s functions can have subsets of tasks.
• The data analytics team should be adaptable and able to evolve to meet
the changing needs of the business.

8/11/23 UNIT: I 9
Big Data Sources
• Finding data sources for analytics processes is a significant challenge
for most organizations, especially when dealing with Big Data.
• Big Data is not just about its size; other considerations play a crucial
role in locating and parsing Big Data sets.
• Identifying usable data is complex and requires detective work to
determine if the data set is appropriate for use in analytics platforms.

8/11/23 UNIT: I 10
Big Data Sources
• Considerations should include the following:
• Structure of the data (structured, unstructured, semistructured, table
based, proprietary)
• Source of the data (internal, external, private, public)
• Value of the data (generic, unique, specialized)
• Quality of the data (verified, static, streaming)
• Storage of the data (remotely accessed, shared, dedicated platforms,
portable)
• Relationship of the data (superset, subset, correlated)

8/11/23 UNIT: I 11
Hunting for data
• Finding data for Big Data analytics involves science, investigation,
and assumptions.
• Obvious data sources include electronic transactions, web logs, and
sensor information.
• Additional data can be collected using network taps and data
replication clients.
• Finding internal data is relatively easy, but it becomes more complex
with unrelated, external, or unstructured data.
• Understanding business analytics (BA) and business intelligence (BI)
processes helps identify how large-scale data sets can interact with
internal data for actionable results.

8/11/23 UNIT: I 12
Big data sources growing
• Multiple sources contribute to the growth of data applicable to Big
Data technology, including new data sources and changes in data
resolution.
• Industry digitization
• Industries like transportation, logistics, retail, utilities, and
telecommunications generate sensor data.
• The health care industry is moving towards electronic medical records
and images for public health monitoring and research.
• Government agencies digitize public records such as census
information, energy usage, budgets, and law enforcement reporting.

8/11/23 UNIT: I 13
Big data sources growing
• The entertainment industry has transitioned to digital recording,
production, and delivery, collecting large amounts of rich content and
user viewing behaviors.
• Life sciences use low-cost gene sequencing, generating massive
amounts of data for genetic analysis and potential treatment
effectiveness.
• Video surveillance is shifting to Internet protocol television cameras
and recording systems, allowing organizations to analyze behavioral
patterns for security and service enhancement.

8/11/23 UNIT: I 14
Big data acquisition
• Barriers to Big Data adoption are primarily cultural, not technological.
• Many organizations fail to implement Big Data programs because they
don’t see how data analytics can improve their core business.
• A data explosion leading to large and difficult-to-manage data sets
triggers the need for Big Data development.
• Managing and analyzing large data sets present challenges, requiring
appropriate training and integration of development and operations
teams.

8/11/23 UNIT: I 15
Big data acquisition
• To get started, it proves helpful to pursue a few ideologies:
• Identify a business problem that leaders can understand and relate to,
capturing their attention.
• Allocate resources to understand how data will be used within the
business, not just focusing on technical data management challenges.
• Define business objectives and questions first, then discover the
necessary data to answer them.
• Understand tools that merge data and business processes to make data
analysis more actionable.
• Build a scalable infrastructure capable of handling data growth and
analysis efficiently.
8/11/23 UNIT: I 16
Big data acquisition
• Choose trustworthy technologies with professional vendor support or
be prepared for longterm maintenance.
• Select technology that fits the specific problem; Hadoop is suitable for
large but relatively simple data analysis and text processing.
• Be aware of changing data formats and needs; Big Data adoption can
be driven by not only volume but also variety of data.

8/11/23 UNIT: I 17
The Nuts and Bolts of Big Data
• Assembling a Big Data solution is sort of like putting together an
erector set.
• With Big Data, the components include platform pieces, servers,
virtualization solutions, storage arrays, applications, sensors, and
routing equipment.
• The right pieces must be picked and integrated in a fashion that offers
the best performance, high efficiency, affordability, ease of
management and use, and scalability.

8/11/23 UNIT: I 18
The Storage Dilemma
• Big Data consists of data sets that are too large to be acquired,
handled, analyzed, or stored in an appropriate time frame using the
traditional infrastructures.
• The scale of Big Data directly affects the storage platform that must be
put in place, and those deploying storage solutions have to understand
that Big Data uses storage resources.
• Businesses have been compelled to save more data, with the hope that
business intelligence (BI) can leverage the mountains of new data
created every day.
• Organizations are also saving data that have already been analyzed,
which can potentially be used for marking trends in relation to future
data collections.
8/11/23 UNIT: I 19
The Storage Dilemma
• Meeting the challenges posed by Big Data means focusing on some
key storage ideologies and understanding how those storage design
elements interact with Big Data demands, including the following:
• Capacity: Big Data can mean petabytes of data. Big Data storage
systems must therefore be able to quickly and easily change scale to
meet the growth of data collections.
• The clustered architecture of scale-out storage solutions features nodes
of storage capacity.
• Security: Many types of data carry security standards that are driven
by compliance laws and regulations.
• The data may be financial, medical, or government intelligence and
may be part of an analytics set yet still be protected.
8/11/23 UNIT: I 20
The Storage Dilemma
• Meeting the challenges posed by Big Data means focusing on some
key storage ideologies and understanding how those storage design
elements interact with Big Data demands, including the following:
• Capacity: Big Data can mean petabytes of data. Big Data storage
systems must therefore be able to quickly and easily change scale to
meet the growth of data collections.
• The clustered architecture of scale-out storage solutions features nodes
of storage capacity.
• Security: Many types of data carry security standards that are driven
by compliance laws and regulations.
• The data may be financial, medical, or government intelligence and
may be part of an analytics set yet still be protected.
8/11/23 UNIT: I 21
The Storage Dilemma
• Latency: In many cases, Big Data employs a real-time component,
especially in use scenarios involving Web transactions or financial
transactions.
• Access: As businesses get a better understanding of the potential of
Big Data analysis, the need to compare different data sets increases,
and with it, more people are bought into the data sharing loop.
• Flexibility: Big Data storage infrastructures also need to account for
data migration challenges, at least during the start-up phase.
• Persistence: Big Data applications often involve regulatory
compliance requirements, which dictate that data must be saved for
years or decades.

8/11/23 UNIT: I 22
The Storage Dilemma
• Latency: In many cases, Big Data employs a real-time component,
especially in use scenarios involving Web transactions or financial
transactions.
• Access: As businesses get a better understanding of the potential of
Big Data analysis, the need to compare different data sets increases,
and with it, more people are bought into the data sharing loop.
• Flexibility: Big Data storage infrastructures also need to account for
data migration challenges, at least during the start-up phase.
• Persistence: Big Data applications often involve regulatory
compliance requirements, which dictate that data must be saved for
years or decades.

8/11/23 UNIT: I 23
The Storage Dilemma
• Latency: In many cases, Big Data employs a real-time component,
especially in use scenarios involving Web transactions or financial
transactions.
• Access: As businesses get a better understanding of the potential of
Big Data analysis, the need to compare different data sets increases,
and with it, more people are bought into the data sharing loop.
• Flexibility: Big Data storage infrastructures also need to account for
data migration challenges, at least during the start-up phase.
• Persistence: Big Data applications often involve regulatory
compliance requirements, which dictate that data must be saved for
years or decades.

8/11/23 UNIT: I 24
The Storage Dilemma
• Cost: Big Data can be expensive, and cost containment is crucial for
organizations.
• Storage deduplication is used in primary storage and can bring value
to Big Data storage systems.
• Reducing capacity consumption by a few percentage points provides a
significant return on investment as data sets grow.
• Thin provisioning allocates disk storage space flexibly among multiple
users based on their minimum space requirements.
• Snapshots streamline data access and aid data recovery; there are
copy-on-write and split-mirror snapshot types.
• Disk cloning involves copying a computer's hard drive contents to
another storage medium.
8/11/23 UNIT: I 25
The Storage Dilemma
• Cost: Data storage systems now include an archive component, with
tape being the most economical storage medium.
• Systems supporting multi terabyte cartridges are becoming the
standard in many environments.
• Commodity hardware has the biggest impact on cost containment in
Big Data environments.
• Majority of Big Data infrastructures rely on commodity-oriented, cost-
saving strategies due to the unavailability of big enterprise hardware.
• Many Big Data users build their own "white-box" systems, leveraging
on-site commodity hardware.

8/11/23 UNIT: I 26
The Storage Dilemma
• Cost: Cost containment trend in Big Data is driving the development
of software-based storage products that can be installed on existing
systems or off-the-shelf hardware.
• Vendors are offering software technologies as commodity appliances
or partnering with hardware manufacturers to provide cost-saving
solutions.
• Application awareness is becoming common in mainstream storage
systems, improving efficiency and performance for Big Data
environments.
• Big Data and associated analytics are becoming valuable for smaller
organizations, leading to the need for smaller initial implementations
that fit smaller budgets.
8/11/23 UNIT: I 27
Building a Platform
• Big Data application platforms require support for scalability, security,
availability, and continuity.
• These platforms must handle massive data across multiple stores and
enable concurrent processing.
• Essential technologies for Big Data platforms include MapReduce,
integration with NoSQL databases, parallel processing, and distributed
data services.
• The platform should utilize new integration targets, especially from a
development perspective.

8/11/23 UNIT: I 28
Building a Platform
• Consequently, there are specific characteristics and features that a Big Data
platform should offer to work effectively with Big Data analytics processes:
• Support for batch and real-time analytics:
• Existing platforms lack support for business analytics, driving Hadoop's
popularity for batch processing.
• Real-time analytics requires more than Hadoop can provide, necessitating an
event-processing framework.
• Major vendors like Oracle, HP, and IBM offer hardware and software for real-time
processing.
• Real-time analytics is often cost-prohibitive for smaller businesses, leading to
cloud-based solutions.
• Cloud services currently offer real-time processing for smaller businesses, filling
the gap.
8/11/23 UNIT: I 29
Building a Platform
• Alternative approaches:
• Mainstream transformation of Big Data application development
involves integrating with NoSQL databases and creating MapReduce
frameworks.
• Consider existing transaction-processing and event-processing semantics
when developing real-time analytics for Big Data.
• Creating Big Data applications differs significantly from traditional
CRUD applications for centralized relational databases.
• Data domain model design, APIs, and query semantics are key
distinctions in Big Data application development.
• Mapping, such as with MapReduce and object-relational tools like
Hibernate, addresses impedance mismatches in data models and sources.
8/11/23 UNIT: I 30
Building a Platform
• Available Big Data mapping tools:
• Hive serves batch-processing projects by offering an SQL-like facade for
complex batch processing with Hadoop.
• Alternatives like JPA, Data Nucleus, Bigtable, GigaSpaces, and Hibernate
object-grid mapping are emerging for real-time Big Data applications.
• Big Data abstraction tools:
• Various choices for data abstraction are available, including open source tools
and commercial distributions.
• Spring Data by SpringSource is a notable high-level abstraction tool that maps
diverse data stores into a common abstraction through annotations and plugins.
• Abstraction tools help normalize and interpret data into a uniform structure for
effective manipulation, ensuring efficiency for current and future data sets.
8/11/23 UNIT: I 31
Building a Platform
• Business logic:
• MapReduce plays a crucial role in Big Data analytics by processing massive amounts of data
through parallel distribution of logic across nodes.
• For custom Big Data application platforms, simplifying MapReduce and parallel execution is
essential by mapping semantics into existing programming models, making parallel
processing resemble single-job execution..
• Moving away from SQL:
• SQL is a powerful query language, but its limitations become evident in the context of Big
Data due to reliance on schemas.
• Big Data's unstructured nature conflicts with SQL's schema-based approach, especially for
dynamic data structures.
• Big Data platforms need to support schema-less semantics, necessitating the extension of
the data mapping layer to handle document semantics.
• Examples include MongoDB, CouchBase, Cassandra, and the GigaSpaces document API.
Focus should be on providing flexibility in consistency, scalability, and performance.
8/11/23 UNIT: I
32
Building a Platform
• In-memory processing:
• For optimal performance and low latency, utilizing RAM-based devices and
in-memory processing is crucial.
• To make this effective, Big Data platforms must seamlessly integrate RAM
and disk-based devices, ensuring asynchronous synchronization of data
written in RAM to disk. Additionally, providing consistent abstractions for
users is important.
• Built-in support for event-driven data distribution:
• Big Data applications and platforms need to support event-driven processes
with data awareness for efficient message routing based on data affinity and
content.
• Fine-grained controls are required to create semantics for triggering events
based on data operations and content, including complex event processing.
8/11/23 UNIT: I
33
Building a Platform
• Support for public, private, and hybrid clouds:
• Big Data applications heavily consume computer and storage resources, prompting the
use of elastic cloud capabilities for more cost-effective processing.
• Big Data application platforms should feature native support for public, private, and
hybrid clouds, allowing seamless transitions between platforms through integration with
frameworks like JClouds and Cloud Bursting.
• Consistent management:
• The typical Big Data application stack consists of multiple layers, including database,
web, processing, caching, synchronization, distribution, and reporting tools.
• Managing these layers is challenging due to their different tools and inherent complexity,
making integrated management critical.
• Choosing a Big Data application platform that integrates both the application and
management stack enhances productivity and streamlines maintenance. Flexibility is
essential given the evolving nature of Big Data tools and technologies.
8/11/23 UNIT: I
34
Building a Platform
• Support for public, private, and hybrid clouds:
• Big Data applications heavily consume computer and storage resources, prompting the
use of elastic cloud capabilities for more cost-effective processing.
• Big Data application platforms should feature native support for public, private, and
hybrid clouds, allowing seamless transitions between platforms through integration with
frameworks like JClouds and Cloud Bursting.
• Consistent management:
• The typical Big Data application stack consists of multiple layers, including database,
web, processing, caching, synchronization, distribution, and reporting tools.
• Managing these layers is challenging due to their different tools and inherent complexity,
making integrated management critical.
• Choosing a Big Data application platform that integrates both the application and
management stack enhances productivity and streamlines maintenance. Flexibility is
essential given the evolving nature of Big Data tools and technologies.
8/11/23 UNIT: I
35
BRINGING STRUCTURE TO UNSTRUCTURED
DATA
• Unstructured Big Data lacks value in its raw form, necessitating processing and
organization for insights.
• Big Data analytics involve converting unstructured data by establishing links,
integrating structured and loosely unstructured data, and employing tools like
linked data, semantics, and text analytics.
• Techniques encompass in-place data transformation, real-time data integration,
linked data and semantics, RDF triple stores, entity extraction, text analytics, and
sentiment analysis.
• Semantic technology exploits linguistics and natural language processing for
entity extraction, enabling comprehensive analytics through unstructured and
transactional data fusion.
• Text analytics tools identify entities, relationships, and sentiment within
unstructured data, progressively enhancing accuracy and offering insights into
product names, brands, events, skills, and more.
8/11/23 UNIT: I
36
BRINGING STRUCTURE TO UNSTRUCTURED
DATA
• Entity relation extraction identifies consistent relationships between entities
across documents, delivering valuable insights in science and enterprise
contexts.
• Unstructured data tools detect sentiment in social data, facilitate cross-language
integration, and apply text analytics to audio and video transcripts.
• Increasing video content necessitates handling unstructured transcripts, which
pose unique challenges due to the lack of punctuation.
• Semantic resource description framework (RDF) triple stores maintain
relationships between data elements in a flexible, extensible manner, allowing
new elements without structural changes.
• While NoSQL databases often include key value, graph, or document stores,
the RDF triple store provides a non-relational but flexible approach,
maintaining relationships and supporting inference.
8/11/23 UNIT: I
37

You might also like