1.basics of Big Data

Introduction to Big Data
Introduction to Big Data

• Definition: Big data refers to extremely large and
complex datasets that exceed the capabilities of
traditional data processing tools.
• Characteristics: The "3Vs" - Volume, Velocity, Variety.
8/11/23 UNIT: I 2
Characteristics of Big Data
• Volume: Massive amounts of data, ranging from
terabytes to petabytes.
• Velocity: Data generated and processed in real-time
or at high speed.
• Variety: Diverse data types and formats, including
structured, semi-structured, and unstructured data.
8/11/23 UNIT: I 3
The case for Big Data
• Building an effective business case for a Big Data project involves
identifying key elements tied directly to a business process.
• Introducing Big Data into an enterprise can be disruptive, impacting scale,
storage, and data center design, resulting in additional costs for hardware,
software, staff, and support.
• Return on investment (ROI) and total cost of ownership (TCO) are crucial
considerations in the business plan.
• To accelerate ROI and reduce TCO, integrating the Big Data project with
existing IT projects driven by business needs can be advantageous.
• Building a business case involves using case scenarios and supporting
information, with many examples and resources available from major
vendors like IBM, Oracle, and HP.
8/11/23 UNIT: I 4
The case for Big Data
• A solid business case for Big Data analytics should cover the following aspects:
• Background: Provide project drivers, align Big Data with business processes, and
clarify the overall goal.
• Benefits Analysis: Identify both tangible and intangible benefits of Big Data,
aligning them with business needs.
• Options: Explore different approaches to Big Data implementation and compare
their pros and cons.
• Scope and Costs: Define the project scope, resource requirements, training, and
associated costs for accurate ROI calculations.
• Risk Analysis: Assess potential risks related to technology, security, compatibility,
integration, and business disruptions.
• ROI: The return on investment is a crucial factor in determining project feasibility
and success.
8/11/23 UNIT: I 5
Big Data Options
• Traditional data warehousing solutions were not designed to handle
the diverse data formats.
• Big Data analytics requires parallel processing across multiple servers.
• Big Data includes various data types like structured, semistructured,
and unstructured data from sources such as log files, machine-
generated data, and social network comments.
• Moore’s Law contributes to the exponential growth of data as
processor and server capabilities increase.
• Big Data solutions gather and interact with all generated data,
allowing administrators and analysts to explore data usage later.
• Hadoop – Google – Yahoo.
8/11/23 UNIT: I 6
The Team Challenge
• Finding and hiring skilled analytics professionals.
• Deciding how to organize the team is crucial.
• Centralized corporate structures may place analytics teams under IT or
a business intelligence.
• Effective organization can involve placing analytics teams by business
function.
• Placing the analytics team in a department where the data has
immediate value can accelerate.
8/11/23 UNIT: I 7
Different team different goals
• An engineering firm may deal with large volumes of unstructured data
for technical analysis.
• Big Data analytics in this context may involve various data sources.
• Adding market data and economic factors may require a different skill
set.
• Different departments within the firm may have varying needs for
analytics.
• As organizations grow and analytics needs increase, roles, processes,
and relationships
8/11/23 UNIT: I 8
Don’t forget the data
• Three primary capabilities are essential in a data analytics team:
locating the data, normalizing the data, and analyzing the data.
• For locating the data, the team member must find relevant data from
internal and external sources.
• Normalizing the data involves preparing raw data by removing
spurious data.
• Analyzing the data is a critical task for the data scientist.
• The data analytics team’s functions can have subsets of tasks.
• The data analytics team should be adaptable and able to evolve to meet
the changing needs of the business.
8/11/23 UNIT: I 9
Big Data Sources
• Finding data sources for analytics processes is a significant challenge
for most organizations, especially when dealing with Big Data.
• Big Data is not just about its size; other considerations play a crucial
role in locating and parsing Big Data sets.
• Identifying usable data is complex and requires detective work to
determine if the data set is appropriate for use in analytics platforms.
8/11/23 UNIT: I 10
Big Data Sources
• Considerations should include the following:
• Structure of the data (structured, unstructured, semistructured, table
based, proprietary)
• Source of the data (internal, external, private, public)
• Value of the data (generic, unique, specialized)
• Quality of the data (verified, static, streaming)
• Storage of the data (remotely accessed, shared, dedicated platforms,
portable)
• Relationship of the data (superset, subset, correlated)
8/11/23 UNIT: I 11
Hunting for data
• Finding data for Big Data analytics involves science, investigation,
and assumptions.
• Obvious data sources include electronic transactions, web logs, and
sensor information.
• Additional data can be collected using network taps and data
replication clients.
• Finding internal data is relatively easy, but it becomes more complex
with unrelated, external, or unstructured data.
• Understanding business analytics (BA) and business intelligence (BI)
processes helps identify how large-scale data sets can interact with
internal data for actionable results.
8/11/23 UNIT: I 12
Big data sources growing
• Multiple sources contribute to the growth of data applicable to Big
Data technology, including new data sources and changes in data
resolution.
• Industry digitization
• Industries like transportation, logistics, retail, utilities, and
telecommunications generate sensor data.
• The health care industry is moving towards electronic medical records
and images for public health monitoring and research.
• Government agencies digitize public records such as census
information, energy usage, budgets, and law enforcement reporting.
8/11/23 UNIT: I 13
Big data sources growing
• The entertainment industry has transitioned to digital recording,
production, and delivery, collecting large amounts of rich content and
user viewing behaviors.
• Life sciences use low-cost gene sequencing, generating massive
amounts of data for genetic analysis and potential treatment
effectiveness.
• Video surveillance is shifting to Internet protocol television cameras
and recording systems, allowing organizations to analyze behavioral
patterns for security and service enhancement.
8/11/23 UNIT: I 14
Big data acquisition
• Barriers to Big Data adoption are primarily cultural, not technological.
• Many organizations fail to implement Big Data programs because they
don’t see how data analytics can improve their core business.
• A data explosion leading to large and difficult-to-manage data sets
triggers the need for Big Data development.
• Managing and analyzing large data sets present challenges, requiring
appropriate training and integration of development and operations
teams.
8/11/23 UNIT: I 15
• To get started, it proves helpful to pursue a few ideologies:
• Identify a business problem that leaders can understand and relate to,
capturing their attention.
• Allocate resources to understand how data will be used within the
business, not just focusing on technical data management challenges.
• Define business objectives and questions first, then discover the
necessary data to answer them.
• Understand tools that merge data and business processes to make data
analysis more actionable.
• Build a scalable infrastructure capable of handling data growth and
analysis efficiently.
8/11/23 UNIT: I 16
• Choose trustworthy technologies with professional vendor support or
be prepared for longterm maintenance.
• Select technology that fits the specific problem; Hadoop is suitable for
large but relatively simple data analysis and text processing.
• Be aware of changing data formats and needs; Big Data adoption can
be driven by not only volume but also variety of data.
8/11/23 UNIT: I 17
The Nuts and Bolts of Big Data
• Assembling a Big Data solution is sort of like putting together an
erector set.
• With Big Data, the components include platform pieces, servers,
virtualization solutions, storage arrays, applications, sensors, and
routing equipment.
• The right pieces must be picked and integrated in a fashion that offers
the best performance, high efficiency, affordability, ease of
management and use, and scalability.
8/11/23 UNIT: I 18
The Storage Dilemma
• Big Data consists of data sets that are too large to be acquired,
handled, analyzed, or stored in an appropriate time frame using the
traditional infrastructures.
• The scale of Big Data directly affects the storage platform that must be
put in place, and those deploying storage solutions have to understand
that Big Data uses storage resources.
• Businesses have been compelled to save more data, with the hope that
business intelligence (BI) can leverage the mountains of new data
created every day.
• Organizations are also saving data that have already been analyzed,
which can potentially be used for marking trends in relation to future
data collections.
8/11/23 UNIT: I 19
The Storage Dilemma
• Meeting the challenges posed by Big Data means focusing on some
key storage ideologies and understanding how those storage design
elements interact with Big Data demands, including the following:
• Capacity: Big Data can mean petabytes of data. Big Data storage
systems must therefore be able to quickly and easily change scale to
meet the growth of data collections.
• The clustered architecture of scale-out storage solutions features nodes
of storage capacity.
• Security: Many types of data carry security standards that are driven
by compliance laws and regulations.
• The data may be financial, medical, or government intelligence and
may be part of an analytics set yet still be protected.
8/11/23 UNIT: I 20
The Storage Dilemma
• Meeting the challenges posed by Big Data means focusing on some
key storage ideologies and understanding how those storage design
elements interact with Big Data demands, including the following:
• Capacity: Big Data can mean petabytes of data. Big Data storage
systems must therefore be able to quickly and easily change scale to
meet the growth of data collections.
• The clustered architecture of scale-out storage solutions features nodes
of storage capacity.
• Security: Many types of data carry security standards that are driven
by compliance laws and regulations.
• The data may be financial, medical, or government intelligence and
may be part of an analytics set yet still be protected.
8/11/23 UNIT: I 21
The Storage Dilemma
• Latency: In many cases, Big Data employs a real-time component,
especially in use scenarios involving Web transactions or financial
transactions.
• Access: As businesses get a better understanding of the potential of
Big Data analysis, the need to compare different data sets increases,
and with it, more people are bought into the data sharing loop.
• Flexibility: Big Data storage infrastructures also need to account for
data migration challenges, at least during the start-up phase.
• Persistence: Big Data applications often involve regulatory
compliance requirements, which dictate that data must be saved for
years or decades.
8/11/23 UNIT: I 22
The Storage Dilemma
transactions.
years or decades.
8/11/23 UNIT: I 23
The Storage Dilemma
transactions.
years or decades.
8/11/23 UNIT: I 24
The Storage Dilemma
• Cost: Big Data can be expensive, and cost containment is crucial for
organizations.
• Storage deduplication is used in primary storage and can bring value
to Big Data storage systems.
• Reducing capacity consumption by a few percentage points provides a
significant return on investment as data sets grow.
• Thin provisioning allocates disk storage space flexibly among multiple
users based on their minimum space requirements.
• Snapshots streamline data access and aid data recovery; there are
copy-on-write and split-mirror snapshot types.
• Disk cloning involves copying a computer's hard drive contents to
another storage medium.
8/11/23 UNIT: I 25
The Storage Dilemma
• Cost: Data storage systems now include an archive component, with
tape being the most economical storage medium.
• Systems supporting multi terabyte cartridges are becoming the
standard in many environments.
• Commodity hardware has the biggest impact on cost containment in
Big Data environments.
• Majority of Big Data infrastructures rely on commodity-oriented, cost-
saving strategies due to the unavailability of big enterprise hardware.
• Many Big Data users build their own "white-box" systems, leveraging
on-site commodity hardware.
8/11/23 UNIT: I 26
The Storage Dilemma
• Cost: Cost containment trend in Big Data is driving the development
of software-based storage products that can be installed on existing
systems or off-the-shelf hardware.
• Vendors are offering software technologies as commodity appliances
or partnering with hardware manufacturers to provide cost-saving
solutions.
• Application awareness is becoming common in mainstream storage
systems, improving efficiency and performance for Big Data
environments.
• Big Data and associated analytics are becoming valuable for smaller
organizations, leading to the need for smaller initial implementations
that fit smaller budgets.
8/11/23 UNIT: I 27
Building a Platform
• Big Data application platforms require support for scalability, security,
availability, and continuity.
• These platforms must handle massive data across multiple stores and
enable concurrent processing.
• Essential technologies for Big Data platforms include MapReduce,
integration with NoSQL databases, parallel processing, and distributed
data services.
• The platform should utilize new integration targets, especially from a
development perspective.
8/11/23 UNIT: I 28
Building a Platform
• Consequently, there are specific characteristics and features that a Big Data
platform should offer to work effectively with Big Data analytics processes:
• Support for batch and real-time analytics:
• Existing platforms lack support for business analytics, driving Hadoop's
popularity for batch processing.
• Real-time analytics requires more than Hadoop can provide, necessitating an
event-processing framework.
• Major vendors like Oracle, HP, and IBM offer hardware and software for real-time
processing.
• Real-time analytics is often cost-prohibitive for smaller businesses, leading to
cloud-based solutions.
• Cloud services currently offer real-time processing for smaller businesses, filling
the gap.
8/11/23 UNIT: I 29
Building a Platform
• Alternative approaches:
• Mainstream transformation of Big Data application development
involves integrating with NoSQL databases and creating MapReduce
frameworks.
• Consider existing transaction-processing and event-processing semantics
when developing real-time analytics for Big Data.
• Creating Big Data applications differs significantly from traditional
CRUD applications for centralized relational databases.
• Data domain model design, APIs, and query semantics are key
distinctions in Big Data application development.
• Mapping, such as with MapReduce and object-relational tools like
Hibernate, addresses impedance mismatches in data models and sources.
8/11/23 UNIT: I 30
Building a Platform
• Available Big Data mapping tools:
• Hive serves batch-processing projects by offering an SQL-like facade for
complex batch processing with Hadoop.
• Alternatives like JPA, Data Nucleus, Bigtable, GigaSpaces, and Hibernate
object-grid mapping are emerging for real-time Big Data applications.
• Big Data abstraction tools:
• Various choices for data abstraction are available, including open source tools
and commercial distributions.
• Spring Data by SpringSource is a notable high-level abstraction tool that maps
diverse data stores into a common abstraction through annotations and plugins.
• Abstraction tools help normalize and interpret data into a uniform structure for
effective manipulation, ensuring efficiency for current and future data sets.
8/11/23 UNIT: I 31
Building a Platform
• Business logic:
• MapReduce plays a crucial role in Big Data analytics by processing massive amounts of data
through parallel distribution of logic across nodes.
• For custom Big Data application platforms, simplifying MapReduce and parallel execution is
essential by mapping semantics into existing programming models, making parallel
processing resemble single-job execution..
• Moving away from SQL:
• SQL is a powerful query language, but its limitations become evident in the context of Big
Data due to reliance on schemas.
• Big Data's unstructured nature conflicts with SQL's schema-based approach, especially for
dynamic data structures.
• Big Data platforms need to support schema-less semantics, necessitating the extension of
the data mapping layer to handle document semantics.
• Examples include MongoDB, CouchBase, Cassandra, and the GigaSpaces document API.
Focus should be on providing flexibility in consistency, scalability, and performance.
8/11/23 UNIT: I
32
Building a Platform
• In-memory processing:
• For optimal performance and low latency, utilizing RAM-based devices and
in-memory processing is crucial.
• To make this effective, Big Data platforms must seamlessly integrate RAM
and disk-based devices, ensuring asynchronous synchronization of data
written in RAM to disk. Additionally, providing consistent abstractions for
users is important.
• Built-in support for event-driven data distribution:
• Big Data applications and platforms need to support event-driven processes
with data awareness for efficient message routing based on data affinity and
content.
• Fine-grained controls are required to create semantics for triggering events
based on data operations and content, including complex event processing.
8/11/23 UNIT: I
33
Building a Platform
• Support for public, private, and hybrid clouds:
• Big Data applications heavily consume computer and storage resources, prompting the
use of elastic cloud capabilities for more cost-effective processing.
• Big Data application platforms should feature native support for public, private, and
hybrid clouds, allowing seamless transitions between platforms through integration with
frameworks like JClouds and Cloud Bursting.
• Consistent management:
• The typical Big Data application stack consists of multiple layers, including database,
web, processing, caching, synchronization, distribution, and reporting tools.
• Managing these layers is challenging due to their different tools and inherent complexity,
making integrated management critical.
• Choosing a Big Data application platform that integrates both the application and
management stack enhances productivity and streamlines maintenance. Flexibility is
essential given the evolving nature of Big Data tools and technologies.
8/11/23 UNIT: I
34
Building a Platform
• Support for public, private, and hybrid clouds:
• Big Data applications heavily consume computer and storage resources, prompting the
use of elastic cloud capabilities for more cost-effective processing.
• Big Data application platforms should feature native support for public, private, and
hybrid clouds, allowing seamless transitions between platforms through integration with
frameworks like JClouds and Cloud Bursting.
• Consistent management:
• The typical Big Data application stack consists of multiple layers, including database,
web, processing, caching, synchronization, distribution, and reporting tools.
• Managing these layers is challenging due to their different tools and inherent complexity,
making integrated management critical.
• Choosing a Big Data application platform that integrates both the application and
management stack enhances productivity and streamlines maintenance. Flexibility is
essential given the evolving nature of Big Data tools and technologies.
8/11/23 UNIT: I
35
BRINGING STRUCTURE TO UNSTRUCTURED
DATA
• Unstructured Big Data lacks value in its raw form, necessitating processing and
organization for insights.
• Big Data analytics involve converting unstructured data by establishing links,
integrating structured and loosely unstructured data, and employing tools like
linked data, semantics, and text analytics.
• Techniques encompass in-place data transformation, real-time data integration,
linked data and semantics, RDF triple stores, entity extraction, text analytics, and
sentiment analysis.
• Semantic technology exploits linguistics and natural language processing for
entity extraction, enabling comprehensive analytics through unstructured and
transactional data fusion.
• Text analytics tools identify entities, relationships, and sentiment within
unstructured data, progressively enhancing accuracy and offering insights into
product names, brands, events, skills, and more.
8/11/23 UNIT: I
36
BRINGING STRUCTURE TO UNSTRUCTURED
DATA
• Entity relation extraction identifies consistent relationships between entities
across documents, delivering valuable insights in science and enterprise
contexts.
• Unstructured data tools detect sentiment in social data, facilitate cross-language
integration, and apply text analytics to audio and video transcripts.
• Increasing video content necessitates handling unstructured transcripts, which
pose unique challenges due to the lack of punctuation.
• Semantic resource description framework (RDF) triple stores maintain
relationships between data elements in a flexible, extensible manner, allowing
new elements without structural changes.
• While NoSQL databases often include key value, graph, or document stores,
the RDF triple store provides a non-relational but flexible approach,
maintaining relationships and supporting inference.
8/11/23 UNIT: I
37

1.basics of Big Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.basics of Big Data

Uploaded by

Copyright:

Available Formats

Introduction to Big Data

Introduction to Big Data

You might also like