You are on page 1of 10

Internal 3 answer

Big data analytic

Part A
1.What is the Series data?
Series is a one-dimensional labeled array capable of holding data of any
type (integer, string, float, python objects, etc.). The axis labels are collectively
called index.
2. Which are various applications of big data analytics?
• Banking and Finance. Big data helps with heavy-handed analysis of
customer portfolios through training models to...
• Healthcare. The Healthcare sector is explicitly trying out multiple
approaches to tap into AI and Machine learning.
• Retail and logistics. Major retailers utilize data analytics to manage
supply chains, optimize inventory, and direct...
• Travel. The customized ticketing platforms for every form of travel are
well ingrained, with AI, machine learning,

3. What are the types of data streams?


• Sensor readings from machines.
• e-Commerce purchase data.
• Stock exchange data to predict the stock price.
• Credit card transactions for fraud detection.
• Social media sentiment analysis

4. What are the four main steps of sentiment analysis?

• tep 1: Cleaning. The initial step is to remove special characters and


numbers from the text. In our example, we’ll...
• Step 2: Tokenization. Tokenization is the process of breaking down a
text into smaller chunks called tokens, which are...
• Step 3: Part-of-speech (POS) tagging. Part-of-speech tagging is the
process of tagging each word with its grammatical...
• Step 4: Removing stop words. Stop words are words like ‘have,’ ‘but,’
‘we,’ ‘he,’ ‘into,’ ‘just,’ and so on.
5. What is mining time?

Mining Time is a game where users mine at a rock to earn money. It is


almost too simple, as users only need to tap on the screen to see their in-game
currency shoot up.

6. What is an aggregate data model in NoSQL?

Aggregate means a collection of objects that are treated as a unit. In


NoSQL Databases, an aggregate is a collection of data that interact as a unit.

7. What is Hadoop integration?

Hadoop Integration. Powered by Cleo. Hadoop is an open-source


software framework essential for systems designed for big data analytics, and
it supports processing and storage of extremely large data sets in a distributed
computing environment

8. What are the main reasons for developing Pig Latin?


Pig latin is a made-up language game used by English-speaking children, and
the point of it is simply to confuse people who don't understand it. Words are
formed by removing the first consonant of a word, adding it to the end, and
adding the suffix “ay.” Easy! (Or easy-ay. If the first letter is a vowel, all you
need to do is add “ay.")

9. Define HiveQL data.?


HiveQL is the Hive query language. Like all SQL dialects in
widespread use, it doesn’t fully conform to any particular revision of the ANSI
SQL standard. It is perhaps closest to MySQL’s dialect, but with significant
differences.

10. How does Cassandra client work?


Cassandra works with peer to peer architecture, with each node
connected to all other nodes. Each Cassandra node performs all database
operations and can serve client requests without the need for a master node. A
Cassandra cluster does not have a single point of failure as a result of the peer-
to-peer distributed architecture

Part-B

11. Describe about streaming data model and architecture.

Building a modern streaming architecture ensures flexibility, enabling you to


support a diverse set of use cases. It abstracts the complexity of traditional data
processing architecture into a single self-service solution capable of
transforming event streams into analytics-ready data warehouse. And, it makes
it easier to keep pace with innovation and stay ahead of the competition.

At SoftKraft we help startups and SMEs unlock the full potential of streaming
data architectures. Business and technology leaders engage us to implement
projects or augment their teams with data engineering experts and Kafka
consultant.

In this article, you will learn how do you decide which popular stream
processing tools to choose for a given business scenario, given the proliferation
of new databases and analytics tools.

Table of Contents

• What is Streaming Data Architecture?


o Streaming Data Architecture Use Cases
o Benefits of data stream processing

What is Streaming Data Architecture?

A streaming data architecture is capable of ingesting and processing massive


amounts of streaming data from a variety of different sources. While traditional
data solutions focused on batch writing and reading, a streaming data
architecture consumes data as it is produced, persists it to storage, and may
perform real-time processing, data manipulation, and data analysis.

Initially, stream processing was considered a niche technology. Today it's


difficult to find a modern business that does not have an app, online advertising,
an e-commerce site, or products enabled by the Internet of Things. Each of
these digital assets generates real-time event data streams. There is a growing
appetite for implementing a streaming data infrastructure that enables complex,
powerful, and real-time analytics.

Streaming data architecture helps to develop applications that use both bound
and unbound data in new ways. For example, Netflix also uses Kafka streams to
support its recommendation engines, combining streamed data and machine
learning.

PRO TIP: Data streaming architecture is also referred to as Kappa architecture.


It is an alternative to Lambda architecture which separates slow batch
processing from fast real-time data access. Data streaming technologies like
Apache Kafka or Apache Flink enable near-real-time processing of incoming
data events - this approach allows Kappa architecture to combine batch and
streaming processing into the same data flow.

Streaming Data Architecture Use Cases

At smaller scales, traditional batch architectures may suffice. However,


streaming sources such as sensors, server and security logs, real-time
advertising, and clickstream data from apps and websites can generate up to a
Gb of events per second.

Stream processing is becoming a vital part of many enterprise data


infrastructures. For example, companies can use clickstream analytics to track
web visitor behavior and tailor their content, while ecommerce historical data
analytics can help retailers prevent shopping cart abandonment and show
customers more relevant offers. Another common use case is Internet of Things
(IoT) data analysis, which involves analyzing large streams of data from sensors
and connected devices.

Benefits of data stream processing

Stream processing provides several benefits that other data platforms cannot:

• Handling never-ending streams of events natively, reducing the overhead


and delay associated with batching events. Batch processing tools need
pausing the stream of events, gathering batches of data, and integrating
the batches to get a conclusion. While it is difficult to aggregate and
capture data from numerous streams in stream processing, it enables you
to gain instant insights from massive amounts of streaming data.
• Processing in real-time or near-real-time for up-to-the-minute data
analytics and insight. For example, dashboards that indicate machine
performance, or just-in-time delivery of micro-targeted adverts or
support, or detection of fraud or cybersecurity breaches.
• Detecting patterns in time-series data. Detection of patterns over time,
such as trends in website traffic statistics, requires data to be processed
and analyzed continually. This is made more complex by batch
processing, which divides data into batches, resulting in certain
occurrences being split across two or more batches.
• Simplified data scalability. Growing data volumes might overwhelm a
batch processing system, necessitating the addition of additional
resources or a redesign of the architecture. Modern stream processing
infrastructure is hyper-scalable, with a single stream processor capable of
processing Gigabytes of data per second.

Developing a streaming architecture is a difficult task that is best accomplished


by the addition of software components specific to each use case – hence
necessitating the need to "architect" a common solution capable of handling the
majority, if not all, envisioned use cases.
12 a) Explain about Introduction to NoSQL and its properties

A NoSQL originally referring to non SQL or non relational is a database


that provides a mechanism for storage and retrieval of data. This data is
modeled in means other than the tabular relations used in relational databases.
Such databases came into existence in the late 1960s, but did not obtain the
NoSQL moniker until a surge of popularity in the early twenty-first century.
NoSQL databases are used in real-time web applications and big data and their
use are increasing over time. NoSQL systems are also sometimes called Not
only SQL to emphasize the fact that they may support SQL-like query
languages.
A NoSQL database includes simplicity of design, simpler horizontal scaling to
clusters of machines and finer control over availability. The data structures
used by NoSQL databases are different from those used by default in relational
databases which makes some operations faster in NoSQL. The suitability of a
given NoSQL database depends on the problem it should solve. Data
structures used by NoSQL databases are sometimes also viewed as more
flexible than relational database tables.
Many NoSQL stores compromise consistency in favor of availability, speed
and partition tolerance. Barriers to the greater adoption of NoSQL stores
include the use of low-level query languages, lack of standardized interfaces,
and huge previous investments in existing relational databases. Most NoSQL
stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-
treeACE, Google Spanner (though technically a NewSQL database), Symas
LMDB, and OrientDB have made them central to their designs.
Most NoSQL databases offer a concept of eventual consistency in which
database changes are propagated to all nodes so queries for data might not
return updated data immediately or might result in reading data that is not
accurate which is a problem known as stale reads. Also some NoSQL systems
may exhibit lost writes and other forms of data loss. Some NoSQL systems
provide concepts such as write-ahead logging to avoid data loss. For
distributed transaction processing across multiple databases, data consistency
is an even bigger challenge. This is difficult for both NoSQL and relational
databases. Even current relational databases do not allow referential integrity
constraints to span databases. There are few systems that maintain both
X/Open XA standards and ACID transactions for distributed transaction
processing.
Advantages of NoSQL:
There are many advantages of working with NoSQL databases such as
MongoDB and Cassandra. The main advantages are high scalability and high
availability.
1. High scalability –
NoSQL database use sharding for horizontal scaling. Partitioning of
data and placing it on multiple machines in such a way that the order
of the data is preserved is sharding. Vertical scaling means adding
more resources to the existing machine whereas horizontal scaling
means adding more machines to handle the data. Vertical scaling is
not that easy to implement but horizontal scaling is easy to
implement. Examples of horizontal scaling databases are MongoDB,
Cassandra etc. NoSQL can handle huge amount of data because of
scalability, as the data grows NoSQL scale itself to handle that data
in efficient manner.
2. High availability –
Auto replication feature in NoSQL databases makes it highly
available because in case of any failure data replicates itself to the
previous consistent state.
Disadvantages of NoSQL:
NoSQL has the following disadvantages.
1. Narrow focus –
NoSQL databases have very narrow focus as it is mainly designed
for storage but it provides very little functionality. Relational
databases are a better choice in the field of Transaction Management
than NoSQL.
2. Open-source –
NoSQL is open-source database. There is no reliable standard for
NoSQL yet. In other words two database systems are likely to be
unequal.
3. Management challenge –
The purpose of big data tools is to make management of a large
amount of data as simple as possible. But it is not so easy. Data
management in NoSQL is much more complex than a relational
database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily
basis.
4. GUI is not available –
GUI mode tools to access the database is not flexibly available in the
market.
5. Backup –
Backup is a great weak point for some NoSQL databases like
MongoDB. MongoDB has no approach for the backup of data in a
consistent manner.
6. Large document size –
Some database systems like MongoDB and CouchDB store data in
JSON format. Which means that documents are quite large (BigData,
network bandwidth, speed), and having descriptive key names
actually hurts, since they increase the document size.
Types of NoSQL database:
Types of NoSQL databases and the name of the databases system that falls in
that category are:
1. MongoDB falls in the category of NoSQL document based database.
2. Key value store: Memcached, Redis, Coherence
3. Tabular: Hbase, Big Table, Accumulo
4. Document based: MongoDB, CouchDB, Cloudan

12 B) What is pig data model and its data types in pig

Data model

Pig is a high-level platform or tool which is used to process the large


datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin
which is used to develop the data analysis

Pig Data Types: It includes the data types of pig and how they handle concepts
such as missing data.
It also helps us to explain the data to a pig.
The data types of Pig can be divided into two categories:

• Scalar Data Types


• Complex Data Types

Scalar Data Types


Pig scalar types are simple types that appear in most programming languages.
Data
Description Example
Types

It is a singed 32-bit
Int 2
integer value
It is a singed 64-bit
long 15L or 15I
integer value

It is a 32-bit floating 4.5F or 4.5.5f or 4.5e2f or


Float
point value 4.5E2F

It is a 64-bit floating
double 08.5 or 08.5e2 or 08.5E2
point value

Boolean represents
Boolean true/false
true or false values

It is a Character
charArray Hello tutorialandexample
array

The default datatype


byteArray in pig is a
ByteArray

It displays the
bigInteger 70204096223145
BigInteger

It displays the
bigDecimal 198.789654123133211
BigDecimal

Complex Data Types


Pig has several complex data types, such as tuples, bags, and maps. All of these
types contain data of each type, including other complex types. It is therefore
possible to have a map where the value field is a bag containing a tuple. Here,
one of the fields is a map.
Data
Description Example
Types

It has Key and [websitename#


Map
value pair data tutorialandexample.com]
It’s a collection of
Tuple (12,56)
one or more fields

It is a collection of
Bag {(12,56),(54,78)}
one or more tuples

You might also like