Bda Super Imp

BDA SUPER IMP Questions and Solns
VTU previous year paper analysis by YouTuber Afnan Marquee. The questions present in this
document are repeated 5+ times in previous year papers. So don't miss it by any chance! For
video explanation, check out my YouTube Channel!
Module 1
Questions
1. Big Data and its characteristics
2. Structured, Semi-Structured, Multi-Structured and Unstructured data
3. Five Layer Data Architecture
4. Data Managing
5. Steps in data pre-processing
6. Data Export to Cloud
7. Applications of BigData
Solutions
1. Big Data and its characteristics
2. Structured, Semi-Structured, Multi-Structured and Unstructured data

3. Five Layer Data Architecture
Layer 1: It considers the following aspects-

● Amount of data needed at ingestion layer 2 (L2)
● Push from L1 or pull by L2 as per the mechanism for the usages
● Source data-types: Database, files, web or service
4. Big Data Management and Analysis
5. Steps in data pre-processing
6. Data Export to Cloud
Cloud offers various services. These services can be accessed through a cloud client (client
application), such as a web browser, SQL or other client. Figure 1.4 shows data-store export
from machines, files, computers, web servers and web services. The data exports to clouds,
such as IBM, Microsoft, Oracle, Amazon, Rackspace, TCS, Tata Communications or Hadoop
cloud services.
7. Applications of Big Data

1. BigData in Marketing and Sales
Data are important for most aspect of marketing, sales and advertising. Customer Value (CV)
depends on three factors - quality, service and price. Big data analytics deploy large volume of
data to identify and derive intelligence using predictive models about the individuals. The facts
enable marketing companies to decide what products to sell.
2. BigData Analytics in detection of marketing and fraud

Big Data analytics enable fraud detection. Big Data usages has the following features-for
enabling detection and prevention of frauds:
● Fusing of existing data at an enterprise data warehouse with data from sources such as
social media, websites, biogs, e-mails, and thus enriching existing data
● Using multiple sources of data and connecting with many applications
● Providing greater insights using querying of the multiple source data
3. BigData in credit card risk management

Financial institutions, such as banks, extend loans to industrial and household sectors. These
institutions in many countries face credit risks, mainly risks of (i) loan defaults, (ii) timely return
of interests and principal amount. Financing institutions are keen to get insights into the
following:
4. BigData in HealthCare
5. BigData in Medicine
6. BigData in Advertising
Module 2
Questions
1. MapReduce, its properties and example
2. Hadoop Core Components
3. Hadoop YARN execution
4. Hadoop Ecosystem Tools
5. Explain Rack Awareness
6. Apache PIG, HIVE, OOZIE
7. YARN applications
Solutions
1. MapReduce, its properties and example
2. Hadoop Core Components
3. Hadoop YARN execution
4. Hadoop Ecosystem Tools
5. Explain Rack Awareness
6. Apache PIG, HIVE, OOZIE
7. YARN Applications
Distributed Shell
Distributed-Shell is an example application included with the Hadoop core components that
demonstrates how to write applications on top of YARN.
Hadoop MapReduce
MapReduce was the first YARN framework and drove many of YARN’s requirements. It is
integrated tightly with the rest of the Hadoop ecosystem projects, such as Apache Pig, Apache
Hive, and Apache Oozie.
Apache Tez
One great example of a new YARN framework is Apache Tez. Many Hadoop jobs involve the
execution of a complex directed acyclic graph (DAG) of tasks using separate MapReduce
stages. Apache Tez generalizes this process and enables these tasks to be spread across
stages so that they can be run as a single, all-encompassing job.
Apache Giraph
Apache Giraph is an iterative graph processing system built for high scalability. Facebook,
Twitter, and LinkedIn use it to create social graphs of users.
Hoya: HBase on YARN

The Hoya project creates dynamic and elastic Apache HBase clusters on top of YARN. A client
application creates the persistent configuration files, sets up the HBase cluster XML files, and
then asks YARN to create an ApplicationMaster.
Apache Spark
Spark was initially developed for applications in which keeping data in memory improves
performance, such as iterative algorithms, which are common in machine learning, and
interactive data mining.
Module 3
Questions
1. What is NoSQL database? Mention its characteristics
2. Explain Key-Value Store
3. Explain CSV, JSON and XML
4. How NoSQL manages Big Data
5. SN architecture
6. Explain MongoDB and CassandraDB
Solutions
1. What is NoSQL database? Mention its characteristics
2. Explain Key-Value Store

Key-Value Store
The simplest way to implement a schema-less data store is to use key-value pairs. The data
store characteristics are high performance, scalability and flexibility. Data retrieval is fast in
key-value pairs data store. A simple string called, key maps to a large data string or BLOB
(Basic Large Object). Key-value store accesses use a primary key for accessing the values.
Therefore, the store can be easily scaled up for very large data.
Advantages
1. Data Store can store any data type in a value field.
2. A query just requests the values and returns the values as a single item.
3. Key-value store is eventually consistent
4. Key-value data store may be hierarchical or may be ordered key-value store
5. Returned values on queries can be used to convert into lists, table- columns, data-frame
fields and columns.
6. Have (i) scalability, (ii) reliability, (iii) portability and (iv) low operationalcost.
7. The key can be synthetic or auto-generated.
Disadvantages
1. No indexes are maintained on values, thus a subset of values is not searchable.
2. Key-value store does not provide traditional database capabilities, such as atomicity of
transactions, or consistency when multiple transactions are executed simultaneously.
3. Maintaining unique values as keys may become more difficult when the volume of data
increases.
4. Queries cannot be performed on individual values.
3. Explain CSV, JSON and XML - document store

Document Store
Features of Document Store

1. Document stores unstructured data
2. Storage has similarity with object store.
3. Data stores in nested hierarchies.
4. Querying is easy.
5. No object relational mapping enables easy search by following paths from the root of
document tree.
6. Transactions on the document store exhibit ACID properties.
CSV Example
JSON Example
XML Example
4. How NoSQL manages BigData?
5. SN architecture
6. Explain MongoDB and CassandraDB

MonogDB
Features of MongoDB
1. Collection stores a number of MongoDB documents. It is analogous to a table of RDBMS.

A collection exists within a single DB to achieve a single purpose.
2. Document model is well defined.
3. MongoDB is a document data store in which one collection holds different documents.
4. Storing of data is flexible, and data store consists of JSON-like documents.
5. Storing of documents on disk is in BSON serialization format.
6. Querying, indexing, and real time aggregation allows accessing and analyzing the data
efficiently.
7. Deep query-ability-Supports dynamic queries on documents using a document based query
language that's nearly as powerful as SQL.
8. No complex Joins.
9. Distributed DB makes availability high, and provides horizontal scalability.
CassandraDB
Features
Module 4
Questions
1. BigData Architecture
2. Mapreduce Processing Steps
3. Hive Workflow
4. PIG Workflow
Solutions
1. BigData Architecture
2. MapReduce Processing Steps
3. Hive Workflow
4. PIG workflow
Module 5
Questions
1. ANOVA and correlation indicators
2. Linear Regression types
3. Apriori Algorithm
4. Text and Web Mining
5. Page Rank
Solutions
1. ANOVA and correlation indicators
The analysis (ANOVA) is for disproving or accepting the null hypothesis. The test also finds
whether to accept another alternate hypothesis. The test finds that whether testing groups have
any difference between them or not.
Relationships and correlations enable training model on sample data using statistical or ML
algorithms. Statistical correlation is measured by the coefficient of correlation.
The correlation r between the two variables x and y is:
2. Regression Types
3. Apriori Algorithm
4. Text and Web Mining
Text Mining
1. Text cleanup is a process of removing unnecessary or unwanted information.
2. Tokenization is a process of splitting the cleanup text into tokens (words) using white spaces
and punctuation marks as delimiters.
3. Part of Speech (POS) tagging is a method that attempts labeling of each token (word) with
an appropriate POS.
4. Word sense disambiguation is a method, which identifies the sense of a word used in a
sentence; that gives meaning in case the word has multiple meanings.
5. Parsing is a method, which generates a parse-tree for each sentence. Parsing attempts and
infers the precise grammatical relationships between different words in a given sentence.
Phase 2: Features Generation is a process which first defines features (variables, predictors).
Some of the ways of feature generations are:
1. Bag of words-Order of words is not that important for certain applications. Text document is
represented by the words it contains (and their occurrences). Document classification methods
commonly use the bag-of-words model. The preprocessing of a document first provides a
document with a bag of words.
2. Stemming-identifies a word by its root.
3. Removing stop words from the feature space-they are the common words
4. Vector Space Model (VSM)-is an algebraic model for representing text documents as vector
of identifiers, word frequencies or terms in the document index
Phase 3: Features Selection is the process that selects a subset of features by rejecting
irrelevant and/or redundant features (variables, predictors or dimension) according to defined
criteria. Feature selection process does the following:
1. Dimensionality reduction-Feature selection is one of the methods of division and therefore,
dimension reduction.
2. N-gram evaluation-finding the number of consecutive words of interest and extract them.
3. Noise detection and evaluation of outliers methods do the identification of unusual or
suspicious items, events or observations from the data set.
Phase 4: Data mining techniques enable insights about the structured database that resulted
from the previous phases. Examples of techniques are:
1. Unsupervised learning (for example, clustering) (i) The class labels (categories) of training
data are unknown (ii)Establish the existence of groups or clusters in the data.
2. Supervised learning (for example, classification) (i) The training data is labeled indicating
the class (ii)New data is classified based on the training set.
WEB MINING
1. Hubs: These are pages with a large number of interesting links. They serve as a hub, or a
gathering point, where people visit to access a variety of information.
2. Authorities: Ultimately, people would gravitate towards pages that provide the most complete
and authoritative information on a particular subject. This could be factual information, news,
advice, user reviews etc.
5. Page Rank

Bda Super Imp

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Super Imp

Uploaded by

Copyright:

Available Formats

BDA SUPER IMP Questions and Solns

2. Structured, Semi-Structured, Multi-Structured and Unstructured data

Layer 1: It considers the following aspects-

7. Applications of Big Data

2. BigData Analytics in detection of marketing and fraud

3. BigData in credit card risk management

Hoya: HBase on YARN

2. Explain Key-Value Store

3. Explain CSV, JSON and XML - document store

Features of Document Store

6. Explain MongoDB and CassandraDB

1. Collection stores a number of MongoDB documents. It is analogous to a table of RDBMS.

You might also like