You are on page 1of 16

KSRCE/QM/7.5.

1/CSE
K.S.R. COLLEGE OF ENGINEERING(Autonomous)

Vision of the Institution


 We envision to achieve status as an excellent educational institution in the global knowledge hub,
making self-learners, experts, ethical and responsible engineers, technologists, scientists, managers,
administrators and entrepreneurs who will significantly contribute to research and environment
friendly sustainable growth of the nation and the world.
Mission of the Institution
 To inculcate in the students self-learning abilities that enable them to become competitive and
considerate engineers, technologists, scientists, managers, administrators and entrepreneurs by
diligently imparting the best of education, nurturing environmental and social needs.
 To foster and maintain a mutually beneficial partnership with global industries and Institutions through
knowledge sharing, collaborative research and innovation.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Vision of the Department
 To create ever green professionals for software industry, academicians for knowledge cultivation and
researchers for contemporary society modernization.
Mission of the Department
 To produce proficient design, code and system engineers for software development.
 To keep updated contemporary technology and fore coming challenges for welfare of the society.
Programme Educational Objectives (PEOs)
PEO1 : Figure out, formulate, analyze typical problems and develop effective solutions by imparting
the idea and principles of science, mathematics, engineering fundamentals and computing.
PEO2 : Competent professionally and successful in their chosen career through life-long learning.
PEO3 : Excel individually or as member of a team in carrying out projects and exhibit social needs
and follow professional ethics.

DATE COURSE FACULTY H.O.D PRINCIPAL

K.S.R. COLLEGE OF ENGINEERING (Autonomous)


Department of Computer Science and Engineering
Subject Name: Theory of Computation
1
Subject Code: 16CS414 Year/Semester: II/IV
KSRCE/QM/7.5.1/CSE
Course Outcomes: On completion of this course, the student will be able to
CO1 Describe and solve problems on finite Automata.
CO2 Match by patterns of strings using regular expressions.
CO3 Generate CFG and its closure properties.
CO4 Construct pushdown automata and conversion from PDA to CFG and vice versa.
CO5 Design turing machine and its various programming techniques.
Program Outcomes (POs) and Program Specific Outcomes (PSOs)
A. Program Outcomes (POs)
Engineering Graduates will be able to :
Engineering knowledge: Ability to exhibit the knowledge of mathematics, science,
PO1 engineering fundamentals and programming skills to solve problems in computer
science.
PO2 Problem analysis: Talent to identify, formulate, analyze and solve complex engineering
problems with the knowledge of computer science. .
PO3 Design/development of solutions: Capability to design, implement, and evaluate a
computer based system, process, component or program to meet desired needs.
PO4 Conduct investigations of complex problems: Potential to conduct investigation of
complex problems by methods that include appropriate experiments, analysis and
synthesis of information in order to reach valid conclusions.
PO5 Modern tool Usage: Ability to create, select, and apply appropriate techniques,
resources and modern engineering tools to solve complex engineering problems.
PO6 The engineer and society: Skill to acquire the broad education necessary to understand
the impact of engineering solutions on a global economic, environmental, social,
political, ethical, health and safety.
PO7 Environmental and sustainability: Ability to understand the impact of the professional
engineering solutions in societal and Environmental contexts and demonstrate the
knowledge of, and need for sustainable development.
PO8 Ethics: Apply ethical principles and commit to professional ethics and responsibility
and norms of the engineering practices.
PO9 Individual and team work: Ability to function individually as well as on multi-
disciplinary teams.
PO10 Communication: Ability to communicate effectively in both verbal and written mode
to excel in the career.
PO11 Project management and finance: Ability to integrate the knowledge of engineering
and management principles to work as a member and leader in a team on diverse
projects.
PO12 Life-long learning: Ability to recognize the need of technological change by
independent and life-long learning.
B. Program Specific Outcomes (PSOs)
PSO1 Develop and Implement computer solutions that accomplish goals to the industry,
government or research by exploring new technologies.
PSO2 Grow intellectually and professionally in the chosen field.

DATE COURSE FACULTY H.O.D PRINCIPAL

BIG DATA AND ANALYTICS


QUESTION BANK
2
UNIT -1: BIG DATA AND ANALYTICS
KSRCE/QM/7.5.1/CSE
PART-A
1. What are the classification of Digital Data?
 Unstructured Data
 Semi-structured Data
 Structured Data
2. What are the characteristics of Data?
 Composition
 Condition
 Context
3. Evolution of Big Data?
The data was essentially primitive and structured. Relation database evolved in 1980s and
1990s. The era was of data intensive applicatins. The World Wide Web(WWW) and the Internet of
Things(IoT) have led to an onslaught of structured, unstructured,and timedia data.

4. Definition of Big Data?


Big Data is high-volume, high velocity, and high-variety information assess that demand cost
effective, innovative forms of information processing for enhanced insight amd decision making.

5. What are the challenges with Big Data?


 Data today is growing at an exponential rate.
 Cloud computing and virtualization are here to stay.
 The other challenge is to decide on the period of retention of big data.
 There is a dearth of skilled professionals.
 Data visualization is becoming popular as a separate discipline.
6. Why Big Data is important?
The more data we have for analysis, the greater will be the analytical accuracy and also the
greater would be the confidence in our decision based on these analytical findings.This will entail a
greater positive impact in terms of enhancing operational efficiencies, reducing cost and time, and
innovating on new products, new services, and optimising existing services.
7. What is Unstructured Data?
This is the data which does not conform to a data model or is not in a form which can be used easily by a
computer program. Example: memos, chat rooms etc.,
8. Define Semi Structured Data?
This is the data which does not conform to a data model but has some structure. However, it is
not in a form which can be used easily by a computer program Example: emails,XML etc.,
9. What is Big Data Analaytics?
Big data analytics is the process of examining big data to uncover patterns, unearth trends, and
find unknown correlation and another useful information to make faster and better decisions.
10. What are the classifications of Analytics?
There are basically two school of thought:
 Analytics into basic, operational, advanced, and moniterzed.
 Analytics into analytics 1.0,analytics 2.0, analytics 3.0.
11. Why is Big Data Analytics is important?
 Reactive-Business Intelligence.
 Reactive Big Data Analytics
3
 Proactive-Analaytics. KSRCE/QM/7.5.1/CSE
 Proactive-Big Data Analytics.
12. What is Data Science?
Data science is the science of extracting knowledge form data. In other words, it is a science of
drawing out hidden patterns amongst data using statistical and mathematical techniques.
13. What are the role of Data Scientist?
 Understanding of domain
 Business strategy.
 Problem Solving.
 Comminication
 Presentation
 Inquistivensess.
14. What are the terminology used in Big Data Environment?
 In-Memory Analytics.
 In-Database Processing
 Security.
 Schema.
 Continous availability.
 Consistency.
15. Difference between Parallel and Distrubed System?
Parallel System:
 It is tightly coupled system. The processor co-operate for query processing.
Distributed System:
 It is loosely coupled and compsed by individual machines.Each of machines can run
their individual application serve their own resprctive user.
16. List any 4 analytics tools?
 MS Excel
 SAS
 IBM SPSS Modeler
 Statistica
17. What are the responsibilities of a Data Scientist?
 Data management.
 Analaytical Techniques.
 Business Analysis.
18. What is typical data warehouse environment?
This data is integrated, cleaned up, transformed and standrandized throught the process of
Extraction, Transformation and loading. The transformated data stored in enterprise data warehouse.

19. Write about Hadoop environment?


The data sources are quite disparate form web logs to images, audios, and videos to social
media data to the various docs, pdfs, etc., Here, the data in focus is not just the data within the company’s
firewall but also data residing outside the company’s firewall.
20. Traditional Business Intelligence BI Versus Big Data?

4
All the enterprises data is housed in a central server whereas in a big data environment resides
KSRCE/QM/7.5.1/CSE
in a distributed system. The distributed file system scales by scaling in or out horizontally as compared to
typical database sever that scales vertically.

PART-B
1. Define data? Describe the types of Digital Data?
Collection of information-types of digital data-Unstructured data-is not form, Semi-structured data-it
has structure, Structured data-it is organised form.
2. Brief about Big Data?
Big data -characteristics of data-evolutions-challenges with big data-why big data is used.

3. Define Big Data Analaytics? Explain the terminologies used in the Big Data Environment?
Big data analytics-classifications-why used-Data science-Terminologies: In-Memory Analytics-In-
Database Processing-Security-Schema-Continues availability-Consistency.
4. Explain about Typical Hadoop Environment ?
Typical warehouse environment – architecture of Hadoop –HDFS(Hadoop distributed environment
system).

UNIT-2
Part- A
1. What is mean NOSQL?
NoSQL stands for Not Only SQL. These are non-relational, open source, distributed database.
They are hugely popular today owing to their ability to scale horizontally and the adeptness at dealing with a
rich variety of data: structure, semi structure and unstructured data.
2. Types of NOSQL
 Key-value or big data hash
 Schema less

3. Why NOSQL
It has scale out architecture instead of the monolithic architecture of relational database. It can
house large volumes of structure, semi structure and unstructured data. Dynamic schema, NoSQL
database allows insertion of data without a pre-defined scheme.

4. Advantages of NOSQL
 Cheap, easy to implement
 Easy to distribute
 Can easily scale up and down
 Relaxes the data consistency requirement
 Doesn’t required a pre-defined schema
 Data can be replicated to multiple nodes and can be partitioned

5. Use of NOSQL in industry


 Key value pairs: shopping carts web user data analysis
 Column-oriented: analyze huge web user actions sensor feeds
 Document based: real time analytics, logging.
 Graph-based: network modelling
5
KSRCE/QM/7.5.1/CSE

6. Difference between SQL and NOSQL

SQL NOSQL

Relational database Non-relational or distributed database

predefined schema Dynamic schema for unstructured data

Relational model Model les approach

Tables based databases Document-based or graph-based

Vertically Scalable Horizontally scalable

7. Define new SQL


We need a database the has the same scalable performance oh NOSQL system for On Line
Transaction processing(OLTP) while still maintaining the ACID guarantees of a traditional database.
This new modern RDMS is called NewSQL.

8. Define HADOOP
Hadoop is an open source project of the Apache Foundation. It is a framework written in Java.
Hadoop uses Google’s mapreduce and Google file system technologies as its foundation.

9. Features of HADOOP
 It is optimized to handle massive quantities of structure, semi structure and unstructured data.
 Hadoop has a shared nothing architecture
 It complements On Line Transaction Processing (OLTP) and On Line Analytical Processing
(OLAP).However, it’s not a replacement for a relational database management system.
 Hadoop is for high throughput rather than low latency

10. Difference between HADOOP and SQL

Hadoop SQL

Scale out Scale up

Key value pair Relational table

Functional programming Declarative queries

6
Off-line batch processing On-line transaction processing
KSRCE/QM/7.5.1/CSE

11. Define cloud based HADOOP


The Google cloud storage connector for hadoop empowers one to perform mapreduce jobs
directly on data in Google cloud storage, without the need to copy it to local disk and running it in
Hadoop Distributed File System (HDFS).
12. Distinguish RDBMS and HADOOP
RDBMS HADOOP

Relational database management system Node based flat structure

OLTP processing Analytical, big data processing

Cost around $10000 to $14000 per Cost around $4000 per terabytes of
terabytes of storage storage
Needs high expensive hardware A nodes required only a processor, a
network card and few hard drives

13. Define HDFS


The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a NameNode and DataNode architecture to implement
a distributed file system that provides high-performance access to data across highly scalable Hadoop
clusters.

14. Application of HADOOP YARN


 Central resource manager
 Node-level agent
 Job application
 Utilization

15. Limitation of HDFS


 Name node save all its file metadata in main memory.
 There is a limit on the number of object that one can have in the memory on a single name
node
 Name node can quickly become overwhelmed with load on the system increasing

16. How does mapreduce work?


Mapreduce divides a data analysis task into two part – map and reduce. Mapreduce programming
works two mapper and one reducer. Each mapper works on the partical dataset that is stored on that
node and the reducer combines the output from the mapper to produce the reduced result set.

7
KSRCE/QM/7.5.1/CSE
17. Different between SQL and mapreduce

SQL Mapreduce
Interactive and batch access Batch accesses

Static structure Dynamic structure

Real and write many times Write once, read many times

High integrity Low integrity


Nonlinear scalability Linear scalability

PART B

1. Distinguished SQL, NOSQL and NewSQL


SQL features • Rely on relational tables • Utilize defined data schema • Reduce redundancy
through normalization
NoSQL features • High performance writes and massive scalability • Do not require a defined
schema for writing data
NewSQLWe need a database the has the same scalable performance oh NOSQL system for On
Line Transaction processing(OLTP) while still maintaining the ACID guarantees of a traditional
database. This new modern RDMS is called NewSQL.

2. Explain in detail about hadoop and its versions


Hadoop:
Hadoop is an open source project of the Apache Foundation. It is a framework written in Java.
Hadoop uses Google’s mapreduce and Google file system technologies as its foundation.
Version:
 Hadoop 1.0
 Hadoop 2.0
Hadoop 1.0: data storage framework - data processing framework.
Hadoop 2.0: management framework – mapreduce programming
3. Explain overview of hadoop ecosystem

HDFS – Hbase – Hive – Pig – Zookeeper – Oozie – Mahout – Chukwa – Sqoop – Ambari

4. Discuss in detail about HDFS


Introduction of HDFS – HDFS daemons: name node – data node – secondary node- anatomy of file
read – anatomy of file write
8
KSRCE/QM/7.5.1/CSE
5. Explain the processing data with hadoop
map task - - reduce task – task tracker – job tracker – job configuration – job client - mapreduce
daemons: map task – task tracker – mapreduce work – mapreduce example: driver class – mapper
class – reducer class
6. Explain about managing resources and application with hadoop YARN
YARN - -limitation of hadoop 1.0 architecture – HDFS limitation – hadoop 2:HDFS – HDFS 2
features – hadoop 2 YARN: taking hadoop beyond batch – fundamental idea – basic concepts:
application – container.

Unit 3
Part A:
1. What is MONGODB?
MongoDb is:
 Cross-platform
 Open Source
 Non-Relational
 Distributed
 NoSql
 Document oriented data store.
2. Why MONGODB?
Few of the major challenges with traditional RDBMS are dealing with large volumes of
data,rich variety of data – particularly unstructured data, and meeting up to the scale needs of
enterprise data. The need is for database that can scale out or scale horizontally to meet the scale
requirements.
3. What are the terms used In MONGODB?
 Database
 Collection
 Document
 Fields/Key Value pairs
 Index
 Embedded Documents
4. What are the data types in MONGODB?
 String
 Integer
 Boolean
 Double
 Arrays
 Null
 Date
5. Explain MONGODB query language?
CRUD(Create,Read update and Delete)
Create – Creation of data
Read - Reading of data
Update – Update of data
9
Delete – Deleted using the Remove()method.
KSRCE/QM/7.5.1/CSE
6. Write a query for performing insert operations in MONGODB?
Db.students.insert(
{
RollNo:101,
Age:19,
Contact no: 10124585:
EmailId:Sample@abc.com
}}

7. Write the features of Cassandra?


 Peer to peer Network
 Gossip and Failure Detection
 Partitioner
 Replication Factor
 Anti Entropy and Real Repair
8. List the CQL data types with definition?
 Int – 32 bit signed integer
 Bigint – 64 bit signed long
 Double – 64bit IEEE 754 floating point
 Float- 32 bit IEEE 754 floating point
9. Define CQLSH with example?
Objective- Try to Acieve here
Input- Given to us act Upon
Act – Accomplish the task at hand
Outcome Executing the statement.
10. Write about KEYSPACE in CQL?
A keyspace is a container to hold application data. It is comparable to a relational database.It is
used to group coloumn families together. Typically a cluster has one keyspace per application.
11. Write an example with KEYSPACE?
Create keyspace students with replication = {
‘class’:’Simple Strategy’,
‘replication_factor’:1
};
12. Define CRVD and write sample program?
To Insert data into the coloumn family ”Student_info”.
An Insert writes one or more columns to a record in Cassandra table atomically. An Insert Statement
does not return and output.One is not required to place values in all the coloumns.
13. Define collection and its types?
A Column of type set consists of unorderd unique values. However, when the column is queried, it
returns the values in sorted order.
 Set
 List
 Map
14. Define TTL?
10
Data in a column, other than a counter column, can have optional expiration period called
KSRCE/QM/7.5.1/CSE
TTL(tiem to Live). The client request may specify a TTL values of the data.
15. Define about alter command?
Alter table sample
Alter sample_id TYPE int;

16. Write the steps to export CSV?


 Check the records of the table” elearningLists” present in the “student” database.
 Execute the below command at the cqlsh prompt
 Check the existence of the “elearninglists.csv” file in D:/.
17. Write the steps to import CSV?
 Check for the Table “elearning list” in the “Students” database.
 Check for the context of the “D:/elearinglist.csv”
 Execute the command to import data into the table in the database.

Part B
1. Explain in details about MONGODB Query language with example?
 Cross-platform
 Open Source
 Non relational
 Distributed
 NoSql
 Document oriented data store
2. Write details about features of Cassandra?
 Peer to Peer Network
 Gossip and Failure Detection
 Partitioner
 Replication Error
 Anti-Entropy and Read Repair
 Writes in Cassandra
 Hinted Handoffs
3. Explain about CRUD operations with example?
 Create
 Read
 Update
 Delete
4. Explain in details about collections?
 Set
 List
 Map
5. Explain in details about import and export of Cassandra?
Export Cassandra:
 Check the records of the table” e-learninglists” present in the “student” database.
 Execute the below command at the cqlsh prompt
 Check the existence of the “e-learninglists.csv” file in D:/.
11
Import Cassandra:
KSRCE/QM/7.5.1/CSE
 Check for the Table “e-learning list” in the “Students” database.
 Check for the context of the “D:/elearinglist.csv”
 Execute the command to import data into the table in the database.

Unit – IV
Part - A
1) Define Mapreduce ?
Answer: MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes
a set of data and converts it into another set of data, where individual elements are broken down into
tuples (key/value pairs). Reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job

2) Different types of mapper and definition ?


Answer: Mapper in Hadoop takes each record generated by the RecordReader as input. Then
processes each record and generates key-value pairs.
Types: Chain Mapper,Identity Mapper

3) Define reduces and its types ?


Answer: Hadoop Reducer takes a set of an intermediate key-value pair produced by the mapper as the
input and runs a Reducer function on each of them. One can aggregate, filter, and combine this data
(key, value) in a number of ways for a wide range of processing

4) Difference between combines and pastrones ?


Answer:

5) What is meant by searching in map reduces program ?


Answer: Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase.

6) Define compression in map reduce ?


Answer: In data intensive hadoop workloads,I/O operation and network data transfer takes
considerably long amount of time to complete.In addition to this internal MapReduce "Shuffle"
process is also under huge I/O pressure as it has to often "spill out" intermediate data to local disks
before advancing from Map phase to Reduce Phase

7) What is Hive ?
Answer: Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides
on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.This is a brief
tutorial that provides an introduction on how to use Apache Hive HiveQL with Hadoop Distributed
File System.
12
KSRCE/QM/7.5.1/CSE
8) What is the features of hive ?
Answer: * Hive provides data summarization, query, and analysis in much easier manner.
* Hive supports external tables which make it possible to process data without actually storing in
HDFS.
* Apache Hive fits the low-level interface requirement of Hadoop perfectly.
* It also supports partitioning of data at the level of tables to improve performance.

9) List out data unit used in hive ?


Answer: Databases, Tables, Partitions , Buckets

10) What is HQL ?


Answer: Hibernate Query Language (HQL) is an object-oriented query language, similar to SQL, but
instead of operating on tables and columns, HQL works with persistent objects and their properties.
HQL queries are translated by Hibernate into conventional SQL queries, which in turns perform action
on database.

11) What is Hive datatypes ?


Answer: * Integral Types (TINYINT, SMALLINT, INT/INTEGER, BIGINT)
 Strings.
 Varchar.
 Char.
 Timestamps. Casting Dates.
 Intervals.

12) Different types of Hive File Formats ?


Answer: Formats:TextFile,SequenceFile,RCFile,AVRO,ORC,Parquet. Apache Hive supports several
familiar file formatsused in Apache Hadoop

13) Write about RC File ?


Answer: RC is a resource file format used by C/C++ programming applications. RC files are used to
hold statements for different files that are going to be compiled into a binary resource file

14) Write about DML statements ?


Answer: DML is short name of Data Manipulation Languagewhich deals with data manipulation,
and includes most common SQL statements such SELECT,INSERT, UPDATE, DELETE etc, and it
is used to store, modify, retrieve, delete and update data in database

15) What is UDF ?


Answer: Functions are built for a specific purpose to perform operations like Mathematical,
arithmetic, logical and relational on the operands of table column names.
16) Detail about SERDE ?
13
Answer: SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface
KSRCE/QM/7.5.1/CSE
handles both serialization and deserialization and also interpreting the results of serialization as
individual fields for processing. A SerDe allows Hive to read in data from a table, and write it back out
to HDFS in any custom format. Anyone can write their own SerDe for their own data formats.
Part – B
1) To write MapReduce program to search for specific keyword in a file and sort by student name?
Program & Definition

2) Explain about MapReduce [Mapper and Reduces]?


MapStage
ReduceStage
Example Program

3) Explain Hive Architecture and explain Hive?


Hive
Features
Architecture
Working

4) List the Hive Datatypes and explain briefly about Hive File Format?
Integral
String
Timestamp
Union types

5) Explain about HQL and also list the DDL, Aggregation, Bucketing?
From clause
As Clause
Select clause
Where Clause
Order by Clause
Update , Delete, DDL, Aggregation, Bucketing
Unit-5
Part-A
1) What is pig?
Apache pig is a platform for data analysis. It is alternative to mapreduce programming.Pig was
developed as a research product at Yahoo.
2) Define anatomy of pig?
 Data flow languages
 Interactive shell where you can type pig latin statements
 Pig interpreter and execution engine
3) Types of pig philosophy?
14
Pig eat anything
KSRCE/QM/7.5.1/CSE
Pig live anywhere
Pig are domestic animal
Pig fly
4) List the pig latin statements?
 Pig latin statements is an operator
 Pig latin statements are basic construct to process data using pig
 Pig latin statements should end with semicolon.
5) What are the modes in running pig?
You can run a pig in two ways
 Interactive mode
 Batch mode
6) What are the data types in Pig?
Int,long,float,double,chararray,bytearray,datetime,Boolean
7) Define local mode & map reduce mode?
local mode:
To run a pig in local mode,you need to have your files in a local file system.
Syntax: pig-x local filename
Map reduce mode:
To run a pig in mapreduce mode you need to have access to a hadoob cluster to read/write
8) Define relational operator?
 Filter
 For each
 Distinct
 Group
 Limit
 Order by
 Join
 Union
 Split
 Sample

9) Define pig bank ?


Pig user can use piggy bank function in pig latin script and they can also sharestheir functions in
piggy bank
10) Write about UDF?
Pig allow you to create your own function for complex analysis.

Part B
1) Explain brief about pig ETL processing and its latin overview?
 Pig latin statement
 Pig latin :keyword
 Pig latin:identifier
 Pig latin :command

15
 Pig latin :case sensitivity KSRCE/QM/7.5.1/CSE
 Operators in pig latin
2) Define machine learning and its algorithm?
 Machine learning definition
 Machine learning algorithm
3) Define execution mode of pig and running pig?
Run pig in two ways
 Interactive pig
 Batch pig
Execute pig in two ways
 Local pig
 Map reduce mode

16