You are on page 1of 377

B.E.

8TH SEIVr
CSE/ISE
THREE CBCS MODEL QUESTION PAPERS &
PREVIOUS YEAR QUESTION PAPERS SOLVED

SUNSTAR EXAM SCANNER


::r Internet of Things Technolo
,? Big Data Analytics

,? Network Management
As Per New Syllabus ofVfU 2015 Scheme
Choice Based Credit System(CBCS)

ALL IN ONE
SUNSTAR EXAM SCANNER
B.E.
STH Sem CSE/ISE
CBCS MODEL QUESTION PAPERS (WITH ANSWERS)
&
PREVIOUS YEAR EXAM PAPERS (WITH ANSWERS)

¢ Internet of Things Technology


¢ Big Data Analytics
¢ Network Management
¢ System Modeling and Simulation

C AUTHORED BY A TEAM OF EXPERTS ) _

SUNSTAR PUBLISHER®
#4/1, Kuppaswamy Building, l 9lh Cross,
Cubbonpet, Bangalore - 560002
Phone:08022224143
E-mail: sunstar884@gmail.com
SUNSTAll EXAM SCANNER for 8,. Sein B.E. CSi/ISE, V. T. U
Authored by A Team of Experts and Published by Sunstar Publisher, Bangalore - 2.

©Authors

-----------.------------------,,
Pages : IV + 364 No of Copies : 1000
Paper used : 11.2 kg (58 GSM) Sripa thi
First Edition : 2020 - Book Size : 1/4th Crown

Published By: Copy Right: Every effort has been made


to avoid errors or commissions in this
SUNSTAR PUBLISHER®
,4/ 1, Kuppaswamy Building. 191h Cross,
Cubbonpet, Bangalore• 560002
publication.lnspiteofthis,someerrorsmight
crept in. Any mistake, error or discrepancy
noted maybe brought lo our notice which
All En
Phone: 080 22224143 shall be taken care of in the next edition. It
is notified that neither the published nor the
E-mail: sunstar884@gmail.com
authors or seller will be responsible for any
damage or loss of action to any one, of any
kind. In any manner, therefrom.
No part of this book may be reproduced or
copied in any form or by any means (graphic,
Composed by electronicormechanical,including.tapingor
11.& .SCX.UTH»nl information retrieval system) or reproduced
Bangalore on any disc, tape, perfora tcd media or other
information storage device, etc., without the
written permission of the publishers. Breath
of th.is condition is liable for legal action.
For binding mistakes, misprints or. for
missing pages, etc., the publisher's liability
Printed at:

=l
is limited to replacement within one month
Sri Manjunatha Printers of purchase by similar edition. All expenses
Bangalore in this connection are to borne by the
purchaser.
All disputes a re subject to Bangalore
Jurisdiction only
, Bangal ore - 2.

Sripa thi

Dedicated
has been made
·ssions in this
to
All ~ngineering Students
c next edition. It
ublished nor the
onsible for any
any one, of any
' om.
e reproduced or
fD'ea ns (graphic,
luding. taping or
Oorreproduced
media or other
.tc., without the
lishers. Brea th
legal action.
,prints or. for
isher's liability
hin one month
n. All expenses
orne by the

o Ilaog,lott I_
____________________________________________
11 CONTENTS II

1. Internet of Things Technology


► CBCS Model Question Paper - 1 03 - 28
► CBCS Model Question Paper - 2 29 - 56
► CBCS J une I July 2019 01 - 19
► CBCS December 2019 I January 2020 20 - 44

2. Big Data Analytics


► CBCS Model Question Paper - 1 03 - 46
► CBCS Model Question Paper - 2 47 - 84
► CBCS Model Question Paper - 3 85 - 124
► CBCS June / July 20 19 01 - 24

3. Netwo,rk Manageme_nt
► CBCS Mo_del Question Paper - 1 03 - 19
► CBCS Model Question Paper - 2 20- 36
► CBCSJune I July 2019 01 - 13
► CBCS December 2019 I January 2020 14- 20

3. System Modeling and Simulation


► CBCS Model Question Paper - I 03 - 24
► CBCS Model Question Paper - 2 25 -44
► CBCSJune / July 2019 01 - 08
► CBCS December 2019 / January 2020 09- 16
As Per New VTU Syllabus w.e.f 2015-16
Choice Based Credit System(CBCS)

SUNSTAR
---
-
- - - - - - ]
SUNSTAR EXAM SCANNER
- - - - - - - - - - - - -- - - - - - - - ---

INTERNET OF THINGS TECHNOLOGY

(VIII SEM. B.E. ~.SE / ISE)


SYLLABUS
INTERNET OF THINGS TECHNOLOGY
[AS PE R C IIOICE lll\SEO CRPr>IT SYSTE M (C llC S ) SC IIE ME )
(EFFECTIVI; FROM TH E AC ADEMI C YF'AR 201 6 - 2017)
Subjrtt Codr ISCSSI IA Marks 20
N11mbtr of Ltttun- llounl\Vrtk 04 Exarn Marks 80
'fot1I Nurnh<'r uf Lrrturr Hours so Eurn llours OJ

MODULE 1
What is loT. Ge1~esis of loT, loT and Digitization, loT Impact, Convergence of IT and loT, loT
Challenges, loT Network Architecture and Design, Drivers Behind New Network Architectures.
Comparing loT Architectures, A Simplified loT Architecture, The Core loT Functional Stack, loT
Data Management and Compute Stack.

MODULE2
Smart Objects: Ti\e "Things" in loT, Sensors, Actual ors, and Smart Objects, Sensor Networks,
Connecting Smart Objects, Communications Criteria, loT Access Technologies.

MODULE 3
IP as the loT Network Layer, The Business Case for IP, The need for Optimization, Optimizing
IP for IoT, Profiles and Compliances, Application Protocols for loT, The Transport Layer, loT
Application Transport Methods. ·

MOOULE4
Data and Analytics for loT, An Introduction to Data Analytics for loT, Machine Leaming, Big
Data Analytics Tools and Technology, Edge Streaming Analytics, Network Analytics, Securing
loT, A Brief History ofOT Security, Common Challenges in OT Security, I low IT and OT Security
Practices and Systems Va,y, Formal Risk Analysis Structures: OCTAVE and FAIR, The Phased
Application of Security in an Operational Environment

MODULES
loT Physical Devices and Endpoints - Arduino UNO: Introduction to Arduino, Arduino UNO,
Installing the Software, Fundamentals of Arduino Programming. loT Physical Devices and
Endpoints - RaspberryPi: Introduction to RaspberryPi, About the RaspberryPi Board: llardware
Layout, Operating Systems on RaspberryPi, Configuring Raspberry Pi, Programming Raspberry Pi
with Python, Wireless Temperature Monitoring System Using Pi, OS I 8B20 Temperature Sensor.
Connecting Raspberry Pi via SSH, Accessing Temperature from DS 18820 sensors, Remote
access to RaspberryPi, Smart and Connected Cities, An loT Strategy for Smarter Cities, Smart
City loT Architecture, Smart City Security Architecture, Smart City Use-Case Examples.
Eighth Semester B.E. Degree Examination,
CBCS - Model Question Paper - 1
INTERNET OF THINGS TECHNOLOGY
Time: 3 hrs. Max. Marks: 80
Note : Am wer any FIVE /111/ q11esfio11s, selecting ONE /111/ quest/011 from each mod11/e.

Module - I
I. a. What is IOT'! Explain the features of IOT (08 Marks)
Ans. The most important features of IOT arc
Connectivity: Connectivity refers to establish a proper connection bcn,vccn all the
things of IOT to lvT platform i\l may be server or cloud. Afler connecting the JOT
devices. it needs a high speed messaging between the devices and cloud to enable
reliable. secure and bi-directional communication.
Analyzing: Aller connecting all the relevant things. it comes to real-time analyzing
the data collected and use them to build effective business intelligence. If we have a
good insight into data gathered from all these things. then we call our system has a
smart syste111 .
Jntcgrating: IOT integrating the various models to impro\'e the user experience as
well.
Artificinl lntelliJ!ence: IOT makes things smart and enhances life through the use
of data. For example, if we have a coffee machine whose beans have going to end.
then the coffee machine itself order the coffee beans of your choice from the retailer.
ScnsinJ?: The sensor devices used in IOT technologies dc1cct and measure any
clrnnge in the environment and report on their status. IOT technology brings passive
networks 10 active ne1works. Without sensors, 1here could 1101 hold an effective or
true IOT environmcnl.
Active Engagement: JOT makes the connected technology. product, or services to
active engagement between each other.
Endpoint Management: It is important to be the endpoint management of all the
IOT system otherwise; it makes lhe complete failure of1hc system. for example, if
coffee machines itself order the coffee beans when it goes to end but what happens
when it orders the beans from a retailer and wi: are not present at home for a few
days. it leads to the failure of the IOT systt:m. So, tht:re must bll a need for endpoinl
management.
b. Explain the IOT Applications (08 Marks)
Ans. The versatility of IOT has become very popular in recent years. There are many
advantages to having a device based on lvT. Mckinsey Global Institute reports
that IOT business will reach 6.2 trillion in revenue by 2025. There are lo1s of
applications are available in the market in different areas.
J ) Personal Home A11tom11tion Syi;tem: Home Automation system is the m,~or
example in this area.

3
VIII Se,rn,, ( CSE/ISC:) In.tern.et-of~ T ~
2)Wemo Switch Smart Plug: It is the mol>t useful devices which connected home
devices in the Switch. a smart plug. It plugs into a regular outlet, accepts the
power cable from any device, and can be used to turn it on and off on hit :1
button on your snrnrtphonc.
3) Enterprise: In the enterprise area many applications are there Like environmental
monitoring i.ystem. smart environment etc.
4) Nest Smnrt Thermostat: It is connected to the internet. The Nes t lenrns
automaticnlly your family 's routines and will automatically adjust the
temperature based on your acti\'ities, to make your house more efficient. There
is also a mobile app which allows the user to edit temperature and schedules.
5) Utilities: smarl metering, smarl grid, and water monitoring system are the
most useful applications in the various utility area.
6) Energy Management: Advanced Metering Infrastructure is the major example
in this area.
7) Medical and Health Care: Remote health monitoring and emergency
notification system are Cl\amples of JOT in the medicalfield.
8) Health patch Health Monitor: It can be used for the patient who can't go to
doctors. letting them get CCG. heart rate, respiratory rate, !.kin temperature, body
posture. fall de1ec1ion. and accivity readings remotely.
9) Transportation: Electronic toll collection system is the most useful example in
this area.
10) Large scale deployment: There are various large projects ongoing in the world.
Songdo (South Korea), the first of its kind fully wired Smart City, is near
completion. Everything in this cit) is planned to be wired. connected and turned
into a data strea111 that would be monitored by an array of computers without any
human interaction.
11) Another ex:imple is the Sino-Singapore Gunngzhou Knowledge City work
on improving ~,ir nnd wnter qu:1lity. French company Sigfox commenced building
an ultra-narro,,band wireless data network in the San Francisco Bay area in 2014.
Another ex:unple is the one completed by New York Wnterwnys in New York
City to connect all their vessels and being able to monitor them 24n.
So these are large applications are present in the market which is based on loT.
This world is going to become a beller place to live with more communication
with everyone. In near future large number of de" ices connected to the internet and
provide great facilities to the world.
OR
2. n. Explnin the IOT network architecture components. (10 Marl<S)
Ans. A "thing·• is an object equipped with sensors that gather data which will be transferred
' over a networl- and actuators that allow things to act (for example, to switch on or
off the light. to open or close a door. to increase or dccreasc engine rotation speed and
more). This concept includes fridges, street lamps, buildings. vehicles, production
machinery. rehabilitation equipment and everything else -imnginable. Sensors are
not in all cases physically attached to the things: sensors ma) need to monitor, for
example. what happens in the closest environment to a thing.
Gateways. Data goes from things to the cloud and vice, ersa through the gateways. A
gateway provides connectivity between things and the cloud part of the loT solution.
enables data preprocessing and filtering before moving it to the cloud (to reduce the
volume of data for detailed processing and storing) and transmits control commands
going from the cloud to things. Things then execute commands using their actuators.
Cloud gateway facilitates data compression and secure data transmission between
field gateways and cloud loT servers. It also ensures compatibility with various
protocols and communicates with field gatcwa) s using different protocols depending
on what protocol is supported by gateways. ·
Streaming data processor ensures eflective transition of input data to a data lake
and control applications. No data can be occasionally lost or corrupted.
Data Lake. A data lake is used for storing the data generated by connected devices
in its natural format. Big data comes in "batches" or in ''streams". When the data is
needed for meaningful insights it's extracted from a data lake and loaded to a big
data warehouse.
Big data warehouse. Filtered and preprocessed data needed for meaningful
insights is exfracted from a data lake to a big data warehouse. A big data warehouse
contains only cleaned. structured and matched data (compared to a data lake which
contains all sorts of data generated by sensors). Also. data warehouse stores context
information about things and sensors (for example, where sensors are installed) and
the commands control applications send to things.
Data analytics. Data analysts can use data from the big data warehouse to find
trends and gain actionable insights. When analyzed (and in many cases - visualized
in schemes, diagrams, info graphics) big data show, for example. the performance
of devices. help identify ineffi-:iencies and work out the ways to improve an loT
system (make it more reliable. more customer-oriented). Also, the correlations and
patterns found manually can further contribute to creating algorithms for control
applications.
Machine learning nnd the models ML generates. With machine learning, there
is an opportunity to create more precise and more efficient models for control
applications. Models are regularly updated (for example. once in a week or once in a
month) based on the historical data accumulated in a big data warehouse. When the
applicability and efficiency of new models are tested and approved by data analysts.
new models are used by control applications.
Control applications send automatic commands and alerts to actuators. for example:
• Windows of a smart home can receive an automatic command to open or close
.. depending on the forecasts taken from the weather service.
• When sensors show that the soil is dry, watering systems get an automatic
command to water plants.
• Sensors help mon.itor the state of industrial equipment, and in case of a pre-
failure situation, an loT system generates and sends automatic notifications to
VIII Se-mt (CSf(ISf) I n.tunet o f ~T ~
field engineers.
The commands sent by control apps 10 actuators can be also additionally stored in
a big data warehouse. This may help investigate problematic cases (for example.
a control app sends. commands, but they are not performed by actuators - then
connectivity. gateways and actuators need to be checked). On the other side, storing
command~ from control apps may contribute lo security. as an loT system can
identify that some commands arc too strange or come in 100 big amounts which ma)
evidence securit) breaches (as well as other problems which need investigation and
corrective measures).
Control applications can be either rule-based or machine-learning based. In the first
case, control apps work according to the rules stated by specialists. In the second
case, control apps arc using models which are regularly updated (once in a week.
once in a month depending on the specifics of an loT S)Stcm) with the historical data
stored in a big data warehouse.
Although control apps ensure belier automation of nn loT system, there should
be always an option for users to innuence the behavior of such appheations (for
example, in cases of emergency or when it turns out that an loT system is badly
tuned to perform certain actions).
User applications: Arc a software component of an loT system which enables the
connection of users to an loTsystem and gi,es the options to monitor and control their
smart things (while they arc connected to a network of similar things. for example.
homes or cars and controlled by a central system). With a mobile or web app. users
can monitor the state of their things, send commands to control applications, set the
options of automatic behavior (automatic notifications and actions when certain data
comes from sen<;ors).
b. Write explanatory note on JOT Data Management. (06 Marks)
Ans. loT data management enables businesses to discover usage patterns. It also challenges
assumptions m:ide during design :ind development phases. identifying weaknesses
in connected devices. In other words - it helps create the best connected products
possible. Before the product is released, loT data management allows to conduct a
field test.

Source: Bm,c/1 Software /11111J1111tium


Armed ,~ ith this data, "'e can improve design and create a higher-quality product
that offers the best user experience. For example. un automated vehicle manufacturer
can identify how various parts and components arc used, and assess what conditions

I
they can with stand. When we consider that vehicle recall costs can run into millions
in compensation claims (not to mention the cost to your reputation), this is a no-
brainer. Collecting field data is also an important step post-lounch. we can provide
continuous product improvements with soHware updates and get important insights
for your next version. Througttout the producl lifetime, these insights will support
the development process of new products and additional ill:rations.
Module - 2
3. I , Whal arc Smart obj1..-cls in IOT? Explain {06 Marks)
Ans. The concept of smart in loT is used for physical objects that arc active, digital.
networked. can operate to some ex.tent autonomously. reconfigurable and has local
control of the resources. The smart objects need energy. data storage, etc. A smart
object is an object that enhances the interaction with other smart objects as well as
with people also. The world of loT is the network of interconnected heterogeneous
objects (such as smart devices, smart objects. sensors. actuators, RFID. embedded
computers. etc.) uniquely addressable and based on standard communication
protocols. In a day to day life, people have a lot of object with internet or wireless or
wired connection. Such ns:
• Smartphone
• Tablets
• TV computer
These objects can be interconnected among them and facilitate our daily life (smart
home, smart cities) no matter the situation, localization. accessibility to a sensor.
size, scenario or the risk of danger.
Smart objects arc utilized widely to transform the physical environment around us
to a digital world using the Internet of things (loT) technologies. A smart object
carries blocks of application logic that make sense for their local situation and
interact with human users. A smart object sense. log, and interpret the occurrence
within themselves and the environment, and intercommunicate with each other and
ex.change information with people.
The work of smurl object has focused on technical aspects {such as software
infrastmcture, hardware platforms, etc.) and application scenarios. Application
areas range from supply-chain management and enterprise applications (home and
hospital) to healthcare nnd industrial workplace support. As for human interface
aspects of smart-object technologies arc just beginning to receive attention from the
environment.
b. What arc Sensor Networks in JOT? Mention the churactcristics of IOT.
(10 Marks)
Ans. A Wireless Sensor Networl-. is one kind of wireless network includes a large number
of circulating, self-directed, minute, low powered devices named sensor nodes
called motes. These networks certainly cover a huge number of spatially distributed,
little, battery-operated, embedded devices that ore m:tworkcd to caringly collect,
process. and transfer data to the operators, and it has controlled the capabilities of

,
VIII Sem., (CSE/IS£)

computing & processing. Nodes are the tiny computers, which work jointly to form
the networks.
The sensor node is a multi-functional, energy efficient wireless device. The
applications of motes in industrial are widespread. A colleclion of sensor nodes
collects the data from the surroundings to achieve specific application objectives.
The communication between motes can be done with each other using tcansceivers.
In a wireless sensor net\\ork, the number of motes can be in the order of hundreds/
even thousands. In contrast with sensor n/ws, Ad Hoc networks will have fewer
nodes withou.t any structure.
Wireless Sensor Network Architecture
The most common WSN architecture follows the OSI architecture Model. The
architecture of the WSN includes five layers and three cross layers. Mostly in sensor
n/w we require five layers, namely application. transport, n/w. data link & physical
layer. The three cross planes arc namely power management, mobility management.
and task management. These layers of the WSN are used to accomplish the n/w
and make the sensors work together in order to raise the complete eHiciency of the
network.
Characteristics of Wireless Sensor Network
The characteristics of WSN include the following.
• The consumpti_?n of Power limits for nodes with batteries
• Capacity 10 handle with node failures
• Some mobil~ty or nodes and Heterogeneity of nodes
• Scalability to large scale of distribution
• Capability to ensure strict environmental conditions
• Simple to use
• Cross-layer design
Advantages of Wireless Sensor Networks
The advantages of WSN include the following
• Network arrangements can be carried out without immovable infrastructure.
• Apt for the non-reachable places like mountains. over .the sea, rural areas and
deep forests.
• Flexible if there is a casual situation when an additional workstation is required.
• Execution pricing is inexpensive.
• It avoids plenty of wiring.
• It might provide accommodations for the new devices at any time.
• It can be opened by using a centralized monitoring.
OR
4. a. How do we connect Smart Objects in IOT? Explain (08 Marks)
Ans. In a world of smart networked devices, wearables, sensors and actuators, transparent
and secure access to and usage of the available resources across various IoT domains
is crucial to satisfy the needs of an increasingly connected society. The current loT
ecosystem is however fragmented: a series of vertical solutions exist today which,

8
on the one hand. integrate connected objects within local environments using
purpose-specific implementations and, on the other hand. connect smart space.<.
with a back-end cloud hosting dedicated otien proprietary sotlware components.
symbloTc (symbiosis of smart objects across IoT environments) comes to
remedy this fragmented environment by providing an ah.\lractio11 layer for a
""unified view'" on various platforms and their resources so that platform resources
are transpl)rent to application designers and developers. In addition, symbloTc also
chooses a challenging task to implement loT platform federativ11.1· so that they can
St:curely interoperate. collaborate and share resources for the mutual benefit. and
what is more. support the migration of smart objects between various loT domains
and plalforms. i.e.. "smart object rva111i111('. symbloTe will achieve all of the above
by designing. and implementing an Open Soun·e mediatioi, prototype. In light of
the above, symbloTc opens up the potential for innovative business models that are
incrementally deployable. This is especially imporlant for SM Es who arc symbloTe's
primary targt:I group. Moreover, symbloTe removes the strict separation between loT
islands to ere;:ate ,in environment which i) has significant impact on the market since
it bridges individunl efforts nnd investments. ii) is attractive for a heterogeneous user
group. iii) malclu:s the dynamicity of modern life dut: lo case of use and iv) is helpful
for various businl'ss, home and public infrastructure use cases.
c,.n,pus
J;;;;!I Office
lluilding ~
g Q (:] ~ o~
/
□ Campus
B~ing. . . . . - -
'-
~G)
Platfonn 1

Stadium

Publlc
Q 1ransporr

□~

b. What arc the nchvork technologies for IOT connectivity'! (08 Marks)
Ans. The 3 m,,in network technologies for loT Connectivity are:
I. Standard Wireless Access - Wif.'i. 2G. 3G and standard LTE.
2. Priv~te Long Range - LoRA based platform. Zigbee, and SigFox.
3. Mobile loTTechnologies- LTE-M. NB-loT, and EC-GSM-loT
There is no doubt that loT is driving a lot of changes in connectivity (among many

9
VIII Sem, ( CSE/IS£) I ritern.et o f ~ r ~
other infrastructure technologies). The rule that connected devices impose over
networking technologies is not the same as in our mobile phones..
Standard Wireless Access - WiFi, 2G, 3G and st:mdard LTE
This is a no-brainer move. There are already plenty of devices that use this. such as:
• Smart-TVs ·
• Gaming consoles
• Panic buttons
• Video surveillance
• eHealth
• Flt:et tracking
• Industrial loT
It's an ..obvious choice'' for providers and consumers to continue using their current
internet access (Wi-Fi, 2G, 3G. LTE, etc) as the primary network option for their
home appliances. /\s always, there is a catch. the power consumption is quite a thing
for cordless devices. Most devices using this network access are statically connected
to a power outlet (or big batteries). In case of using a mobile network, another
·constraint is that their data plan acquired was thought for phones. This way it's quite
expensive if you are thinking in multiples devices per user.
Private Long lfange - LoRA based platform, Zigbee, and SigFox
The requirements of low power connectivity opened a window to private companies
to develop new networks with that specific constraint at their core which arc loT
native. These private wireless networks are used for the specific proprietary network.
It's also to deploy a network which allows third-party devices to connect and build
an ecosystem.The three leading technologies in this area arc, LoRAWAN, Zigbee.
_. ,.- .......
........_,._.,
.. ..., ......
- - • 1 " t ,.. . ......, .. ~
............
•I
1.-.. ,~
111'!o• ........ I I
..-

,: ..-;: 1•,n•'""'''
and SigFox.
Module - 3
5. a. Explain the business case towards IP. (08 Marks)
Ans. Much of the technology needed for the Internet of Things {loT) has been available
for some time, but connecting a world's worth of devices to the cloud requires more b. What arc Ligh
than just an Internet connection. For loT deployments to be successful. business Ans. /\long with ph
models and pain points for players on both the device side and network side must protocols for I
application mes
be identified, followed by solutions that can collect and analyze data securely and
efficiently. Lightweight M
protocol betwe
and are thus re
are as folio,
information be
this class treat
deleted, and re
can use differe
LWM2M proto
send messages

10
C8CS · tvfodel,Q~wn,pape,,- - l

TheClood Tetcos
e,ao,-...,t. "°"'°'
c&.,.twt...,i.-..:1~,.._
lllllll'< cb,d. t,Ot ~ TT \• --...r, -.:.t '" .lo, t} M•
ff~IIM~lhtup~,loM,1,,,1,ncf;. I. l!iUI
.,..._.,.,,....~ ..~.J-,..•- •H,o,.Mi.. D,111
.....9119'"''•""'-'-"

~
-,
Gateways J
S1;'1,.UtH;
frvi1n,mf\fo,(<J•h,,,rh,.,-, 1,11,,.1
..... ,
l•"' t\.\l<f",., l,1,11 ,f ,,.
,,1 ... ..,, .......r ,,...~.,,.., 1,,, ou•l•ou.,..._a-,.,.<1><•
..,.,,Lt,,, t..\l<t tl,.,1 ,l,. l;y .,,,r,_., , . j ,.., t r•lf"'1tf,1,l.l ,rt,"\.,,,
...,. , .•, .... .,..,,~,S.1,0:11 t, .... 0 1 ,~ • .......... J,,.....,.,.c;i
•ly•~J J•,-.f- •,J o,, ..... I 11 •~ ,, ,- lo tt,+IJ~111 •\•I\O(H
.,,ti,•,. ,, ..., ,, ......,,..-fT',

S~n~or Networks
Edge Management
llt-• .. ,,._,.,,_t••l-,..-..,,btu• ~ - '•"ti -t
,Ol..._..............frO<on1~-,..,..,, ..,•• ,..f,,•,..-,1 .,.,
.......,.,,.._,,"4.,....,-......... t.r• ....,
r.~. ~
,. ,j ,,.,.,,~,•.-..-.&el,,..,, •-.i
•.a,H,11 ..., , . 0 0
,,... .......o,1., ...",.-........ ,, . ___,
t ,~,. • ..'..,.,.-;••iltl'II" ••►' .,,..-,t' ttu"'t ... _,, ll 0
0

b. What nrc Light Weight Applicntion Layer protocols (08 Murks)


Ans. Along with physical and MAC layer protocols. ,,c also need application layer
protocols for loT networks. These lightweight protocols need to be able to carry
application messages, while simultaneously reducing power as far as possible. OMA
Lightweight M2M (LWM2M) is one such protocol. It delim:s the communication
protocol between a server and a device. The devices often have limited capabilities
and arc thus referred to as constrained devices. The main aims of the OMA protocol
are as follows:( I )Remote device management.(2)Transferring service data/
information between different nodes in the LWM2M network. Atl the protocols in
this class treat all the network resources as objects. Such resources can be created.
deleted, and remotely configured. These devices have their unique limitations and
can use different kinds of protocols for internally representing information. The
LWM2M protocol abstracts all of this away and provides a convenicill interface to
send messages between a generic LWM2M server and a distributed set of LWM2M
VIII Semt (CSE/ISE) C8CS · Mod.el-Q
clients. This protocol is otlen used along \\ ith Co/\ P (Constrained Application • lnan:sctm
Protocol). It is an application layer protocol that allows constrained nodes such as needed.
sensor moct:s or small embedded clcvict:s to communicate across the Internet. CoAP CoAP la..:ks sccun
seamh:ssly integrates with HTTP. yet it provides additional facilities such as suppon sccuri~ tDTLS),
for multicast operations. It is ideally suited for small devices because of its low that reduce it:. ~m
overhcad and parsing complexity and n:liance on UDP rather thnn TCP. which i, n kc~ ad
OR place a he.I\~ bu
6. u. Expl:1in the application protocols for IP. baitCl} po\\ ered d
( 12 Marks)
Ans. Application layer protocols founded on TCP and UDP solve the communication Another applicati
(X\1PP). fhe pro
challenges faced in an lo r project. The TCP protocol enables the XMPP, MQTT
XMPP is not suitn
and REST/11n·p communication protocols. The UDP protocol enables DDSI
As} nchronous an
and there arc DDSI implementations on TCP/IP. Device anti dntn storage server
subscribe approncl
communication is enabled by XMPI' nnd MQ"IT protocols that arc c1rnbled b) TCP/
cl isaclvantagc of X
IP. The RESTful API supported by I ITfP ennbles client/server communication. In
because of XML p
subsequent sections. the communication protocols will be discussed in detail.
low latency and sin
The message queue telemetry tr::i nsport (MQTT): - is a machine to machine
Representational s
architecture developed to enable lightweight connectivity. MQTT supports publish/
over HTrP. Cachin
subscribe over TCP. TCP minimizes the risk of data loss and brings in stream
RF.ST. The accept
simplicity and relinbility. The publish/service protocol is advnntageous in an loT
environment because there is no requirement on clients to request updates which can be implemcn1e
minimizes bandwidth, battery and computational requirements. Due to the advantages which have made i
identified MQ'IT is suitable for home nutomation and mobile communication. MQTT implcmt:ntation.
The advanced m
is laid 0111 in n star architecture where all devices connect to a common server. The
server is referred to ns n broker and TCP enables communication between a server publish/subscribe aA
and a client. Security is enforced through a user name and a password in a similar can be used. Rdiabi
way to I ITTPS. • A messnge is sen
• There is a guaran
MQTT offers flexibilit) in qualit) b) allowing three levels of qualit) enforcement
which arc listedbelow: • There is a guaran
• fhc first option is sending n message without an ad.no" ledgment requirement St:curity in AMQQ i•
• Tht: :.econd option 1s sending n mes~age once and requiring an acknowledgment When selecting the 1'
• The third option is requiring only one delivery through a handshake mechanism need to be considcrc
The constrnined application protocol (CoAP): - uses request/response to enable • Bandwidth requin
• Data latency
communication in recourse-constrained environments. Because the de~ign ofCoAP
is a subset of HTTP intcropernbility between CoAP and HTrP is possible. CoAP • Reliability
is implemented over UDP to minimize its footprint. UDP is preferable over TCP • Memory and code
because UDP minimizes bandwidth and overhead as compared to ·1CP. Becnuse of b. Briefly describe the
the unreliability of UDP the design of CoAP included reliabilit) . faery packet has a Ans. IOT Application Tran:
header thlll specifies message type and qualit) level n:quired. rhe mi::ssage t) pcs that the common t:.\isting s
can be specified are listed below. Notation (JSON) in th
• A confirmnble messnge type is sent S) nchrunously or aS) nchronousl) and an to create a statcful We
ackno" ledgment isrequircd. J) urrP : 1ITTP is
• A non-confi rmable mcssngc does not require acknowledgment. The more secure meth
• An acl-.no" lcdgment me~sagc t) pe requires confirmation or a processed message. de, 1cc. not a server.

12 ~l\s.+M f..c,-M 5<.,-MW'


VIII Se.wl/ (CSE/IS£) I nt"e.-t'\et" o f ~T ~
2) WebSocket: WebSockel is a protocol that provides full-duplt::-.. communication lie second
over a single TCP connection bet"een client and server. It is pa11 of the HTML 5 elsidmt
specification. The WcbSod,et standard simplifies much of the complc:-..ity around bi- managing
directional Web communication and connection management. proccss,s
3) XMPP: XMPP is a good e:-..ample of an existing Web tcchnolog) finding n.:w use asap
in the loT space. XMPP has its roots in instant messaging and presence information. okeach
It has expanded into signaling for VoIP, collaboration. lightweight middle\\are. ....,_ ti
content syndication, and generali7ed routing of XML data. It is a contender for mass
scale management of consumer white goodc; such as washers, dryers. refrigerators. --.-
plllfonns
and so on.
4) CoAP: The Constrained Application Protocol (CoAP) was designed by the IETF
for use with low-power and constrained networks. CoAP is a Rr.S fful protocol. II
"· ......
is semantically aligned" ith IITl"P. and even ha~ a one-to-one mapping to and from savmg big data
HTTP. Cot\P is a good choice of protocol for devices operating 011 batlery or energy and allow better c
harvesting. scalable and ffexi
S) MQTT: MQ Telemetry Transpon (MQTI ) is an open source protocol for replacement of X
constrained devices and lo,,-bandwidth. high-latency networks. It is a publish/ web application.
subscribe messaging transport that is extremely lightweight and ideal for connecting ln-memo11· Dato
small devices to constrained networks. These databa~e st01
Module- 4 the way of big data
process informatio
7. II, Expla in brie fly Big d:1tu anulytics tools nnd technology ( 12 Murks)
therefore. drastical
Ans. Big data analytics is the process of extracting useful information b} analysing different
types of big data sets. Big data analytics is used to discover hidden patterns, market IMDB systems. Vi
trends and consumer preferences. for the benefit of organizational decision mal-ing. same.
There nre sc, ernl stt:ps and technologies ill\ olvcd in big data anal) tic:.. H) hrid Daca Sron
Data Acquisition
Apachl' Hadoop i
scalability and spc
Data acquisition has two components: identification and collection of big data.
Identification of big data is <lone b) analyzing the mo natural form,11, of data- born a 1-ladoop IJistribu
systems known as
digital and born analogue.
Born Digital Data smooth openition e
It is the information which has been captured through a digital medium, e.g. a Google's MapRedu
computer or smanphone app, etc. This type of data has an cvt:r expanding range 'Mapping' and 'Re
since systems keep on collecting different kinds of information from users. Born for big data proce
digital data is traceable nnd can provide both personal and demographic business number of function
insights. E:...nmplc~ include Cookies. Web /\nalytics and GPS traci,.ing. Moreover. I ladoop
Born Analogue Data its development and
When information is in the form of pictures. ,idcos and other such formats \\hich Data Mining
It is a recent conce
relate to ph) sieal elements of our world, it is termed as analogue data. This data
discover the rclatior
requires conversion into digital format b) using sensors, such ns camerns. voice
single data sel for di
recording. digital assistants. etc. The increasing reach oftechnolog.) has also raised
reducing cos1s and i1
the rate al which traditional!) analogue data is being converted or captured through
digital mediums.

14
~
Llion The second step in the data acquisition process is collection and storage of data
MLS sets identified as big data. Since the archaic DBMS techniques were inadequate for
d bi- managing big data. a new method is used for collecting and storing big data. The
process is called MAD-magm:tic. agile and deep. Since, managing big data requires
WllSt:
a significant amount of processing and storage capacity. creating such systems is out-
ation. of-reach for most entities which rely on big data analytics. Thus, tht: most common
ware. solutions for big data proct:ssing today are based on two principles- distributed
mass
storage and Massive Paral.lel Processing a.k.a. MPP. Most of the high-end Hadoop
ators.
platforms and specialty appliances use MPP configurations in their system.
Non-relational Databases
, IETF
The databases that store these massive data sets have also evolved in how and where
col. It
the data is stored. JavaScript Object Notation or JSON is tht: preferred protocol for
saving big data nowadays, Using JSON. the tasks can be written in the application layer
and allow better cross-platform functionalities. Thus enabling. agile development of
scalable and flexible data solutions for the devs. Many companies are using it as a
replacement of XML as a way of transmitting stnactured data between the server and
ublish/ web application.
1ecting In-memory Database Systems
These database storage systems are designed to overcome one of the major hurdles in
tht: way of big data processing-the time taken by traditional databast:s to access and
arks) process information. IMDB systems store the data in the RAM of big data servers.
ilforent therefore. drastically reducing the storage 1/0 gap. /\pache Spark is an example of
market IMDB systems. VoltDB, NuoDB and IBM solidDB are some more examples of the
1aking. same.
Hybrid Data Storage and Processing Systems-Apache Hadoop
Apache Hadoop is. a hybrid data storage and processing system which providt:s
ig data. scah1bility and speed at reasonable costs for mid and smnll-scale businesses. It uses
-born a Hadoop lJistributed File System (HDFS) for storing large files across multiple
systems known as cluster nodes. Hadoop has a repl ication mechanism to ensure
smooth operation even during instances of individual node failun:s. Hadoop uses
, e.g. a Google's MapRcduce parallel programming as its core. The name originates from
grange 'Mapping· and 'Reduction' of functional programming languages in its algorithm
s. Born for big data processing. MapRt:duce works on the premise of increasing the
usiness number of functional nodes over increasing processing power of individual nodes.
Moreover. Hadoop can be run using readily available hardware which has sped up
its development and popularity, significantly.
s which Data Mining
his data It is a recent concept which is based on contextual analysing of big data sets to
s. voice discover the relationship between separate data items. The o~jcctive is to use a
o raised single data set for dilTerent purposes by different users. Data mining can ht: ust:d for
through reducing costs and increasing revenues.

15
VIII Sern., (CSE/ISE)
b. Wh11t do you menn by F.clge streaming a nalytic!>? Explain (04 Marks)
Ans. Azure Stream Analytics (ASA) on loT Edge empowers de\clopers to deploy near-
real-time analytical intelligence closer to loT devices so that the) can unlock the full
value of device-generated data. A.wre Stream Annlytics is designed for low latency.
resiliency, efficient use of bandwidth, and compliance. Enterprises can now deplo>
control logic close to the industrial operations and complement Big Data anal)tics
done in the cloud. Jong
Alurc Stream Analytics on 10·1 1-.dgc runs within the Al.lire lo I Edge frameworl,.. '-'ep,ncd
Once the job is creakd in ASA, you can deploy and manage it using lo r I lub. SA and el
~ of the
OR
physicall) iso
8. n. llow IOT Security is different from Physical and convcntiou:ll IT Security'~ 5. Ecos)stem
Explain. (06 Marks) When genera
Ans. I. Lifecycle mismatch alfect thi: 1arg
Many t) pes of ph) ,ical objects - buildings. automobile<;. refrigerators. light Data brcad11:
switches - last a long ti1m:, decades even. They t) picall) require maintenance, but stolen: thi~ a
they're otten not replaced until the repair bills get too high or they just don't work. multiplies th
We certainly don ·1 expect to replace them because a manufacturer decided not to even be a\\ ar
support them after five years. Ecosysll:m cl
Yet much of the software involved in loT is intended to be dispo~ablc. There may be significant di
no prO\ isions for upgrading client devices at all. Software support for even relatively Linux on ,~cl
expensive consumer devices is usually just on the order of a few) cars. attacks.
From a sccurit) per~pective, this means otherwise functional de, ice~ are likel) to be 6. Common,
exposed to unpatchcd vulnerabilities as they get older. Speaking at
2. Genernl-purposc extensible devices Bruce Schne
If these network-connected systems used specialized hardware and softwan: tu computeri;,e
operate and communicate, outdated software wouldn·t nccessaril) be a major issue. about the fi \
It would likely be hard to force such devices to take actions they weren't original!}' the same \\ a)
designed to do. into one phy
However, in practice, it"s very common for loT devices to effectively be general- I mentioned
purpose computers running open source operating systems and network stacl..s. more broadl
There arc good reasons for this: among other things, it"s easier (at least in principl..:) indi\ idual att
to update them and add new capabilities over time. However, it also means that an But even br
attacker who gains control of a device has more options to wreak ha, oc. failures in s
3, Dael economic incentives matters.
None of the above is unfixable. We keep industrial equipment running for decades. 7. ActuatorS
Sofiwarc vendors, including my employer, offer a vari•!IY of e:-.tended life support we·ve seen I
options for subscription products. These models work because customers are willing convenliona
to pay for ongoing maintenance and support at levels where ifs profitable for vendors different. it".
to supply them. Schneier cal
Those same incentives aren't in place when you buy a light switch, or perhaps even Software al
a vehicle. No consumer is likely to pay for an ongoing light switch contract. Some do. But the
may do so for cars, but it"s not common after the initial warranty period. /\s a result. pervasively
there nre 1\0 incentives for mo, t device nrnkerc; to continue supoortinit what tht:y" Vt:
16 &w.+M C.CAM ~MV
rks) sold beyond a fairly short window.
1ear- 4. Connected by default
full Vulnerability to attackers who connect to an loT device or gateway wouldn't matter
ncy. so much if making that corniection were difficult or impossible. But increasingly it is
ploy not. Tht: norm is to connect to networks, usually wirt:less networks. and often public
ytics networks. Even when thcrc·s no compelling reason to do so.
Ifs Jong been recognized that protecting computer systems again5t intruders w ho
vork. have gained physical access can be extremely challenging. (Witness breaches at the
NSA and elsewhere.) However, pervasive and routine network access introduces
many of the same threat vectors. Certainly it creates a far greater attack surface than
physically isolated systems.
rity? S. Ecosystem effects
nrl.s) When general-purpose. networkaconnected computers are breached. it doesn ·, just
affect the target or the attack.
light Data breaches can affect millions of customers when sensitive information is
e. but stolen: this applies whether we·re talking loT or more conventional IT systems. loT
vork. multiplies the issue by collecting more and more ambient data that people may not
not to even be aware is being collected.
Ecosystem damage can go beyond data. The Mirai botnet's OOoS attackcaused
ay be significant disruption to the internet as a whole. It resulted from outdated versions of
lively Linux on webcams being turned into remote-controlled bots for large-scale network
attacks.
to be 6. Common widespread failu re modes
Speaking at the Open Source L<!adership Summit earlier this year, security expert
Bruce Schneier noted that comp'Jtcrs and networks fail in a di/Terent way tha11 non-
are to computerized systems. "You worry about crashing all the cars. You·re concerned
r issue. about the five sigma guy, not the average guy. It doesn ·1 happen in lock picking in
1 inally
the same way.'' he said. because no matter how skilled, one person can only break
into one physical building at a time.
cneral- I mentioned the Mimi botnc:t c:arlier. But it"s the nature ofloT and connt:cte<I systems
stacks. more broadly that vulnerabilities and attacks usually affect many systems. Ofcourse,
ncipl::) individual attacks can still lead to data breaches or the shutdown of a critical system.
that an But even breaches that would be relatively innocuous in isolation can cause serious
failures in systems like the power grid if multiplied by a thousand or a million. Scale
matters.
ecades. 7. Actuators can affect our environment
support we·ve seen how loTcan differ in scale, connectedness and vendor support from more
willing conventional IT systems. But if I had 10 pick one aspect of loT IhaI ·s rundamentally
different. it's this one: loT is not read-only.

t:l:~;~;
result.
Schneicr calls it ··an intt:rnet that affects the world.'"
Software already controls many critical systems or directly tells humans what to
do. But the degree to which loT is replacing manual and disconnected controls
pervasively and hy default is striking.
they've

1\1\U'
17
VIII Se-n11 (CSE/ISE)
That loT can take physical actions may not really change its security model, but it
certainly raises the stakes.
We prioritize foatures. We prioritize low prici::s. We prioritize today. We do
notprioritize security over a product lifecyclc that may span decades. In devices that
have the power to a[ect the physical world.
All loT Agendu 11etwurk c:u11tributor.1· are respu11siblefor the cu11te11t and accuracy uf
/heir pus1.1·. Opi11io11s are ci /he ,wr-ilers and do 110I 11ec:essurily c:011vey t/1e tlruuglrt.1
<~( loT Agenda.
b. Wlrnt is OCTAVE and FAffi explain? (10 Marks)
Ans. Within the industrial environment, there arc a nup1bcr of standards. guidelines. and
best practices available to help understand risk and how to mitigate it. lEC 62443
is the most commonly used standa.rd globally across industrial v~rticals. It consists
of a number of parts, including 62443-3-2 for risk assessments. and 62443-3-
3 for foundational requirements used to secure the industrial environment from a
networking and communications perspective. Also. ISO 2700 l is widely used for
organizational people. process, and information security management. In addition. The first step
the National Institute of Standards and Technology (N IST) provides- a series of criterion. OC
documents for critical infrastructure, such as the NIST Cybcrsecurity Framework impact, , a lue
(CSF). In the utilities domain, the North /\merican Electric Reliability Corporntion's that at an) po
(NERC's) Criticnl lnfrastructure Protection (CIP) hns legally binding guidelines for model. (Whit
North /\merican utilities. and IEC 62351 is the cybersecurity standard for pO\\er model, dcscri
utilities. The seconds
The key for any industrial environment is that it needs to address security holistically with assets, a
and not just focus on technology. It must include people and processes, and it should owners, cust
include all the vendor ecosystem components thnt make up a control system. It is importar
In this section, we present a brief review of two such risk assessment frameworks: information
• OCT/\VE (Operationnlly Critical Thrent. t\ssct and Vulnerability Evaluation) critical.
from the Software Engineering Institute at Carnegie Mellon IJnivcrsity Within this ai
• f/\ll{ (l· actor Anal)sis oflnformation Risk) from The Open Group of the assets
These two systems work toward establishing a more secure environment but with identifying ti
two dilferent approaches and sets of priorities. Knowledge of the environment is key human actor
to determining security risks and plays n key role in driving priorities. There arc, I
OCTAVE than simply
OCTAVE has undergone multiple iterations. The version this sc<.:tion focuses on is the prioritiz
OCTAVE/\ llegro. which is intended lo be a light"cight and less burdcn~ome process technical con
to implement. Allegro nssumcs that a robust security team is not on stnndby or the applicati~
immediately at the ready to initiate a comprehensive security review. This approach with that ind·
rind the assumptions ii makes are quite appropriate, given thal man) operational The third stc
technology arens are similarly lacking in security-focused human assets. Figure 8 the range of
illustratei; the OCTAVE Allegro steps and phases. This refercn
However, it
or even the A
information.

18
- ModeiQ~Petpet"' · l
)lit it

• do ~,.p~: s,q,6:
ldenufy Are.IS
' that ofCouet:n1
ltknufy R,;ks

·yo/
SttpS:
gill.\ lde11tify Thrc>I
SttJI 7:
An;d)t.C Risks
~ SCC03CIOS

SleJ> 8:
M11tt;111on
A1lS~""(l:,ch

Establish Profil.: ld.:ntity Identity and


om a Driv.:rs As.~cts Thr.:~ts Mitigate Risk
ed for Figure 8 OCTAVE Allegro Steps and Phases
ition. The first step of the OCTAVE Allegro methodology is to establish a risk measurement
·es of criterion. OCTAVE provides a fairly simple 1m:ans ot'doing this II ith an emphasis on
work impact, value. mid measurement. The point oflrnving a risk 111cn~urc1111.:nt criterion is
1tion's that at any point in the later stages, prioritization can take place against the reference
es for model. (While,OCTAVE has more details to contribute, we suggest using the FAIR
model, described next. for risk assessment.)
The second step is to develop an information asset profile. This profile is populated
tically with assets, a prioritization of a$Sets, attributes associated with each asset, including
hould owners, custodians, people. explicit security requirements, and technology assets.
It is importalll ro stress the imµortance of process Certainl~ , lhe net:d to protc1:1
orks: information does not disappear, but operational safety and continuity are more
iation) critical.
Within this asset profile, process are multiple substages that complete the definition
of the assets. Some of these are simply survey and repo11ing activities, such as
tt with identifying the asset-and attributes associated with it. such as its owners, custodians.
tis key human actors with which it interacts, and the composition of its technology assets.
There arc, however, judgment-based attributes such as prioritization. Rather
than simply assigning an arbitrary ranking, the system calls for a justification of
son is the prioritization. With an understanding of the asset attributes. particularly the
roccss technical components, appropriate threat mitigation methods ca11 be applied. With
dby or the application of risk assessment, the level of security investment can be aligned
,proach with that individual asset.
ational The third step is to identify information asset containers. Roughly spc:iking, this is
igure 8 the range of transports-and possible locations where the information might reside.
This references the compute elements and the netwu ks by which they communicate.
I lowever, it can also mean physical manifestations such as hard copy documents
or even the people who know the information. Note that the operable target here is
information, which includes data from which the information is derived.
)UU\lltK
19
VIII Se+11,, ( CSE/IS£)
In OCTAVE. tht: emphasis is on the container level rather than the asset level. The
value is to reduce potential inhibitors within the container for information operation.
In th.: OT "orld, the emphac;ic; ic: on reducing potential inhibitors in the containerized
operational space. If there is some attribute of the information that is endemic to
it, then the entire container operates with that attribute because the information is
the defining element. In some cases this may not be true. even in n environments.
Discrete atomic-level data may become actionable information onl~ ifit is seen in the
context of the rest of the data. Similarly. operational data taken without kno,-.ledgc
of the rest of the elements may not be of particular value either.
The fou11h step is to identify areas of concern /\t this point. \\e depart from a data
flow. touch. ;ind attribute focus to one ,.,.here judgments are made through a mapping
of security-rcl;itcd attributes to more business-focused use cases. At this stage, the
analyst looks to risk profiles and delves into the prcviou~ly mentioned risk analysis.
It is no longer just facts. but tht:rc is also an element of creativit) that can factor
into the evaluation. History both within and outside the organization can contribute.
References to similar operational use cases and incidents of security failures arc
reasonable associations.
Closely related is the fifth step. where threat scenarios arc identified. Threats are
broadly (and properly) identified as potential undesirable events. This definition
means that results from both malevolent and accidental causes are viable threats. In
the context of operational focus, this is a valuable consideration. It is at this point that c, cnt frcqucnc
an explicit identification or actors, motives. and outcomes occurs. These scenarios There arc mul
are described in threat trees to trace the path to undesired outcomes, which. in turn. be undcr!>tood
can be associated "'ith risk metrics. applied to a vu
At the sixth step risks arc identified. Within OCTAVE. risk is the possibility of an ,, cakncs~. but
undesired outcome. This is extended to focus on how the organi,ation is i,mpactcd. fai I as a result
For more focused analysis. this can be localiLcd. but the potential impact to the The other side
organization could extend outside the boundaries of the operation. begins to quant
The seventh step is risk analysis. with the effort placed on qualitative evaluation of The FAIR spec
the impacts of the risk. I lcre the risk measurement crite1 ia defim:d in the first swp are cost estimat..:&
explicitly brought into the process. the target of ti
finally, mitigation is npplied at the eighth step. There arc three outputs or decisions on operational
to be taken at this stage. One may be to accept a risk and do nothing. other than much easier.
document the situation. potential outcomes. and reasons for accepting the risl-. The FA IR defines
second is to mitigate the risk with whatever control effort is required. By walking focused. Of p
back through the threat scenarios to asset profiles. a pairing of compensating controls loss. Respons
to mitigate those threat/risk pairings should be discoverable and then implemented. measure but di
1 he final possible action is to dcfor a decision. meaning risk is neither accepted nor least measurall
mitigated. lhis may imply further research or activit), but it is not required by the
process.
OCTAVI:. i~ n balanced information-focused process. What it offers in terms of 9. a. What llo you
discipline and lnrg.cl~ unconstrained brendth. ho" c, er. 1s offset b~ ih lack ofsccuri[) Aas. setup() - The
spccilicit). fherl:.' i~ an ac;s11111ptio11 that bc)ond these steps arc ~ccmingl) means of \ ariablt.!s. pin 1

20
I. Tiu: identifying specific mitigations that can be mapped to the threats and risks exposed
ration. during the analysis process.
erized FAIR
1ic to FAIR (Factor Analysi~ of Information Risk) is a technical standard for risk definition
tion is from The Open Group. While information security is the focus. much as it is for
ments. OCTAVE. FAIR has clear applications within operational technology. Like OCTAVE.
in the it also allows for non-malicious actors as a potential cause for harm, but it goes to
vledgc greater lengths to emphasize the point. For many operntional groups. it is a welcome
acknowledgement of existing contingency planning. Unlike with OCTAVE, there is
a data a s ignificant emphasis on naming, with risk taxonomy definition as a very specific
apping target.
ge, the FA IR places emphasis on both unambiguous definitions and the idea that risk and
ialysis. associated allributes are measurable. Measurable. quantifiable metrics are a key art:a
factor of emphasis. which should lend itself well to an operational world with a richness of
tribute. operational data.
res an: At its base, FAIR has a definition of risk as the probable frequency and probable
magnitude of loss. With this defi nition, a clear hierarchy of sub-elements emerges.
ats are with one side of the taxonomy focused on frequency and the other on magnitude.
•fin it ion Loss even frequency is the result of a threat agent acting oo an asset with a resulting
eats. In loss to the organization. This happens with a given frequency called the threat
oint that event freq uency (TEF), in which a specified time v.1 iridow becomes a probability.
enarios There arc multiple sub-attributes that define frequency of events, all of which can
in turn. be understood with some form of measurable metric. Threat event frequencies are
applied to a vulnerability. V11/11erability here is not necessarily some compute asset
ty of an wcakm;.ss, but is more broadly dt:11111::d as the probability that tho: targeto:d asstlt will
pacted. fai l as a result of the actions applied. There are further sub-allributes here a~ well.
I to the The other side of the risk taxonomy is the probable loss magnitude (PLM). which
begins to quantify the impacts. with the emphasis again being on measurable metrics.
ation of The FAIR specification makes it a point to emphasize how ephemeral some of these
step are cost estimate~ can be, and this may indeed be the case when information security is
the target of the discussion. Fortunalt::ly for the OT operator. a signiticant emphasis
ccisions on operational efficiency and analysis makes understanding and quantifying costl>
her than much easier.
isk. The FA IR defi nes six forms of loss, four of them externally focused and two internally
walking focused. Of particular value for operational teams are productivity and replacement
controls loss. Response loss is also reasonably measured. with fines and judgments easy to
mented. measure but difficull lo predict. Finally, competitive advantage and reputation are the
·pted nor least measurable.
d by the
Module - 5
terms of 9. a. What do you setup Arduino UNO? (06 Marks)
security Ans. setup() - The setup() function is called when a sketch starts. Use it to initialize
eans of variables, pin modes. start using libraries. etc. The setup function will only run once.

21
VIII Se,m,, (CSE(ISE)
after each powerup or reset of the Arduino board.
Example
int button Pin = 3;
void setup()
{
Serial.begin(9600);
pinMode(buttonPin. INPUT);
}
void loop()
l
II ...
}
b. Explain the IOT netw ork protocol stack ( 10 Marks) IO'.! 15 4 I:, 12
Ans. The Internet Engineering Task Force (IETF) has developed alternative protocols for ~ the adapiat
communication between loT devices using IP because IP is, a flexible and reliable under routinl!.
standard. _The Internet Protocol for Smart Objects (IPSO) Alliance has published ofmthcnCl\~o
various white papers describing alternative protocols and standards for the layers net\\Ork.
of the IP stack and an additional adaptation layer, which is used for communication (3) ~etwork
between smart objects. from the trans
(1) Physical and MAC Layer (IEEE 802.15.4). The IEEE 802.1 S.4 protocol is (ROLL) \\Orki
designed for enabling communication between compact and inexpensive low power Lossy Nerwor
embedded devices that need a long battery life. It defines standards and protocols For such ncl\\
for the physical and link {MAC) layer of the IP stack. It suppo11s low po\,er describes ho"
communication along with low cost and sho11 range communication. ln the case of the nodes after
such resource constrained environments, we need a small frame size. low bandwidth. function is us
and low trnnsm it power. constraints ma
Transmission requires very little power (maximum one milliwatt). which is can be 10 avoi
only one percent of that used in WiFi or cellular networks. This limits the range function can a
of communication. Because of the limited range. the devices have to operate need to be sent
cooperatively in order to enable multihop routing over longer distances. As a result. rhe making of
the packet size is limited to 127 bytes only, and the rate ofcommunication is limited to to nt.•ighboring
250kbps. The codirrg scheme in IEEE 802.1 S.4 has built in redundancy, which makes not depending
the communication robust, allows us to detect losses. and enables the retransmission for\Vard the m
of lost packets. The protocol also supports short 16-bit link addresses to decrease the leaf nodes and
size of the header, communication overheads, and memory requirements. upwards hop b
(2) Adaptation Layer. 1Pv6 is considered the best protocol for communication in follows. We se1
the loT domain because of its scalability anti stability. Such bulky IP protocols were (towards leave
initially not thought lo be suitable for communication in scenarios with low power To manage the
wireless links such as IEEE 802.15.4. nonstoring no
6LoWPAN. an acronym for 1Pv6 over low power wireless personal area networks. nodes arc in a
is a very popular standard for wireless communication. It enables communication route informati
using IP¥6 over the IEEE 802.15.4 protocol. This standard defines an adaptation root. The root
layer between the 802.1 S.4 link layer and the tran~port layer. 6LoWPAN devices can

22 ~I\!,+..... E '"""' $c.,.Mt.,,,


~
communicate,, ith all other IP based devices on the Internet. The choice of 1Pv6 ~:.
bic,ca~ oft he large:: addressing space available in IP, 6. 6LoWPAN nctworl-.s connect
to the Internet via a gateway (Wiri or Ethernet), which also has _protocol support for
comcrsion bch,cen 1Pv4 and 1Pv6 as today·s deployed Internet 1s mostly 1Pv4. 1Pv6
headers are 1101 small enough to tit within the small 127 byte MTU of the 802.1 ~-4
saandard. Ilcncc, squee£ing and fragmenting the packets to earl) onl} the essential
information is an optimi7ation that the adaptation layer performs._ . _ _
SP',-cilkally. the adaptlllion layer performs the following three op11n11zat1ons in order
to n-ducc communication o, erhead·(i)I leader compression 61oWP/\N defines header
compression of I Pv6 packets for decreasing the 0\erhead of I Pv6. Some of the fields
are deleted because they can be derived from link level information or can be shared
across packets.(ii)fragmentation: the minimum MTU si1e (nrnxirnum transmission
unit) of 1Pv6 is 1280 bytes. On the other hand. the maximum size ofa frame in IEEE
Marks)
80'.::. I 5.-t is 127 bytes. Therefore. we need to fragment the 1Pv6 packet. This is done
ocols for
b, the adap1niion layer.(iii)Link la)er fornarding 6LoWPJ\N also supports mesh
reliable
u;ider routing, which is done at the link layer using link level short addresses instead
ublished
he layers
of in the network layer. 'l his feature can be used to communicate \\ithin a 6LoWPAN
nct\\ork.
unication
(3) Network Layer. The network layer is responsible for routing the packets received
rotocol is from the transport layer. The IETF Routing over Low Power and Losf.y Networks
ow power (ROLL) working group has developed n routing protocol (RPI ) for Low Power and
Lossy Nct\,orks (LLNs).
protocob.
iw po\\er For such ncl\\Orks. RPI is an open routing prolocol. based on distance , ectors. It
e case of describes how a destinalion oriented directed acyclic graph (DODAG) is built with
nndwidth. the nodes afler they exchange distance vectors. A set of constraints and an objectivt:
function is used to build the graph with the best path. The objective function and
which is constraints may differ with respect to their requirements. For exnmple, constraints
the range can be to n, oid battery powered nodes or to prefer encrypted links. The objective
~ operate function can aim 10 minimize the latency or the expected number of packets that
need to be senl.
,s a resuh.
limited to Tht: making of this graph starts from the root node. rhc root start~ ~ending messages
ich makes to neighboring nodes. which then process the me~sagc and decide whether to JOin or
nsmission 1101 depending upon the constraints and the objective function. Subsequently. they
1crease the fon\ard the message to their neighbors. In this manner, the messngc traveb till the
leaf nodes and II graph is formed. Now all the nodes in the graph can send packets
ication in upwards hop by hop to the root. We can realize a point to point routing algorithm as
cols wen: follows. We send packets to a common ancestor, from which it travels dow11wards
ow po\\cr (towards leaves) lo reach the destination.
1
To manage the memol) requirements of nodes. nodes are classified into storing and
1networks. nonstoring nodes depending upon their ability to store routing inforn'iation. When
~unication nodes are in a nonstoring mode and a downward pa1h is being constructed, the
ndaptation route information is attached to the incoming message and forwnrdcd further till the
evicescan root. The root receives the whole path in the messagt: and sends .1 data packet alo11
11

23
VIII Sem, ( CSE/ISE) I V\t"e,,-vte-C o f ~ i ~
with the path message lo the destination hop by hop. But there is a trade-off here
because nonstoring nodes need more power and bandwidth to send additional route
information as they do not have the memory to store routing tables.
(4) Transport Layer. TCP is not a good option for communication in low power
environments as it has a large overhead owing to the fact that it is a connection
oriented protocol. Therefore, UDP is preferred because it is a connectionless protocol
and has low overhead.
(5) Applicati~n Layer. The application layer is responsible for data formatting.
and presentation. The application layer in the Internet is typically based on HTTP.
However, HTTP is not suitable in resource constrained environments because it is
fairly verbose in nature and thus incurs a large parsing overhead. Many alternate
protocols have been developed for loT environments such as CoAP (Constrained
Application Protocol) and MQTT (Message Queue Telemetry Transport).(a)
Constrained Application Protocol: CoAP can be thought of as an alternative to HTTP.
It is used in most loT applications. Unlike HTTP, it incorporates optimizations for
product,
constrained application environments. It uses the EXI (Efficient XML Interchanges)
and 1,me. S,
data format, which is a binary data format and is far more efficient in terms of
can run on ,t and.
space as compared to plain text HTM L/XML. Other supported features arc built in ·
\\Ondcrful pro
header compression, resource discovery. autoconfiguration. asynchronous message
Installing and
exchange, congestion control, and support for multicast- messages. There arc
Raspbian, the defi
four types of messages in CoAP: nonconfirmablc, confinnable, reset (nack). and and 3, Ml loading
acknowledgement. For reliable transmission over UDP. confirmable messages arc
clicl.. the top let1 P
used. The response can be piggybacked in the acknowledgement itself. Furthermore.
"P~ thon 3 (IDEL)
it uses DTLS (Datagram Transport Layer Security) for security purposes.(b)Message
Queue Telemetry Transport: MQTT is a publish/subscribe protocol that runs over
TCP. It was developed by IBM primarily as a client/server protocol. The clients
are publishers/subscribers and the server acts as a broker to which clients connect
through TCP. Clients can publish or subscribe to a topic. This communication
takes place through the broker whose job is to coordinate subscriptions and also
authenticate the client for security. MQTT is a lightweight protocol. which makes it
suitable for loT applications. But because of the fact that it runs over TCP, it cannot
be used with all types of loT applications. Moreover. it uses text for topic names.
which increases its overhead.
MQTT-S/MQTr-SN i<: an extension of MQlT, which is designed for lo\\ power
and low cost devices. It is based on MQTT but has some optimizations for WSNs
as follows. The topic names arc replaced by topic IDs, which reduce the overheads
of transmission. Topics do not need registration as they are preregistered. Messages
are also split so that only the necessary information is sent. Further, for power
conservation, there is an offiinc procedure for clients who are in a sleep state.
Messages can be buffered and later read by clients when they wake up. Clients \\ hen P) thon 3 lo
connect to the broker through a gateway device. which resides within the sensor l. Entering Pytb
network and connects to the broker. 2. Input/output~

24
~
f hcrc OR
route
a. Ho" do ) 'OU progrnm raspbe rry pi with python? (10 'htr~)
P)1hon is a programming language that has recently become vcr) poµul;,r-so
ower
popular. in fact. that it is no" the fourth most popular language (according to the
.crion
tocol TIOBE indcx).lJnlikc other l;rnguagelt (C. Ct+, etc.). Python is an 1111e1 prc1cd
language. which means ,1,hcn a Python program is cxccutcd,';111 1111erpre1cr reads
tting the Python instructions and then performs the desired action. The: interr rer itself
TIP. is \Hitten in CPU nati,e instnactions, but the P)thon progra111 is not. fhi-, 111eans
it is Python programs can, in theory, run on ANY computer. so long as that computer
has a Python interpreter. This makes Python n cross-platform lm1gu:1g1.,, whh;h
is one of the main rea:.ons "h) it has bcconu.' so p11pular /\ program ., ri11en on
Windo\\ s does 1101 nccd to be rcwrillen to work on a Mm; or a Linux 111,11:hinc. It's
also incredibly simple. cas) tu read. and pti,,crful. II i;;•n ,Hile ad\'anccd programs.
including graphical interfaces, nel\1 orking. parnllelism. and mu-:h more. Python is a
pro<lucti, c language, resulting in faster program production and requiring less code
nnd time. Since Linux has been written for the Pi (ARM core) the Python mterprcler
can run on it and, thercforc. so can Python programs. So, ho,.,, do we \lart with this
wonderful programming language on the Pi?
e an: lnst:111ing :ind Loading the IDE
• and Rnspbian, the default OS choice for lhe Raspbcrl) Pi, should contain both Python 2
sore a1~d 3, so loadi1_1g !'~thon !thould be ca~y lo navigate through menu oplwns. firstly.
more. click the top lell P1 icon on thi! menu bar. Then na'vigate to ··Progn1111ming" and click
ssagc "Python 3 (IDEL)''.
o,cr • L:!11 •:? ~ •• -••W •.,.,r -
lients
nncct
~~----~
~
ation
also
tkes it
annot
m1:s.

0\\Cr
SNs
1eads
sages
ower
state.
lients Lmuli11g Python 3 ll>t
ensor \\ hen P) thon 3 loads. the, isiblc windo" 1, the shell. \\'luch ha l\\ u n1ai11 pu, poses:
1. Entering Python commands: ,;ome quick calc11int1ons
2. lnput/outpul for programs: gelling datn from a 11-;er and tli..,ph~ in:.: infonnation

MW' 25
VIII Se,m,, (CSE(ISE)

. .,..,,.
Sman trtr11--
Th~ 111ni11 !,/wf/ ~·hgent I
The 1•)• ,11s1> ha~ a 1•1en11 oar,,, .. 111·my options, including file opcmtions, editing., S)Stems is t
shcl\ : .;,iic nption.,; (''li.t-.:nup~ p10-rnm'' for example), program 11\!bugging, IDE avoid accident
optio,,s. ,,ml ·H·.•lp... l ur now, ,. c will only concern ourselves "ith creating a new technologies g
file, "hich is don., ~y chckir.f rile> New File. nccch:rometcrs
infrared sensors
,chide mo,cmcnt
Traffic sur-. eillanc
ncl\,ork to each o
.... -=~II (l•l•I,
Rf· ID de, ices. and
Qpen <~•O
~•t.Q<IUe M•U. pnrt, of the cit\ C:
fi~•Ml!f•~ cond1tiom, can be
CL:.• ~oo-•~l"f .oll•C surveillance using
Po•~Q:'11>,W
can .ilso be imple
~q Clr•S sensors. These ap
s.,,,iui,
5a',,!CoP!~
user •~ driving. St
and users nrc usi
CrM>
"""4\Yl."1<10>,\'
,\pphcntions toe
conditions. It nls
now was mainly i
to help drivers
Creating a new file of drivers and he
tired and helping
CreatingFirst Pro~rnm
Once the ne" w indo,~ lo.ids. "c can linall> enter ou,· first Python program. To do application\ are
this, enter the code as shown below into the window. Then, sa,e the file. the steering ( to
name=,.,. application, ,,hi
such as the acccl
age 0
currcntYcar - 0

26
~-- ----·- ---
yearBorn -= 0
name= input("Enter your name: ")
age=- inl(input("Enter your max age this year:•'))
cum:ntYear = int(inpul("What is the current year?:"))
yearBorn = currentYear - age
print("You were born in the year"+ str(yearBorn))
Running Our Program ·
With the code entered and the file saved, i1's lime to run lhc program. Running a
Python program can be: done in one or three ways: press FS in lhc window with the
program to run. go to the menu bar and click Run > Python Shell. or run the file via
a tenninal window as an argument for Python. For now. the easiest w;iy is In simply
press rs in the ,vindow with the code. Once pressed, lhe code should return no
errors. and the shell window should p~ompt for data.
b. Justify the statement " An IOT strategy for smarter cities". (06 Marks)
Ans; Smart transport applications can manage daily traffic in cities using sensors and
intelligent information processing systems. The main aim of intelligent transport
systems is to minimize traffic congestion. ensure easy and hassle-tree pa,-king, and
avoid accidents by properly routing traflic and spotting drunk dri,1ers. The sensor
technologies governing these types of applications are GPS sensors for location.
accelerometers for speed. gyroscopes for direction. RFIOs for vehicle identification.
infrared sensors for counting passengers and vehicles, and cameras for rccofding
vehicle movement and traffic. There are many typ.cs of applications in this area:( I)
Traflic surveillance and management applications: vehicles .ire connected by a
network to ench other, the cloud. and to a host of loT devici:s such as GPS sensors.
RflD devices. and cameras. These devices can estimate lrafiic conditions in different
parts of the city. Custom applications can analyze trafik patterns so that future traffic
conditions can be estimated. They implemented a vehicle tracking system for traffic
survei Ilance using video sequences captured on the roads. Traffic congest ion detection
can also be implemented using smartphone s~nsors such as accelerometers and GPS
sensors. These applications can detect movement patterns of the vehicle while the
user is driving. Such kind of information is already being colleclcd h, Google maps
and users are using it to route around potentially congested areas vf the city.(2)
Applications lo ensure safety: smart transport does not only impl) managing traffic
conditions. It plso includes safety of people trnvclling in their vehicl!:!s, v. hich up till
now was mainly in the hands of drivers. There are man) lo r applications developed
to help drivers become safer drivers: Such applications monitor driving behavior
of drivers nnd help them drive safely by detecting when they arc feeling drowsy or
tired and 111:lping them to cope with it or suggesting rest. Ted111ologie~ used 111 such
applications are face detection, eye movement detection, and pres~ure detection on
the steering (to measure the grip of the driver's hands on the skering).A srn:111phone
application, which estimates lhc drivcr'.s driving behavior using smnrtphonc sensors
such as the acceleromeler, gyroscope. GPS. and cnmcrn. has been propns,•d by Eren

t'
27
VIII Se-n-11 (CSE/ISE)
et nl. It can decide "hethcr the dri, tng i,; safe 01 rash b) an.ii~ 711lg the sensor data
())Intelligent p:uki11g 111anagen1cnt· in n small tr:111sportat1011 '-Y~tcm, parking. 1s
completely hassle free as one can easil) chcd, on the Internet to find out" l11ch parl,,.ing
lot has frec spaces. Such lots use scn-,ors to detect if the slot~ ate free or occupied
by vehicles. This data is then uploaded to a central server.(4)Smart traffic light<::
traffic lights equipped,, ith sensing. processing. and commun1catio11 capabilities .u-c
called smart traffic light~. These light- sense the traflk congestion at thl· inl<!r~cction
and the amount of trallic going each ''- ,l\. I his inform,tti,111 can he anal) Led and
then sent to neighboring traffic light, N •• ccntr:ll ~ont1ollcr It i'> possible to 11',c
this information crcativel). For example. 111 an en11.:rgenc) situation the tralhc lights
can preferential I} gin ,, 1) to 1111 ambulance. \\ hen the smart trnllic light senses an
ambulance com :ng. 11 dears the path for 1t and ,llso informs neighhoring lights about
it. Technologies used Ill thc~e lights arc cameras, communication technologies. and
data an .. ·ys1 :.1oduks. Su,·h 1 y<:t.!ms ha\'C already been deplo} cd in Rio De fanc1ro.
(S)Accl\lent dctcctio11 api'!1c·11ions: a smartphone application designed by White
dete.:ii. the occurn..11ec 1>f ,111 .1cciJcnt \\ ith the help of an accelerometer and acoustic
data It •.~mcdi:11i.:ly ~ends th1 ,niormation along\\ ith the location to lhc ncan.'.st
hospital '11 Ille addi11011al situational information ~uch as on-site photographs 1s Jlso
sent so th'.lt tlw fin,t r.:spondcrs kncm about the \\hole scenario '.llld the degree of
medical hi.:lp th:,t is required.

S)Sh:m or a C
a fac1lit) such

One of the co
loT \\orld ha,
i■lelligcmtl)p
arc the SA nnd

....
a.
is Eighth Seml'ster B.E. Degree Examination,
g CBCS - Moc.Jet Question Paper - 2
d INTERNET OF THINGS TECHNOLOGY
ls: ae: 3 hrs. Max. Marks: 80
re Net~: Atnwer a11y Ff VF. /11/1 q11e.\li1111.~, .1e/ecti11R ONE full 1111e.1·titJ11 from e11ch module.
bn
nd Module- I
SC
Explain the fou r pillars ofIOT &how arl' they interconnected with each other
~ts
(10 Miu·ks)
an
The four pillars of loT arc M2M, RFID, WSNs and SCA DA (Supervisory Contcol
iut and Data Acquisition).
nd
ro.
• M2M u~es devices to capture events, via a network connt:ction to a central server.
lite that translates the captured events into meaningful information.
,tic • RFID uses radio waves to transfer data from an electronic tag attached to an
·est object to a central system through a reader for the purpose of identifying and
lso
tracking the object.
of • A WSN consists of spatially distributed autonomous sensors co monitor physical
or environmental conditions.
• SCA DA is an autonomous system based on closed-loop control theory or a sma11
system or a CPS data connects, monitors, and controls equipment via network in
a facility s_uch as a plant or a building.

M2M

IOT RFIO

SCADA

One of the common characteristics of the Internet of Things is that objects in a


loT world have to be instrumented. inkrconnccted. before ;10ything can be
intelligently processed and used anywhere, :inytime, anyway. nnd anyhow. which
are the SA and 31 characteristics.

29
WID'
Iclat~. Eighth Semester B.E. Degree Examin.ttion,
!ng IS
irking CBCS - Model Question Paper - 2
up1ed INTERNET OF THINGS TECHNOLOGY
ight,: • 3 hrs Mull.. Marks: 80
cs an.: N.-: An,i.·l!r an.I' F/J, F /11/l t/W!Miom, lt!lt!L'ling 01\'E/11/1 q11elti1111 from e11d1 mot/11/e.
·ction
d ,llld Module - 1
10 USC
lights Explain the four pill:trs of IOT &how :trc they interconnected with each other
ISC', an (10 Marks)
about Au. The four pillars of lo I are M2M, RflD, WSNs and SCA DA (Supcn isory Control
•s. and an<l Onta Acquisition).
iinciro. • M2M uses devices to caplure events,\ ia a network connection to a cenlral se1vcr.
White lhat translates the c.. ptured c,·ents into meaningful information.
coustic • RFID m,cs radio waves to transfer data from an t:lectronic tag attached to an
nearest ObJect to a central S)Stem through a reader for lht: purpose of identifying and
1s :ilso tracking the object.
.grce of • A WSN consi!.tS of spatially distributed autonomous sensors ro monitor physical
or environmental conditions.
• SCAOA is an autonomous system based on closed-loop control theory or a sma11
S) sti:m or a CPS dala connects, monitors, and controls equipment , ia nelwork in
a focility s.uch as a plant or a building.

IOT RFIO

One of the common characteristics of the Internet of Things is that objects in a


loT '"urld hm e to be instrumented. interconnected. before anything can b~
intelligent!) processed :ind u~ed anY', ht•rc, nnytimc, an) WII) . and n n~ hon." hi.:h
arc lhc SA and 31 charncteri~tics.

29
VIII Se,m, ( CSF./ISF.) Interl'\.etof~T~
Four Pillars and Their Relevance to the Network:
-
RFIO Vu No Some

W$N v., Som• No Some

M2'A Som• v.. No Som•

SCADA Some Som, v., Yu

l. Machine-to-Machine (M2M) communication:


Machine-to-Machine (M2M) communication 1s a form of data communication that
im oh cs one or more entities that do not ncccssaril} require human interaction or
intel'\ention in the process of communication. M2M is also named as Machine Type
Communication (MTC) in 3GPP. It is dilfcrt:nt from tht: currt:nt communication
models in the ways that it involves:
- new or different market scenarios
- lo" er costs and effort
- a potentially very large number of communicating terminals The RFID reader
- linle tratnc per terminal. 111 general coil, hich genera
M2M communication could be carried over mobile net\\orks (e.g. GSM-GPRS. 1s usuall) n pass,
CDMA EVDO networks). In the M2M communication, the role of mobile network microchip. so "h
is largely confi ned to serve as a transport network. induction. a volta
Some of the key features ofM2M communication system are given below: for the microchip.
a. Low Mobility : M2M Devices do not move, move infrequently, or move only As ,,di as active
within a certain region different frcquen
b. Time Controlled : Send or receive d111a only al certain pn:-dclincd periods Some frcl111encic.
c. Time Tolerant : Data tran~fer can Ix dcla)'ed other'· can read
d. Packet S" itched : Network operator to provide packet switched service with or ra11ng of the mo
without an MSlSDN in the thous..,nds
e. Online small Data Transmissions: MTC Devices frequently send or receive small unattainable for
amounts of darn. to ch,1ngc. and m
f. Monitoring: Not.intend to prevent theft or vandalism but provide functionality to
detect the events
g. Lo,, Po,,cr Consumption : To improve the ability of the system to efficiently
service M2M applications
h. Location Specific Trigge1 : Intending to trigger M2M device in a particular area
e.g. wake up the device

----...
2. RFID : is the 2nd pillar of IOT.

I!]
30
,,,
• t-,todeL,Q~t.Ct'\IPtJ1)e,.,- • 2

RFIO stands for Jbdio Frequency JDcntifica tion and ifs u non-contact ~cch~o,log~
that"s broadly used in man> industries for task~ <,u~h ~., pc_rsonncl tra~ht,ng, access
control. supply chain management, books tracking 111 hbrancs. tollgate S) stems and

~F~;
data
uses radio waves produced b) a reader to detect the ~resenc_c of(then read the
stored on) an RFl D tag rags arc embedded in small 111:ms hke cards. buttons.
or tin) capsules.
,,
/ ,,
I / ,,
I I I /
I I I I
I I I I
in that I I \ \
ion or \ '
e Type
' ' ...... __
'
' ..... __
- - \ol IU I

,cation 1.,~ 111 .... _ _


11.,n~, .. ,n,),
~J
0 - ' ""'l~,1,,,
The Rfl D reader consist of a rndio frequcnc> module, a control unit <111d an antenna
coil which generates high frequency ck·ctromagnetic field. On the other hand, the tag.
-GPRS. is usually a passive component. which co,1sist of just an antenna and an electronic
!ctwork
microchip. so when it gets nenr the electromagnetic field of the transceiver. due to
induction. a voltage is generated in its antenna coil and this voltage serves as power
for the microchip. RFID Frequencies
vc only
As ,,ell ns nctivc nnd passive systcll's. RI-ID systems can nlso be broken 0111 into
different frequencies.
Some frequencies and systems arc designed to only read one tag at a time. \\ hile
with or others can read multiple. Cost of renders can also VIII) "ildl) ha-;cd the freq11cnc>
rm111g of the modules. In prior years a reader capnblc of reading multiple tags was
vc small in the thousands of dollars. sometimes tens of thousands. These S) stems were
unattainable for most hobbyists and prototypcrs. I lowcvcr. this is finally beginning
1nality to to change. nnd multi-read capable readers are becoming much more affordable.

fficiently

ular area

31
VIII Se,m; ( CSf/ISf) I V\te+-"VU't" o f ~ T~
3. Wireless Sensor Network (Wsn):
WSN is the third pillar of IOT.
Wireless Sensor Network

It's a collection of devices " sensor nodes" . They arc small. inl!xpensivc, with
constrained power. The} an: organi1t:d 111 a cooperati, c network. rhc) communicate
wireless() in multi hop routing. I !cavil) deployment. Changing. network topology
WSN Definition
A sensor nctwo1" is composed of (I large number of scn'>or nodes that nrc dcnscl)'
deployed inside or vel) close to the phenomenon~
random dcployment8
sell:organi,ing capabilities0
(v)Com
Each node of the sensor networb consist of three subs) stcan:O
ll pro"d
Sensor subsystem: senses the en, ironmcn18
Processing subsystem: perfonns local computations on the sensed data0 b. Esplain thl' ap
Communication :;ubS) stem. rcspon'>iblc for mes~age c,chang.c ,, ith neighboring. Aa. Smart Homl':
sensor nodes0
l'hc features of sensor nodes8
Limited sensing region, processmg power, energ} 8 The number o
4. Scada (Supervisory Control ;md O:,t11 Acquisition): 60,000 people.
anal) tics inclu
HMI in\'olved in sm
amount of fun
comrnun~cauon
Suporvlsory
........ t lnte,tacc rapid rate. Th
CADA (
System
AlertMe or N
Baier, or Belk


~
¥ PLC Wearables
RTU
$CADA
Just like sma
P,ourammlng year, consum
Apart from t
It is impossible to keep control and supcn ision on all induc,trial acti1 ities manual\)'. such as the S
Some automated tool 1s required" hich can control, supcn ise. collect data. anal) sc:.
data and generate reports./\ unique solution 1s introduced to 111cc1 all this demand i~
SC ADA S}S!Clll

32
C'i3CS . ModcltQ1-<.e1ttowP~eY · 2

Aux111iary memoL-rv----1 □ Display and


Control Console
Prog.l/01-----t C.P.U. □

Communication Interface

(i) Human Machine Interface (HMI) .


It is an interface which presents process data to a /111ma11 operator, and thr,, tgh this,
the human operator monitors and controls the process.
(ii) Supervisory (computer) system
It gathers data on the process and :;ending commands (or c:ontrol) to tlw process.
(iii) Remote Terminal Units (RTUs) . . . . .
It connect to sensors in the process, convcrtmg sensor s1gnab tc d1g1111l data and
sending digital data to the supervisory system.
(iv) Programmable Logic Controller (PLCs) . . .
It is used as field devices because they are more cconom,cal. versatile, fh:x1ble, and
configurable than special-pl)rposc RTlJs.
(v) Communication infrastructure . : .
It provides connectivity to the supervisory system to the Remote 1 crm111al Units.
b. Explain the applications of IOT. (06 Murks) ·
Ans. Smart Home :
Whenever we think of loT systems. the most important and efficient application that
stands out is the smart home, ranking the highest loT appliemion on all channels.
The number of people searching for smart homes increas~s every 111011th by about
60,000 people. Another interesting thing is that the database of smart homes for loT
analytics includes 256 companies and startups. More companies are no" actively
involved in smart homes, as well as similar applications in the field. The estimated
amount of funding for smart home startups exceeds $2.5 billion a,1c1 i~rowing at a
rapid rate. The list of startups includes prominent startup company names, such as
AlertMe or Nest, as well as .a number of multinational corpoiation~, lil-.c Philips.
Haicr, or Belkin.
Wearables
Just like smart 11omes, wearables remain a hot topic among potential lo T. Every
year, consumers all across the globe await the release of the latest Apple smnrtwatch.
Apart from this, there are plenty of other wearable devices that make our lit~ easy,
such as the Sony Smart B Trainer, LookSee bracelet, or the Myo gesture control.

33
VIII Se-vw (CS'f(I SE)
Smart City
Smart cities, like its name sugge!>tl>, i-, 'I big. inn0Hll1on and spans a ,, idc vantl} of
usc cases, from ,,ater distribution and 1raflic manag.l'mcnt to ,,aste management and
Cll\ iron mental monitoring. The n:ason "hy it 1s sv popular is that 11 trits to rcmo\c
the discomfort and problems of peopl. "I• J II\ e m cities loT sulul 1011,; offered in th,·
smart city s1.:1:tor solve various eity-relah:11 problem~. comprb,ng o ftrntnc, reducing
air and noisi;: pollution, and helping IL make 1.1t1es s,1for.
Smart Grids
Smart grids are another area ofloT t1;.!1111ll1J,:() 1hat stands out. A sm,trt grid bai.icall}'
promic;es to 1::\tract information on the hdia, h.>rs of consumen, and electricit)'
suppliers in an 'automated fashion to ,mproH the elllciem:y, economics. and
rcl111bil1ty of dectmil) ,ltstribut,on. 41,000 111011thl) Google searches ,s a testament
to this 1.onccpt's populant) Smart farm mg ,
lndu~tri:1I lntcl'net number of famu
Ont ,,.:) h.. think ol' 1he l11Justrial Internet i,; by looking 111 connected machines 1hat fam,ers \\O
and u vu·•-~ 111 111dustr:c.s ~•1 h ac; pC'\\CI !,(eneration, oil. g11s, and healthcare. II al~o
1
re\ olutiunize th
nrnl-.t:, ·•~1. 0f sttu tt l(lnS wi1t:n.: 1111pl111rncd downtime and system failures can re:,ult in scale attention.
lili:-: :11~11 ng ,,ituat10:1~. ,\ s,.stc,n \:mbcdded \\ ith the fol lend<; 10 include de, ICC!> not lx umh:n:~1
such"" fitm.:~,; t•:.mh for lie·,! 111oni111ri11g. or i.murl home applianc..:s ·1hcsc '-)'-t-.ms upphc,nion heli
are ii' 1c111 n,11 an.I cnn provide ca~c of 11~1.! hut arc not rl!!iable because the) do not
typicall) cr1;ate emergency situation!> if a cto1Antimc wns to occur. ·
Connl-ctctl Car 1. a. fapl:iin the 10
Connl!cted .:ar technology 1s a vast 1111CI an e:-.tensivc network of multiple sensors. Ans. An O,cn ic" a
antc,111,1s, cmb.:ctded sof\wrirc, and technologies that assist in communication to the4 Stage 10·1
navigalc 111 our 1;0111plex world It has 1hc responsibility of mal-.ing dcci~ton~ "ith Data Acquisiti
co11s1stc11cy, act.urae), and Si eccl I• also has tu be reliable. 1 he~I! rcq111remc11ts \\ ill ufthesestap.c~ c
become e, en more crit1c,1! ,,hen humans give up conlrol of the sttc11ng wheel nnd The 4 Stage
brak,·s II> 1he autonomou~, chicles that are being tested on our highwqys right 110\\,
Connected Health (Digital I lealthffcleh1:alth/Telcmedicine)
loT has various applications in healthcare, which are from ,emote monitoring.
equipment to advance and smart sensors to equipment integration. II has the Th Ill Q
potential to improve how phvsicians deliver care and also keep patients safe t1nd
healthy. I lealthcare lo r can nllow patients to spend more 111111:l mteracting ,, ith their
doctors. which can boo~t pati nt engagement nnd satisfaction I ro111 pl!rsonal fitness
sen~ms to surgical robots. loT in hcalthcnrc brings new tools updated" i\h the latest
technulog~ in the ecos) ~tern that hdps Ill developing bener healthcare. loT helps to
revolutio11i1c healthcare and prO\ idc pocket-friendly solutions for both tht: patient
and healthcare professional.
Snu1rt Retail
Retailer:. have started ndopting luT ~olutions nnd using lo1 embedded systems
across II number of applications that improve store operations, increasing purchases.
reduc111g. 1heft, .:nab ling i1l\ cnto1: mnnagcment, and enhancing the com,umer's
shopping experience. Tl11ough luT physical retaileri. can compcte against online

34
.,

chnllengcrs more strongly. They can regain their lost markel share and altr_act
consumers into the store. thus making it easier for the111 to buy more "' 11 ll: saving
money.
Smurt Su11ply Chain . .
'S\.Jf"Jf'tb• 4-'.'.h ,"lli.n'.'. hnvc already been 1:;c\lln~ SITHlt"lcr for a couple of yen1 •. {)ffenng
M>lulions to probhml5 like trncking or !l,OOd~ )Yhile lht:y are on the road or Ill :rnnsit or
hc!p,ng suppliers exchange inventOI)' information are some of the popular dfcrings.
With an loT cnablt:d system. factory equipment that contains emhecil!,..' <.cnsors
fly communicate data about difforcnt parameters. such as pressure. tempe,-- urc. and
lty utilization of' the mad1ine. The loT system can also process workflow , • ' change
nd equipment settings to optimize performance.
ml Smnrt Funning
Smart fanning is an otlcn overlooked in loT applications. However, be, ause the
number of farming operations is usually remote and the large numb;r J livestock
es that farmers work on. all of this can be moniton.:d by 1he Internet ofTh,ngs and can
·,o revolutionize the way farmers oper:ite day to day. But. this idea i~yet to 1\::tch a large-
t in scale attention. Nevertheless, it still remains on<: of the loT application~ tli.,t should
c~s 1101 be underestimated. Smart· fanning has the µ01cn1ial 10 become :11• imponant
nu, application fleld, spccificall} in the agricullurnl-produc1 exporting countril.s.
ot
OR
2. a. Explain the IOT network nrchitecturc nml design ( lO Murks)
ors. Ans. An Overview of the Main Singes in the loT Architecture D,agrnm. In simple terms.
I to the4 Stage loT architecture consi5ls of Sensors and actuators. Internet gi:taways and
ith Data Acquisition Systems, l:dge IT. Dnt,1 center a111.I <.:loud. The dcl:1iled prc,-,entation
"ill ofiheses1aµ,es can be found on the diaµ,rnm below.
and The 4 Stage loT Solutions Architecture
W.
.., . -,
~l.co,1 .,1....

ring s,... , t, 7
lrH1rnu,t G.,tew 1,.,-. ,
the oa,., ~ c1u1-,1tnJ11 f II T
Tnc ""Thinys
-nncl ~·.,,."'II
their Pt,,,,, I , . •r.• rc-h ,e

,~,, O.J ( 141,'il

-,l -.~'
ness ,$,0.. •c-t;

atest 4 _,
,, . . JiJ,I
?S to l
tient

tems
C'J'l ,,1.ma J
1
C e"tfJJ :O}''i:
•r.,,l'f'O".-~~•
Ow .;f"<J~ n•.;
*
--1~
§:[.,;1

Gl;L.•., -.. .,:_ ,. ~ -


ses.
er·s
SIYS/0~$
Dtl!•FIO,-.,
u
nline c,w,~, Fro,,

35
VIII Se-wt1 ( CSf./ISf.)
To get the proper understanding ofthl main actions and the importance of each stage
in this process, refer to the detailed reviewl> prt>sentcd below.
Stage 1: Networked things (wireless ~eusors and actuator~)
The outstanding feature about :,cm,ors is their abilit) to convert the information
obtained in the outer world into data tor analysis. In other words. it's important to
start with the inclusion of sensors in th..: 4 stages of an loT architecture framework
to get information in an appearance lh:1t can I v actually processed. r-or actuators. the
process goes even further- these de\ ices 1re able: to inu.:n em: the physical reality.
For example. they can switch off thl. light and :idjust the tempcrnture in a room.
Becau.,.: 0f this, sen,,ing and actual iug stag,e 1,;o, en, and adjusts everything needed in
the physical wnrld to ~•ain the ncce~sary insights for furthe1 analysis.
Stngc:? ~Clll>Or tlata aggreg:1'ion S)stt•ms 'lnd analog-to-digital data conversion
Even ,houth this stage of lol ,irchitectun: still means working in a close proximity rce -
with , ,i•·or~ and a'.'lu::itor.- in'i ~net getaways and data acquisition sysh:ms (DAS) cater to >our c
appca,· 'll'' 110. SF'~cifh:.ill) 11,e later conn~ct to the sensor network and aggregate to-end xperie
out1'11:1 , , ii.: I 1tcrnc1 gct,1•· , , , work through Wi-Fi, wired LANs and perform firrn,\a .
furths:, ·,, ",sing. I l•\! • ,\.,I impor1 •II~~ of this slagc is to process the enormous Layer I: ·n:.ors
amou •, ' ,nfonnatior l.ulh:c!<:d on the previou~ stage and squeeze it to the optimal Sensors h, ve bee
size tor furt,1er ·10al) s 1~. i.k,1de~, the necessary conversion in terms of .timing and healthcare aviati
strnl.l'lr: happen~ her1,; In sl:ort, Stage 2 1nak;;s data both digitalized and aggregated. and inexp nsive t
Stage 3. T hl· appeanmce of edge IT systems profession lly. Th
During thb monu.:nt among the stages of loT architecture. the prepared data is connected cnsors
transfeircd to the IT world. In particular, edge IT systems perform enhanced anal) tics Layer 2: icroc
and pn:-processing here. for c.\illl1pk. it refers to machine learning and visualization The secon layer •
technologi~s. At the same time, som\~ -idditional processing may happen here, prior data proc ssing a
to the stage of entering the data center. Likewise. Stage 3 is closely linked to the connectivity to s
previou , µhases in the building of anarchitecture ofloT. Because of this. the location generate o er I 0,
of edge IT ~ystcms is close to the one where sensors and actuators are situated. before se ing it
creating a wiring closet. At the same time. the residing in remote offices is also your colle ted da
possible of unnece sary d,
Stage 4. Analysis, management, and storage of data on data tr nsfer a
The main processes on the last stage of loT architecture happen in data center or Your mic ocontr
cloud. Precisely, it enables in-depth processing, along with a fol low-up revision for loT device store
feedback . Here, the skills of both IT :ind OT (operntion:11 technology) professionals datab:ise Your
arc needed. In other words. the phase already inclu$les the analytical skills of the hold data some
highest rank. both in digital and human worlds. Therefore. the data from other
sources may be included here to ensure an in-depth analysis. Aller meeting all the In some · ses. y
quality standards and requirements, the information is brought back to the physical take acti n and t
world-btll in a processed and precisely analyzed appearance. via a clo d appl'
Stage 5. loT Architecture when a nsor d
In fact. there is an option to extend the process of building a sustainable loT aastomc .
architecture by introducing an extra stage in it. It refers to initiating a user's control
over the structure if onl) your rc~ult doesn't include full automation. of course.

36
The: main tasks here arc visualization and management. After including Stages_. th.:
stage
system turns into a circle where a user sends commands to se1~sors/actuators (Stage
I) 10 pt:rform some actions. And the process starts all over aga111.
ation Ill. Explain the layered IOT Stack. . . . . (06 _Mar~)
ant to The loT stack is rapidly developing and maturmg mto the Thmg Stack. Tiu~ Thmg
work Stack consists of three technology layers: sensors. microcontrollcrs and mternet
s. the connectivity, and service platforms. . .
•ality. • Layer One - Sensors nre emheddt:d in objects or the physical environment lo
·oom. capture information and events for your company.
ed in • La)·er 1\vo - Microcontrollers and internet connectivity share information
captured by sensors within your loT objects and act based on this information to
rsion change the environment.
ximity • Layer Three - Through the aggregation and analysis of data, service platforms
(DAS) cater to your customers. Service platforms also control your loT product's end-
,regal~ to-cnd experience and enable your customers to define system rules and updntc
i:rform firmware.
rmous Layer l: Sensors
ptimal Sensors have been used for years in a number of different industry contexts. like
ng and healthcare. aviation. manufacturing and automotive. Now, sensors are so tiny
gated. and inexpensive they can be embedded in all the devices you use personally and
professionally. The sensor layer of the loT tech stack continues to expand as internet
data is connected sensors are added to new products and services.
ial) tics Layer 2: Microcontrollcrs and internet connectivity
lization The second layer in the Internet of Things technology stack allows for local storage.
e, prior data processing and internet connectivity. The Internet of Things needs internet
cl to the connectivity to send collected data to your cloud database. Because some sensors
ocation generate over I0.000 data points per second. it makes sense to pre-process data locally
ituated. before sentling it to your cloud database. By analy1.ing. extracting and summarizing
is also. your collected data before you send it to your cloud database. you reduce tht: volume
of unnecessary data you send to and store on your cloud database. saving you money
on data transfer and storage costs. ·
enter or Your microcontroller is a small computer embedded within a chip and it helps your
sion for loT device store and pre-process collected data before it's synced to your cloud
sionals database. Your microcontrollcr possesses a processor. a small amount of RAM to
s of the hold d11t11, some kilobytes of EPROM or flash memory to hold embedded software.
n other and solid-date memory to cache data.
, all the
In some cases. your loT device may need to use programmable microeontrollers to
hysical
take action and turn something on or off. In most cases, tht:se decisions are made
via a cloud application, but it makes sense to use progrnmmable microcontrollers
when a sensor detects something that could affect the health and safety of your end
ble loT customers.
control
course.

37
VIII Se-rn,, (CSEIISE) I vit"evn.e,t- o f ~T ~
The main and most important capability of this layer is neL\1orking. which is either
wireless or wired. _If a device is stationary and can access an external power source. a
wired network is sufficient. but a v.,ircd network docsn ·1 make sense for many loT use
cases because physical cables are needed to connec' to the ndwork. WiFi, wireless
modems, and wireless mesh ne_tworks an, the most common ways loT devices arc
connected to the internet.
If you plan to manufacture an loT device you must keep in mind dependencies for
your use case;.. Is your device mobile or fixed? Does your device need a battery or
is it connected to a fixed power supply? How much data do you need to transfer
to your cloud database per hour? Should your device's connectivity be episodic or
cont ii1uous?
Devices you use to track your health and fitness while bicycling. running, and
exercising store data while you· re active, and these devices use episodic connectivity.
Your device then syncs with the cloud v. hen it's dose lo your smartphonc or tablet.
Compare this to the continuous connectivity needed by i\mazon Echo's, oice based
digital assistant who is always listening for your commands, fetching answers from
the internet the instant you ask a question. Deptmding. on your loT product"s ust:
cases, you may need continuous connectivity.
When you research Think Stack vendors, you'll notice a wide range of different
networking protocols, hardware. software, and architectures arc used 10 build loT
products. Due to the variation in use cases and environments, you have many choices
when it comes to adding nerworking and computing capabilitie~ your loT device.
While some vendors focus more on hardware components, other vendors provide a
system of integrated software and hardware. Sometimt!s, loT sol'rware solutions spill
into the third layer of the Thing Stack. wliich is referred to as the service platform.
Layer 3: Service platforms
The first two layers for the Thing Stack embed sensor and microcomputers in your
loT device. but your loT product profits from the service platform layer. This layer
delivers value to your customers by automating processes and delivering rich data
analytics. Your cloud application combines data collected from numerous loT
sensors with your (or your customers) other business data lo produce insights that
generate business value.
It's important for your service platform 10 create a feedback loop between your
loT devices and you device management software, so you and your customers can
upgrade, monitor, and maintain the firmware on each_ t_hc device_. In most cases.
service platforms operate on cloud infrastructure and ut,hle a muln-t_enant software
architecture to deliver a seamless software-as-a-service (SaaS) expern:ncc. .
The convergence between our digital and physical worlds stresses your _IT ~pe_rallons
!t
by increasing demand for data management. storage, tagging and analys,_s. s Ill your
company's best interests to build your loTscrvice platfonn on robust clouct mtrastructure.
so you can scale infinitely as your business grows with your new loT product.

38
is either V. hile "<,OID\;ire i<; eating the world" as Marc Andreessen said, consumers and
·ource. a companies still purchase a lot of physical things. If )OU build a robust loT service
loTusc platform. ) ou obtain insights About how your customers use your loT products
\\ irelcss and scr\ ices. With a great loT sen ice platform you can manage post-trnnsaction
ticcs an: n:lu1ionships in new and engaging ways.
Because sef'I ice platforms store and make decisions based on data collected from
'lcies for all t) pes of loT de, ices. the) arc often considcn:d the backbone of post-tram,actions
tlery or relationships. Whnt if }OU reached out to cw,tomers ,, hone, er powered on your loT
trans for de, ice and proactivcly on-boarded them? What if you used anonymizcd data from
sodic or customers getting the most ,·aluc out of your product to help other customers unlock
more business value?
ing, and Now is the time 10 build ) our lo r product because sensors keep gelling smaller and
ectivity. network connectivity solutions keep gelling better. Create a plan to engage with your
r tablet. customers and facilitate gonl-orienh.:d post-transaction relationships, leading to nc,"
:c based opporluni1ii.::s 10 turn u single transnction in10 a ~trong n:lationship.
,rs from Module - 2
1ct·s use
J. a. \ Vh:it :ire sensors, nctuntors and s m:1rt objects in IOT'! (04 Marks)
Ans.
itfen:nt
ild loT OASIS FOR
choices SENSORS ACTUAT O RS
COMPA RISON
device.
rovidc a Used lo measure thi: co111i11uous and lmpd co111i11uous and discrete
Basic
<liscrct,: process variables. process,:~ parnrncters.
ms spill
tfon11. Plac.:d at Input pon Output port
Outcome Electrical signal Meat or motion
in your
is layer Magnetometer, Cameras. I.ED. Laser, Loudspeaker.
ch data Accelerometer, microphones. Solenoid, motor controllers.
1us loT
b. What 11rc smart physic.ii objccts'in IOT '! (06 Marks)
its that
A ns. The concept smart for a smart physical object simply means that it is active, digital.
;n your net\\orked, can operate to some ell.tent autonomously, is reconfigurable and h11• local
1crs can control of the resources it needs such as cncrg}, data storage, elc. Note, " smart
t cases. ob~cct_ do~s no_t necessarily nee~ lo be intdligenl as in exhibiting a strong essence of
oftware art1fic1al mtell1gence although 1t cnn be designed to also be intelligent.
Physical world smart objects can be described in h.:rms of three properties:!
• Awareue.u: is a smart object's ability to understand (thut i~. sense, interpret, and
erations
in your react to) e, ents,and human activitie~ occurring in the physical world.
ructure. • ~epre~·e11tatio11: refers to a smart object's applicalion and programmmg model-
'" particular, programming abstractions.
• Interaction: denotes the object's ability to converse with the user in terms of
input, output, control, and feedback.

39
VIII Se-m, ( CSE/IS£)

Based ~1~on these properties. these have been classified into three types:
• Ac:1tv1ty-Aware Sm art Object.\·: ~re objects that can record information about
work activities and its own use.
• ~olic:y-Aw11re Sm art Objects: Are objects that are activity-aware Objects can
interpn:t events and acti, ities with respect to predefined organi7.ationnl policie~.
• Process-Aware Smart Objects: Processes play a l'undan1ental role.: in industri,11
work management and operation. A process is a collection of n:latcd activities or
tasks that are ordered according to their position in time and space.
c. What arc the primary components of smart connected products? (06 Marks)
Ans. Smart. connected products have three primary components
• Physical - made up of the product's mechanical and electrical parts.
• Smart - made up of sensors, microprocessors. data storage. controls. software.
and an embedded operating system with enhanced user intcrfru.:e.
• Connectivity - made up ofpo11s, antennae, and protocols enabling wired/wireless
connections that serve two purposes, it allows data to be exchanged with the
product and enables some functions of the product to exist outside the physical
device. ·
Each component expands the capabilities of one another resulting in "a virtuous
cycle of value improvement"'. First. the smart components or a product amplify the
value and capabilities of the physical components. Then. connectivity amplifies the
value and capabilities of the smart components. These improvements include:
The embed cd sys
• Monitoring of the product's conditions, its external environment. and its of these ty cs cont
operations and usage. The csscnti 11 com
• Control of various product functions to belier respond to changes in its like Motor la 68H
environment, as well as to personalize the user experience. factor that ifferen
• Optimization of the product's overall operations based on actual performance their intern I read ·
data. and reduction of downtimes through predictive maintenance and remote and system archite
service.
• Autonomous product operation, including learning fro~ their environment.
adapting to users' preferences and self-diagnosing and service. ~ Sen

OR ';.\VJll.
~
(04 Marks) :;~-r,U

4. a. What :.re actuators in JOT '! Explain . .


Ans. An actuator is a mechanism for turning energy 11110 mo11011. . .
Actuators can be catego11.·2 ed by the energy source they require to generate motion.
$~~(~,
"-•~f~
f:m•t

For example: . . · n
• Pneumatic actuators use compressed all' to ge'.1cratc motto .
• Hydrolic actuators use liquid to generate motion.
• Electric actuators use an external power source, such as a battery, to genera1e
motion. .
• Thermal actuators use a heat sour<.;e to generate motion.

40
IL What arc embedded systems in IOT? EAplnin ( 12 Marks)
ation about It is essential to know nbout the embedded clt.:vices while learning the loT or building
the projects on loT. The embedded devices arc the objects that build tht: unique
bjects can computing system. l'he!>c !>) stems may or may not connect to the Internet.
policies
1:11 An embedded device system genernlly runs :is n single npplication. llowevcr, these
n in<lusll 1,11 devices can connect through the internet connection, and able communicate through
nctivitics or other network devices.

(06 Ma rks)
ltmboddod
Sy•to,n
' ....
, solh~are. ,'

'd/wircless
d with the
he physical

·n virtuous
amplif) the lotcru @t of Thin&• tJoTJ

plifics the Embedded System ffard wnrc


lude: The embedded system can be of type microcontrollcr or type microprocessor. Both
t, and its of these types contain an integrated circuit (IC).
The essential component of the embedded system is n RISC family microcontroller
gcs in its like Motorola 681IC 11, PIC I6F84. Atmel 805 I and many more. The most important
factor that differentiates lhesc mrcrocontrollcrs with the microprocessor like 8085 is
erformnnce their internal read and \Hitable memory. The essential embedded device components
and remote and system arch itecture are specified below.

1vironment.
~--~
L \len-.or"· mn->ed 10
_ _ _ _;.;"'-""~'c,,:.;~:-"
J
.. _ _ __

' Se~
:
~
........
..... .,. t=.:>-
,:.,~ M iorocoat<oll•r
Controll•I'•

0-l Marks)
~Ju

:;,., .. -:,.
·..·.,·'·";I, •• 8051
tOFS.s
4TMl:0

ate motion. ~
::.,·-,t!!\
X.4\fll~
c=> 108.
03HC:1

r F•t

Dl•play Uatt
o generate - c..• ,nen1 Du la,· LC 0 <.,,11,~J"., C,15p1a,

Fig: Basic Embedded Stst <'m

41
VIII Semt (CSf(ISE) I nt"er-vi.et o f ~T ~
OR
5. a. Lis t the bus iness case for IT and IOT (05 Marks)
Ans. I) Recognise the need for a business case
2) Start on the shop floor
3) Identify meaningful data
4) Employ predictive analytics
5) Track your products and assests
6) Create a new revenue modela
7) Move from drawing board to reality
8) Choose the right 1OT platforms and pa1111crs
9) Build a proof of concept
I 0) Roll Olll at scale
b. Wh:11 is me:rnt by device management in JOT (05 Marks)
Ans. Device management (DM) is an es~ential part of the loTand provides efficient means
to perform many of the management tasks for devices:
applicatio s in he
• Provis ioning: Initialization (or ~ctivation) of devices in regards toconfiguration bits p.ir s ·ond) ,
and features to be enabled.
lo" cost a cl ener
• Device Configuration; Management of device st::ttings and parameters.
recent sma phon
• Software Upgrndcs: Installation of firmware. system software. and applications
on the device. 3. Low-R tc, Lo,
are anothe key le
• Faull Management: Enablt::s error reporting and access to device status
4. IP, 6 N vorki
1
t In tht: simplest dt::ployment. the devices communicate directly with the DM server.
making tic fact
This is. however. not always optimal or even possible due to network or protocol
\ariouscap hilitie
constraints, e.g. due to a firewa ll or mismatching protocols.
7 It is fores cable t
t In these cases. the gateway functions as mediator between the server and the
ii can som how co
devices. and can operate in three different ways:
S. 6LoWP N
• If the devices are visible to the DM server. the gateway can simply forward the
(1Pv6 Over Low ~
messages between the device and the server and is not a visible participant in tht:
by the 6Lo PAN
session.
The 6LoW AN c
• In case the devices arc not visible b11t understand the DM protocol in use, the and should e appl
gate\\ay can act as a proxy, essentially acting as a DM server towards the device
limited pro cssing
and a OM client towards the server.
6. RPL
• For deployments where the devices use a different DM protm:ol from the server. 1Pv6 Routi g Prot
the gakway can represent the devices and translate between the different protocols
Low-Powe and L
(e.g. TR-069. OM/\-DM. or CoAP). The devices can be reP.resented either a~
routers an their i
virtual devices or as part of the gate\\ ay.
constraints n proc
c. What :ire the netw orking technologies adopted with IOT (06 Marks) 7. CoAP
Ans. I . Power Linc Communication Constrainc Appli
(PLC) refers to communicating 0\ er power (or phone. 1.:om,, etc.) lines. compute-c nstrain
This amounts to pulsing. with various degrees of power and frequency, the electrical
line~ used fur power distribution. l'LC comes in numerous navors. At low frequencies

42
(tens to hundreds of 1-krlz) ii is possible to communica1c over kilometers with low
bit rates (hundreds of hits per scco nd >· · and was seen
Typically. this type of communication was used for remote mcte~•ng. . ·
as potentially useful for the smart grid. Enhancement~'? allow higher b_,1 rates have
led to the possibility of delivering broadband connec11v1ty over power Imes.
1. LAN (and WLAN)
7 Continues to be important technology for M2M and loT applications.
7 This is due to the high bandwidth, reliability. and lt:gncy of the technologies.
Where power is not a limiting factor. and high bandwidth is required. devices may
connect seamlessly 10 the Internet via Ethernet (IEEE 802.3) or Wi-Fi (IEEE 802. I I).
The utility of existing (W) LAN infrastructure is evident in a number of early loT
applications targeted at lhc consumer market, particularly where integration and
control with smart phones is required.
2. Bluetooth Low Energy
rks) (BLE; .. 131uctooth Smart'') is a recent integration of Nokia's Wilm:c
eans standard with the main 131uc1001h standard. fl is designed for short-range(,50 m)
applications in healthcare, filness, security, etc., where high daln mies (millions of
tion bits per second) arc required to enable application functionality. It is deliberately
low cost and energy efficient by design, and has been integrated into the majority of
recent smart phones. ·
tions 3. Low-Rate, Low-Power Networks
arc another key technology that form the basis of the loT.
4. rPv6 Networking
crver. making the fact that devices arc networked, wilh or without wires, with
tocol variouscapabilities in terms ofrl\ngc and bandwidth, essentially seamless.
T It is foreseeable that the only hard requirement for an embedclt:d device will be that
d the
it can somehow connect with a compatible gateway device.
S. 6LoWPAN
rd the (1Pv6 Over Low Power Wireless Personal Area Networks) was developed initially
in the by the 6LoWPAN Working Group (WG) of the IETF
The 6LoWPAN concept originated from the idea 1h111 "the Internet Protocol could
sc, the and should b.: applied even 10 the smallest devices•·. and that low-power devices with
cvice limited processing capabilities should be nble to participate in the Internet ofThings
6. RPL
erver. IPv6 Routing Protocol for Low-Power and Lossy Networks. Abstract
locols
Low-Power and Lossy Networks ( LLNs) are a class of network in which both the
her as routers and their interconnect are constrained. LLN routers typically operate with
conslrainls on processing power. memory, and energy (battery power).
11rks) 7. CoAP
Constrained Application Protocol (CoAP) is a prot,,col that specifies how low-power
compute-constrained devices can operate in the internet of things (loT).
ctrical
cncies

43
VIII Se,m,, (CSE/ISE)

OR a.AMQP
TIie Advanced Me sage
6. a. Explainany threelOT application layer transport mctho<ls with suitable
lhe financial indus ry. Sec
illustra tions. (12 Marks)
Its run over TCP. MQT
Ans. A.MQTI
messaging. AMQ
MQTT (Message Queue l'elemctry Transport) wa!. dc\clop by or introduci:: by IBM
then forward it, a
in 1999 and standardiLcd by OASIS in 2013 to target come up with lightweight
Rliability. Show i figure
M2M communication. It is publish/subscribe protocol architecture similar to client/
Exchange rcspon ibilit>
server protocol show in figure below. The importance ofMQTT protocol is due to its
simplicity and the 110 necll of high CPU and memory usage (reason is the lightweight Queues is based o pre-d
protocol). MQTT supports a\\ ide range or different de\ ices and mobile platforms. subscribers .,..hos bscrib
At transport layer TLS/SSL security provide to MQTf. Pl' /SHERS

PUBLJSHERS Publisb Subscribe Sl"ISSC'RJBERS

.. Examples: Cs.
BROKER a da•I)
PCs. ICtS

a11dan.1 Servers
Queue alld Top1r titd potm
derice.s

C. COAP
Co/\P (com,train d appl
Show above figure there three component arc there publishers, a broker and embedded devic 'S whe
subscribers. Publishers are genernll) lightweight sensors that connect with a broker Currently there s I ITT
and send data to a broker and go back to sleep. Subscribers are loT applications that HTTP has many featurei
interested in data send by sensors and also connect with a broker. so broker send will need more csourc
mechanism. No for lo
interested data to subscribers. The brokers classif) senS0f) data in topics and send
them lo subscribers interested in the topics. This nil thing is on loT point of vie,~. protocols and .,.. can o~
MQTT provide 3 option to achieve message in As CoAP is a Re tful w~
uses client/serv model
Quality of Services (QoS): networks with I w overt
1. On( delivery (al mosc):
better protocol mparc
Deliver message according to best try of the network. • Co/\P runs er U
An acknowledgment b not required. LO\\ est le\ el of QoS. handshake fore da
2 . One delivery (at lcnst): • Co/\P prot
At least one message can be send and some duplicate message are there. An transfer as it uses fo
acknowh.:dgment is required • It suppon ~ r types
3. O n d e livery: _ I) Cnnfinnablc 2) No
Additional protocol required to ensure that one and only one mcs,ag.c <;end. It 1s Response laye used t
highcst le\cl of QoS. n:sponse, 3) n con
ocher

44
-------
B.AMQP
The Advanced Message Queuing Protocol (AMQP) is a protocol that across from
the financial industry. Security is manage with the use of the TLS/SSL protocols.
irks) Its run over TCP. AMQT is follow publish/subscribe communication protocol for
messaging. AMQP is same like MQTT but AMQT have advantage its store data
IBM then forward it, and this feature used at when network disruptions that tiroe ensures
reliability. Show in figure below a broker divide into two part exchange and queue.
Exchange responsibility to receive publishers messages and distribute to queue.
to its Queues is based on pre-define roles and condition.and it's basically send message to
eight
subscribers who subscribe those data.
rms.
PUBLISHERS Publish - - - = = : : : - - - - - .Subscrib• st:BSC/UIJrRS
BROKER
Qme,nd
uampies· Top:t endpoint Se1wrs
PCs,
and any r-----111r~,,av~~ ~ Sm·ers
dei'ices
Qu1t1•and
Topic endJ»!m

C.COAP
CoAP (constrained application protocol) is used for low power and low memory
embedded devices where it can be used for communication instead of HTTP.
er and
Currently there is I-ITTP protocol available with request/response paradigm but
broker
HTTP has many features and more footprint [5). I ITI' P runs over TCP where TCP
,ns that
will need more resources due to three way handshake and many more complex
r send
mechanism. Now for low power embediled devices, there is no need of this heavy
d send
protocols and we can optimize it to run ov~r TCP.
iew.
As CoAP is a Restful web transfer p.otocol for use with constrained networks. CoAP
uses client/server model of approach same as HTrP. It is designed for constrained
networks with low overhead and lower footprint. Some points for CoAP that makes
better protocol compared to HTTP is:
• CoAP runs over UDP (User data-gram protocol) that helps to avoid costly TCP
handshake before data transmission
• CoAP protocol is only 4-bytc header and provides reliable transter and no reliable·
re. An
transfer as it uses four types of messages.
• It support four types of message
I) Confirmable, 2) Non-Confirmable, 3) Acknowledgement and 4) Reset. Request/
d. It is
Response layer used those message and classified in I) Piggy-backed, 2) Separate
response, 3) Non confirmable request and response, and communicate with each
other.

45
VIII Se,m., ( CSE/IS£)
b. Write expla natory notes on netwo rk optimi1.ation in IOT. (04 Marks)
Ans. Generally, networl-. optimimtion is dl!ffoed as the tcchnolo~y used to improve the
performance of the network for any environment. This plays an important role in IT.
as day by day large amount uf data frum vnrious kinds of devices nnd applications
are being populated into the netv.ork. Networl.. optimization otfors various benefits
such as faster data rate, data recovery, eliminating redundant data and to increase the
response time of application and network. Network optimization in loT is gaining
increased attention due to the e:-.pectation of a high increase in traffic from loT things
and objects, as billions of loT devices are expected tu connect global network in
the coming )'ears. Due to this, it is obvious for researchers nnd operators to provide
efficient solution to optimize loT networl-.s to reduce the loT generated traffic
impacting other services in the network and to utilize network resourceefficiently,
The traffic generated by loT devices is different from the cellular network due to
heterogeneity in applications and device types. Additionally, loT traffic needs to be
regulated to monitor the working of loT devices and its services. loT application
generates fewer amount of data, however integration of devices 10 the application
generates tht: higher volume of traffic because of control plane mt:ssagcs. Hence
this non-application traffic puts a significant additional burden on the nt:twork. So to
· overcome from this burden, efficient mechanism is require(_! to addrei.i. and optimi1.e ai.:uon t at can
the control plane messaging from loT devices. make u e ofth
OR b. Explain I e com
7. a. Justify the stateme nt" Me rging Data Analytcs and IOT will pos itive ly impact Au. I ) Erosio, of Ne
business. {08 Marks) lwo of th major
Ans. Network optimization in loT is gaining increased allention elm: to the expectation of design and ongoin
a high increase in traffic from loT things and objects. a~ billions of loT etc\ ices arc th:itnch\ ol,.s \\ er
expected to connect global nct,\orl-. in the coming }Cars. Due to this, 11 •~ ob, ious or no ~·on ccti, it
for researchers and operators lo provide elftcient solution 10 optimi7e loT networks sullic icnt no" le
to reduce the loT generated tralftc impacting other sef\ ices in the network and to threat ton t\\iorl-.
utilize network rcsourcecfficiently. The tranic generated by loT devices is different or the nctl or!.. he
,t is b,:ttcr o t.. no,
from the cellular network due to heterogeneity in applications and device types.
Additionally, loT traffic needs to be regulated to monitor the working of loT devices communi ation p
and its services. loT application generates fewer amount of data, however integration solid dc!>i n to b
ofdevices to the application generates the higher volume of traffic because of control to har<lwa ·c and
plane messages. I lencc this non-application 1ralftc puts a significant additional Thi~ 1-.in<l of or,
burden on the network. So to o,ercome from this burden, efficient mechanism is nnd the i) troduct
required to address and optimi1.e the control plane messaging from loT de, ices. considcrn.iit•n of
• Competitive Edge: loT is a bull.word in the current era of technology and there poorl~ C'O trolk
are numerous loT application developers and pro, idcrs present in the market. or ina<lc1.1 1ate nc
Tiu:: use of data analytics in loT investments,, ill provide a business unit to offer In man)
better services and will, therefore, provide the ability to gain a competitive edge that arc
Pd ork
in the market.

46
r
1) . d anal tics lhat can be used and applied in the loT
le There are differe~l types of ataS y f th St: types havi: been listt:d and described
investments to gam advantages. ome o e
r.
s below. . .... Tl . . r of data analytics is also referred as event stream
• Stream1ng Analytics. us orm R 1 t' c data streams are
ts processing and it analyzes huge in-motion _data.sets. ea. - im ct· t'ons loT
he in this roccss to detect urgent s1tua11ons and imme iate ac I ••
ng :;;~t::ions base~ on financial transactions. air fleet tracking, traffic analysis etc.
gs can benefit from this method. • . z
in • S atial Analytics: This is the data analytics me_thod .that is used to ana 1_Y,-t:
1de g:ographic patterns 10 determine tht: spatial rclat1onsh1p b~tween t!1e ~hy~i~al
C objects. Location-based loT applications, such as smart parking applications can
t\y. benefit from this form of data analytics. . .
to • Time Serk-s Analytics: As the name suggests, this form of d?ta analytics 1s
o be based upon the time-based data which is analyzed to reveal as~oc'.ated trends and
tion patterns. loT applications, such as weather forecasting app~1cat1ons and health
tion monitoring systems can benefit from this form of data analy11cs metho~. . _
ence • Prescriptive Analysis: This form of data analytics is the comb11rnt1on of
So to descriptive and predictive analysis. It is applied to unde~stnncl tl.1e b_est ?eps of
imizc action that can be taken in a particulal'situation. Commercial loT applications can
make use of this form of data analytics to gain better conclusions.
b. Explain the common challenges in OT security (08 Marks)
1pact Ans. l ) Erosion ofNetworkArchitccturc
arks) Two of the major chnllenges in securing industrial environments have been initial
ion of design and ongoing maintenance. The initial design challenges arose from the concept
cs arc that networks were sale due to physical separation from the enterprise with minimal
vious or no connectivity to the outside world, and the assumption that attackers lacked
vorks sumcient knowledge to carry out security attacks. The challenge. and the biggest
and to threat to network security, is standards and best practices either being misunderstood
fferent or the network being poorly maintained. In fact, from a security design perspective.
types. it is better to know that communication paths are insecure than to not know the actual
eviccs communication paths. It is more common that. over time, what may have been a
ration
solid design to begin with is eroded through ad hoc updates and individual changes
ontrol to hardware and machinery without consideration for the broader network impact.
tional
This kind of organic growth hns led to miscalculations of expanding networks
ism is
and lhe introduction of wireless communication in a standalone fashion, without
s. consideration of the impact to the original security design. These uncontrolled or
, there
poorly controlled OT network evolutions have. in many cases. over time led to weak
arket. or inadi:quate network an·d systems security.
offer
In many industries, the control systems consist of packages. skids. or components
edge
that arc self-contained and may be integrated as semi-autonomous portions of the
network. These packages may not be as fully or tightly integrated into the overall

47
VIII Se,m,, (CSE/IS£)
control system. networl.. management tools. or securit) applications, resulting in
potential risk.
2) Pervasive Legacy Systems
Due to the static nature and long lifccyclcs of equipment in industrial environments.
many operational systems may be deemed legacy systems. For example, in a power
utility environment, it is not uncommon to have racks of old mechanical equipment
still operating alongside modem intelligent electronic de, ices (lEDs). In many cases.
legacy components arc not restricted to i!>olated net\\Ork segment!> but have now
been consolidated into the IT operational environment. from a security perspective.
this is potentially dangcrom, as many devices may have historical vulnerabilities or
weaknesses that have not been patched and updated, or it may be that patches are not
even available due to the age of the equipment.
Beyond the endpoints, the communication infrastructure and shared centralized
compute resources are often not built to comply with modern standards. In fact.
their communication methods and protocols may be generations old and must be
interoperable with the oldest operating entity in the communications path. This
includes switches. routers. firewalls, wireless access points. servers. remote access
systems, patch management, and network management tools. All of these may have
exploitable vulnerabilities and must be protected.
3) Insecure O perational Protocols dch\C'') o mcssa
Many industrial control protocols. particularly those that arc serial based, "ere "cal..ne,,,·rom a
designed without inherent strong security requirements. Furthermore, their operation unsoli~itc rc~po
was often within an assumed secure network. In addition to any inherent weaknesses s.ccunt) c emcnt
or vulnerabilities. their operational environmt:nt may not have been designed wuh
secured access control in mind.
The structure and operation of most of these protocols is often publicly available.
scc:urit,t"
lhc abili~ to tm~
s pr
has been ddrc~,
6) ICCP l ntcr-
While they may have been originated by a private firm, for the sake ofinteropcrability.
they are typically published for others to implement. Thus, it ~ecomes a rela~i~ ely lCCP is a ~omrn
used to mmu
simple matter to compromise the protocols themselves and ~ntroduce ma_l1c1ous
bet\\ ccn iffere
actors that may use them to compromise control systems for either reconna1ssa_nce
expose a utilil)
or attack purposes that could lead to undesirab_le imp~cts in normal syste~ opcra11~n.
)OPC OU:
The followingseetions discuss some common industrial protocols and the_ir respecuve
OPC is a...ed
security concerns. Note that many have serial, IP, or Ethernet~based ,crs~ons: and the
Embed mg (0
· chaIIen ges ,and vulnerabilities are different for the d1t1erent variants.
security doma111 oml P"
4) Modbus T• d facturing acro~s I indu!>
Modbus is commonly found in many industries. such as ~111 111es an manu . • In indu trtal c
environments. and has multiple, ariants (for examph:, serial, TC~/1 P). It was ere:::: of the ontrol
b) the fir.,t programmable logic controller (PLC) vendor, Mod1con, a~d _has . around OPC
in use since the l 970s. It is one of the most widely used protocols 1_n t~duStnal
deployments, and its development i~ ~~verned by the Modbus Organ1zat10~: 1:he
security challenges that have existed with Modbus arc not unm,ual. Au1henuca11on

48
of communicating endpoints was not a default operation because it would alto,~ an
inappropriate source to sead improper commands 10 the recipient. For example, tor a
message to reach its destination. nothing more than the proper Modbus address and
function call (code) is necessary.
Some older and serial-based versions of Modbus eommunic;ite via broadcast. The
ability to curb the broadcast function does not exist in some versions. Th~re _is
potential for a recipient to ;ict on a command that was not spcc!fi_cally 1a~gct1ng 11.
furthermort:, an attack could poten1ially impact unintended rec1p1cnt devices, thus
reducing the need 10 understand the details of the network topology. . .. .
Validation of the Modbus message content is also 1101 performed by the 1111t1at111g
application. Instead. Modbus depends on tht: nt:twork stack to perform this funct ion.
This could open up the potential for protocol abuse in the sys1en1.
S) DNP3 (Distributed Network Protocol)
DNP3 is found in multiple deployment scenarios and industries. It is common in
utilities and is also found in discrete and continuous process systems. Like many
other ICS/SCADA protocols. it was intended for serial communication between
ess controllers and simple IEDs.
ave There is an explicit "secure'' version of DNP3. but there also remain many insecure
implemcncations or DNP3 as well. DNP3 has placed great emphasis on the reliable
delivery of messages. That emphasis, while normally highly des\rablt:, has a specific
1vere weakness from a security pcrspeccivc. In the case of DNP3, participants allow for
1tion unsolicited responses, which could trigger an undesired response. The missing
:sses security clement here is the avility to establish trust in the system's state and thus
with the ability to trust the veracity of the information being presented. This is akin to the
security flaws presented by Grrtuitous ARP messages in Ethernet 111;:tworks. "hich
able. has been addressed by Dynamic ARP Inspection (DAI) in modern Ethernet switches.
ility. 6) lCCP (Inter-Control Center Communications Protocol) · ·
ively ICCP is a common control protocol in utilities acro·s s North America that is frequently
cious used to communicate between utiliti'es. Given that it must traverse the boundaries
;ance between different networks. it holds an extra level of exposure and risk that could
1tion. expose a utility to cybcr attack.
1
ctive 7) OPC (OLE for Process Control)
1d the OPC is ~ased 011 the ~i~rosoft interoperability methodology Object Linking and
Embe.ddmg (OLE). Thrs rs an example where an IT standard used within the IT
doma111 and personal computers has been leveraged for use as a control protocol
uring across an industrial network.
eated
In industrial control ne~vorks. OPC is limited to operation at the higher levels
been of the control sp~ce, .with a dependence on Windows-based platforms. Concerns
1strial ;ir?und OPC ~cg11.1 with the operating system on which it operates. Many or lhc
. The Wmdows devices 111 the operational spact: arc ola. 1101 fully patched, and al risk due
1 ation to a plethora of well-~nown vulnerabilities. The dependence on OPC may reinforce
Iha! dependence. Whtie newer versions ofOPC have enhanced security capabilities.

49
VIII Se,rn,, (CSE(ISE) I Y\t"ur\-et o f ~ T ~

:::e:a:~::es~~sL~r:e~;l~I~~;;~\\ communications modes,\\ hich have both positive and

Of p~rtic~ilar conccr~, with OPC is the dependence on the Remote Procedure Call
(RPC) piotocol. which creates two cl~s.s~s of exposure. The first requires you to
clear_ly understa1~cl th~ _man) vulnerabil111es associated with RPC. and the second
requires you to 1dent1fy the level or risk thc5e vulnerabilities bring to a specific
network.
OR
now ~T m11l OT Security Practices and Systems Vnry? (12 Marks)
8. a The differences between an enterprise IT environment and an industrial-focused OT r,gu 8 Th
Ans.
deployment are important tu understand because they have a direct impact on the
security practice applied to them. Some of these areas arc touched on briefiy earlier el iden
in this chapter. and the> .1re more explicitly discussed in the folloVving sections. operatio al dom
The Purdue Model for Control Hierarchy an mdu rial de
Regardless of where a security threat arises. it must be consistently and unequivocall) • Ente prise z
treated. IT information is typical!) used to make business decisions, such as those in • Le, 5: E
process optimization, whereas OT information is instead characteristically leveraged Res urce
to make physical decisions. such ns closing a valve. increasing pressure. and so on. dO\:l 01\!11t 11
Thus, the operational domain must also address physical safety and environmental the utside
factors as part of its security strategy-and this is not normally associated with the • Le, 1 4: B
IT domain. Organizationally, IT nml OT teams and tu\lh- have been hb1uri<.:nlh this le,el
separate, but this has begun to change. and they have sta11ed to converge. leading opti 11iiatio
to more traditionally IT-centric solutions being introduced to support operational pri1 ing. a1
activities. for example. systems such as firewalls and intru~ion prevention systems lndust ial de
• D:V Z: Th
(\PS) arc being used in loT networks. bet ,ecn th
As the borders between traditionally separate OT and IT domains blur. they must
align strategies and work. more closely together to ensure end-to-end security. The of organi
types of devices that are found in industrial OT environment!> are typically much e, ~thin
O pe tio nal
more higlily optimized for tasks and industrial protocol-specific opemtiun than their
• L cl 3 :
IT counterparts. Furthermore. their operational profile differs as well. m, naging
Industrial environments consist of both operational and enterprise domains. To
an cont,
understand the security and networking require111cnts for a control syst~m. the use of
c;c edulin
a logical framework to describe the basic comllosition and function is needed. The m nagcm
Purdue Model for Control Hierarchy. introduced in Chapter 2, is the most widely used
Sl has D
framework across indu~trial environments globally and is used in manufacturing. eL, cl2:
oil and gas. and many other industries. It segments devices and equipment by
hierarchical function lc\e\S and areas and has been incorporated into the ISA99/l[C
62443 security standard. as sho"n in Figure 8-3. For aclditional_detail on_ how the
Purdue Model for Control Hierarchy is applied to the manufacturing and 011 and gas
indllStries, sec Chapter 9. ··Manufacturing,'' and Chapter I0, "Oil and Gas."

Sahlw

50
,l,•.

Fig11re H: TIie L{)gic11/ Framewurk B11sed 011 the P1m/11e Model /or Cmllro/
Hier11rchy
This model identifies levels of op~rations and defines each level. The enterprise and
operational domains are separated into different zones and kept in strict isolation via
·all) an industrial demilitarized zone (DMZ): •
SC in • Enterprise zone
aged • Level 5: Enterprise network: Corporate-level applications such as Enterprise
o on. Resource Planning (ERP), Customer Relationship Management (CRM).
cntal document management, and services such as Internet access and VPN entry from
lh the the outside world exist at this level.
icall) • Level 4: Business planning and logistics network: The IT services exist at
,a<ling this level and may include scheduling systems, material now applications.
tional optimization and planning systems, and local IT scrvifeS l;uch as phone. email.
stems printing, and security monitoring.
Industrial demilitarized zone
must • DMZ: The DMZ provides a buffer zone where services and data can be shared
. The between the operational and enterprise .1:oncs. II also allows for easy segmentation
much of organizational control. By default. no traffic should traverse the DMZ;
n their everything should originate from or tenninate on this area.
Operational zone
s. To • Level 3: Operations and control: This level includes the functions involved in
use of managing the workOows to produce the desired end products and for monitoring
. The and controlling the l!nlire operational system. This could include production
used scheduling, reliability assurance, systernwide control optimization. security
uring. management, network management, and potentially other required IT services.
nt by such as DHCP, DNS, and timing.
9/IEC • Level 2: Supervisory control: This level includes zone control rooms, controller
w the status. control system network/application administration, and other control-
nd gas related applications, such as human-machine interface (HMI) and historian.
• Level I: Dasie control: At this level, controlkrs and IEDs. dedicated HM ls, and
other applications may talk to each other to run part or all ol'the control function.
• Level 0: Process: This is where devices such as sensors and actuators and
machines such as drives. motors. and robots communicate with controllers or
IEDs.
SM".+".. EKAM Sulll\w- 51
VIII Se.mt ( CSE/IS£) I Vlt"un.et o f ~ T ~
Safety 1.onc
• Safety-critica l: This level include~ device:.. sensors. and other equipment used to
manage the safety functions of the control S) s11:m.
One of the key advnntagcs of designing nn industrial network in !>tructured levels,
as with the Purdue mod..:1, is that it allows security 10 be correctly applied at each
level and between levels. For example, n networks typically rc.:side at Levels 4 and
5 and use security principles common to IT nel\\orks. the lo"cr levels nrc where the
industrial systems and loT net"orks rt:i.idc. As shown in Figure 8, 11 DMZ resides
between the IT and OT levels. Clearly, to protect the lo,"er industrial la} ers, security
technologies such as fire" alls. pro11.y !-.ef\ crs. and IPSs should he used 10 ensure
that only authorized connections from trusted sources on e11.pected ports arc being
used. Al the DMZ, and, in fact, e,en bct\\een the lower leH:ls. industrial fil'\:\\alls
that are capable of understanding the control protocols !,hould be used to ensure the
continuoui. operation of the OT network.
Although security vulncrnbilities may potential!) e11.ist at ench level of the model.
it is clear that due 10 the amount of connectivity and sophistication of devices and
systems. the higher levels ha, e a greater chance of im:ur~i~n due to the "ider atta~k
surface. This docs no\ mean that lower lcvch. art: 1101 as important from n security
perspective; rather, it means that their altac~ surfac~ is small~r, a,~d if m~tigat,~n
tcchniqm:s nrc implemented properly, there 1s poten~ially less 1mp~c_t _l~ the ov_er,,_11
sy!>tcm. /\ s s hown ·Ill ,,·,Oour•·~ 8• I , ..~ rcvic,\ of pubhshed \ ulncrnb1h11cs

associ.,ted

with industrial security in 2011 shows that the a!>scts at the higher le, els ol the
framework had more detected vulnerabilities.
7011 PHbllshC(l llulner~b•l•I~ ArOM

.\
HI'

Figure X-1 201J 1111/m trial Security Report of Pu~i;,lietl


(US /11d11stri11t Control Sy.\lems Cyher £ 111erge11,·y e,\po11se
~:'~~;,:;,~1~;~;:~n
l,ttpl:l/ic1>·Cl!rt. us-cert.gov).
• • • , •1 Explain (04 Marks)
b. \\'lu\t i'I Network Scc~n~ Momt~n~1g . ·ess of finding intruder<; in a network.
An,. Nct,\urk sccurit) momtorlllg (NS!\,) is_a p_roc,. : ., ·ind \\ 'irning.!, 10 prioritize and
. and •unl\z111g tn( 10.:a101., • '
It is '\\:hie, ell h) co II cct lllg • ' ·· . . • \ facl .11, umle~ired presence.
·• , -11 thc assumption that 11e1c 1 1s. 11 ••
in"-,11g.,1te tnciuenh '" ' . . . . l'lnentcd often or thoroughly cnoug1I
. o l'NSM
\ he prac11ce • is· not nc" · )Cl ,tis not unp e
-s......~+..... e..- ~""...... •
52
-
•ithin reasonably mature and large organizations. There are many reasons for
underutilization. but lack of education and organizational patience are common
aasons. To simplify the approach. there is a large amount of readily available data
. . , 1f reviewed, would expose the activities of an intruder.
IS important to note that NSM Ls inherently a process in which discovery occurs
through the review of evidence and actions that have already happened. This is not
meant 10 imply that it is a purely postmortem type of activity. If you recognize that
mtrusion activities are. much like security, an ongoing process. then you see that
there is a similar set of stages tJiat an attacker must go through. The tools deployed
"ill slow that process and introduce opportunities to detect and thwart the attacker.
but there is rarely a single event that represents an attack in its entirety. NSM is the
discipline that wi ll most likely discover the extent of the attack process and, in tum,
define the scope for its remediation.
Module - 5
model.
cs and Explain the features of Ardunio UNO with regard to IOT (08 Murks)
attack Ans. The Arduino Uno is a microcontrollcr board based on the ATmega328. Arduino is
ecuriry an open-source, prototyping platform and its simplicity makes it ideal for hobbyists
igation to use as well as professionnls. The Anluino Uno has 14 digital input/output pins (of
overnll which 6 can be used as PWM outputs). 6 analog inputs. a 16 MHz crystal oscillator.
ociated a USB connection, a power jack. an ICSP header, and a reset button. It contains
of the everything needed to support the microcontroller: simply connect it to a computer
with a USB cable or power it with a /\C-to-DC adapter or battery to get sta1icd.
The Arduino Uno differs from 1111 preceding boards in that it dot:s not use tht: FTDI
USS-to-serial driver chip. Instead. it features the Atmega8U2 microcontroller chip
programmed as a USB-to-serial converter.
"Uno·• mt:ans om: in Italian and is named to mark the upcoming release of Arduino
1.0. The Arduino Uno and version 1.0 will be the ~eforence versions of Arduno.
moving forward. The Uno is the latest in a series of USB Arduino boards, and the
reference model for the Arduino platform.
Features of the Arduino UNO:
• Microcontroller: ATmcga328
• Operating Voltage: 5V
• Input Voltage (recommended): 7-l 2V
• Input Voltage (limits): 6-20V
reus
• Digital 1/0 Pins: 14 (of which 6 provide PWM output)
rER1) • Analog Input Pins: 6
• DC Current per 1/0 Pin: 40 mA
Marks) • DC Current for 3.3V Pin: 50 mA
etwork. • Flash Memory: 32 KB of which 0.5 KB used by bootloader
i1.e and • SRAM: 2 Kn (ATmcga328) '
esence. • EF.l'ROM: I KB (ATmt:ga328)
enough ·• Clock Speed: 16 MHz

53
VIII Sewt1 ( CSE/ISE)
optimiled ell\ ironmcn1 rhis
b. What 11rc the models of Raspberry pi ? (05 Marks)
room temperature, but also
Ans. Raspberry Pi - T[lerc are six different models of Raspberry Pi. The Pi 2 Model
and 1he tools )Oil use to incrc
Bor Pi I Model B+ and Pi 3 Model B arc ideal for beginner projects because the>
are the most versatile and have the widei.t range or capabilities The Pi 3 Mudd
8 has the added bonus of having a quad-core processor and I GB ul' RAM so it
supports heavier operating systems. like Ubuntu and Microsoft I 0. Tht: Model A+is
a powerful board for building robotics, but doesn · 1 have an Ethernet port and only
comes with one USB pon. Raspb1;::rry Pi Zero is basically a miniature version of the
Model A+, but has a more robust computing. power. It has a micro USB pon and mini
HDMI port for I080p output compatibility but doesn't have \\ ire less capability. It
only costs $5- and Ada fruit sells v. 1.3 for just $5, but you can only buy one per order.
The Raspberry Pi Zero W is the same singh:-board computer as tlw standard Zero but
does support wireless and [3luctooth connectivity. It costs $10 on Adafruit,
c. What is a temperature sensor? (03 Marks)
Ans. A temperature sensor is a device, typically, a thermocouple or RTD,that provides
for temperature measurement through an electrical signal. A thermocouple (T/C) is Play
made from two dissimilar metals that generate electrical voltage in direct proportion loT learns as much about )
to changes in temperature. technology to support leisu
OR • Culture and Night
response lo guide you i
10. a. Explain bow IOT can be used with consumer appliances (06 Marks)
recommending rcstaura
Ans. Consumers benefit personally and professionally from the optimization and data
• Vacations - Planning
analysis of loT. loT technology beha, es like a team of personal assistants, advisors.
many utiliLe agencies,,
and security. It enhances the way we live. work. and play.
• Products and Service
Home
need than current ana
loT takes the place or a full staff -
• Butler - loT wai_ts for you to rt:tum home. und ensure, your home remains full> information like your fi
prepared. It mo111tors your supplies, family. and the state of your home. It takes b. Ex.pl:1i11 the IOT Applic,
actions to resolve any issm:s that appear. Ans. Approximately 70 percen
• Chef - /\11 loT kitchen prepares meals or simply aids) ou in preparing them. 2050 according to Gartnc
• Nann_y - loT can _somewhat act a~ a guardian b} controlling acct:ss. providinr strain on the existing infra
supplies. and alertmg the proper individuals in an emergency. living, it's only going to
• Gard~er -The same loT systems of a farm t:asily work for home landscaping. To accommodate lhis nc
• Rep:urman - Smart systems perform key maintenance and repairs and also turning to the Internet of
r,:qu,:st ,h.,m. •
• Security
. Gunrd - loT watches over you 1417 It can observe and imprO\e communicat
.. to improve nearly every
individuals 111 ·1 d - · susp1c1ous
t b cl' I es away. an recognize the potentinl ofrninor equipment problems sm:,rt c:ili.:s.
. o ecomc 1sasters well before they do.
::s..
Work
s::~~:·. ~~
1
:~.:~;~: ~ t~~~ ~.r~ n~ ~h<lirlpool allows two different heat settings on the
11 0 111 g ...n remote control.
A more efficient water su
The Internet ofThings ha
Smart meters can impro\.
A smart office or other workspace combii . . due to ineffic1cnc>, and l
with smart tools. foT !cams about ou ies custom1zat1011 of the work environment entering and anal) ,ing da
y . your Job. and the wa'-, )·ou work to <Ie1·1vcr an
optimi 7 ed environment. This result!> in practical accommodations. lil-.e adjus1111g. the
room temperature. but also more advanced benehl~ lil..e mod1f>111~ your ~d1cdule
and the tools you use to increase your output and reduce your work tame. lo r acts as
a mana •er and consultant ca abk of seein "hat ou cannot.

Play
loT learns as much about you personally as it does professionally. 1 his enables the
technology to support leisure -
• Culture and Night Life - lo r can analyze your real-world activities and
response to guide you in findmg more of the things and places you enjoy such as
arks)
recommending restaurants and events bused on your preferences and experiences.
~ data
isors. • Vacations - Planning and saving for vacations prO\ es difficult for some, and
mimy utili.i:e agencies, wh ich can be replaced by loT.
• Products and Services - loT offers better analysis of the products you like and
need than current analytics based on its deeper access. It integrates with key
s folly
information like your finances to recommend great s~lutions.
\ \akcs b. Explain the IOT Application for smart cities. (10 Marks)
Ans. Approximately 70 percenl of the world"s population is expected to live in cities by
em. 2050 according. to Gartner. This rapid urban gro\\fh is already placing. a considerable;:
ovidinr strain on tht: existing infrastructure, and with more people making the move to urban
living, it's only going to get worse in the coming years.
aping. To accommodate this new demand on cities, municipalities around the globe are
nd also turning to the Internet of Things innovation to enhance their services. reduce costs,
and improve communication and interaction. Though the potential is there for loT
picious to improve nearly every aspect of urban living. there are three lo'f applications for
oblcms smart cities.
A more efficient water supply
on lht: The Internet of Things has the potential to transform the wa> cities consume water.
Smart meters can improve leak dctcction and dntu intcgrit): pn;, cnt lost 11::, cnue
due to inefficiency. and boost productivity b) rt:ducing the amount of time spent
onmenl entering and mialyzing data. Abo. these meters can he de~igncd 10 fo:nurc cu~tomcr-
liver an
55
VIII SetW ( CSE(ISE) I n.te.-net o f ~ r~
facing portals. pro, iding rcside111s wi1h real-lime access to information abou1 their
consumption and water supply.
An innovative solution to traffic congestion
As more and more people move lo cities. traffic congestion - which is alread) a
masl>ive problem-is only going to gt:t worse. Fortunald). the Internet ofThi11gs i-. well
positioned to make improvements in this area that can benefit rei,idcnts immediately.
For example, smart traffic signals can adjust their timing to accommodate commutes
and holiday traffic and keep cars moving. City officials can collect and aggregate data
from traffic cameras. mobilt: phones. vehicles. and road l>ensors to monitor traffic
incidents in real-time. Drivers can be alerted of accidents and directed to routi:s that
arc less congested. The pos1,ibili1ies a.-.: endle,;s and the impact "ill be substantial.
More reliable public transportation
Public transportation is disrupted \\htncver there are road closures. bad weather. or w l mimuce
equipment breakdowns. loT can give transit authorities the n:al-timc insights they - ID illlaKt
need to implement contingency plans. ensuring that residents alwnys have access to . . decisions taken.
safe, reliable, and efficient public transportation. Thi), might be done using insight:. llleapoflOTiso
from cameras or connected devices at bus shelters or other public arens. . _ bme period, the
Energy-efficient buildings ,apulation. With
loT technology is making it easier for buildings with legacy infrastructure to save a IICW age was u
energy and improve their sustainability. Smart building energy management systems. wid1 the creation o
for instance, use loT devices to connect disparate, nonstandard heating. cooling. for Procter & Garn
lighting, and fire-safety systems to a central management application. The energy ID linking the com
management application then highlights areas of high use and energy drills <:o staff
can correct them.
Research shows that commercial built.lings waste up to 30 percent of the energy they
use,' so savings with II smart building energy management system can be significant.
As hlorc smArt city buildings use energy management systems. the city will become
more sustainable 11s a whol..:.
Improved public safety
Smart cities and their CSP partners often implement video monitoring systems to
tackle the safety concern~ that come up in every growing cit). !)ome cities 110\\ have
Internet Phase
hundreds of cameras monitoring traffic for accidtnts and public streets for safet)
concerns. Video analytics software helps process the thousands of hours of video Connectivity
footage each camera produces, whinling it down to onl} important events. Systems (Digitize access
using loT technology turn every camera attached 10 the system into n sensor. with
edge computing and anal} tics starting right from the source. Artificial intelligence Networked Econ
technology like machine lt:arning will then complt:te the analysis 1111<1 st:nd video (Digitize busine
footage to humane; \,ho can react quick!) to ,ol,c problems and 1-eep residents sati:.
Cities arc also improving public safet} with smart lighting initiatives that replace
traditional streetlights \\ ith connected LED infrastructure. Nol only do the LED
lights last longer and conser\C energy, they also provide information on outages in
1cal lime. City workers can use that information to ensure important areas arc well lit
to deter crimes and make the public feel safer.
56
Eight Semeste r B.E. D'--gree Examination, CBCS - June / July 2019
Internet of Things Technology
Max. Marks: 80
Note : Answer atty FIVE fill/ questlotM, selecting ONE full quest/011 f rom each nwdule.

Module- I
What is JOT"! Explain in detail on Genesis ofIOT. (08 Marks)
Internet of Things (IOT) is an ecosystem of connected physical objects that are
accessible through the internet. The 'thing' in IOT could be a person with a heart
monitor or an automobile with built-in-sensors, i.e. objects that have been assigned
an IP address and have the ability to collect and transfer data over a network without
manual assistance or intervention. The embedded technology in the objects helps
them to interact with internal states or the external environment, which in turn affects
the decisions taken.
The age ofIOT is often said to have started between the years 2008 and 2009. During
this time period, the number ofdevices connected to the Internet eclipsed the world's
population. With more "things" connected to the Internet than people in the world,
a new age was upon us, and tho Internet of Things was born. The person credited
with the creation of the term "Internet of Things" is Kevin Ashton. While working
for Procter & Gamble in 1999, Kevin used this phrase to explain a new idea related
lo linking the company's supply chain to the Internet.

-
Buelnee•
Soclo111
IOISNCl

Internet Phase Definition


Connectivity rfhis phase connected people to email, web services, anc
(Digitize access) search so that information is easily accessed.
Networked Economy rrhis phase enabled. e-commerc~ and supply chair
(Digitize business) ~nhancements _along_ with ~ollaboratrve engagement to driv~
increased efficiency in business processes.

gcs in
ell lit

1
VIII Sem, ( CSE I ISE)

is phase extended the Internet experience to encompas


lmmersive
idespread video and social media while always bein 50
Experiences onnected through mobility. More and more applications ar
(Digitize interactions
oved into the cloud. 40
is phase is adding connectivity to objects and machines i
Internet of Things
1e world around us to enable new services and experiences
(Digitize the world)
t is connecting the unconnected.

b. What does loT and digitization mean? Elaborate on this concept. (04 Marks)
Ans. loT and digitization are terms that are often used interchangeably. In most contexts,
this duality is fine, but there are key differences to be aware of. At a high level. 10
loT focuses on connecting "things," such as objects and machines, to a computer
network, such as the Internet. loT is a well-understood tenn used across the industry 0
as a whole. On the other hand, digitization can mean different things to different
people but generally encompasses the connection of "things" with the data they What these numbers
generate and the business insights that result. businesses interact wi
For example. in a shopping mall where Wi-Fi location tracking has been deployed, using real-time conne
the "things" are the Wi-Fi devices. Wi-Fi location tracking _is simply the capability This in tum results i
of knowing where a consumer is in a retail environment through his or her smart services that save tun
phone's connection lo the retailer's Wi-fi network. While the value of connecting quality of life.
Wi-Fi devices or " things" to the Internet is obvious and appreciated by shoppers,
tracking real-time location of Wi-Fi clients provides a specific business benefit to the
mall and shop owners. In this case, it helps the business understand where shoppers
tend to congregate and how much time they spend in different parts of a mall or
store. Analysis of this data can lead to significant changes to the locations of product
displays and advertising. where. to place certain types of shops, how much rent to
charge, and even where to station security guards.

c. Write a abort note on "loT impact in Real World". (04 Marks)


Ans. Projections on the potential impact of loT are impressive. About 14 billion. or just
0.06%, of "things" are connected to the Internet today. Cisco Systems predicts that Scale
by 2020, this number will reach SQ billion. A UK government report speculates that
this number could be even higher, in the range of 100 billion objects connected.
Cisco further estimates that these new connections will lead to $19 trillion in profits
and cost savings. Figure provides a graphical look at the growth in the number of
devices being connected.

Security

2
•June,/ J!Al,y 2019
encompas
ways bcin
Jications ar
50


...,.._ ·-~'-~~~~ / Billion ,
40
machines i
xperiences
fo -·- h-.-.-. / -·
po
-- ••'-
'I

contexts,
high level, 10
_.......,.
a computer 7 3,1• 7 83

the industry 0
2003 2008 2010 2015 2020
to different
data they What these numbers mean is that loT will fundamentally shift the way people and
businesses interact with their surroundings. Managing and monitoring smart objects
using real-time connectivity enables a whole new level ofdatadriven decision making.
This in tum results in the optimization of systems and processes and delivers new
services that save time for both people and businesses while improving the overall
connecting quality of life.
shoppers. OR
efit to the
shoppers Discuss IOT challenges. (08 Marks)
fa mall or
Challenge Description
While the scale oflT networks can be large, the scale of OT can be
several orders of magnitude larger. For example, one large electrical
utility in Asia recently began deploying 1Pv6-based smart meters on
its electrical grid. While this utility company has tens of thousands
Scale ofemployees (which can be considered IP nodes in the network), the
number of meters in the service area is tens of millions. This means
the scale of the network the utility is managing has increased by
more than 1,000-fold! "IP as the loT Network Layer," explores how
new design approaches are being developed to scale 1Pv6 networks
into the millions of devices.
With more "things• becoming connected with other "things" and
people. security is an increasingly complex issue for IoT. Your threat
Security surface is now greatly expanded. and if a device gets hacked, its
connectivity is a major concern. A compromised device can serve as
a launching point to attack ocher devices and systems. loT security
is also pervasive across just ..... ffery facet ofloT.

3
VIII Sern, ( CSE ( ISE)
hardware. Examples includ
As sensors become more prolific in our everyday lives, much ·or the VPNs, and so on. Riding o
data they gather will be specific to individuals and their activities. adds APls and midd\eware
This data can range from health information to shopping patterns Network layer: This is the
Privacy and transactions at a retail establishment. For businesses, this data It includes the devices
has monetary value. Organizations are now discussing who owns them. Embodiments of th
this data and how individuals can control whether it is shared and technologies, such as IEE
with whom. as IEEE 801.11 ah. Also i
loT and its large number of sensors is going to trigger a del~ge of power line communicatio
data that must be handled, This data will provide critical information
Big data and and insights if it can be processed in an efficient maimer. The
data analytics challenge, however, is evaluating massive amounts of data arriving
from different sources in various forms and doing so in a timely
manner,
As with any other nascent technology, various protocols and
architectures are jockeying for market share and standardization
bT within loT. Some of these protocols and architectures are based on
1nteropera 1 ,ty proprietary elements, and others arc open. Recent loT standards ate Commu
Ne
helping minimize this problem. but there are often various protocols
and implementations available for loT networks.

b. With a neat diagram, explain architecture ofloT


Ans.
1. "Things" layer: At
the environment in wh'
information needed.
2. Communications ne
need to communicate
uses a wireless technol
Access network sublay
This is typically made
LoRa. The sensors con
SorvK)os Layor Ne1Wofk Layor:
ApplJC:\\lOn~ La.,01:
or.etn M.rcluckH a common N>J)lk:.ahon..~ tel< to
Gateways and backh
• Smart Energy
• ""6et Traci<llg ceMCet. hOrtZOMa.1uamewo!'k th• AP\, to COffUl'1unk:.G.ta organizes multiple sm
l• f~et Mar\lY;lcment s.worUno R..ttul API• - to sensor•
gateway communicat
Applications layer: The oneM2M architecture gives major attention to connectivity to forward the coll
between devices and their applications. This domain includes the application-layer backhau\) to a headen
protocols and attempts to standardize northbound API definitions for interaction Network transport s
with business intelligence (Bl) systems. Applications tend to be industry-specific transport layer prot
and have their own sets of data models, and thus they are shown as vertical entities. variety of devices to
Services layer: This layer is shown as a horizontal framework across the vertical loT network manage
industry applications. At this layer, horizontal modules include the physical network the headend applicati
th'at the loT applications run on, the underlying management protocols, and the andMQIT.

4
cscs ·J1M1.€,,(JIM(l2019
' uch 'of the hardware. Examples include backhaul communications via cellular, MPLS networks,
activities. VPNs, and so on. Riding on top is the common services layer. This conceptual layer
adds APls and middleware supporting third-party services and applications.
g patterns
s, this data Network layer: This is the commun ication domain for the IoT devices and endpoints.
It incfudes the devices themselves and the communications network that links
who owns
them. Embodiments of this communications infrastructure include wireless mesh
shared and
technologies, such as IEEE 802.15.4, and wireless point-to-multipoint systems, such
as IEEE 801.1 I ah. Also included are wired device connections, such as IEEE 190 I
a deluge of power line communications.
nformation c. Explain core "loT" functional stack.
aimer. The (04 Marks)

Core loT loT Data Management


Functional Stack and Compute Stack

tocols and
dardization
re based on
Applications Cloud
7
Communicauons
-~
tandards ate
us protocols
Network j Fog
I
Things: Sensors and
(04 Marks) Actuators
Edge
I
1. "Things" layer: At this layer, the physical devices need to fit the constraints of
the environment in which they are deployed while still being able to provide the
information needed.
2. Communications network layer: When smart objects are not self contained, they
need to communicate with an external system. In many cases, this communication
uses a wireless technology. This layer has four sublayers:
Access network sublayer: The last mile of the IoT network is the access network.
This is typically made up of wireless technologies such as 802. l lah, 802.15.4g, and
LoRa. The sensors connected to the access network may also be wired.
. Gateways and backhaul network sublayer: A common communication system
organizes multiple smart objects in a given area around a common gateway. The
gateway communicates directly with the smart objects. The role of the gateway is
to connectivity to forward the collected information through a longer range medium (called the
pplication-layer backhaul) to a headend central station where the information is processed.
for interaction Network transport sublayer: For communication to be successful, network and
ndustry-specific transport layer protocols such as IP and UDP must be implemented to support the
:vertical entities. variety of devices to connect and media to use..
ss the vertical IoT network management sublayer: Addi~ prococols must be in place to allow
hysical network the headend applications to exchange data wida 6c sensors. Examples include CoAP
tocols, and the and MQTT.

5
-Jun,e,,/Ju½! 2
In.turtet of~ T~ and moisture ·analy
c:har&cteri,;tics of ga
VIII SeAW ( CSf / [Sf} upper layer, an application need\:
M •\he die density of gas su
~n,a:n. '->'ul~\;,'"5. ...,,..~~\'\ "~ce~~"a.t':(,
-. . - ..- - - -······ .... -...a. ..........., "'-~ ... , ..,,\" ; ' ._;;.on,\l'U\ \\'\.C
process the co\\ccte<l \lata, no\ o n Y " · · t t
to make intelligent decision based on the information collect~~ and, 111 turn, ms ru~ for thermal image
the "things" or other systems to adapt to the analyzed conditions and change their 6. Optical Sensors
behaviors or parameters. it measures a physi
Module- 2 maybe digital for
units. It involves n
3. a. List and explain dll!erent types of senson. . . (08 M arks)
Also, it is used in
Ans. Here is the list of Sensors most commonly used m the l oT devices, the two different
l. Temperature Sensor common devices,
2. Pressure Sensor that tum on auto
3. Proximity Sensor 7. Gas Sensor:
4. Accelerometer and Gyroscope Sensor
5. IR Sensor
6. Optical Sensor it can detect com
7. Gas Sensor 8. Smoke Senso
8. Smoke Sensor
1. Temperature Sensor: A Temperature Sensor senses and ~easure~ the te~perature
and converts it into an electrical signal. They have a maJor role m Environment,
smoke sensor, an
Agriculture and Industries. For example, these sensors can detect the temperature of
photoelectric smo
the soil which is more helpful in the production of crops.
2. Pres'sure Sensor: A pressure sensor senses the pressure applied ie, force per contains a pulsed
every I 0seconds t
unit area, and it converts into an electrical signal. It has high importance in weather
forecasting. There are various Pressure sensors available in the market for many
purposes.
3. Proximity Sensor: A proximity sensor is a sensor able to detect the presence
of nearby objects without any physical contact. A proximity sensor often emits an
electromagnetic field or a bean1 of electromagnetic radiation and looks for changes
in the field or return signal. A most common application of this sensor is used in cars.
4. Accelerometer and Gyroscope Sensor: The difference between Accelerometer
and the gyroscope is accelerometer measures linear acceleration based on vibration
whereas, the gyroscope is intended to detenuine an angular position based on the
principle of the rigidity of space. Accelerometers in mobile phones are used to
detect the orientation of the phone. The gyroscope, adds an additional dimension
to the information supplied by the accelerometer by tracking rotation or twist. A
3D gyroscope has three gyroscopic sensors mounted orthogonally. Accelerometers
and gyroscopes are the sensors of choice for acquiring acceleration and rotational
infqrmation in drones, cell phones, automobiles, airplanes, and mobile loT devices.
5. Infrared Sensors: An Infrared Sensor is an electronic device, which senses certain
characteristics of its surroundings by emitting Infrared radiation. It has the ability
to measure the heat being emitted by an object and also measures the distance. It
has been implemented in various applications. lt is used in Radiation them1omete~s
depend on the material of the object. IR sensors are also used in Flame monitors

6
and moisture analysis. IR sensors are used in gas analyzers which use absorption
characteristics of gases in the IR region. Two types of methods are used to measure
ary, but the density of gas such as dispersive and nondispersive. IR imaging devices are used
instruct for thermal imagers and also for night vision.
ge their 6. Optical Sensors: The Optical Sensors conven light rays into an electronic signal,
it measures a physical quantity of light and transforms into a fonn which is readable,
maybe digital form. It detects the electromagnetic energy and sends the results to the
units. It involves no optical fibers. It is a great boon to the cameras on mobile phones.
arks) Also, it is used in mining, chemical factories, refineries, etc. LASER and LED are
the two different types of light source. Optical sensors are integral parts of many
common devices, including computers, copy machines (Xerox) and light fixtures
that tum on automatically in the dark.
7. Gas Sensor: A Gas Sensor or a Gas detector is a device that detects the gas in an
area, which is very helpful in safety systems. It usually detects a gas leak in an area,
that results are sent to a control system or a microcontroller, that finally shuts down.
it can detect combustible, flammable and toxic gases.
8. Smoke Sensor: A smoke sensor detects smoke and its level of attainment.
Nowadays, the manufacturers of the sensor implement it with a voice alarm through
ALEXA, also notifies in our smartphones. The smoke sensor ifoftM'o types, Optical
smoke sensor, and the ionization smoke sensor. The optical smoke sensor also called
photoelectric smoke alarms works using the light scattering principle. The alarm
contains a pulsed Infrared LED which pulses a beam of light into the sensor chamber
rce per
every I 0seconds to check for smoke particles.
M"cathcr
r many b. Elaborate on small physical objects and small virtual obje\:ts. (04 Marks)
Ans. Refer Q.no 3 od Model Paper I
resence Explain "loT Access Technologies" (04 Marks)
mits an loT Access Technology is spread across licensed and unlicensed spectrum and there
hanges are several number of Radio technologies. At high this access can be classified in
in cars. two categories:
rometer I. Non -Cellular Technologies
ibration 2. Cellular Technologies
on the Each of the technologies available for loT connectivity has its own advantages and
used to disadvantages. However, the range ofloT connectivity requirements- both technical
and commercial - means cellular technologies can provide clear benefits across a
wide variety of applications. While choosing technology following requirement
needs to be considered
• It should have Global Reach
• Matured Ecosystem
• Diverse and Secure
• Scalable and QoS support
• Low Total Cost ofOwnership (TOC)

7
VIII Se.m, (CSE ( ISE)
A common infonnation set is being provided. Particularly, the· following topics are
addressed for each loT access technology: Wireless! !
1. Standardization and alliances: The standards bodies that maintain the protocols is a protoc
for a technology Wireless self:healin
HART the 2.4 G
2. Physical layer: The wired or wireless methods and relevant frequencies
3. MAC layer: Considerations at the Media Access Control (MAC) layer, which can be fi
engincerin
bridges the physical layer with data link control
4. Topology: The topologies supported by the technology Construct
5. Security: Security aspects of the technology . stack for
Thread
6. Competitive technologies: Other technologies that are similar and may be suitable products i
alternatives to the given technology Thread G
OR b. What is SANET? Ex
4. a. Briefty Explain protocol stack utilization IEEE 802.15.4. (08 Marks) based solution offcn.
Ans. A sensor/actuator netw
Description sense and measure thei
.Protocol
The sensors and/or a
Promoted through the ZigBee Alliance, ZigBee defines upper-layer cooperating in a produ
components (network through application) as well as application and cooperation is a p
profiles. Common profiles include building automation, home in SANETs are diverse!
automation, and healthcare. ZigBee also defines device object The following arc some
ZigBee functions, such as device role, device discovery, network join, and offers:
security. For more information on ZigBee, see the ZigBee Alliance Advantages:
webpage, at W\vw.ziibee.oo:. ZigBee is also discussed in more detail I. Greater deployment~
later in the next Section. 2. hard-to-reach places)
6 LOWPAN is an 1Pv6 adaptation layer defined by the fETF 6L0WPAN 3. Simpler scaling to a
4. Lower implementatic
6LoWPAN working group that describes how to transport I Pv6 packets over
IEEE 802.15.4 layers. RFCs document header compression and TPv6 5. Easier long-term mal
enhancements to cope with the specific details of IEEE 802.15.4. 6. Effortless introducti
7. Better equipped to h
An evolution of the ZigBee protocol stack. ZigBec IP adopts the
Disadvantages:
ZigBee lP 6F0WPAN adap~~tion_layer, l~v6 network layer, and RPI. routing
I. Potentially less secu
·pro~ocol. ln ~dd,tton, 1t offers improvements to [P security. ZigBee JP
is discussed 111 more detail later in this chapter.
2. Typically lower tran
3. Great.er level of imp
ISA I00. !,la .is developed by the International Society of Automation
(ISA) as Wireles~ Sy~tems for Industrial Automation: Process Control
ISAIOO. I la nod _Reln~ed Appl1cnt1ons." It is based on IEEE 802 15 4-2006 d Explain working of
spcc1ficnt1ons were bl' h d • · · , an
pu is e m 2 OIO and then as IEC 62734 Th ~oT network lechnolo
network and transport layers arc based on IETK 6L0WPAN IP 6. e mclude cellular, wifi.
UDP standards. , v , .nut
as LPWAN, Bluetooth
consider which ne
mindful of the folio

8
-JUAf!/(J!Afy 2019

WirelessHART, promoted by the HART Communication Fo_u~dation,


is a protocol stack that offers a time-synchronized, self-organizing, and
Wireless self:healing mesh architecture, leveraging IEEE 802.15.4-2006 over
HART the 2.4 GI lz frequency band. A good white paper on WirelessHART
can be found at http://www.emerson.com/i:esource/blob/ system-
engineering-guidelines-iec-625yi-wirelesshart--dar.a-79y00.pdf
Constructed on top of IETF 6LoWPAN/IPv6, Thread is a protocol
stack for a secure and reliable mesh network to connect and control
Thread products in the home. Specifications are defined and published by the
Thread Group at www.threadi:roup.o[i.
b. What is SANET? Explain some advantages and disadvantages that a wireless
based solution offers. (08 Merks)
A sensor/actuator network (SAN ET), as the name suggests, is a network ofsensors that
sense and measure their environment and/or actuators that act on their environment.
The sensors and/or actuators in a SANET are capable of communicating and
pper-layer cooperating in a productive manner. Effective and well coordinated communication
pplication and cooperation is a prominent challenge, primarily because the sensors and actuators
on, home in SANETs are diverse, heterogeneous, and resource-constrained.
ce object The fo llowing are some advantages and disadvantages that a wireless-based solution
join, and offers:
Alliance Advantages:
ore detail I. Greater deployment flexibility (especially in extreme environments or
2. hard-to-reach places)
3. Simpler scaling to a large number of nodes
4. Lower implementation costs
5. Easier long-term maintenance
6. Effortless introduction of new sensor/actuator nodes
7. Better equipped to handle dynamic/rapid topology changes
Disadvantages:
I. Potentially less secure (for example, hijacked access points)
2. Typically lower transmission speeds
3. Grea~er level of impact/influence by environment
utomation Module-3
ss Control
Explain working of IP as the loT network layer, (08 Marks)
2006, and
~oT network techn~logies to be aware of toward the bottom of the protocol stack
2734. The
mclude cellular, w1fi, and Ethernet. as well as more specialized solutions such
1Pv6, .nut
as L'.WAN,. Bluetooth ~ow Energy (BLE), ZigBee, NFC, and RFID. When you
c~ns1dcr which netw~rkmg technologies to adopt within your IoT application, be
mmdful of the followmg constraints:

9
VIII Semt (CSE I ISf) I~n.etc,f ~T~
I. Range technologies are released
2. Bandwidth subject to change.
3. Power usage Security
4. lntennittent connectivity Security is always a priori
5. lnteroperability implement end-lo-end secu
6. Security protection.
Range · h d · ty · II ._ Write note on B usiness
Networks can be described in terms of the distances over wh1c ata 1s pica Y Data flowing from or to
transmitted by the loT devices attached to the network: center servers eith~r in the c
PAN (Personal Area Network) Dedicated applications are
PAN 1·s short-range, where distances can be measured in meters, such as a wearable or on network edge platfi
. that communicates
fitness trackci: device • w1'th an app on a cell phone over BLE. applications communicate
LAN (Local Area Network) solutions combining vari
LAN is short- to medium-range, where distances can be up to hundreds of met~rs, approach with a common
such as home automation or sensors that are installe~ within_ a _factory pr~d~ction or upper (application) lay
line that communicate over wifi with a gateway device that 1s installed w1thm the started playing a key arch·
same building. in the IT markets but also
MAN (Metropolltan Area Network) .
MA1'!_ is long-range (city wide), where distances are meas~red up to a few k,lomet~rs, Discuss ne~ for optlm"
such as smart parking sensors ins1alled throughout a city that are connected m a The Internet ofThings wi
mesh network topology. challenges !.till exist for I
WAN (Wide Area Network) . . of non-IP devices, you
WAN is long-range, where distances can be measured m kilometers, such as levels that loT often im
agricullura\ sensors that are ins1alled across a large farm or ranch that are used to of the IP stack to hand
monitor micro-climate environmental conditions across the property. following sections take a
Bandwidth the nodes and the netwo
Oandwid1h, or the amount of data that can be transmitted in a specific period of is transitioning from vc
time, limits the rate at which data can be collected from loT devices and transmitted the loT space.
upstream.
Power usagt
Transmitting data from a device consumes power. and transmitting data over long Describo application p
ranges requires more power than over a short range. You must consider the devices The loT application prot
that operate on a battery to conserve power to prolong the life of the battery and Vertical industries they a
reduce operating costs. on the characteristics 0
lntennittent connectivity protocols that are suffici
loT devices aren't always connected. In some cases, devices will connect periodically well suited for constrain
by design in order to save power or bandwidth. ne Transport Layer:
Interoperability lhe constrained nature
With so many different devices connecting to the IoT, interoperability can be lnditional transport mec
a challenge. Adopting standard protocols has been the traditional approach for leT Application Trans
maintaining interoperability on the internet. However, for the loT, standardization loT application data an
processes sometimes struggle to keep up with the rapid pace of change and -.clitionaJ networks, TC

10
llc:hnologies are released based on upcoming versions of standards that are still
llbject to change.
~~ . .
Security is always a priority, so be sure to select networking technologies that
aplement end-to-end security, including authentication, encryption, and open port
protection.
Write note on Business case for IP (04 Marks)
Data flowing from or to "things" is consumed, controlled, or monitored by data
center servers either in the cloud or in locations that may be distributed or centralized.
Dedicated applications are then run over virtualized or traditional operating systems
or on network edge platforms (for example, fog computing). These lightweight
applications communicate with the data center servers. Therefore. the system
solutions combining various physical and data link layers call for an architectural
approach with a common laycr(s) independent from the lower (connectivity) and/
or upper (application) layers. This is how and why the Internet Protocol (IP) suite
started playing a key architectural role in the early I990s. IP was not only preferred
in the IT markets but also for the OT environment.
~~~~~iza~ ~~~
The Internet of Things will largely be built on the Internet Protocol suite. However,
challenges still exist for IP in IoT solutions. In addition to coping with the integration
of non-IP devices, you may need to deal with the limits at the device and network
levels that loT often imposes. Therefore, optimizations are needed at various layers
of the IP stack to handle the restrictions that aro present in loT networks. The
following sections take a detailed look at why optimization is necessary for IP. Both
the nodes and the network itself can often be constrained in loT solutions. Also, IP
is transitioning from version 4 to version 6, which can add further confinements in
;mitted the IoT space.
OR
•r long Describe application protocols for IoT. (08 Marks)
evices The loT application protocols you select should be contingent on the use cases and
vertical industries they apply to. In addition, IoT application protocols are dependent
ry and
on the characteristic~ of the lower layers themHlves. For example, application
protocols that arc sufficient for generic nodes and traditional networks often are not
well suited for constrained nodes and networks.
ically
Tbe Transport Layer: IP-based networks use either TCP or UDP. However,
the constrained nature of loT networks requires a c:loler look at the use of these
traditional transport mechanisms.
loT Application Transport Methods: This section aplores the various types of
loT application data and the ways this data can be canted across a network. As in
traditional networks, TCP or UDP are utilized in mOlt caes when transporting loT

11
VIII Set11t (CSE I I S£) -1~1J~ 2019
application data. The transport methods are covered in depth and form the bulk of the
material in this chapter. You will notice that, as with the lower-layer loT protocols,
there are typically multiple options and solutions prestinted for transporting loT
application data. This is because loT is still developing and maruring and has to
account for the transport of not only new application protocols and technologies but
legacy ones as well.
b. Discuss the various methods used in loT application transport. (08 Marks)
Ans. The following categories of loT application protocols and their transport methods
are explored in the following sections:
Application layer protocol not present: In this case, the data pay load is directly
transported on top of the lower layers. No application layer protocol is used.
Supervisory control and data acquisition (SCADA): SCADA is one of the most a. What do you mean by
common industrial protocols in the world. but it was developed long before the days Data Analytics has a si
of IP, and it has been adapted for IP networks. applications and inv
Generic web-based protocols: Generic protocols, such as Ethernet, Wi-Fi. and effective use of their da
4G/LTE, are found on many consumer- and enterprise-class loT devices that Volume: There are huge
communicate over non-constrained networks. The business organizatio
loT application layer protocols: loT application layer protocols are devised to run analyze the same for ex
on constrained nodes with a small compute footprint and are well adapted to the data can be analyzed easi
network bandwidth constraints on cellular or satellite links or constrained 6LoWPAN Structure: loT applicati
networks. Message Queuing Telemetry Transport (MQTT) and Constrained as unstructured, semi-s
Application Protocol (CoAP), covered later in this chapter, are two well-known significant difference in
examples of loT application layer protocols. business executive to an
SCADA and software.
In the world of networking technologies and protocols, loT is relatively new. Driving Revenue: The
Combined with the fact that IP is the de facto standard for computer networking business units to gain an
in general, older protocols that connected sensors and actuators have evolved and lead to the development
adapted themselves to utilize IP. A prime example of this evolution is supervisory expectations. This, in
control and data acquisition (SCADA). Designed decades ago, SCADA is an organizations.
automation control system that was initially implemented without IP over serial Competitive Edge: loT
links, before being adapted to Ethernet and 1Pv4. · numerous loT applicati
IoT Application Layer Protocols data analytics in loT in
When considering constrained networks and/or a large-scale deployment of and will, therefore, pro'II
constrained nodes, verbose web-based and data model protocols. To address this
b. Discuas Bigdata analy
problem, the loT industry is working on new lightweight protocols that are better
suited to large numbers of constrained nodes and networks. Two of the most popular
Alla. Generally, the industry
protocols are CoAP and MQTT. Figure 6-6 highlights their position in a common
1. Velocity: Velocity
Hadoop Distributed Fi
loT protocol stack.
Smart objects can gene
database or file system

12
·JIM\£,,/JiA½t 2019

CoAP MOTT
,_
UDP TCP

INS

~--~.:.:.:.:J'~-'
-~,'·--
.....:¥·---
.
·-·-=·- -b'·~~
~
r~:i:-,._ - .
~.: ,.__
~ I I
-
,,~ -
-
.,..,.:.~~1'
I:.;' .. _..,·
~~--.,.
Module-4
What do you mean by data and analytics for loT? Explain. (04 Marks)
Data Analytics has a significant role to play in the growth and success of loT
applications and investments. Analytics tools will allow the business units to make
effective use of their datasets as explained in the points listed below.
Volume: There are huge clusters of data sets that loT applications make use of.
The business organizations need to manage these large volumes of data and need to
analyze the same for extracting relevant patterns. These datasets along with real-time
data can be analyzed easily and efficiently with data analytics software.
Structure: loT applications involve data sets that may have a varied structure
as unstructured, semi-structured and structured data sets. There may also be a
significant difference in the data formats and types. Data analytics will allow the
business executive to analyze all of these varying sets of data using automated tools
and software.
Driving Revenue: The use of data analytics in loT investments will allow the
business units to gain an insight into customer preferences and choices. This would
lead to the development of services and offers as per the customer demands and
expectations. This, in tum, will improve the revenues and profits earned by the
organizations.
Competitive Edge: loT is a buzzword in the current era of technology and there are
numerous loT application developers and providers present in the market. The use of
data analytics in loT investments w ill provide a business unit to offer better services
and will, therefore, provide the ability to gain a competitive edge in the market._
b. Discuss Bigda ta a nalytics tools and technology. (04 Marks)
Ans. Generally, the industry looks to the "three Vs" to categorize big data:
I . Velocity: Velocity refers to how quickly data is being collected and analyzed.
Hadoop Distributed File System is designed to ingest and process data very quickly.
Smart objects can generate machine and sensor data at a very fast rate and require
database or fi le systems capable of equally fast ingest functions.

13
VIII Se,rn, (CSE I ISE)
2 • Variety: Variety refers to different types of data. Often you see data cate~orized
. d t t d Different database technolog,cs may
as structured, sem r structure , or uns rue ure . 1-1 d .sable to collect and store all
· ofthese types. a oop 1 •
only be capable ofncccptin,i ~e. l h~n combinino machine data from loT devices
~~::~ry- " Th1<1 cnn be ben1:nCIA w "' . \ d'
s\ructured in nature with data from other sources, such as socia me 1a or
multimedia, that is unstructured.
TbePurdueM
Regardless ofwh
. . . treated. IT inform
3. Volume: Volume refers 10 the scale of the data. Typ1cal\y, this 1s measured from
process optimizat
gigabytes on the very low end to petabytcs or even exabytes of da!a on ~he other
to make physical
extreme. Generally, big data implementations scale beyond what 1s available on
locally attached storage disks on a single node. It is common to see clusters of Thus, the operati
factors as part of
servers that consist of dozens, hundreds, or even thousands of nodes for some large
IT domain. Orga1
deployments.
separate, but this
c. With a case study relate the concept ofsecurin& loT. (08 Marks) to more tradition
Ans. It is often said that if World War Ill breaks out, it wil\ be fought in cyberspace. As loT activities. For ex
brings more and more systems together under the umbrella of network connectivity. (JPS) are being
security has never been more important. From the electrical grid system that powers As the borders
our world, to the lights that control the flow of traffic in a city, to the systems that keep align strategies a
airplanes flying in an organized and efficient way, security of the networks, devices, types of devices
and the applications that use them is foundational and essential for all modern more high\y opti
communications systems. Providing security in such a world is not easy. Security IT counterparts.
is among the very few, if not the on\y, techno\ogy disciplines that must operate with
external forces continually working against desired outcomes. To further complicate
matters, these external forces are able to leverage traditional technology as well as
nontechnical methods (for example, physical security. operational processes, and
so on) to m-eet their goals. With so many potential attack vectors, information and
cybcrsecurity is a challenging. but engaging. wpi~ \hat is of critical importance to
-.,uw,ogy vendors. enterprises, and service providers alike.
Information technolol)' (IT) environments have faced active attacks and information
socurity threats for many decades, and the incidents and lessons learned are well-
known aod documented. By contrast, operational technology (OT) environments
were traditionally kept in silos and had only limited connection to other networks. f :lolety

Thus, the history of cyber-attacks on OT systems is much shorter and has far ......,rlsezoae
fewer incidents documented. Therefore, the learning opportunities and the body of • Le¥el 5: Eate
cataloged incidents with their corresponding mitigations are not as rich as in the 1T Resource Pia
world. Security in the OT world also addresses a wider scope than in the IT world. cloc:wnent man
For example, in OT, the word security is almost synonymous with safety. In fact, 1be outside wo
many of the industrial security standards that form the foundation for industrial loT Lffel 4: Bui
security also incorporate equipment and personnel safety recommendations. tbis level and
optimization
printing, and

14
.OR
in In detail how IT 11nd OT sec:■rity practices and systems vary in real
..._ (OS Marks)
TIie Purdue Model for Control Hienrclly
ltegardless of where a security threat arises, it must be consistently and unequivocally
lreatcd.IT information is typically used to make business decisions, such as those in
process optimization, whereas OT information is instead characteristically leveraged
to make physical decisions, such as closing a valve, increasing pressure, and so on.
Thus, the operational domain must also address physical safety and environmental
factors as part of its security strategy-and this is not normally associated with the .
IT domain. Organizationally, IT and OT teams and tools have been historically
separate, but this has begun to change, and they have started to converge, leading
to more traditionally IT-centric solutions being introduced to support operational
activities. For example, systems such as firewalls and intrusion prevention systems
(IPS) are being used in loT networks.
As the borders between traditionally separate OT and IT domains blur, they must
align strategics and work more closely together to ensure end-to-end security. The
types of devices that are found in industrial OT environments are typically much
more highly optimized for tasks and industrial protocol-specific operation than their
IT counterparts. Furth~rmore, tJ1eir operational profile differs as well.

Er1e,,nse Zone

DMZ

OptmtJor•&w,,,t

rmation P"""'""Coiltll/
SCAOI\Zono
e wcll-
~nments
tworks.
has far Enterprise zone
body of • Level 5: Enterprise network: Corporate-level applications such as Enterprise
the IT Resource Planning (ERP), Customer Retltioaship Management (CRM),
world. document management, and services such II lllernet access and VPN entry from
In fact, the outside world exist at this level.
rial loT • Level 4: Business planning and lo1ladca ■■•11-ank: The IT services exist at
this level and may include scheduling s, fl • material flow applications,
optimization and planning systems, and local · such as phone, email,
printing, and security monitoring.

15
VIII Sem, (CSE ( ISE)

Industrial demilitarized zone voltage provid


• DMZ: The DMZ provides a buffer zone where services and data can be shared the board that
between the operational and enterprise zones. It also allows for easy segmentation that operate wit
of organizational control. By default, no traffic should traverse the DMZ; future purposes
everything should originate from or terminate on this area. • Stronger RESE
Operational zone • Atmega l 6U2
• Level 3: Operations and control: This level includes the functions involved in "Uno" means one
managing the workAows to produce the desired end products and for monitoring 1.0. The Uno and
and controlling the entire operational system. This could include production forward: The Uno
scheduling, reliability assurance, systemwide control optimization, security model for the Ard
management, network management, and potentially other required IT services, b. With a neat diag
such as DHCP, DNS, and timing.
• Level 2: Supervisory control: This level includes zone control roo~s, controller
status, control system network/application administration, and other control-
related applications, such as human machine interface (HMI) and historian.
• Level 1: Basic control: At this level, controllers and IEDs, dedicated HMls, and
other applications may talk to each other to run part or all of the control function.
• Level 0: Process: This is where devices such as sensors and actuators and
machines such as drives, motors, and robots communicate with controller~ or
lEDs.
Safety zone
• Safety-critical: This level includes devices, sensors, and other equipment used to
manage the safety functions of the control system.

b. Discuss OCTAVE and FAIR formal r\sk. analysis. Here are \he vari
Am. Refer Q. no 8 b of Model Paper l 1. ARM CPU/
Module-5 that's made up o
processing unit (
9. a. Give a brief note on Arduino UNO (04 Marks) work (taking inp
Ans. TheArduino Uno is a microcontroller board based on theATmega328 (datasheet). It graphics output.
has 14 digital input/output pins (ofwhieh 6 can be used as PWM outputs), 6 analog 2. GPIO - Thes
inputs, a 16 MHz ceramic resonator, a USB connection, a power jack, an lCSP header, will allow the re
and a reset button. lt contains everything needed to support the microcontroller; 3.RCA-An R
simply connect it to a computer with a USB cable or power it with a AC-to-DC devices.
adapter or battery to get started. 4. Audio out -
The Uno differs from all preceding boards in that it does not use the FTDI USB-to• devices such as
serial driver chip. Instead, it features the Atmega 16U2 (Atmega8U2 up to version 5. LEDs-Ligh
R2) programmed as a USB-to-serial convener. Revision 2 of the Uno board has a 6. lJSB - This
resistor pulling the 8U2 HWB line to ground, making it easier to put into DFU mode. (including your
Revision 3 of the board has the following new features: can use a USB
• pinout: added SDA and SCL pins that are near to the AREF pin and two other keyboard if it ha
~wp~ . 7. HDMI -Thi
• placed near to the RESET pin, the IOREF that allow the shields to adapt to the

16
cs -Jr.,f,,tlU!//J~ 2019
volta e rovided from the board. In future. shields will be c~mp~tible b~th with
;hared the b!a! that use·the AYR, which operate with 5V and ~•th the_ Ardumo Due
,tation that operate w1·th~.>.3y. The second one is a not connected pm, that 1s reserved for
DMZ; future purposes.
• Stronger RESET circuit.
• Atmega l 6U2 replace the 8U2. . .
"Uno" means one in Italian and is named to mark the upc~mmg releas~ of Ardu~no
1.0. The Uno and version 1.0 will be the reference versions of Ardumo, movmg
forward: The Uno is the latest in a series of USB Arduino boards, and the reference
model for the Arduino platform; for a comparison with previous versions
:rvices, b. With a neat diagram, explain Raspberry PI board. (04 Marks)
Ans.
troller iCA VID£0 AUDIO lflS LAN
ontrol-
I' ~ ¥ .
. /
.. .
~

llers or Flv 2
CP

somo POWEi
~

Here are the various components on the'Raspberry Pi board:


1. ARM CPU/GPU - This is a Broadcom BCM2835 System on a Chip (SoC)
that's made up ofan ARM central processing unit (CPU) and a Videocore 4 graphics
processing unit (GPU). The CPU handles all the computations that make a computer
work (taking input, doing calculations and producing output), and the GPJJ handles
graphics output.
2. GPIO - These_are exposed general-purpose input/output connection points that
will allow the real hardware hobbyists the opportunity to tinker.
3. RCA - An RCA jack allows connection of analog TVs and other similar output
devices.
4. Audio o ut- This is a standard 3.55-millimeterjack for connection ofaudio output
devices such as headphones or speakers. There is no audio in.
5. LEDs - Light-emitting diodes, for all of your indicator light needs.
6. USB :... This is a common connection port for peripheral devices of all types
(including your mouse and keyboard). Model A has one, and Model B has two. You
o other can use a USB hub to expand the n·umber of ports or plug your mouse into your
keyboard if it has its own USB port.
7. HDMI - This connector allows you to hook up a high-definition television or

17
VIII Sem, (CSc I IS£) •Jl,,l,.t'\,(!//J~20
othtr compatible device using an HDMI cable.
8. Power - This is a Sv Micro USB power connector into which you can plug your
compatible power supply. •· Explain in deta il sm
9. SD cardalot - This is a full-sized SD card slot. An SD card with an operating Refer Q. no IO b. of
system (OS) installed is required for booting the device. They are available for b. With case study expl
purchase from the manufacturers, but you can also download an OS and save it to
the card yourself if you have a Linux machine and the wherewithal.
10. Ethernet - This connector allows for wired network access and is only available
on the Model B.
c. With a neat diagram, explain wirless temperature monitoring system using
Raspberry Pl. (08 Marks)
Ans.
____
RASPBERRY Pl
_,,

- "' 0--...
',
l ONNa-Hl.l/LO '\

-
1/0

... ~
~....
t __
-
....
l
~

--
,I loT-enabled smart ci
environment and imp
:,• /, §:- lighting. Below, we
already implemented

-,
l _, /I

-- ~
... ,,
i' -
\
- Ruad traffic
Smart cities ensure
efficiently as possible
implement smart trail

cb
The components required for the implementation arc Digital tem~ratur~ sensor
OS18B20, Resistor4.7k0 , Breadboard, Jumper Wires, Connectm_g Wires and
Smart traffic solution
drivers' smart phone!
At the same time, s
allow monitoring g
current traffic situati
smart solutions for t
. k't The resistor is utilized as a 'pull-up' for the data-hne. It ensures take measures to pre
Raspberry P, , . . . . ..r ti
that the \-Wire data tine is at a described logic level, and hm1ts mtcucrence ~m for example, being
mechanical sound if the pin is let\ t\oating. Jumper wires is used for int~rfac1_ng has implemented a s
raspberry and sensor. Connect GPIO GND [Pin 61 on the Pi to the 1~e.gat1v_e side and closed-circuit le
of the breadboard and link GPIO 3.3V {Pin 1) on t~e Pi to the pos1t1ve side on a central traffic man
the breadboard. Next step is to insert the DS l 8B20+ mto the breadboard, C~nnect platform users of
OS\ 8820+ GND [Pin I] to the negative side of the breadboard and+ VDD [P~n 31 to Additionally, the ci
the positive side of the breadboard. Next step is to place one arm of 4.7kll res1~tor to make second-by-s1
OS 18B20+ DQ [Pin 21 and other end to the positive side of the breadboard. Finally, conditions in real t
link OS t 8820+ DQ [Pin 21 lo GPIO 4 [Pin 7] alongside a jumper wire.

18
• Jl.,(,t'\,f!, I Ju½12019

~g your OR
a. Explain in detail smart city loT Architecture • (08 Marks)
ReferQ. no 10 b. ofModel paper2
b. With case study explain smart and connected cities using Raspberry PI
(08 Marks)

IoT-enabled smart city use cases span multiple areas: from contributing to a healthier
environment and improving traffic to enhancing public safety and optimizing street
lighting. Below, we provide an overview of the most popular use cases that are
already implemented in smart cities across the globe.
Road traffic
Smart cities ensure that their citizens get from point A to point B as safely and
efficiently as possible. To achieve this, municipalities turn to IoT development and
implement smart traffic solutions.
Smart traffic solutions use different types of sensors, as well as fetch GPS data from
drivers' smart phones to determine the number, location and the speed of vehicles.
At the same time, smart traffic lights connected to a cloud management platform
allow monitoring green light timings and automatically alter the lights based on
current traffic situation to prevent congestion. Additionally, using historical data,
ensures smart solutions for traffic management can predict where the traffic could go and
ce from take measures to prevent potential congestion.
rfacing For example, being one of the most traffic-affected cities in the world, Los Angeles
ive side has implemented a smart traffic solution to control traffic flow. Road-surface sensors
side on and closed-circuit television cameras send real-time updates about the traffic flow to
Connect a central traffic management platform. The platform analyzes the data and notifies the
Pin 3) to platform users of congestion and traffic signal malfunctions via desktop user apps.
sistor to Additionally, the city is deploying a network of smart controllers to automatically
Finally, make second-by-second traffic lights adjustments, reacting to changing traffic
conditions in real time.

19
CBCS · Veo-2019 / J~
b. E;,i.plaio Access Ne;:
Eight Semester B.E. Degree Examination, CBCS - D"c 2019 / ,fan 2020
Ans. The last mile of the I
Internet of Things Technology
wireless technologies
Time: 3 hrs. Max. Marks: 80 to the access network
Note : Ans"'"' an)' FIVE fi,t/ qu.tstlons, selecting ONE full question from end, module.

Module- I t.FC
CM1
1. a. What is IOT? Explain evolutionai;y phases of the internet, (06 Marks) I
< 10 trn
Ans. Internet of Things (IOT) is an ecosystem of connected physical objects that ar~
accessible through the internet. The 'thing' in IOT could be a person with a heart
monitor or an automobile with built-in-sensors, i.e. objects that have been assigucd
an IP address and have the ability to collect and transfer data over a network without
manual assistance or intervention. The embedded technology in the objects helps
them to interact with internal states or the external environment, which in turn affects
the decisions taken. l. j
I J,,
.,,,.
l¥FM
fNtr. ..rYI
\'4\"l'fot'lllfl

CanMClhtly
BualMH WPA.~ W1ctes. P
oo,r@,._ WHAN. V/reeis
and
Soclelal •C....i WFAN Wi'l!less F
•-1 •'t/d,BrowM,
• s.,,,a,
WlAN W',..es, •

As IOT continues
applications and sp
required. !OT somet
match more or less
Internet Phase Definition
technologies w.ere d
Connectivity (Digitize lbis phase connected people to email, web services, One key parameter
access) and search so that information is easily accessed. the smart object a
This phase enabled e-commerce and supply chain technologies you
Networked Economy
enhancements along with co\\aborative engagement to distances.
(Digitize business) drive increased efficiency in business processes. Range estimates a
This phase extended the lnternet experience to the vertica\ where
as follows:
lmmersive Experiences encompass widespread video and social media while 1. PAN (persona
(Digitize internet ions) alwnys being connected through mobility. More and around a pers
more ap lications are moved into the cloud.
2. HAN(home
This phase is adding connectivity to objects and wireless tcchn
Internet of Things machines in the world around us to enable new services
(Digitize the world) 3. NAN (neighb
and ex riences. It is connectin the unconnected. NAN is often
4. FAN (field a1
meters. FAN

20
CS - Veo2019 / J<M\12020
a,. Explain Access Network sublayer witb a neat diagnm. . . . (06 Marks)
The last mile of the IOT network is the access network. This 1s typically made up of
wireless technologies such as 802. 1lah, 802. I5.4g, and LoRa. The sensors connected
to the access network may also be wired.

:JCFrus.
y r OSM. '/.'COVA.
lsFC E:-Of'fl3, • 3CPP "'3 1o-
EM'/ < 6 km Vl'l'°'A)
I
< 10cm
\
GelUat (lir~,....,,
Rongeo '\J4 9'0CI\'
AN Ari LPW
-1 W~l>N WFAN ~-.
,_
LPWA (Ut~Lla,n.<,ed)

/ \
~L \
IOA ,:'A.> 1, fJJIM'M, «2,,.~..., '-..
tJ-IUttO..~,l"'N
~.,,,.,.,
··--
00,(.
ZYl.,..;t w-,.••.•p ... r-,1or ~JbM-,.lCl,ol.f-A~
l•~fllreWJI.I ,.,.~ ')Nit.a ~ 11r•""~1 IA,cyO'-'t.
w...itc,..,.. .UIIMI\Vlft~
~ ,... (.f,.lit~
""'70.-.1>

WPA'i Wrelffl PlnOIIII "'"" 14•'-" WNAN; w,~ Net,r(,QttUJd Arw Nt;W<Jlk
wttm V/ue" Hom:, A11111 Nelwol\ WWAtl. W1R:l~s Wide Arda NtiMlll
WFAN Wireless F1et0 (OI FIICIOly) AIU NBIWOrll LPWA LOW Power Wkle Arel
Wt.AN Wireles, Local Alea Ne~
As IOT continues to grow exponentially, you will encounter a wide variety of
applications and special use cases. For each of them, an access technology will be
required. IOT sometimes reuses existing access technologies whose characteristics
match more or less closely the JOT use case requirements. Whereas some access
technologies w.ere developed specifically for IOT use cases, others were not.
One key parameter determining the choice ofaccess technology is the range between
the smart object and the information collector. Above Figure lists some access
technologies you may encounter in the IOT world 1tnd the expected transmission
distances.
Range estimates are grouped by category names that illustrate the environment or
the vertical where data collection over that range is expected. Common groups are
as follows:
1. PAN (penonal area network): Scale ofa few meters. This is the personal space
around a person. A commo~ wireless technology for this scale is Bluetooth.
2. HAN (home an:a network): Scale ofa few tens of meters. At this scale, common
wireless technologies for JOT include ZigBee and Bluetooth Low Energy (BLE).
3. NAN (neighborhood area network): Scale ofa few hundreds of meters. The term
NAN is often used to refer to a group of house units from which data is collected.
4. FAN (field area network): Scale of several tens of meters to several hundred
meters. FAN typically refers to an outdoor area larger than a single group of

Zl
· Veo2019
VIII Sem, (CSE I ISE)
This concept!
house units. The FAN is often seen as "open space" (and therefore not secured
and not controlled). A FAN is sometimes viewc:<l as a group ofNANs, but some and applicatij
verticals sec the FAN as a group of HANs or a group of smaller outdoor cells. specifications
As you can set:, fAN and NAN may sometimes be used interchangeably. In most be readily elll
cases, the vertical context is clear enough to determine the grouping hierarchy. connecting th
5. LAN (local area network): Scale of up to I00 m. This term is very common servers, whij
in 111:tworking, and it is therefore also commonly used in the JOT space when of oneM2M
standard networking technologies (such as Ethernet or IEEE 802.11) are used. business dom
Other networking classifications, such as MAN (metropolitan area network, with utility, industi
a range ofup to a few kilometers) and WAN (wide area network, with a range of J, Network la)1
more than a few kilometers), an: also commonly used. endpoints. It •
Note that for all these places in the IOT network, a " W" can be added to specifically that Iinks th
indicate wireless technologies used in that space. For example, HomePlug is a wired wireless mes
technology found in a HAN environment, but a HAN is often referred to as a WHAN multipoint sys
(wireless home area network) when a wireless technology, like ZigBce, is used in connectjons, s
that space.
c. What are the elements of one M2MIOT Architecture? Explain. (04 Murks) Explain the lune!
Ans. IP, TCP, and UDP
to take care of d
Multiple protocol

-...... problems. Some n•


interval or based
application querie
also possible.
Following the IP I
transfer phase. Afi
.use the client part
and then data can

AppO:.IIICIIC L.ayor,
Smart Ene,gy
SOr.i;:os Ll)'ar
0f'OM2t.1 •l'Glude~ II CO<TUYIOn
l\,OIW0f1t Ul)OI:
Applicalt0·~ 1al~ 10
HTTP is somethiri
environments wit
I
• Asset Trac~1~g
• Fleel 1-.la!laQement
servlCes horizontal rrame111ork
supportmg RS51lol APls
the APlc to communlcQte
10 l!el\S0'3 packet failure. D,
suggested for the
1. Applications layer: The oneM2M architecture gives major attention to
HTMLS specifica
connectivity between devices and their applications. This domain includes
connection. Some
the application-layer protocols and attempts to standardize northbound AP! the smart object ,
definitions for interaction with business intelligence (Bl) systems. Applications other protocols, s
tend to be industry-specific and have their own sets of data models, and thus they the communicaticil
are shown as vertical entities.
2. Services layer: This layer is shown as a horizontal framework across the vertical .. Describe JOT W(
industry applications. At this layer, horizontal modules include the physical In 20 14 the 101
network that the JOT applications run on, the underlying management protocols,
and the hardware. Examples include baekhaul communications via cellular,
MPLS networks, VPNs, and so on. Riding on top is the common services layer.

22
CS • 'Dec-2019 / ]CM\/2020
secured This conceptual layer adds A Pis and middlewarc supporting third-party sc:rvic\:S
ut some and applications. One of the stated goals of oneM2M is to "develop technical
, or cells. specifications which address the need for a common M2M Service Layer that can
. In most be readily embedded within various hardware and software nodes, and rely upon
rarchy. connecting the myriad of devices in the field area network to M2M application
common servers, which typically reside in a cloud or data center." A critical objective
of oneM2M is to attract and actively involve organizations from M2M-related
re used. business domains, including h:lematics and intelligent transportation, healthcare,
rk, with utility, industrial automation, and smart home applications, to name just a few.
range of 3. Network layer: This is the communication domain for the JOT devices and
endpoints. It includes the devices themselves and the communications network
~cifically that links them. Embodiments of this communications infrastructure include
a wired wireless mesh technologies, such as IEEE 802.15.4, and wireless point-to-
WHAN multipoint systems, such as IEEE 801.1 lah. Also included are wired device
s used in connections, such as IEEE 1901 power line communications.
OR
Explain the functionality ofIOT network management sublayer. (05 Marks)
IP, TCP, and UDP bring connectivity to JOT networks. Upper-layer protocols need
to take care of data transmission between the smart objects and other systems.
Multiple protocols have been leveraged or created to solve IOT data communication
problems. Some networks rely on a push model (that is, a sensor reports at a regular
interval or based on a local trigger), whereas others rely on a pull model (that is, an
application queries the sensor over the network), and multiple hybrid approaches are
also possible.
Following the IP logic, some IOT implementers have suggested HTTP for the data
transfer phase. After all, HTTP has a client and server component. The sensor could
.use the client part to establish a connection to the JOT central application (the server),
and then data can be exchanged. You can find HTTP in some IOT applications, but
kl HTTP is something of a fat protocol and was not designed to operate in constrained
unica111 environments with low memory, low power, low bandwidth, and a high rate of
packet failure. Despite these limitations, other web-derived protocols have been
ntion to suggested for the !OT space. One example is WebSocket. WebSocket is part of the
includes HTMLS specification, and provides a simple bidirectional connection over a single
und API connection. Some lOT solutions use WebSocket to manage the connection between
plications the smart object and an external application. WebSocket is often combined with
thus they other protocols, such as MQTT (described shortly) to handle the JOT-specific part of
the communication.
e vertical b. Describe IOT World Forum (IOTWF) stan~ardize architecture. (07 Marks)
i physical Ans. In 2014 the IOTWF architectural committee (led by Cisco, IBM, Rockwell
rotocols, Automation, and others) published a seven-layer JOT architectural reference model.
l cellular,
While various JOT reference models exist, the one put forth by the IOT World Forum
·ces layer.

Z3
VIII Sern; ( CSE I IS'f)
offers a clean, simplified perspective on IOT and includes edge computing, data
storage, and access. lt provides a succinct way of visualizing lOT from a technical
perspective. Each of the seven layers is broken down into specific functions, and this includes sensors
security encompasses the entire model. Above Figure details the IOT Reference and so on. The top o
Model published by the IOTWF. databases, and applic1


IT. In the past, OT an
a.a.borallon & p,_.... to even talk to each o
(1""'1Mr9 Pooplo &--. """'"-l

G With a neat diagra


oaui 111>s1racuon world. Classify actu
~&Ace,..,)
Sensors are designed
Data Accumulaflon
(- ego) the physical world. T
signals or digital rep
e Edge Compullng
(Cata E_,...,.,.~.,. & fransbnnatan) device or a human).
e Conntetlvlty
(Cotnmonbllln & PrOCNllng Ul41)
signal (commonly a
effect, usually some t,
0 Physical Dtvktl .. Conllollell
{ll'le •n.inoa•in IO'r}

the IOT Reference Model defines a set of levefs with control flowing from the center
(this could be either a cloud service or a dedicated data center), to the edge, which
RealW
includes sensors, devices, machines, and other types of intelligent end nodes. ln Physical
~eneral,data travels up the stack, originating from the edge, and goes northbound to
the center.
c. Compare and contrast IT and OT.
Ans.

Much like sensors, a


Some common way
Query Cctn ot Non real
8i.t~CU Rest Tlme
motion: Actuators c
example, linear, rota
... ...
Evant D~t-lh
....
Ro.ii
• Power: Actuators
power, low powe
685ec Motio, Time • Binary or contin
state outputs.
• Area of applicati
vertical where thl
• Type of energy: ,
Typt
Mechanical actuiito
An interesting aspect of visualizing an IOT architecture this way is that you can start
to organize responsibilities along IT and OT lines. Above Figure illustrates a natural Elec'trical actuators
demarcation point between lT and OT in the JOT Reference Model framework.

24
· Deo2019 I JMl/2020
IOT systcims have to cross several boundaries beyond just the functional layers. The
bottom of the stack is generally in the domain of OT. For an industry like oil and gas,
this includes sensors and devices connected to pipelines, oil rigs, refinery machinery,
and so on. The top of the stack is in the IT area and includes things like the servers,
databases, and applications, all of which run on a part of the network controlled by
IT. In the past, OT and IT have generally been very independent and had little need
to even talk to each other. IOT is changing that paradigm.
Module-2
With a neat diagram, explain bow actuators and sensors intctact with physical
world. Classify actuators based on energy type. (08 Marks)
Sensors are designed to sense and measure practically any measurable variable in
the physical world. They convert their measurements (typically analog) into electric
signals or digital representations that can be consumed by an intelligent agent (a
device or a human). Actuators, on the others hand, receive some type of control
signal (commonly an electric signal or digital command) that triggers a physical
effect, usually some type of motion, force, and so on.
,-
Sense

Sensors
.,. ___
M.iasum

._______
Real World - Digtlal Aeptese11tal100 -
Phyaicel Envirorvnenl Eleclnc Signal

useful Act
Work

Much like sensors, actuators also vary greatly in function, size, design, and so on.
Some common ways that they can be classified include the following: · Type of
t-cn rcol motion: Actuators can be classified based on the type of motion they produce (for
...
Twm,
example, linear, rotary, one/two/three-axes).
• Power: Actuators can be classified based on their power output ( for example, high

Ro.ii
Time
power, low power, micro power)
• Binary or continuous: Actuators·can be classified based on the number of stable-
state outputs.
• Area of application: Actuators can be classified based on the specific industry or
vertical where they are used.
• Type of energy: Actuators can be classified based on their energy type.
Type Examples
Mechanical actu11tors Lever, screw jack, hand crank
Electrical actuators Thyristor, biopolar transistor, diode

2S
VIII Se,m, ( CSE ( ISE) •
Electromechanical actuators AC motor, DC motor, step motor
Electromagnetic actuators Electromagnet, linear solenoid
Hydraulic and pneumatic actuators
Hydraulic cylinder, pneumatic cylinder, piston,
pressure control valves, air motors
Smart material actuators (includes Shape memory alloy (SMA), ion exchange fluid,
ti I d - t ) magnetorestricrive material, bimetallic strip,
1erma an magne11c ac1ua ors . • b" h
p1ezoc1cctric 1morp
Micro- and nanoactuators Electrostatic motor, microvalve, comb drive

b. List out the limitations of the smart objects in WSNs and explain the data
aggregation in WSN with a near diagram. (08 Marks)
Ans. The following are some of the most significant limitations of the smart objects in
WSNs:
I. Limited processing power
2. Limited memory
3. Lossy communication
4. Limited transmission speeds
S. Limited power

Above Figure shows an example of such a data aggregation function in a WSN


where temperature readings from a logical grouping of temperature sensors are
aggregated as an average temperature reading. These data aggregation techniques
are helpful in reducing the amount of overall traffic (and energy) in WSNs with

26
· Veo2019 I J~2020
very large numbers of deployed smart objects. While there are certain instances in
which sensors continuously stream their measurement data, this is typically not the
case. Wirelessly connected smart objects generally have one of the following two
communication patterns:
I. Event-driven: Transmission of sensory information is triggered only when a
smart object detects a particular event or predetermined threshold.
2. Periodic: Transmission of sensory information occurs only at periodic intervals.
The decision of which of these communication schemes is used depends greatly on
the specific application. For example, in some medical use cases, sensors periodically
send postoperative vitals, such as temperature or blood pressure readings. In other
medical .use cases, the same blood pressure or temperature readings are triggered to
be sent only when certain critically low or high readings are measured.
OR
What is Zigbee? Explain 802.15.4 physical layer, MAC layer, and security.
(08 Marks)
Zigbee is based on the IEEE's 802.15.4 personal-area network standard. All you
need to know is that Zigbee is a specification that's been around for more than a
decade, and it's widely considered an alternative to Wi-Fi and Bluetooth for some
applications including low-powered devices that don't require a lot of bandwidth -
like your smart home sensors.
Physical Layer
The 802. 15.4 standard supports an extensive number of PHY options that range
from 2.4 GHz to sub-GHz frequencies in [SM bands. The original IEEE 802.15.4-
2003 standard specifiecl only three Pl IY options based on direct sequence spread
spectrum (DSSS) modulation. DSSS is a modulation technique in which a signal is
intentionally spread in the frequency domain, resulting in greater bandwidth. The
original physical layer transmission options were as follows: 2.4 GHz, 16 channels.
with a data rate of 250 kbps 915 MHz, IO channels, with a data rate of 40 kbps 868
MHz, I channel, with a data rate of20 kbps
6 Bytes o - 127 Bytes

Start of frame Frame


Preamble Deimitor

Synchronlzat.on Head~r PHY Header /

5 Byles
... 1 Bvte ►

MAC Layer
The IEEE 802.15.4 MAC layer manages access to the PHY·channel by defining
how devices in the same area wills~ 1bc fNqueocies allocated. At this layer, the
scheduling and routing ofdata frames are also coordinated. The 802.15.4 MAC layer ·

MM' 27
VIII Sem, (CS£ ( I S£)
performs the following tasks:
• Network beaconing for devices acting as coordinators (New devices use beacons
to join an 802.15.4 network)
• PAN association and disassociation by a device
• Device security
• Reliable link communications between two peer MAC entities
The MAC layer achieves these tasks by using various predefined frame types. Ln
fact, four types of MAC frames are specified in 802.15 .4:
• Data frame: Handles all transfers of data ~ SA<:11rir1 E~11b111rl
• Beacon frame: Used in the transmission of beacons from a PAN coordinator h1t 1nHRmft
Conu ol IS S8l lO l .
• Acknowledgement frame: Confirms the successful reception of a frame
• MAC command frame: Responsible for control communication between devices b. Explain LoRaWAN stan
MAC Frame 4 -201:fy!a• MAC Layer
26 18
var able The MAC layer is defined
of the LoRa physical la)I
battery life and ensure d
The LoRaWAN specifica
• Class A: This class is
nodes, it allows bidir
receive downstream t
after each transmissio
• Class B: This class
can be better defined
0·127 B';tes
windows compared t
beaconing process.
Security • Class C: This class i
The lEEE 802.15.4 specification uses Advanced Encryption Standard (AES) with a enables a node to be
128-bit key length as the base encryption algorithm for securing its data. Established when not transmitti
by the US National Institute of Standards nnd Technology in 200 I, AES is a block
1 Byte
cipher, which means it operates on fixed-size blocks of data. The use of AES by
the US government and its widespread adoption in the private sector has he lped it
become one of the most popular algorithms used in symmetric key cryptography. MAC Header (MHDR)
(A symmetric key means that the same key is used for both the encryption and
decryption of the data.) Security
. In addition to encrypting the data, AES in 802. I 5.4 also validates the data that is Security in a LoRa
sent. This is accomplished by a message integrity code (MIC), which is calculated architecture, as detail
for the entire frame using the same AES key that is used for encryption. of security, protecting
The first layer, called '
·the authentication ofth
LoRaWAN packets by

28 ..._..~~M'
CS - Vro2019 I ]<M112020
4 -20 Sylas
devices use beacons

ies
ned frame types. In
® Auxil ary Securify Header
IIPld ,s added to MAC lrllrM

::I) Security En~bled


t\N coordinator hit 1n Hame
ofa frame c omrol Is seI 10 1
tion between devices
b. Explain LoRaWAN standard and alliance MAC layer and security (08 Marks)
Ans. MAC Layer
28 The MAC layer is defined in the LoRaWAN specification. This layer takes advantage
of the LoRa physical layer and classifies LoRaWAN endpoints to optimize their
battery life and ensure downstream communications to the LoRaWAN endpoints.
The LoRaWAN specification documents three classes of LoRaWAN devices:
• Class A: This class is the default implementation. Optimized for battery powered
nodes, it allows bidirectional communications, where a given node is able to
receive downstream traffic after transmitting. Two receive windows are available
after each transmission.
• Class B: This class was designated "experimental" in Lo RaWAN 1.0. 1 until it
can be better defined. A Class 8 node or endpoint should get additional receive
windows compared to Class A, but gateways must be synchronized through a
beaconing process.
(AES) with a • Class C: This class is particularly adapted for powered nodes. This classification
its data. Established enables a node to be continuously listening by keeping its receive window open
200 I. AES is a block when not transmitting.
The use of AES by
xctur h115 helped it
1 Byte Variable 4 Bytes
..
by c,yptography. MAC Header (MHDA) Message Integrity
111c encryption and Code (MIC)

Security
Security in a LoRaWAN deployment applies to different components of the
architecture, as detailed in Figure. LoRaWAN endpoints must implement two layers
of security, protecting communications and data privacy across the network.
The first layer, called "network security" but applied at the MAC layer, guarantees
'the authentication of the endpoints by the LoRaWAN network server. Also, it protects
LoRaWAN packets by perfonning encryption based on AES.
VIII Sem; (CSE I ISE) · Veo2019 / J,

1\\e io\\\ \)fOCtd\l


the join process,
request and join a
LoRaWAN netwo
and AppKey. The
AppSKey keys.

With a neat diagra


fragmentation.
Header Compressio
(i) Oe'/ice 'flCIYPIS dala em-to-end @ NS decryplS USlll9 net~ort. ~ey 1Pv6 header compres
® Sepanue nar,,Oflc enayPt lo NS @ NS iorN".mls pad<Pt 10 relevanl NS subsequently updated
byte headers and Use
@ 0.,,,ce send$ a pad<eC (1) AS decrypts us.ng app tu;y as 6 bytes combined i
© Al APs In range,_ PQcilet @ NS &elects beSI AP for return TX is only defined for an
Each endpoint implements a network session key (NwkSKey), used by both itself support 1Pv4, and, in
and the LoRaWAN network server. The NwkSKey ensures data integrity through 802.15.4. 6LoWPAN
computing and checking the MIC of every data message as well as encrypting and complicated. Howeve
decrypting MAC-only data message payloads. as implementation of
The second layer is an application session key (AppSKey), which performs various 1Pv6 addressi~
encryption and decryption functions between the endpoint and its application server. use case and how the
Furthermore, it computes and checks the application-level MIC, if included. This
ensures that the LoRaWAN service provider does not have access to the application 18
payload if it is not allowed that access. Endpoints receive their AES-128 application
key (AppKey) from the application owner. This key is most likely derived from
an application specific root key exclusively known to and under the control of the
6LoWPAN Head
application provider.
For production deployments, it is expected that the LoRaWAN gateways are
protected as well, for both the LoRaWAN traffic and the network management and
operations over their backhaul link(s). This can be done using traditional VPN ana 2B 4B
IPsec technologies that demonstrate scaling in traditional IT deployments. Additional i=-~
: -,'..
9-A-I IUDP
security add-ons are under evaluation by the LoRaWAN Alliance for future revisions
, of the specification.
t
6LoWPAN Headen
LoRaWAN endpoints attached to a LoRaWAN network must get registered and with Compressed
1Pv6 Header
authenticated. This can be achieved through one of the two join mechanisms:
• Activation by personalization (ADP): Endpoints don't need to run a join Fragmentation
procedure as their individual details, including DevAddr and the NwkSKey and The maximum trans
AppSKey session keys, are preconfigured and stored in the end device. This same bytes. The term MT
information is registered in the LoRaWAN network server. passed. For IEEE 802
• Over-the-air activation (OTAA): Endpoints ar.e allowed to dynamically join a because 1Pv6, with a
particular LoRa WAN network after successfully going through a join procedure. much smaller one. To

30
· Veo2019 ( ]CU'l.12020
The join procedure must be done every time a session context is renewed. During
the join process, which involves the sending and receiving of MAC layer join
request and join accept messages, the node establishes its credentials with a
LoRaWAN network server, exchanging its globally unique DevEUI, AppEUI,
and AppKey. The AppKey is then used to derive the session NwkSKey and
AppSKey keys.
Module-3
With o ncot diagram, explain 6LOWPAN protocol heuder compression and
fragmentation. (08 Marks)
Header Compression
IPv6 header compression for 6LoWPAN was defined initially in RFC 4944 and
ne!work !ey
subsequently updated by RFC 6282. This capability shrinks the size of IPv6's 40-
IO - ~nl NS
byte headers and User Datagram Protocol's (UDP's) 8-byte headers down as low
as 6 bytes combined in some cases. Note that header compression for 6LoWPAN
,-., 11,turn TX
is only defined for an 1Pv6 header and not 1Pv4. The 6LoWPAN protocol does not
support 1Pv4, and, in fact, there is no standardized 1Pv4 adaptation layer for IEEE
. Key), used by both itself
802.15.4. 6LoWPAN header compression is stateless, and conceptually it is not too
s data integrity through
complicated. However, n number of factors affect the amount of compression, such
as well as encrypting and
as implementation of RFC 4944 versus RFC 6922, whether UDP is included, and
various IPv6 addressing scenarios. It is beyond the scope of this book to cover every
SKey), which performs
use case and how the header ~elds change for each.
and its application server.
6LoWPAN Without Header COmpreaslon
I MIC, if included. This
4--- - -- - --127 Byte IEEE 802.15.4 Frame -------♦
access to the application
1B 408 88 53B
ir AES-128 application
likely derived from r.........::=-...a....-r.--- 1Pv6 _ _ _ ~ ~
UDP [ .............:.~ ~.......... ~
aider the control of the t
6LoWPAN Header
gateways are llLoWPAN With tPYll and UDP Header Compression
management and - - - - - - - - 127 Byte IEEE 802.15.4 Frame - - - - - - - •
iooaJ VPN and 2B 4B 1088
!llts, Additional
ra!all111rmen._ Payload
or future revisions
J ~DPj
- .... ···~· ..
t
6LoWPAN Header
ust get registered and with Compressed
1Pv6 Header
join mechanisms:
't need to run a join Fragmentation
and the NwkSKey and The maximum transmission unit (MTU) for an 1Pv6 network must be at least 1280
e end device. This same bytes. The term MTU defines the size of the largest protocol data unit that can be
passed. For IEEE 802.15.4, 127 bytes is the MTU. You can see that this is a problem
to dynamically join a because fPv6, with a much larger MTU, is carried inside the 802.15.4 frame with a
rough a join procedure. much smaller one. To remedy this situation, large TPv6 packets must be fragmented

31
VIII Sem- ( CSE ( I S£) In.ru-rie:t o f ~ r ~
across multiple 802.15.4 frames a(Layer 2.
The fragment header utilized by 6LoWPAN is composed of three prima~ fields:
Datagram Size, Datagram Tag, and Datagram Offset. The I-byte Datagram Size field
specifics the total size of the unfragmented payload. Datagram Tag identifies the set
of fragments for a payload. Finally, the Datagram Offset field delineates how far
into a payload a particular fragment occurs. Below figure provides an overview of a
6LoWPAN fragmentation header.
6LoWPAN Fragmentation Hoador

18 18 28 18
Datagram Tag ~

r
Datagram
r
Datagram Offset
applicability ofexistin
their suitability for ce
solutions are validated
Size (F,eld Not Present
Oo First Fragment) the Datagram Transpo
6LoWPAN DICE
Fragmentation New generations of co
Header
access networks are e
b, List and explain the key advantage/I of Internet protocol. when implementing U
(04 Marks)
Ans.
• Adoption is just a matter of time The Internet Protocol is a must and a
requirement for any Internet connectivity. It is the addressing scheme for any data
transfer on the web. The limited address capacity of its predecessor, IPv4, has
made the transition to 1Pv6 unavoidable. Google's figures arc revealing an 1Pv6
adoption rate following an exponential curve, doubling every 6 months.
• Scalability: 1Pv6 offers a highly scalable address scheme. The present scheme.
of Internet Governance provides at most 2 x IO 19 unique, globally routable,
addresses. This is many orders of magnitude more that the 2 x I09 that is possible·
with 1Pv4 and the IO 13 that is the largest estimate oflOT devices that will be used
this century. It is quite sufficient to address the needs of any present and future
communicating device still allowing it to have many addresses. ·
• Solving the NAT barrier: Due to the limits of the 1Pv4 address space, the current
Internet had to adopt a stopgap solution to face its unplanned expansion: the
Network Address Translation (NAT). It enables several users and devices to share
the same public IP address. This solution is working but with two main trades-
off: The NAT users are borrowing and sharing IP addresses with others. While olapplicaf
this technique allows single stakeholders to mount large applications, it becomes IDlutions. I
completely unmanageable if the same end-points are to be used by many different IEEE IBIS 201
stakeholders; this would occur in an IOT deployment where the same sensors protocol over IP can
are to be used by multiple, independent, stakeholders. Secondly the mechanism or UDP or by installi
cannot be used to access specific end-points from the Internet. between the serial pro

32
• Malti-Stakeholder Support: 1Pv6 provides for end devices to have multiple
addresses and an even more distribllled routing mechanism than the IPv4 Internet.
This allows different stakeholders ID assign IOT end-device addresses that are
consistent with their own applicaliaa and network practices. Thus multiple
mkeholders can deploy lhcir owa lpplications, sharing a common sensor/
actuation infrastructure, wilhoul I lang the technical operation or governance
of lhe Internet.
Esplain Rl'L encryption and autheadalioa on constraint node. (04 l\:larks)
ACE
Much like the RoLL working group, the Authentication and Authorization for
Constrained Environments (ACE) working group is tasked with evaluating the
applicability ofexisting authentication and authorization protocols and documenting
their suitability for certain constrained-environment use cases. Once the candidate
solutions are validated, the ACE working group will focus its work on CoAP with
the Datagram Trnnsport Luycr Security (DTLS) protocol.
DICE
New generations of constrained nodes implementing an IP stack over constrained
access networks are expected to run an optimized lP protocol stack. for example,
when implementing UDP at the transport layer, the IETF Constrained Application
Protocol (CoAP) should be used at the application layer. In constrained environments
secured by DTLS, CoAP can be used to control resources on a device. (Constrained
environments are network situations where constrained nodes and/or constrained
networks arc present.
The DTLS in Constrained Environments (DICE) working group focuses on
implementing the DTLS transport layer security protocol in these environments.
The first task of the DICE working group is to define an optimized DTLS profile
for constrained nodes. In addition, the DICE working group is considering the
applicability oflhe DTLS record layer to secure multica.~t messages and investigating
how the DTLS handshake in constrained environments can get optimized.
OR
Explain tunneling legacy SCADA over IP networks and SCADA protocol
translation with a neat diagram. (08 Marks)
Deployments of legacy industrial protocols, such as DNP3 and other SCADA
protocols, in modern IP networks call for flexibility when integrating several
generations of devices or operations that are tied to various releases and versions
of application servers. Native support for IP can vary and may require different
solutions. Ideally, end-to-end native IP support is preferred, using a solution like
IEEE 1815 2012 in the case of DNP3. Otherwise, transport of the original serial
protocol over IP can be achieved either by tunneling using raw sockets over TCP
or UDP or by installing an intermediate device that performs protocol translation
between the serial protocol version and its IP implementation.

33
VIII Se,m,, ( CSc I ISc) - oeo2019 I Ji

connection to their ~
RTUs both ends of the linW
over the IP networ
piece of software is
to IP ports. This so
serial redirector in e
Raw Socket ~ aster and m ent Set-Up converts it to a TCP
Detween Routers and SCADA Server the SCADA server s
.i SCA DA
B, where a router o
serial ports to IP po
RTUs Server socket connections.
L:.;...--=-~-=--=-:=:::" Application Comrnunieates Through
COM Port~ Mapped to IP+ TCP/UDP DeKribe MQTT fi
Pons by IP/Senal HPCllrl!Ctor SW Message Queuing
Scenllrio B: Raw Socket between Router and SCADA Server - no SCADA appllcatlon change At th'e end of the I
on ....., but IPfSerlal Redirector aoftwar• and Ethernet Interface to be added Eurotech) were loo
Raw Soclcet Master and Cl .nt Set-Up monitor and control
Between Routers and SCADA Server location, as typicall
! SCADA
the development an
(MQIT) protocol th
ATU:i Server
Appllcanon Communicates Otreclly of Structured lnfor~
over Raw Socket IP + TCP/UDP Ports oil and gas industri
designed, with con
Soenarto C: R- Socket between Router Ind SCADA Server - SCADA appflcallon knows communications, am
how to clreotly COlllmunlcate over II Raw Socket end Ethernet Interface some of the rations
framework based on
Araw socket connection simply denotes that the serial data is being packaged directly
into a TCP or UDP transport. A socket in this instance is a standard application
programming interface (API) composed of an IP address and a TCP or UDP port
that is used to access network devices over an IP network. More modem industrial
application servers may support this capability, while older versions typically require
another device or piece of software to handle the transition from pure serial data to
serial over IP using a raw socket. Above Figure details raw socket scenarios for a
legacy SCADA server trying to communicate with remote serial devices.
In all the scenarios in above figure, notice that routers connect via serial interfaces to
the remote tenninal units (RTUs), which are often associated with SCADA networks.
An RTU is a multipurpose device used to monitor and control various systems, MQIT is a lightwei
applications, and devices managing automation. From the master/slave perspective, fixed header with op
the RTUs are the slaves. Opposite the RTUs in each Figure scenario is a $CADA note that a control pa
server, or master, that varies its connection type. In reality, other legacy industrial an overview of the

34
- Dec-2019 / Jouv2020

application servers could be shown here as well.


In Scenario A in Figure, both the SCADA server and the RTUs have a direct serial
connection to their respective routers. The routers terminate the serial connections at
both ends of the link and use raw socket encapsulation to transport the serial payload
over the IP network. Scenario B has a small change on the SCADA server side. A
piece of software is installed on the SCADA server that maps the serial COM ports
to IP ports. This software is commonly referred to as an IP/serial redirector. The IP/
serial redirector in essence terminates the serial connection of the SCADA server and
converts it to a TCP/IP port using a raw socket coMection. In Scenario C in Figure,
the SCADA server supports native raw socket capability. Unlike in Scenarios A and
B, where a router or lP/serial redirector software has to map the SCADA server's
serial ports to IP ports, in Scenario C the SCADA server has full IP support for raw
socket connections.
b. Describe MQTI framework and messaiie format in detail. (08 Marks)
Message Queuing Telemetry Transport (MQTT)
At tire end of the 1990s, engineers from IBM and Arcom (acquired in 2006 by
Eurotech) were looking for a reliable, lightweight, and co.st-effective protocol to
monitor and control a large number of sensors and their data from a central server
location, as typically used by the oil and gas industries. Their research resulted in
the development and implementation of the Message Queuing Telemetry Transport
(MQTT) protocol that is now standardized by the Organization for the Advancement
of Structured Information Standards. Considering the harsh environments in the
oil and gas industries, an extremely simple protocol with only a few options was
designed. with considerations for constrained nodes, unreliable WAN backhaul
communications, and bandwidth constraints with variable latencies. These were
some of the rationales for the selection of a client/server and publish/subscribe
framework based on the TCP/IP architecture, as shown in below Figure.
Application

,,,,
.,, . MOTT Client
(Subscr1t>er)

,,' IVI MQTT C/ien/

MQTTC/16111 ~~~:.!!~~~..------ • (Sut>scribtlr)


{Publisher) ',,,
''·IVf• MOTTCl!ont
(Subserlbtlr)
Publish: Ternp/AH Subscribe to:Temp/RH
•--------
MQTT is a lightweight protocol because each control packet consists of a 2- byte
fixed header with optional variable header fields and optional payload. You should
note that a control packet can contain a payload up to 256 MB. Below Figure provides
an overview of the MQTT message format.

35
VIII Sem, ( CSf I I S£) · Vec,2019 / JtM\/202

~essage Type

Rema1111ng Leng1h

-
•--:'\,,_ ;-- ... --:;:,;; -·
-:¥-' ~ -•~' .. ".' • ;'. -
- •:"} _ -~-~ ·:~;. ~·:.i.
R..icl.
Pavk...U ro1,11,1n,111
SNitth
NameNode-
Message Type Value Flow Description
CONNECT I Client to server Request to connect
CONNACK 2 Server to client Connect acknowledgement
Client to server
PUBLISH 3 Publish message
Server to client
Client to server
PUBACK 4 Publish acknowledgement Hadoop relics on a scale-o
Server to client
Client to server and storage to distribute
PURREC s Server to client
Publish received MapReduce and HDFS tak
process massive amounts
Client to server
PUEREI. 6 Publish release nodes in the cluster. For H
Server to client
cluster, including NameN
Client to server Publish complete NameNodes: These are a
PTJBCOMP 7
Server to client
HDFS. They coordinate
SUBSCRIM 8 Client to server Subscribe request each block of data is sto
SUBACK 9 Server to client Subscribe acknowledgement is coordinated through the
UNSUBSCRIBE 10 Client to server Unsubscribe request NameNode notified of th
UNSUBACK 11 Server to client Unsubscribe acknowledgement NamcNode takes write
available nodes in configu
PINGRBQ 12 Client to server Ping reouest NameNode is also respo
PINGRESP 13 Server to client Ping response should occur.
DlSCONNLCT 14 Client to server Client disconnecting DataNodes: These are th
NameNode. It is commo
Module- 4 the.data. Data blocks are
7. a. Explain the elements ofHadoop with a neat diagram. _(~7 Marks) three, four, or more time
Ans. Hadoop is the most recent entrant into the data management _market,_ but it 1s arguably of the DataNodes, the D
the most popular choice as a data repository and processmg engine. Hadoo~ -:vas replication policies, to e
originally developed as a result of projects at Google and_ Yahoo!, and the original techniques such as Red
intent for Hadoop was to index millions of websites and quickly return search results not used for HDFS beca
for open source search engines. Initially, the project had two key ele'.11ents: redundancy with this rep
• Hadoop Distributed File System (HDFS): A system for stormg data across b. Explain neural networ
multiple nodes . . .
• MapReduce: A distributed processing engme that splits a large task mto smaller Neural networks are M
ones that can be run in parallel you took at a human fi

36
• 'Dec-2.019 / JOIA'\,,2020

Olstr burlon Swltr h -

Switth

'tXk R~,k R.xlc Rack N


l Z 3

Hadoop relies on a scale-out architecture that leverages local processing, memory,


and storage to distribute tasks and provide a scalable storage system for data. Both
MapRcduce and HDFS take advantage of this distributed architecture to store and
process massive amounts of data and are thus able to leverage resources, from all
nodes in the cluster. For HDFS, this capability is handled by specialized nodes in the
cluster, including NameNodes and DataNodes
NameNodes: These are a critical piece in data adds, moves, deletes, and reads on
HDFS. They coordinate where the data is stored, and maintain a map of where
each block of data is stored and where it is replicated. All interaction with HDFS
is coordinated through the primary (active) NameNode, with a secondary (standby)
NameNode notified of the changes in the event of a failure of the primary. The
NameNode takes write requests from clients and distributes those fi les across the
available nodes in configurable block sizes, usually 64 MB or 128 MB blocks. The
NameNode is also responsible for instructing the DataNodes where replication
should occur.
D:ataNodes: These are the servers where the data is stored at the direction of the
NameNode. It is common to have many DataNodes in a Hadoop cluster to store
the data. Data blocks are distributed across several nodes and often are replicated
three, four, or more times across nodes for redundancy. Once data is written to one
of the DataNodes, the DataNode selects two (or more) additional nodes, based on
replication policies, to ensure data redundancy across the cluster. Disk redundancy
techniques such as Redundant Array of Independent Disks (RAID) arc generally
not used for HDFS because the NameNodes and DataNodes coordinate block-level
redundancy with this replication technique.
b. Explain neural network In machine learning wltb a detailed example.
o smaller (05 Marks)
Neural networks arc ML methods that mimic the way the human brain works. When
you look at a human figure, multiple zones of your brain are activated to recognize

37
VIII Sem,, (CSE ( ISf) · Veo2019 /

colors, movements, facial expressions, and so on. Your brain combines these
elements to conclude that the shape you are seeing is human. Neural networks mimic
the same logic. The information goes through different algorithms (called units), • FNF Flow Mo
each of which i~ in charge of processing an aspect of the information. The resu_lting NetFlow cache
value ofone unit computation can be used directly or fed into another unit for further the flow record
processing to occur. In this case, the neural network is said to have several layers. record: match s~
For example, a neural network processing human image recognition may have two or characteristid
units in a first layer that determines whether the image has straight lines and sharp is the Flow Ex
angles because vehicles commonly have straight lines and sharp angles, and human information, in
Monitor includ
figures do not. If the image passes the first layer successfully (because there are no
or only a small percentage of sharp angles and straight lines), a second layer may the size of the c
look for different features (presence of face, arms, and so on), and then a third layer • FNF flow reco
might compare the image to images of various animals and conclude that the shape is values used to
a human (or not). The great efficiency of neural networks is that each unit processes predefined for
a simple test, and therefore computation is quite fast. This model is demonstrated record aggregat
How N•ural Nalwork1 Netflow. User-
A.eogni. • Oog In• Pholo in the flow reco
a wide range o1
expected that 1I
user-defined anl
example, secur
• FNF Exporte
Using the sho
an application
iJ1formation p
NetFlow Expo
of transport fo~
be configured
• Flow export f
collection and
• NetFlow expo
• NetFlow serve
export. It is oft
patterns.

Explain Formal
Refer Q. no 8 b o
b. Explain the puru

operational domai
an industrial demi

38
• Veo2019 / JC\.+'\12020

Describe the components ofFNF. (04 Marks)

• FNF Flow Monitor (NetFlow cache): The FNF Flow Monitor describes the
NetFlow cache or information stored in the cache. The Flow Monitor contains
the flow record definitions with key fields (used to create a flow, unique per flow
record: match statement) and non-key fields (collected with the flow as attributes
or characteristics of a flow) within the cache. Also, part of the Flow Monitor
is the flow Exporter, which contains information about the export of NetFlow
information, including the destination address of the NetFlow collector. The Flow
Monitor includes various cache characteristics, including timers for exporting,
the size of the cache, and, if required, the packet sampling rate.
• FNF Oow record: A flow record is a set of key and non-key NctFlow field
values used to characterize flows in the Netflow cache. Flow records may be
predefined for ease of use or customized and user defined. A typical predefined
record aggregates flow data and allows users to target common applications for
NetFlow. User-defined records allow selections of specific key or non-key fields
in the now record. The user-defined field is the key to Flexible NetFlow, allowing
a wide range of information to be characterized and exported by NetFlow. It is
expected that different network management applications will support specific
user-defined and predefined flow records based on what they are monitoring (for
example, security detection, traffic analysis, capacity planning).
• FNF Exporter: There are two primary methods for accessing NetFlow data:
Using the show commands at the command-line interface (CU), and using
an application reporting tool. NetFlow Export, unlike SNMP polling, pushes
information periodically to the Netflow rcponing collector. The Flexible
NetFlow Exporter allows the user to define where the export can be sent, the type
of transport for the export, and properties for the exp.ort. Multiple exporters can
be configured pc::r FlowMonitor.
• Flow export timers: Timers indicate how often flows should be exported to the
collection and reporting server. ,
• NctFlow export format: This simply indicates the type offfow reporting format.
• NetFlow server for collection and reporting: This is the destinlllion of the flow
export. It is often done with an analytics tool that looks for anomalies in the traffic
pattems.
OR
Explain Formal R isk Analysis Structures. (08 Marks)
Refer Q. no 8 b of June/July 2019
b. Explain the purude model for control hierarchy and OT network characteristics.
(08 Marks)
This model identifies levels of operations and defines each level. The enterprise and
operational domains arc separated into different zones and kept in strict isolation via
an industrial demilitarized zone (DMZ):

39
VIII Sem- (CSE ( IS£) - t>ec-2019 / J
4. Level 0: Proc
machines such
EMerpr1so L one lEDs.
• Safety zone
DMZ l. Safety-critical
to manage the
OpemtlOns support

Explain the followi


P r = Control / I) Structure
SCADAZOll6
W) Variables
Process LevtlO v) Data type
l)Structurc: Ardui
1 Safoty (variables and con
• Enterprise zone Arduino software p
I. Level 5: Enterprise network: Corporate-level applications such as Enterprise any syntax or com
Resource Planning (ERP), Customer Relationship Management (CRM). consist of two mai
Void setup () {
document management. and services such as Internet access and VPN entry
}
from the outside world exist at this level.
2. Level 4: Business planning and logistics network: The IT services exist at PURPOSE - Tlie
this level and may include scheduling systems, material flow applications, the variables, pin
optimization and planning systems, and local IT services such as phone, email, once, after each p
printing, and security monitoring. INPUT--
• Industrial demllltarized zone OUTPUT--
RETURN--
l. DMZ: The DMZ provides a buffer zone where services and data can be
shared between the operational and enterprise zones. It also allows for easy Void Loop () l
segmentation of organizational control. By default, no traffic should traverse }
the DMZ; everything should originate from or terminate on this area. PURPOSE-A
• Operational zone values, the looi
I. Level 3: Operations and control: This level includes the functions involved consecutively,
in managing the workflows to produce the desired end products and for control the Ard
monitoring and controlling the entire operational system. This could include INPUT--
prod~ction scheduling, reliability assurance, systemwide control optimization, OUTPUT--
secu~1ty management, network management, and potentially other required IT RETURN--
services, such as DHCP, DNS, and timing. ii) Functions:
2. Level 2: Supervisory control: This level includes zone control rooms, controller the loop functi
status, control system network/application administration, and other control- is just writing
related applications, such as human machine interface (HMI) and historian. function, whic
3. Level I: Basic control: At this level, controllers and IEDs, dedicated HMls, type, no need
and other applications may talk to each other to nm part or all of the control a semicolon (
function. declaration us·
Example

40
· 'Dec-2019 / J<M\/2020

4. Level 0: Process: This is where devices such as sensors and actuator,; and
machines such as drives, motors. and robots communicate with controllers or
IEDs.
• Safety zone . .
J. Safety-critical: This level includes devices, sensors, and other equipment used
to manage the safety functions of the conbOI system.
Module-5
Explain the following with respect to Anlaiao programmlag.
I) Structure ii) Fuactlons
W) Variables iv) Flow control statemeats
v) Data type vi) Constaata. . (08 Marks)
i)Structure: Arduino programs can be divided i~ three i:nam parts_: Structure, Values
l (variables and constants), and Functions. In this tutorial, we will learn abo~t the
Arduino software program, step by step, and how we can write the program without
any syntax or compilation error. Let us start with the Structure. Software structure
consist of two main functions - Setup() function, Loop() function
Void setup ( ) {
}
PURPOSE - The setup() function is called when a sketch starts. Use it to initialize
the variables. pin modes, start using libraries, etc. The setup function will only run
once, after each power up or reset of the Arduino board.
INPUT--
OUTPUT--
RETURN--
Void Loop ( ) {
}
PURPOSE - After creating a setup() function, which initializes and sets the initial
values, the loop() function docs precisely what its name suggests, and loops
consecutively, allowing your program to change and respond. Use it to actively
control the Arduino board.
INPUT--
OUTPUT--
RETURN--
ii) Functions: A function is declared outside any other functions, above or below
!11~ loop ~~ction. We can declare the function in two different ways - The first way
1s JU~ wnt1~g the ~ of the fun~tion called a function ·prototype above the loop
function, which consists of- Function return type Function name Function argument
type, ~o need to write the arg_ument name Function prototype must be followed by
a semicolon ( ; ). The following example shows the demonstration of the function
declaration using the first method.
Example

41
VIII Semt (CSE I ISE) • 'Deo2019 / JCM'V2

int sum_func (int x, int y) // functioo declaration {


witch case st
int z = O;
Similar to the
z = x+y; 4
by allowing th
return z; // return the value
executed in val
}
void setup () {
Statements// group of statements
5 Conditional
The condition~
ol
} v) Data Type: Data ty
Void loop () { , variables or functions o
int result = 0 ; much space it occupies i
result "' Sum func (5,6) ; // function call Th e following table pre
} p rogramming.
iii) Variables: Variables that arc declared inside a function or block are local
· variables. They can be used only by the statements that are inside that function or void Boolean I

block ofcode. Local variables are not known to function outside their own. Following
Unsigned s
is the example using local variables long long
Void setup () {
Vii) Constants: The co1
}
Void loop() { modifies the behavior o~
int X, y; the variable can be use
int z ; Local variable declaration changed. You will get a
x=O: Constants defined wit
y = O; actual initialization govern other variable
z= 10; keyword a superior me
} Example Code
iv) Flow Control statements: const float pi = 3.14;
float x;
S.No. Control Statement & Description
II ....
If statement x = pi • 2; // it's fine t
It takes an expression in parenthesis and a statement or block ofstatements. pi = 7; // illegal - y
1 If the expression is true then the statement or block of statements gets ~ Explain Raspberry E
executed otherwise these statements are skipped. Refer Q. no 9 b of Jun
l f ... else statement
2 An if statement can be followed by an optional else statement, which
executes when the expression is false.
If.. .else if .. .else statement
,.
L Write a python pro

Blink
3
The if statement can be followed by an optional else if...else statement,
which is very useful to test various conditions using single if...else if
statement.
., Tums on an LED o

// the setup function


void setup() { // init'
pinMode(2, OUTP

42
switch case statement
Similar to the if statements, switcb.;.case controls the flow of programs
4
by allowing the programmers to specify different codes that should be
executed in various conditions.
Conditional Operator? :
5
The conditional operator? : is the only ternary operator in C.
v) Data Type: Data types in C refers to an extensive system used for declaring
variables or functions of different types. The type of a variable determines how
much space it occupies in the storage and how the bit pattern stored is interpreted.
The following table provides all the data types that you will use during Arduino
programming.
e that function or void Boolean char Unsigned byte int
char Unsigned int word
ir own. Following
Unsigned String-char String-
long short Hoat double array
long array object
..
vn) Constants: The const keyword stands for constant. It 1s a variable qualifier that
modifies the behavior ofthe variable, making a variable "read-only". This means that
the variable can be used just as any other variable of its type, but its value cannot be
changed. You will get a compiler error if you try to assign a value to a const variable.
Constants defined with the const keyword obey the rules of variable scoping that
govern other variables. This, and the pitfalls of using #define, ma~es the const
keyword a superior method for defining constants and is preferred over using #define.
Example Code
consl float pi= 3.14;
float x;
II ....
x = pi • 2; II it's fine to use consts in math
ofstatements.
pi= 7; II illegal• you can't write to (modify) a constant
ents gets
b. Explain Raspberry Pi learning board. (08 Marks)
Ans. Refer Q. no 9 b of June/July 2019
t, which OR
10. a. Write ~ python program on Raspberry Pi to blink an LED. (06 Marks)
Ans. 1•
Blink
Turns on an LED on for one second, then off for one second, repeatedly.
•1
II the _setup function runs once when you press reset or power the board
void setup() { II initialize digital pin 13 as an output.
pinMode(2, OUTPUT);

43
VIII Sem- (CS£ I IS£)
}
// the loop function runs over and OIIII'
void loop() {
digita1Write(2, HlGH); // tum da the voltage level)
delay( I 000); II wait for a
digita1Write(2, LOW);
delay( I000); // wait for
}
(06 Marks)
b. Explain smart dly F , • 5 b II e.
Ans. ReferQ. no lOL•ll?~l~J"..,•
c. Writeasltort.- e
i) IOT Qal1eagel.
(04 Mark.,)
ii) Baeldaaal Technolopll.
Ans. i) IOT Cballen1es: Refer Q.
aetwork switching equipment use the
ii) Backhaul Technologies:
tcnn to mean "getting data• .ai.bolie'' (which is similar to its use in
the satellite communicatioa Ascend uses the term to describe
how its MAX 2000 swildl data from a backhaul T-1
line on which mobile _. ted to an Internet service
provider and the bac may sometimes be used to
mean the use of the unications line.

44
As Per New VTU Syllabus w.e.f 2015-16
Choice Based Credit System(CBCS)

SUNSTAR
(06 Marks)

(04 Marks)

equipment use the


similar to its use in
the term to describe
m a backhaul T-1

etimcs be used to
ons line.

BIG DATA ANALYTICS

(VIII SEM. B.E. CSE / ISE)


SYLLABUS Eighth Seme
BIG DATA ANALYTICS CBCS-
JAS Pl::R CIIOICE HASED LR~l>IT SYDDIBl:S)S<.111:Ml.:I
BIG

..•
(f.FFCCTIVF. FROM TIIE AlAonacYFAll 2016-20171
IA M ark• 20 Time: 3 hrs.
Subjttl Cllllt IS<liG
Eua M arki 80 Note : A11swer a11y FIVE full qu
Numbtr of l ,tttun- lloun/Wttk
t:u • Hours 03
Total Numbr r of Ltrtun- lloun

I. a. Explain the Hadoop dislributi


adoop Distributed File System Basics, Running ms and Benchmarks, Had
apReduce Framework, MapReduce Programmillg Ans. Hadoop Distribution File Syst
(HDFS) was designed for Big
read many model that enables 0
cohe.rence overhead requirement.
ssential Hadoop Tools, Hadoop YARN Appl' Hadoop with Apache Amba · restricts data writing to one user
asic Hadoop Administration Procedurc;s is no random wri1in1.t to HDFS
by1c Slreams arc gua-ra111ccd 1o b
The design of IIDFS is based
· usiness Intelligence Concepts and ~•=housing,
1
Data Mining, published by Google provid~s
archive/gfs.htm I).
isualiution
HDFS is designed for data stre
MOooa.a. The HDFS block size is typically
for standard POSIX fi le system u
ecision Trees. Regression, Artificial Neural tl&t C • Cluster Analysis, Association
no local caching mechanism. The
inings. from HDFS 1han to lry cache the
The 1110s1 i111er~s1ing asp~c1 of H
MODULE S its data locality. A principal des
rext Mining, Naive-Bayes Analysis. Su , Web Mining, Social Netwo the computation to the dala ralhe
nalysis is reflected in how Hadoop will
Data is then moved to and from
parallel file system array. HDFS,
B1. Douglas Eadline,"Hadoop 2 Quick-Start Guide: Learn the Essentials the computer portion of the clusl
of Big Data Computing in the Apache Hadoop 2 Ecosystem", c_~mputation .:nginc and a storag
finally, lladoop clusccrs assume
1stEdition, Pearson Education, 2016. ISBN-13: 978-9332570351 deal with this situation. HDFS ha
provide the data needed by the co
B2. Anil Maheshwari, “Data Analytics”, 1st Edition, McGraw Hill The following points summarize
Education, 2017. ISBN-13: 978-9352604180 • The write-once/ read-many d
• Files may be appended, bul ra
• Converged data storage and p
• " Mo~ing computation is cheai
• A reliable file system maintai
failure of a single node (ore
system.
• A specialized fil~ system is us
Eighth Semester B.E. Degrtt Eumination,
CBCS - Model Question Paper - I
BIO DATA ANALYTICS
Time: 3 hrs. Max. Marks: 80
20
Note: Ati.vwer any FIVE full t{UeMiom,, selecting ONE/111/ que$lionfrom each module.
80
OJ
Module - I
Explain the Hadoop distribution file system design features with it's important aspects'!
(06 Marks)
fladoop Distribut ion File System Design Feature, : ·1ht: lladoop Distribution File System
(IIDFS) was designed for Big Data processing. The design assumes a large file write- once/
B1 63 read many model that enables other optimizations and relaxes many of the concurrency and
coherence overhead requirements ofa true parallel file system. For instance, 1IDFS rigorously
restricts data writing to one user at a time. All additional writers arc "append-only", and there
is no random writing to HDFS files. Bytes an: always appended to the end ofa stream. and
byte streams are guaranteed 10 be stored in the order \Hitten.
I he design of HDFS is based Oil the design of th,· (ioogle I· iles Sy~lt:lll (GFS). A paper
Mining, published by Google provides further background on GFS (http://rescarch. google.com/
archive/gfs.html).
HDFS is designed for data streaming where large amounts of dala are read from disk in bulk.
The HDFS block size is typically 64MB or 128MB. Thus, this approach is entirely unsuitable
ssociation for slandard POSIX file systl!m use. In addition, due 10 the sequl!nlial nature ofthl! da1a, Jherl! is
no local caching mechanism. The large block and file si1e make it more efficient to rl!read data
from I IDFS than lo 1ry cacht: 1he daia.
Thi: most in1eresti11g aspect of I I Dr· s- and 1111: one that separa1es it from other file system- is
Social Netw its data locality. A principal design aspect of Hadoop Map Reduce is 1he emphasis moving
the computation to the data rather than moving the data to 1he computation. This distinction
is reflected in how Hadoop will exist on hardwan: separate from the compute! hardware.
Data is then moved to and from the computer components via high-speed interfaces to the
parallel file system array. HDFS, in co1itrast, is designed to work on thl! saml! hardware as
the compulcr po11ion of the clus1er. Thal is. a single serv1:r node in the cluster is olkn both a
computation engine and a storage engine for the application.
Finally, Iladoop clusters assumt: node (and even rack) failurl! will occur at some point. To
deal wilh this situation, IIDFS has a redundant design that can toll!ratl! system failure and still
provide the da1a needed by 1he compute part of the program.
The following points summarize the important aspects of HDFS:
• The write-once/ read-many design is intended to facilitate streaming reads.
• Files may be appended, but random seeks are not permitted. There is no caching of data.
• Converged data storage and processing happen on the same server nodes.
• "Moving computation is cheaper than moving data··.
• A reliable fi le system maintains multiple copies ofdata across the cluster. Consequently,
failure of a single: node (or even a rack in a large cluster) will not bring down the file
system.
• A specialized file system is used, which isnoc cxsigned for gem:ral use.

3
VIII Sem, (CS£flS£) CBCS · }vfodel,Q~Wt'IIP

b. Wilh II neal I a - ~ - l ) H e a i roles in an HDFS deployment? ( 12 Marks) lets the conversation between t
Ans. HDFS Cuap I • progressing, the NameNode al
The design ofHDf'S 1111111 lypaofnodes: from DataNodes. The lack of
JS DltaNocles. a case, the NameNode will ro
In a basic dcsip. • . . . ., I ---~all die metadata needed to ~tore and retrieve now-missing blocks Because t
B1 64 the actual d-■ ..... C M I acaually stored on the Name Node, however. (decommissioned) for maintena

DataNode dacma- •-•••..._


For a mmunal:; I •

The desil!II IS 1 17
5 C _

El
_._._ a sin~lr Name Node da-,mon and singl,:

adl .- master (Na,m:Node) manages the file


from the I IDFS pool.
The mappings between data b
storage on the NameNode.
system n~mcsp■ar- J Df I • • "1 dieals. File systc:m namespace operations For performance reasons, the
such asopen,._ I 7£ I I aR . .tfnctoricsan: all managed by the Na1m:Node. DataNode provides a block re
The Name odr. . I • Hi g lfllDcks to DataNodes and handles DataNode The block rcpons arc sent every
failures. property.) The reports enable th
The , laves (DataNodes) - 57 . . _... read and write n:quests from the file in the cluster.
system to the clients. The Nalllll•••••lllock creation, deh:tion, amt replication.
An examph: of the client I SP IE X 9 alfflk'tion is provide in Fi~urc: 1.1. When
In almost all Hadoop deploym
required by a NameNode, it is
a client writes data, it first co..,m n&s ~ ameNode and requests to crate a file. (now called CheckPointNode)
The NameNodt: deh:nnines ho"' many lllds ~ needed and provides the clicnt with the primary NameNode in case of i
DataNodes that will ston: the data. As pal.,_ SIOfagc process, the
·-- --· --- -.
The purpose of the Secondary
status of the NameNode. Recall
fast access. It also has two disk
• An image of the fi le system
fsimage_ • and is used only 1
• A series of modifications do
begin with edit_• and refle
The location of these files is set
file.
The SecondaryNameNode peri


. . al: new fsimage, and uploads the n

-----
restarts, the fsimage ti le is reas
• since the last checkpoint. If th
'---·- ---·-· NameNode could take a prohrn
system.

Figure I. I Varlo,,s .,..._,_, In 011 HDFS ,lepluvment


datn block~ nre replicated after they att - - -IO the assigned node. Depending on ho\ 2. a. Explain the map reduce mod
nod"s ar" 111 the cluster, the NameNodc allftnpt 10 w -1 I' f v many
nodes that arc in other st:parate racks I,rpaailllc) II' h . n c rep leas o the data blocks on Ans.
blocks are wriucn to other servers in the s■-c kt ere is only one rack, then the replicated T~e Map Reduce Model: Apa
the file block replication is complete lheclieac:e· ~fter the DataNode acknowledges that Prior to Hadoop version 2, this ,
the operation is complete. Note tha; the•·• ~ e file and i~1fonns lhe NameNode that the Map Reduce capability and
DataNodes. II give's the client a limited.._ r .does not wrrtc any data directly to the all the tools developed for flad
not complete in the time period the operaa o time to complete the operation. If it does Hadoop version 2 MapReduce.
Rc~ding data happens in a similar fashion ao;.: ~ elcd. ,
wh11:h return, the nc,1 DataNo<.lcs from ltlCII c lftlt requests a tile from the NameNodc
!he MapReduce computation
is more common thAn mo~, 11 ~.-
data directly from the: Data odes. " '° l'Qd the data. The client then accesses th~
There 11re two stages: 8 mapp
Thus, once the metadata has been d I. d I~ the mapping stage, a mappin
e ivere to tile clit:nt, the NameNode steps back and
4 kind of fi lter or soning process.
lets the conversation between the client and the DataNodes proceed. While data transfer is
progressing, the NameNodc also moniters the data odes by li~h:ning for heartbeats sent
from Data Nodes. The lack of a heartbeat signal indicates a potential node failure. In such
a case, the NameNode will route around the failed DataNode and begin re-replicating the
now-missing blocks. Because the file system is redundant. DataNodes can be taken offiinc
trieve
(decommissioned) for maintenance by informing the NameNode of the DataNodes to exclude
wever.
sing.le from the 1-!DFS pool.
The mappings between data blocks and the physical DataNode an: not kept in persistent
storage on the NameNode.
for performance reasons, the NameNode stores all IJletadata in memory. Upon startup, each
DataNode provides a block report (which it keeps in persistent storage) to the NameNode.
The: block reports arc sent every IO heartbeats. (The interval between reports is a configurable
pro~rty.) The reports enable the Name Node lo keep an up-to-data account of all data blocks
in the cluster.
In almost all Hadoop deployments, there is a SecondaryNameNode. While not explicitly
required by a NameNode, it is highly recommended. The term "Secondary- NameNode"
(now called CheckPolntNode) , it is not an active failover node and cannot replace the
primary NameNode in case of its failure.
The purpose of the SecondaryNamcNOde is to perform period checkpoints that evaluate the
status of the NameNode. Recall that the NameNode keeps all system metadata memory for
fast access. It also has two disk files that track changes to the metadata:
• An image of the file system state when the NameNode was started. This file begins with
fsimage_• and is used only at startup by the NameNode.
• A series of modifications done to the file system after starting the NameNode. These file
begin with edit_• and reflect the changes made after the fsimage_ • file was read.
The location of these files is set by the dfs.namenode.name.dir property in the hdfs- site.xml
file.
The SecondaryNameNode periodically downloads fsimage and edits files. joints them into a
new fsimage, and uploads the new fsimage file to the NameNode. Thus, when the NameNode
restarts, the fsimage file is reasonably up-to-data and requires only the edit logs 10 be applied
since the last checkpoint. If the SecondaryNameNode were not running, a restart of the
NameNode could take a prohibitively long time due to the number of changes to the file
system.
OR
2. a. Explain the map reduce model with simple mapper script and simple reduce script.
(08 Marks)
Ans. T~e Map Reduce Model: Apache Hadoop is often associated with Mapreduce computing.
plicated
Prior to Hadoop version 2, this assumption was cenainly true. Hadoop version 2 maintained
ges that
ode that B1 101 the MapReduce capability and also made othe~ processing models available to users. Virtually
all the tools developed for Hadoop, such as Pig and Hive. will work seamlessly on top of the
y to the
Hadoop version 2 MapReduce.
fit does
The MapReduce computation model provides a very po1verful tool for many applications and
is more common than most users realize. Its underlying id1:a is very simple.
cNode,
There a re two stages: _a mapping stage a nd a reducing stage,
sscs the
In the mapping stage, a mapping procedure is applied to input data. The map is usually some
ck and kind of filter or soning process.

5
VIII Sewv (CSE/IS£) 13 ~ Vci.t"c;1; A ~

for instancC, assume you need to count how many times the name ..KutuLov·· appears in the
novel War and Peace. One solution is to gather 20 friends and give them each a sectio~ of the Map(key I,value I)-> list(key2, value
book to search. This step is the map stage. The reduce phase happens when eve1 yone ,s done: The reducer function i~ then applied t
counting and you sum the total as your friends tell you their coums. _ _ of values in the saim domain:
Now consider how this same process could be m;complished using s1mpll' •n,x command- Reduce(key2. list(value2))-> list(va
line tool~.~ following grep comm.ind llpplied a specific mnp to a text file: Each reducer call typically produces
$ gn:p .. Kutuzav ·· war-and-peace.tAt MapReduce framework transforms a
This command searches for the word Kutuzov (with leading and trailing space) in a leAt The MapReduce model is inspired
file called war-and-peace.txt. Each match is reported as a single line of text that con1ains · ~any functional programming lang
the search 1enn. 1be actual text file is a 3.2MB text dump of the novel War and Peace and important properties:
is available from lhe book download page. The search term, Kutuzon. is a character in the i.D~ta flow is in one direction (map t
book. lfwe ignon: die grep count (-c) option for the moment, we can reduce the number of ~~e mp1~1 lo another MapReduce proc
instances to a smgle number (257)'by sending (piping) the re,ull~ of grep 11
• As w11h _func1ional programming.
into we -1 11
and reduction limc1ions lo lhe input
(we -1 or •word count" reports the number of lines•it rcce,vcs.) ~~ Hadoop data lake is always prese
S grep - KUllmJII" war-and-peace.txt!wc - I
111 . Because there is no dependency on
257 to th~ data,the mapper and reducer da I
Though noc strictly a MapReduce process, this idea is quite similar to and much faster than provide better perfonnancc
the manual process of counting the inst.inccs of Kut~on in the printed book. rhe analogy
Distributed(paraflel) imple~entations
can be laken a bit funbcr by using the two simple (and naive) shell scripts shown in Listing
analyzed qui~kly. In general, the ma
I. I and Listing 1.2. 1lle shell scripts are operation (much more slowly) and tokcni.:c both the ~ubset of the mpu1 data. Results from ~
Kutuzov and Petersbwa llriags in the text: m the reducer phase:.
S cat war-and-pea"e lAl mapper.sh I ./reducer.sh
KuhtL0V. 615 b. E1plain compllln& and runnln& P
Petersburg, 121 program.
Notice that more iAllace of Kutuzov have been found (the first grcp command ignore Ans. ~ord~ount is a simple application that
instance like -Kunaov.· or ~Kutuzov,"). The mapper inputs a text file and then outputs daia ~•ven mput set. The MapReduce frame
in a (key. value) paar (IOkeo-aame, count) format. Strictly speaking, the input to the 5cript is rs, the frame~ork views the input to the
the file and 111c keys ll'C Kutuzov and Petersburg. The reducer script takes these key-value ~ey-value parrs of diff'erent types. The
pairs and combines 111c similar token and counts the total number of instances. The rcsull is a (mpul) <k I. v I ~ ..... map.,,. ,ki. v" ....
_#

key-value pair Uokcn-name. sum 1. (outpul) -


Llslin& I. I Simple Mapper Script The mapper implemenlation,.via the ma
II! !bin/bash
by ,'he: specified Textlnputformat class
while read line ; do for token in Stine; do wh1tespa~es ~sing the StringTokenizer an·
if ["$token"="Kutuzov"J ; then encho ·•Kutuzov, I" code section rs as follows:
elif["$token" = "Petersburg") ; !hen echo "Petersburg, I" public void map(object kay. Text value
fi done done ) throws IOException, lnterrup1edExc;· I
Listing 1.2 Simple Redueer Script (value.1ostring ()) : while (itr.hasMo li~
II! /bin/bash kcount=0 pcount=0 word t (' re 0
J .se llr.nexToken ()) : conlext.write(
while read line ; do
if ["Sline·•=·•Kutuzov, I "] ; then let kcount=kcount+ I J
elif["Sline'' """Petersburg, I" ] ; then let pcount~pcount+ I Given two inpu1 files with contents Hello
done Hadoop Goodbye Hadoop. the WordCount
echo "Kutuzov, $kcount" echo "Petersburg.Spcount'' <Hello. I>
Formally,the MapReduce process can be described as follows. <World. I >
The mapper and reducer functions arc both defined wrt Jnt., structured in (key,vnluc) pairs. <Bye, I>
The mapper takes one pair of the data with aty~ in one data domain.and returns a list of pairs <World. I>
in a different domain: <Hello. I>
< H11doop, , .,,
Map(key I,value I)-> list(key2,value2)
The reducer function is then applied to each key-value pair." hil:h in turn produces a colh:ction
of values in the saim domain:
Reduce(key2 , list(valu.:2))-> list(valu.:3)
Each reducer call typically produces either one valt1e (value3J or an empty response. Thus.the
MapReduce frnmework 1rnnsforms a list of (key.value) pairs into a list of values.
The MapReduce model is inspired by the map and reduce functions commonly used in
many functional programming languages. The functional nature of MapReduce has some
important properties:
i.Data flow is in one direction (map to reduce).11 is possible to use output of a reduce step as
the input to another MapReduce process.
ii. As with functional programming. the input data an: not changed. 8) applying the mapping
and reduction functions 10 the input data, new data arc produced. In eft'ec1,1he original state
of Hadoop data lake is always preserved.
iii. Because there is no dependency on how the mapping and reducing functions are applied
to the data,the mapper and reducer data flow can be implemented in any number of ways to
provide better performance.
Distributed(parallel) implementations of MapRcduce enable large amounts of data to be
analyzed quickly. In general, the mapper process is fully scalable and be applied to !lny
subset of the input data. Results from multiple parallel mapping functions are then combim:d
in the reducer phase.
b. Explain compllln& and running process or the Hadoop word count example with
B1 111
program. (08 Marks)
WordCount is a simple application that counts the number of occurrences of each word in a
given input set. The MapReduce framework operates exclusively on key• value pairs: that
is, the framework views the input to the job as a set of key-value pairs and produces a set of
key-value pairs of different types. The MapReduce job proceeds as follows:
(input) <k I. v I ; -> map•> <-k2. v2_. . ; Com bin,: •-· , 1..2. v2 -· ro:duce . ; •· k3, v);
(output)
The mapper implementation,.via the map method, processes one line at a time as provided
by _the specifi~d Textln~utFormat_ class. It then splits the line into tokens separated by
wh1tespaces using the StrmgTokenizer and emits a key-value pair of<word. I>. The relevant
code section is as follows:
public void map(object kay, Text value, Context context
) throws IOException, lnterruptedException I StringTokenizer itr=new StringTokenizer
(value.tostring ( ) ) : while (itr.hasMoreTokens () J;
word.set (itr.nexToken ( ) ) ; context.write(word, one):
}
}
Given two input files with contents Hello World Bye World and Hello
Hadoop Goodbye Hadoop. the WordCounl mapper will produce two maps:
<Hello. I>
<World. I>
<Bye, I>
pairs. <World. I>
ufpairs <Hello. I>
< Hadoop, I>

7
VIII Se,rn., (CSf/ISf) 13~V~A ~ CBCS - /vfodel,Q~im\f Po
<Goodbye. I>
<Hadoop, ,,..,.
1_5/05/24 18: 13:26 INFO
hmulus.8188 ·w~' v l/tin11!1in
I
WordCount sets a mapper l_.~105/24 18: 13:26 INFO
jub.se1MappcrClass (Tokeni1erMap~r cla~s) , hmulus/ I 0.0 0 I 8050
a combiner job sc1Comb1ncrClass(lntSumReducer.class) ; l:'\/05124 18:13:26 WARN
and a reducer JOb.se1Combiner('lass(ln1SumRcducercla~~), not performed. Implement tl
Hence, the output of each map 1~ passed through the local combiner (\\hich sums the values to remedy this.
in the same way as 1hc reducer) for local ;1ggrcga1,on :md 1hcn sends 1hc darn on to 1he final 15/05/24 18: 13:26 INFO int
reducer. Thus, each map a,cn"C the wmbincr ~rforms 1he folio\\ ing pre-reductions: 18: 13:27 INFO mapreduce.J
<Bye. I> r...J
,Hello, I> File Input format Counters
<World, 2> Bytes Read=3288746
<Goodbye, I> file Output funnm Counters
<Hadoop, 2> Bytes Wriltcn-467839
<Hello, I> In addition, the fol1011 ing fil
The reducer• n •ia 1M reduce method, simply sums the values, which are the file name may be slightly ditl
occurrcll« C.-S ilr cadl kC) 1be rele..,ant code section i5 as folio\\~: public void reduce S hdfs dfs -Is war·,and-peace~
(Texl kq-. lluak c:lalWrilable> waka. Found 2 times
Con1wccmca1 •rw-r--r-- 2 hdfs hdfs O2015-
) 1hn)lls IOEAccpuon, I ~ : int sum 0: ~rw-r-r-- 2 hdfs hdfs 467863•
for llntWritablc val ~~ - val get ( ) : r~c complete list of word co
I wrth lhe following command
result.set (sum); con1e,t •nlC (key rauhJ: • •and -peac
S hdfs dfs -get war
I If the WordCount program i,
The fi nal output oflhe reducer IS 111£ following: overwr11t: l~te war-and-peace
<Dye, t> remo_vcd wuh the following
,Goodbye. I> $ hdls dfs -rm •r -sJ..ipTrai.h w
<Hadoop. 2>
<Hello, 2> Explain with example A
<World,2> Apache Pig is a high-lcve
To compile ad . . 111e program from the command line, perform the following steps: transformations using n
1. Mae a local wordcount_Classes dam:tcxy a set of transformations
S makdirwordcount_classes cxtmct, transform, and I
2. Compile Ille WordCount.java pnignm using the 'hadoop classpath• data processing.
command to include ull the available Hadoop class paths. Apache Pig has several
Sjavac-q, ' hadoop classpath" -d \llordcoun1_dasscs WordCount.j:1va done on 1he local ma,hi
3. The jar file can be created using the following command modes execute the job o
Tezcngine
Siar -1:vt wordcounl.jnr -C 1\orJcount_d~-
4· lo run the c>.amptc, ..,,.,....... :~r"' J,....,·111ry in HUFS and place a te>.t file in the new
L
directory. For this e~ample, 1\e 1\ ill use the \\ar-and-peace.1>.t:
S hdfs dfs -mkdir war-and-peace-input Interactive Mode
S hdfs dfs -put war-and-pcace.txt war-and-peace-input Batch Mode
S. Run tfle WordCount application using the follo,~ing command: There arc also in1cmc1tH:
S hadoop jar wordcount.jar WordCount war-and-pl.'ncc-input developed locally in inte
__, war-and-peace-output scale on chc cl11s1er in a
If everything is worki ng ~orrcetly. I luduop mi:ssage~ for the job should look like the Pig faample Walk-Th
following tabbn:viated vcr.,1on):

8
15/05/24 18: 13:26 INFO impl.Timelincclientlmp I: Timelinc service address: http://
limulus:8188/ws/v 1/timdinc/
15/05/24 18. 13:26 INFO client .RMProx) Connec1ing to ResourceManager at
limulus/ I 0.0.0. I:8050
15/05/24 18: 13:26 WARN maprcduce.JobSubm111er Hadoop command-line option parsing
not performed. Implement lhe Tool interface and execute your application with ToolRunncr
er (which sums the values to remedy this.
ih the d:it:i on to the final 15/05/24 18:13:26 INFO input.FilclnpulFormat Total input paths to process : I 15/05/24
ng. prc-rcducuons: 18: 13:27 INFO maprcduce.JobSubmitter: number ot splits: I
[...)
File Input Format Counters
Bytes Read 3288746
File Output rom1a1 Counters
Bytes Wri1tcn=467839
In addition, the follo,1 ing fi les should be in the war-and-peace-output din:ctory. The actual
~ the values. "h1ch an: the tile name mny be slightly different depending on your Hadoop ver:,ion.
ollows: public void reduce $ hdfs dfs -ls war-and-peace-output
Found 2 times
-rw-r--r-- 2 hdfs hdfs O20 I5-05-24 11: 14 war-and-peace-output I _SUCCESS
-rw-r-•r-- 2 hdts hdfs 4678639 20 I S·0S-24 11: 14 war-and-peace-outpuV part -r-00000
The complete list of word counts can be copied from HOF S to the working direc1ory
wilh the following command:
$ hdfs dfs -get war-nnd-peace-oulput/part -r-00000.
If the WordCount program is run again using the same outruts, ii will foil when it tries
overwrite the war-and-peace-output directory. 1'11c output directory ,md all contc:nts can be
removed ,1 uh the following command:
$ hdfs dfs -rm -r -skipTrash war-and-peace-output

Module-2
3. a. Explain with example Apache pig and Apache Hive? (08 Marks)
the folio" ing steps: Ans. Apache Pig is a high-level language that enables programmers to IHile complex MapReduce
transfonna1ions using a simple scripting language. Pig Lalin (the actual lunguagc) defines
a set oftransformlltions on a data set such as aggrcga1.:, join. and son. Pig,~ o0en used 10
extmct, transform, and load (ETL) data pipelines, quick n:se.1rch on raw data, and itemtivc
data processing.
Apache Pig has several usage modes. The first is a local mode in which all processing is
done on the local machine. The non-local (cluster) modes are MapReduce and Jez. These
modes execute the job on the cluster using either tht MapReduce engine or the optimized
Tczenginc
l;k.~ a te\t file in the new
fable 2.1 Apache Pig Usage Modes
Local Mode Tez Local Mode Map Reduce Mode Tcz Mode
lnteraclivt Mode Yes Experimental Yes Yes
Batch Mode Yes fapen mental Yes Yes
There are also mteract1ve modes, usmg small amounts~ f data. and then run al
developed locally in interactive modes, using small amounts of data. and then run :11
scale on the cluster in a pmducrion mode. The modes nrc summar,;,cd in Tnt,lc ::?. '
Pig f.Aamplc Walk-Through

9
VIII Se-wt1 (CSf./ISE) 8~V~A ~ C6CS · Moctei Q ~
For this e"<ample, th..- folio,, mg software environment 1s assumed Other environments Comments are delincJ
should "ork in a s1m1lar fashion. called id.out for !he re
• OS: Linux and then start Pig wil
• Platform• RHEL 6 6 $ /bin/rm -r id.out/
• 1lorton\\orkS HOP 2 2 \\tllCh Hadoop v~ion 2 .6
S pig -x local id.pig
• Pig version: 0.14.0 .. . If the script worked c
I f pseudo-distributed installation IS uSN, hlnstallauon R..-c1pcs. instructions for installing
length file with then
Pig are. the only difference is
In this ~imple example, Pig is used 10 cx1rae1 user names fromt h~ 1~tc /pass~vd file. A full S hdfs dfs -rm -r 1d.Ol
description of the Pig Latin langua&e is beyon~ the scope of this mtroduct1on, but more $ pig id.pig
infonnation about Pig can be found al http l p1g.apache.org/docs rtl.14.0 start.html. The
If Apache lez is insta
following example assumes the user IS hdfs.but an> valid user" ith access to IIDFS can run
learn more about wri1
the example. . . . . Usin& Apache Hive
To begin the example, copy the pasPl,d file to a worl..1ng d1re1:tol')' tor 101:al Pig optm111on:
Apache Hive is a dat
S cp /etcipass\\d .
summarization, ad h
Next, copy the data file into IIDFS a Hadoop MapReduce operation:
called HiveQL. Hive
S hdfs dfs -put passwd pass" d
pe1abytes of data usin
You can confinn the file is in IIOFS by anering the following command:
• Tools to enable e
hadfs dfs -ls passwd
-rw-r-r- 2 hdfs hdfs 2526 2015-03-19 11 :08 passwd • A mechanism to i
In the following example of local Pia opm1tion, all processing is done on the local machine • Access 10 fi Jes st
HBase
(Hadoop is not used). F1r5t, 11,c llllffact1ve command line is started·
• Query execu1ion
S pig -x local Hive provides users ~
If Pig starts correct!~. ~ou •ill Jff a grunk prompt You may !llso s~~ a bunch of INro
Hadoop clusters. Alt
messages, which you can ...,.-c Next, enter the following commands to load the passwd file wiJh lhc MapRcduce
and then grab the user name and dump it to the tenninal. Note that Pig commands must end
Hive queries can also
with a semicolon(;).
YARN in lladoop ver
grun1> A~ load 'pass,.,d" using PigStorage l' ;');
Hive £\ample Walk-
grunt> B = forcach A gtnttale SO as id:
grunt..,. dump 0;
For this example. th
The processing will ~tan and a h~I of user name& w,11 ~ printo:d to the screen. To exit the should work in a simi
1n11:rac11ve scsMon. enter 11,c command quit. • OS: Linux
S pig -x mapreduce • Pla1fonn: RHEL 6
• Hononworks 11D
The same sequence of commands can be entered at the grunt> prompl. You may wish to
change the SO argument 10 pull out other items in the passwd file. In the case of this simple • Hive version: O 14.
script,you ,~ill notKC lhal lhc: MapReduce version llke) much longer. Also, bccuasc we arc Although the followi
running this applicalion under Hadoop, make sure t'1e file is placed in HDFS. HDFS can nm the e,, a
If you are using the Horton\\ orks 1-1 DP dis1ribution "ith te1 installed. the 1ez engine can be To star I live, simply e
used as follo" s: a hive> promp1.
S pig -x tez Shive
Pig can also be from a script. An exampll: script (id.pig) is available from the example code (some message may sh
download (see Appendix A, ·•Book Webpage and Code Download"). This script, which is hive>
repeated here.is designed to the same things as the interactive version: As a simple test, create
,. Id.pig., (;).
A = load ·passwd' using PigStorage ('; ·,; - load the pass,\d file hive> CREATE TABL
n - forcaeh A genemtc SO as id,•· extract the u~er IDs OK
dump B. Time tnkcn: 1.705 se1,;
store 13 mto ,id.out'; --,-.r11c rhc results to a directory name id.out hive> SHOW TABLE
10
Comments are delinealed by,.,. and •• at the end ofa line. rhe )Cripl "ill create a directory
. r h It f 'rst ensure that 1he id.out director) isnot in~ our loc:al directory.
called 1d.oul ,or t eresu s. 1 , •
and then s1:1r1 Pig with 1hc script on 1he command hoe:

S pig -x local id.pig


If the scrip1 worked correctly, you should see at least one data file with the results and a zero-
installing length file with the name_SUCCESS. To run the MapRcduce version. the same procedure;
the only difference is that now all reading and writing taken place in HDFS
le. A full $ hdfs dfs -rm -r id.out
but more $ pig id pig
html. The If Apache tel is installcd, you can run the e,-ample script u)ing the -x tez option. You can
Scan run learn more about writing Pig script at http://pig.apache.org/docs/r0.14.0/start. html.
Usln& Apache Hive
Apache llivc is a data warehouse infrastructure built on top of Hadoop for providing data
summariz.ation, ad hoc queries,and the analysis of large data sets using a SQL· like language
called HiveQL. Hive is considered the de facto standard for interactive SQL queries over
petnbytes of data using Hadoop and offers the following features:
• Tools to enable easy data e,,traction, transformation, and loading (El L)
• A mechanism to impose structure on a variety of data formats
• Access to files s1ored either directly in HDFS or in other data storage systems such as
HBase
• Query execution via MapReduce and Tez (optimized MapReduce)
Hive provides users who are already familiar with SQL the capability to query the data on
Hadoop clusters. At the same time, Mive makes it possible for programmers who are familiar
"ilh 1he MapReduce framework to add 1heir cuMom mappers and reducer.; to I live queries.
Hive queries can also be dramaricallly accelt:ra1cd using 1he Apache Tc,1 framework usnder
YARN in l-ladoop version 2.
Hive Eumple Walk-Through
For this example. 1he following software environment is assumed. Other environments
should work in a similar fashion.
• OS: Linux
• Pla1form· RHEL 6.6
• Hortonworks HDP 2.2 with Hadoop version: 2.6
o·u may wish to • Hive version: 0.14.0
of this simple Although the following example assumes the user is hdfs. any valid user with access to
becuase we are HDFS can run the example.
s. To star Hive, simply enter the hive command. If Hive starts correctly, you should get
a hive> prompt.
$ hive
(some message may show up here)
hive>
As a simple test, create and drop a 1able. Note that Hive commands must end with a semicolon
(;).
hive> CREATE TABLE pokes (foo INT, bar STRING) ·
OK
Time 1akcn: 1.705 seconds
hive> SHOW TABLES:

11
VIII Semt (CSE/IS£) CBCS · Modei-Q~P

OK [DEBUG) 434
pokes [ERROR) 3
Time taken: 0.174 s«onds.. fetched: I row (s) [FATAL] I
hive> DROPTABLC pokrs [INFO] 96
OK [TRACE] 816
rime 1aken: 4.0311 ~ecoods [WARN] 4
A more detailed example can be de\elop,:J using a web server log tile to summarize Time taken: 32.624 seconds,
message types. First, create a labk using lhc folio" ing command: To exit Hive, simply type ex
hive> CREATE TABLE logs(tl 5lr1llg. C string, 13 string, t4 string, hive> exit;
..... tS string, t6 string, t7 string) ROW FORMAT DELIMITED FIELDS
b. Explain with lhc following
_, TERMINATED BY' · ;
I) Create the database
OK
Time taken: 0.129 seconds 2) Inspect the database
Next, load the data-in this caK.from the sample.log file This tile is available from the 3) Create row
example code download *•file is found in the local directory and not in HDFS. 4) Delete a row
hive> LOAD DATA LOCAL I PATH ·sample.log· OVERWRITE INTO TABLE logs; 5) Remove a ta ble
Loading data to table defaull logs 6) Adding data in Bulk.
Table default.logs stats: (..-f les=I, numRow=0, totalSize,=99271, rawDataSize"'0) I) Create the Data base
OK The next step is to create the
Time taken: 0.953 s«onds hbase (main) :006:0> create '
Finally, apply the ~lccl Skp 10 the file. Note that this invokes a lladoop MapReduce 0 row(s) in 0.8150 seconds
operation. The resulU appc.,ar II the end of the output (e.g., totals for the message types In this case. the table name i
DEBUG, ERROR. and so on) the row key. The price colun~
hive> SELECT t4 AS ~v. COUNT(") AS cnt FROM logs WIIERE t4 LIKE;(%' GROUP command is used to add to ti
BYt4; data can be entered by using t
Query ID "' hdfs_20 I50327 GOOOO_dlela265-a5d7-4ed8-b785-2c656979 l 368 Total jobs • I put 'apple'. '6-May-15', 'pric
Launching Job I oul of I put 'apple'. '6-May-15', ·pric
Number qf reduce tasks not specified. fatimated from input data size: I put 'apple'. ·6-May-15', 'pric
In order to change the average load for a reducer (in bytes) : put :apple', '6-May-15'. 'pric
set hive .exec.reduce.bytes.per.redueer=<number> put 'apple'. '6-May-ts·, 'vol
In ordtr to limit the mIDlimum number of n:ducers: Note that these commands c
set hive.exec.reduters.max-<number> from the book download files
In order to set a constant number of reducers: commands can be retrieved an
set mapreduce.job.reduces"'<numbcr> 2) Inspect the Database
Starting Job = job_ 1427397392757_000 I, Tkeing URL = http://norbert:8088/proxy/ The entire database can be I
application_ l42739739257 _000 I/ command with large database.
Kill Command= /opl/hadoop-2.6.0/bin/hadoopjoh killjob_l4273973927.57_0001 Hadoop scan 'apple'
job information for Stage- I: number of mappers: I; number of reducers: I 2015-03-27 hbsc (main) :006:0> scan 'app
13 :00: 17,399 Stage- I map= 0%, reduce = 0% Row COLUMN+CELL
20 I5-03-27 13:00:26.100 Stage- I map= I 00%, reduce= 0%, Cumulative CPU 2.14 sec 6-May-15 column~pricc:close
2015-03-27 13:00:34,979 Stage- I map= l00%, reduce = 100%. Cumulative CPU 4.07 sec 6-May-15 column" pricc:high.
MnpReduce To1al cumula1ivc CPU time: 4 seconds 70 msc1: 6-May-15 column- price:low.
Ended Job = job_l427397392757_000 I column-price:open, timestam
MapReduce Jobs Launched: 6-May-15 column=volumc:, ti
Stage-Stage-I : Map: I Reduce: I Cumulmiw CPU: 4.07 se1: I ll)FS Read: 106384 3) Get a Row
HDFS Write: 63 SUCCF.SS You can use the row key to ac
Total MapReduce CPU Time Spent: 4 seconds 70 msec the row key.
OK

12
[DEBUG] 434
[ERROR) 3
[FATAL) I
(INFO) 96
(TRACE] 816
[WARN)4
marize Time taken: 32.624 seconds, Fetched. 6 row(s)
To exit I live, simply type exit: :
hive> exit;
b. E~plain wilh lhe rollowlng com mud~ in 1hr If b11 se c.111111 model. (O.i Murks)
I) Create the database
2) Inspect the database
3) Create row
ilable from the
1 in HDFS.
4) Delete a row
BLE logs; 5) Remo,•e II table
6) Adding dala in Bulk.
111aSize..O) I) Creale the Dalabnse
The next step is to cn:ate the database in HBase using the following command:
hbase (main) :006:0> create 'apple', 'price·, 'volme'
doop MapReduce 0 row(s) in 0.8150 seconds
he message types In this case. the table name is apple, and two columns arc defim:d The data \\ ill be used a~
the ron key. The price rnlumn is a famil) of four values (open, close. low, high). The put
(IKE,(%' GROUP command is used to add to the database from "ilh in !he shcl I. For inslancc. the preceding
data can be cntl!rcd by using lhc following commands.
1368 Total jobs= I put 'apple' . ' 6-May- Is·. ' pricc:open·. · 126.56'
put 'apple'. '6-May-15', 'price:high', ' 126.75'
put 'apple'. '6-May-15', ·price:low·. '125.36'
put :apple', '6-May-l s·. 'price:close·. '125.0 I•
put 'apple'. '6-Mny- 15', ' volume·, '71820387'
Note tha1 these commands can be copied and paMed in10 I!Base shell nnd are available
from the book download files . 'I he shell also k.:t!ps a hbtory for the scc.:tion. imd previous
commands cnn be retrieved and edited for resubmission.
2) Inspect lhe Database
orbert:8088/proxy/ The emire database can be listed using 1he scan command. Be careful when using 1his
command with large database. This example is for one row.
757_0001 Hadoop scan 'apple'
·ers: I 2015-03-27 hbse (main) :006:0> scan 'apple'
Row COLUMN ➔ Crl.L
CPU 2. 14 sec 6-May-15 column -price:closc. timestamp 14309551 :!8359. value 125.0 I
ive CPU 4.07 sec 6-May-15 column pricc:high. timestamp 14309551~602-1. value 1~6.75
6-May-15 column=price:low. 1imestamp 1430955126053. val\Je- I 23.36 6-May- I 5
column=price:open, timestamp-1430955125977. valuc- 126.56
6-May- IS column=volumc:, timestamp= 1430955 I41440. value ·71820387
. 106384 3) Gel a Row
You can use lhe ro11 key lo access an individual ro11 In lhe ~hx:k pric.: datnhase. 1hc daw i~
the row key.

13
VIII Sem, (CSE/ISE)
hbase (main) OOl:O> . . apple', ·6-May- Is·
COLUMN CELL Using the VARN Dlstrlbut
price:closetin.e-..,=1430955 l:?8359, value: 125.0 I For th e purpose of the exam
pricc:high t.mes1an1p• 1430955 I:?6024, value= 126.75 assign the following installat'
price:low tunestanip-1•00955126053, value~l23.36 application:
price:opent1rnestampa 14;o955 I 25977, value-126.56 S export YARN OS
volume: timcstamp 1430955141440. value 71820387 distributedshell.jar -
row(s) in O 130 seconds For the pseudo-distributed in
4)Delete I Row run the Distributed-Shell Bppj
location Hadoop);
You can delete an entire row by &iv11111 the deleteall command as follows:
hbasc(main) :009:0> delctNIJ •app1c·, "6-May-15' S export YARN DS=SH
5) Remove a Table disrributedshell-2.6.0.jar
!fanother dislribution is use
To remove (drop) a table, YOII - fina disable it. The following two commands remo~c
the apple table from Hbasc lllmc(maan) :009:0> disable 'apple' hbase(main) :0 10:0> drop Jar and SCI SYARN_OS base
'apple' can be found by running the
6)Adding data in B11lk S yam org.apache.hadoop.ya
.... -help
There are several ways to efficiently load bulk data into 11 Base. Covering all of these methods
The output of this command
is beyond the scope of this chapter lastad, we will focus on the lmportTsv utility. which
usage: client
loads data in tab-separated values (tsv) format into HBase. It has two di~tinct usage modes:
• Loading data from I tsv-format file as HDFS into IIBase via the put command •appname <arg>
• Preparing StoreFilcs to be loaded via the completebulkload utility
The (OI/Qlrins example shows how to USC lmporTsv for the first option, loading the ISV• ~attempr_failures_ vafidir
rnterval <arg> ·
format file using the put COflUIIIDd. The s«Olld option works_ in a two-step fashion and can
be • Kplo""d by t'MSUlting hUp:l/hbac.apache.org/book.html#1mporttsv. . _ . .
The l\n,t "'"P •• cunv•r\ the, Appl.........__... Ale to 1£Y fonnat The following scr,pt. which IS
included in the book software. "di l'ftllOft Ille nr.,1 lim: and,,., 1hc cunv.:r~1on. In do,ng so,
-c.:u111ainer_mcmo,y <aJ'B>
it creates a file named Apple-stock.lSv
S convert-to-tsv.sh Apple-stock tsv /Imp
Finally, lmpottTsv is run using lhe follow mg command line. Note the column designation in -container_vcores <arg>
the •Dimporttsv.columns option. In lbc example, the HBASE_ROW_ KEY is set as the first
column-that is, the data for the data. -create
$ hbase org.apache.hadoop.hbase.mapmiuce.lmportTsv -Dimporttsv.columnss
.... HBASE_ROW_KEY, price open.price:high,pricc:low,price:close,volume -debug
- apple /tmp/Apple-stock.tvs
The lmportTvs command worts will use MapReduce to load the data into HHase. To "erify -domain <arg>
that the command worts. drop and re-create lhe apple dalabtie, as described previously,
before running the import command. •help
-jar <arg:>
c. What is YARN? E1plaia aay five commands? (04 Marks)
Ans. The Hadoop YARN project includes the Distributed-Shell application. which is example an ·keep containers acros
of a Hadoop non-MapRcducc application built on top of YARN. Distributed- Shell is a applicm ion_anempts
simple mechanism for running shell commands and scripts in containers on multiple nodes
in a Hadoop cluster. This application is not meant to I>.: a production administrntion tool. but -log_propenics <arg>
rather a demonstration of the non-MapRcducc capabiht) that can I>.: implemented on top of -master_memeory <arg>
YARN. There are multiple mature implementations of a distributed shell that administrators
typically use to manage a cluster of machines. In addition. Distributed-Shell can be used as
a s!arting point for exploring and building 1-ladoop YARN applications. This chapter offers -master_~cores <arg>
cu•~c~ on how the Distributed-Shell can be used to understand the operation of YARN
apphcataons.
14
tV~A~
Using the YARN Distributed-Shell
For the plll'J)Ose of the examples presented in the remainder of6is chapter. we assume and
assign 1he following ins1alla1ion path, based on Hononworks HDP2 2. the Dis1ributed-Shell
application:
$ ex pon YARN_ DS ~tusr/hdptcurrenlt hadoop-) arn-clientllladoop-yam-applications-
distributcd~hel I.jar
For the pseudo-distributed install using Apache Hadoop version i .6 0.die following path will
run the Distributed-Shell application (assuming SHA DOOP_HOME is defined to reflect the
location Hadoop):
$ export YARN_DS=SHADOOP_ HOME/shan:'hadoop/yam/h~yam-applications-
s:
distributedshell-2.6.0.jar
If another distribution is used, search for the file hadoop-yam-applicalionl- distribu1edshell•.
jar and sel SYARN_DS based'on its location. Distribu1cd-Shell exposes v«ious options that
0 commands remove
can be: found by running 1hc following command.
main) :010:0> drop
S yarn org.apache.hadoop.yam.applications.distribuledshcll.Client -jar SYARN_DS
_. -help
The output of1his command follows:
gall of these methods
usage: client
portTsv utility. "hich
-appname <arg> Application Name Defauh
!listinct usage modes:
value -distributedSMII
t command
-nt1empt_failurcs_validity_ when attempt_failun:_validit)_ intcrvol in milliseconds is
tion, loading the tsv- inlerval <arg> set 10 >O. the failures Nnber will not 1ake failure which
o-step fashion and can happen out of the validilylnterval into failure count. If
failure count reaches IO maXAppAnempts, the application
v.
!lowing script. which i~ will be: failed.
unversion. In doing :.o, -container_memory <arg> Amount of memory in MB to be requested to run the shell
command
-container_ vcorcs <arg> Amoun1 of vinual cores to be reques1cd 10 run 1hc shell
column designation in command
KEY is set as the first
-create Flag to indicate wht:thc:r to crea1e the domain
specified with -domain.
columns=
volume -debug Dump out information
-domain <arg> ID of the timeline domain when: the timclinc
cnlitics will be put
-help Print usage
Jar file containing the applicatio,i ma~u:r
-jar <arg>
(04 Marks) Flag to indicate whether to keep 1:ontainers acros~
-keep_ conta1ners_across_
. which is an example application attempts. If the nag is true. running containers
application_attempts
istributed- Shell is a will not be retrieved by the new application attempt
ners on multiple nodes log4j .propertit:s file
-log_propcrtics <arg>
inistration tool. but Amount of memory in MB to be reque~ted to run the
imph:m.:ntcd on top of -master_memeory <arg>
application master
hell that administra1ors
Amount of virtual cores 10 bt: requested to run the
Shell can be used as -master_vcorcs <arg>
application master
ns. This chapter offers
e operation of YARN 15
VIII Se-mt (CSE/IS£) CBCS · Model-Q~

UM:~ and groups thal allowed IO modif> lhe limtine of various Hadoop serviJ
-modify_acls -.arg>
cntilics the 1imeline enlilies in the given domain the scrv ice managed b)
services managed by A
-node_label_expression <arg> Node label expression to determine the nodes where all metrics (Ganglia).
the containers of this application will be allocated, " " Da~hboard View
means containers can be allocated anywhere,,f you don't The Dashboard view pr
spei:ify 1he option, default nod1:_labcl_expression of cluster. The actual servi
queue will bl: u~cd. remove, or add these wi
-num_containers , nrg~ No. of containers on \\htch the shell c9mmand needs to • Movmg. C:hcl.. and h
be e11ecuted • F.dit: Place the mous
-priority <arg> Application Priority. corner of the widget.
Default 0 the widget.
RM Queue in which this application is to be submitted • Remove: Place the m
-queue <arg>
• Add: Click the small I
-shell_args <arg> Command line arge for the shell script. Multiple args can will be displayed. Sel
~ separated b~ empt~ space.
Some widgets provide a
-shell_cmd_priority <arg..- Prionty for the shdl n1mmnnd rnntniners instance, 1he DataNodes
-shell_command <erg> Shell command to be executed by the Application Master. Figure 4.2 provides a de
Can only specify either. -shell_command or • -shell_script The Dashboard view als
Environment for shell script. Specified asenv_ kercnv_ map selected m.:trics acr
cluster will be displayed.
val pairs
from the Select Metric p
-shell_script <arg> Location of the shell script to he executed. Can only 4.3
specify either
- -shcll_command or
• -shell_script
-tinu."Out <arg> Application timi:0\1I in milliseconds
-view _acts <arg> Users and group that allowed to view the timeline entities
in the given domain
OR
4. a. Esplaln viru1 or Apache Ambarl? (12 Marks)
Ans. Afier completing the initial installation and logging into Ambari) a dashboard similar to
that shown in Figure 4.1 is presented. The same four-node cluster as created that will be
used to explore Ambari . It you need to reopen the Ambar1 dashboard interface. simpl) enter
the following command (which assumes you are using the Firefox browser. although other
browsers may also be used):
S fin:fo,.. localhost : 8080
The default login and password Arc admin and admin. respectively. Befor1: continuing any
further. you should change the default password. To change the password. select Manage
Ambari from the Admin pull-down menu in the upper-right corner. In the management
window. click Users under User+ Group Management, and then click the admin user name.
Select Change Password and enter a new password. When you arc finished, click the Go To _ ,_. ___
Dashboard link on the ldl side of the wiRdow lo retum to the dashboard view.
···--···-··'
,

To leave the Ambari interface. select the Adm in pull-down menu of the m,t~lled ,crvicc:s.
Figure 4./
A glance at the dashboard should allow you to get a sense of how the cluster is performing.
The top navigation m.:nu bar. shown in figure ,I.I, provides acces~ to th.: Dashboard, Services,
Hosts, Adm in and Views fc:atures (the 3 " J cube is the Views mc:nu). The Matus (up/down)

16 ~l\t.+AI" e.cAM Sc.At11\W'


modify the timline of various I ladoop service is displayed on die left UMltg pttniorange dots. Note that two of
en domain .the service managed by Ambari arc Nag,os and Gangha; the standard cluster management
services managed by i\mbari, they an: sued to provide cluster monitoring (Nag10s) and
he nodes where all metrics (Ganglia).
ill be allocated, " " Dashboard View
ywhcre,if you don't The Dashboard view provides small status widgets for many of the service running on the
label_exjm:ssion of cluster. The actual services are listed on the left-side vertical menu. You can move, edit,
remove, or add these widgets as follows:
I q>mmand needs to • Moving: Click and hold a widget while it is moved about the grid.
• l:::dit: Pince the mouse on the widget and clicl. the gray o:dit symbol in the upper- right
corner of the widget. You can change several dilferent a~pects (including thresholds) of
the widget.
• Remove: Place the mouse on the widget and click the X in the upper-left corner.
is to be submitted • Add: Clicl. the small triangle ncl\t to the Mtcrics tab and select Add. The available widgets
ipt. Multiple args can will be displayed. Select the widgets you want to add and click Apply.
Some widgets provide additionnl information ,~hen you move the mouse over them. I-or
instance, the DataNodes widget displays the number of live, dead, and view. For instance,
Figure 4.2 provides a detailed view of the CPU Usage" idget from figure 4.1.
The Dashboard view also includes a hcatmap view of the cluster. Cluster heatmaps physically
map selected metrics across the cluster. When you clicf- the Heatmaps tab, a heatmap for the
cluster" ill be displayed To select the metric used for the heat map, choose the desired option
from the Select Metric pull-down menu. Note that the scale ory used is displayed in Figure
executed. Can only 4.3
........ ......,
♦ .. ..- .. . ..- ........ • 0 •
• . .i

lhc timclinc entities --,..


......
' 4,4
=1Cf.. ··-~
(12 Marks) -- _, -
a dashboard similar to =-.L. ~ 0.08 me

as cn:ated that will be


intcrfa~·e. simply c:nh:r
52 0 d ..• 1.33
--- 52.0 d

- -- --.., --
browser. although other

. Before continuing any ...• 18.2 d 4/4 4/4


word. select Manage
In the management
lhe admin user name.
n1shc:d, did., the Go To ...._,-.......
.......-_,.
,ic~.
f the rnslalled services. Figure 4.1 Apaclte Ambarl 1l1ul1boar1I 11iew ofa lladoop Cltuter
clu~tcr is performing.
Da~hboard. Services.
The ~tatus (up/down)
17
VIII SW'II (CSf/ISf) CBCS - Modw Q~U¼twn,p

CPU Usago Configuration l11Mory is the


of configuration chnngcs ma

....--- configurations to be sorted b


specific configuration selling
setting is prO\ ided Inter in the

- 5Hvlcc View
The service menu provides a
provides a graphical 1111:thod
etc/hndoop!con f Xt-1 L fi ks).
service metrics and un Allers
Similar to the Dilshboard vie
menu. To select a scrvicc,clic
will have is 01111 summar). /\ I
example. Figure .J.5 shows II
status ofNamcNodc, Second,
displayed in the Summnry 11i
status or the scr1 i,c and its c

·•- -· .. II <> e
metrics are displayl·J ns 11 idg
As on the dashboard, tln:sc wi
the Configs lab will open an
...
--
-
(properties) an: the same on
through the Ambari intnfricc

--- __ _
iii ...........
..,.

-· ·----
_, ,,.._.~ .-~· ....... ..__

'Ill' llost memorv uso e

·~ ■--......,..., • • e

.• ----- ..........

. 1w ,,...

-
1w
C
0
C
C
C
Cl ....... ,

C
....,_ . . . .11...,.1uo
C

Fi;:ur
18
llfrVah:t,A ~

Conliguralion hb1ory is 1he final tab in the dashboard window. This view provides a list
of configurn1ion changes made to the cluster. As sho\!on in Figure 4.4, Ambari enable
configurattons to be sorted by service, configuration, group, data, and author. To find the
specific configurmion se1tings, click the service name. More information on configuration
setting is provided lalcr 111 the chapter.
Service View
The service menu provides a detailed look at each St"n;ice running on the cluster. It also
provides a graphic.ii mc1hod for configuring each service (i.e., instead of hand-editing the/
etc/hadoop/conr XI\ IL Ii les). The summary tab provides a current Summary view of important
service mclrics :ind an Alters and Health Ched.s sub- window.
Similar to the Dashboanl view, the currently installed services arc listed on the lefi- side
menu. To S1.'h:ct a sen icc,click the service name in the menu. When applicable, each service
will have is 01111 M1111niary. Alters and I lealth Monitoring and Service Metrics windows. For
example, Figure .J.5 sho11s lhe service view for HDf'S. Important information such as the
status ofNamcNodc, SccondaryNameNodc, D:1taNodes, uptime, and available disk space is
displayed in the Summary II indow. The Allers and I lealth Checks widow provides the latest
status of the service and its component systems. Finally, several important real-time service
metrics are displayed as \\ idgets nt the bottom of the screen.
As on the dashboard, these widgcls can be c;,..panded to display a more detailed view. Clicking
the Configs tab will open an options from, shown in figure 4.6, for the service. The options
(properti~s) arc the same ones that arc set in the Hadoop XML should manage them only
through the Ambari intrrlhcc-th~t is, the us~r should not edit the files by hand.
• ♦

1io 1111 • • I •

w:::r:==i
........
... --·
,sa e
"" ..1•~,., -- ...,;,e~·......,..,_.,.~
. ...................
fl, • ~.,, .... ..... 11 •- ,.,., .., , .

....._ --·•·-.. •-- ::::.~.~:.--:;:-_ -~-,-


.. -........._.. ,~ ... ""'
..;
............... ... , ,,_...
.,, .,.,.,,.,, .....u•-~ .,, -----··.._
~ ..... ---- . . .,_,.,,

_.,..._\'II

:l

liJt f"i_i:11re 4.5 1/DFS seri•h-e ~1111111u,ry w/11{/ow

19
VIII Sen11 ( CSE/IS£)

TI1e c~nl Sdlm,!.S a ,1ilablc for each service arc shown in the form. Tiu: adminis1rn1or can sofiware installed. I he rcm:iini,
set each of these properties by changing the values in the form. Placing the mouse in the input over the various service compon
box of 1hc proix-n~ <lispla> s n short description of each property. Where possible, property Further details for ;i particular h
are group,:d by functionalily. The form also has provisions for adding prope11ies that are not A s shown in fi1.:111c ,1.s. 1hc indi
listed. An e:.:unpk of changing service properties and restarting the service components is Host Metrics, a71d Summary info
providt>J in the ~M. n:i~1ng I ladoop Service" section. currently nmning on the ho~t. E
If a sen1ce prm J.:s 11s O\\O graphical interface (e.g., IIDFS, YARN, Oozic), then that plnccd in maintenance mode Til
interface can be opened in a separate browser tab by using the Quiel... Links pull- down menu metrics (e.g., C PU, llll'n1ory, dis
located in lop miJdlc of the window. version of the graphic. lhc Sui
Finall), the Sc-n ,,: i\C'tll'n pull-down menu in the uppcr-lefi corner provides a method for includinu till' la~, lime a heartbc~
starting and sto p '° c ch ~c, vice and/or its component dae'.11ons :icross the cluster. Some
service ma) h:nc a sct of unique actions (such a_s rebalancing_ 1IDFS) that apply to o~ly
· .· 1• ns e,n.,llv CH'ry service has a Service Check option to make sure the service
ce1tam s11ua 10 • ~ .. • , . • d
is working pmpcrt~. The scf\icc check is initially run as part of the installation process an
can be valuabk ,,hen d a_no~in:: roblcms.

• • IJ:II I

-~~
Q
--
c:a
. ..... ,..•,,,_

.. r,

----. "'

..
...
.....
..
. ..~ -
. ...... -
; - : :• ...; ;• •, .1• •

'•- ..

Fi;:11r<! 4.6 A111b11rl l<!n'k<! opliOIISfor 1/DFS ·--- ... ....


c-..~_;';, ...... - ..
Hosts View -..,.
1-...
.....,.
Selecting the 110s1s menu item provides the information shown in Figure 4.7. TI1e host - ···-~
name, IP addn:ss, numhcr of cores, memory, disk usage, current load average, and Hadoop
components :tre listed i11 this window in tabular fonn.
To display th..: I tadoop componcnls installed on each host, click the links in the rightmost Fig11r,t 4.
columns. You c:,n abo add new hosts by using the Actions pull-down menu. The new host Adrnin View
must be running the i\mbari agent (or the root SSI I key must be entered) and have the base T_he Administra1ion (Admin) vie,
d1sol:iy~ :i list nf i,"tnll-,J con",,.
20
administrator can sofiware installed. The remaining options in the Actions pull-down menu provide control
ouse in the input over the various service wmponents running on the hosts.
ossible, property Further details for a particular host can be found by clicking host name in the left column.
11ies that are not As shown in Figure 4.8, the individual hcJIII view provides three sub-\\ indows: Components,
ce components is Host Metrics, nnd S11111111ary information TIie Components window lists the services that are
currently running on the ho~t. Each SffVICIC can be stopped, restarted, decommissioned, or
Oozie), then that placed in mainten.ince mode. The Metrics willdow displays widgets that provide important
pull- down menu metrics (e.g., Cl'U, nm11u1y, disk, and nctwak usage). Clicking the widget displays a larger
version of the graphic. The Summary wiadDw provides basic information about the host,
includin!t the last time a hec111bent was receid
ides a method for
the cluster. Some
that apply to only
·-- ..., __ . . . ......~ ....... -
---•-· • •• •
- I
e i■I

ke sure the service


lation process and

♦ • ■
,,_
t,....
- ..
I

I
I
...
..,
:J ' •

.........
_:_

• f , . •

igure 4. 7. The host


erage, and Hadoop
Fi,:ure 4.8 Amb1ui cluster host 1/ellul 1·iew
ks in the rightmost ,'\dmin View . . The first as shown in Figure 4.9,
enu. The new host The Administration (/\dmin) view provides th~ce ?Pll~n~. ' reflects the version of
) and have the base displays a li~I of ins1ull~d soflw:mi. This repos11om:s listing generally
21
VIII Se.m, (CSf/ISE) Bi.g,,V~A ~ CBCS . i'-fode-lt Q~wn, p
Hortonworks Data Platform (IIDP) uscddurin th . . y ARN Wcbl'roxy
option lists the service accOUIIIS added wbtmgthe mStnllation process. The Service Accounts
st t ~e Web Application Proxy i
used to nm various service_.. .11, Ambar" \~ c':'1 was. inS alled. _These accounts are with the cluster web intc=rfa
1 th
on the cluster. A fully ~ - - clust . · . e ird ~ptlon, S~curuy, sets the security the Resource Manager itself,
._.. er is important m many instances and h Id b
explored if a secure e11YM • aceded. s ou e the configuration property ya
YARN Configs view, scroll
Views view
h
Ambari • . is a fram~..-..
Views r a systematic way to pl ug m
· user mtenace
· ..r. capab1ht1es
••• In stand-alone mode ya
Kerberos pnne1p:il
· • name m.
' and
t ~t provide tor custom • • ◄ 1, _management, and monitoring features in Ambari.
Views allows vou to extc:nd . . CIIIIIDmize Ambari to meet our s cific needs. These clements canbe added
Using the J obllistoryServe
The removal of the JobT ·k
leve I . tiramework necessitate
rnc
Jo~H,storyScrvcr provides a

-_______
,....,,.... ...,. , ..,.,___,,,
..........,.....,___.
which
the J to .a<>or.
= cgatc complete
ob~1storyServer can bc
Managing YARN Jobs
_YAR~ jobs can be managed
....,... Mmcludmg -kill • -li•t•, and -st
apReduce jobs can also be c
-appTypi:s <comma-scparared
HU,M - - - -- - •

.,.,, .. _....,,....ni.
.._..........,,.. .........~-,.
-help D

_______
•~ -kill <Application 10>

·---
K
., ' "°''.. - - - - - - ....,. __...._.,. __....,._
....---·-..............._,.,,... ",,..
-list
.,,..
Li
..,..
...
',.,....
....,.,.. ;P
-------
• S tatus <Application ID>
,,nut ,..__ _ _ _ _ _ _ 'f"""'
,.,,......___.......,..
Neither the YARN Resource r
...,,.. applications. If a job needs to
Application ID :ind then use lhe

- -
Setting Container Memory
YARN manages application re
amount ofcomnincr melllOl)' 13~
file:
Figure 4. 9 t1111h11ri i1/.\t11lled p,,c~ with 1•enio11s,1111mbers a11d descriptlo11s • yam.nodcma11ager.reso11rce.1
use for containers.
b. Explain the llusil: "ttiuloop YARN administration? {04 Marks)
Ans. YARN has several administrative features and commands. To find out more about them, • schedulcr.minimum-allocatio
examine the YARN commands documentations at https://hadoop.apache.org/ docs..'currcnt/ ResourccManag.cr. A request.:
hadoop-yarn/hadoop-yarn-site/YamCommands.html#Administration_ Commands. The container of1his size (default
main administration commands is yam rmadmin (resource manager administration). Enter • yarn.scheduler.maximum-all
yam nnadmin -help to learn more about the various options. Resoum:M:in:igcr (default 81
Setting Cont.1incr Cores
Decommissioning YARN Nodes
If a NodeManag.cr host/nodes to be removed from the cluster, it should be decommissioned You can set the munbcr of cores fi
first. Assuming the node is responding, you can easily decommission it from the Ambari web xml:
UI. Simply go 10 1hc 110s1s view, click on the host, and select Decommission from the pull• • yarn.scheduler.maximum-all
down menu next lo the NodcManagcr component. Node that the host may. also be acting as request at the RcsourceMana
a HOFS DotoNodc. Use 1hc Ambari Hosts view to decommission the HDFS host in a similar this allocation will not wke c
number of cores. ·1 h.: default
fashion.

22
VCtmlA~

s. The Service Accounts YA RN Wcbl'ro..:y


led. These accounts are The Web Application Proxy is a separate proxy server in YARN that addresses security issues
curity, sets the security with the cluster web inti:rface on ApplicationMasters. By default, the proxy runs as part of
instances and should be the Resource Manager ibclf, but it can be configured to run in a stand-alon'e mode by adding
the configuration property yarn.web-proxy.addrcss to yam-site.xml. (Using Ambari, go to the
YARN Configs view, scroll to the bottom, and select Custom yam-site.xml/Add property.)
r interface capabilities In stand-alone mode, yarn.web- proxy.principle and yam.web-proxy.keytab control the
Kerb.:ros principal nom.: and the corresponding keytab, respectively, for use in secure mode.
g features in Ambari.
These ekments canbc.: added 10 the yam-site.xml if required.
ific needs. Using the Jol>I listoryScrvcr
The removal of the JobTracker and migration ofMapReduce from a system to an application-
level framework ncccs~itated creation of a place to store MapRcduce job history. The
••••• • JobHistorySem:r provides all YARN MapReduce applications with a central location in
which to aggn:gate compklc<l jobs for historical reference an<l debugging. The settings for
lht: JobHistoryScrvcr can be found in the maprcd-sitc. xml file.
Managini; YAttN Jobs
YARN jobs can be managed using the yarn application command. The following options,
including -kill, -list, and -Matus, art: available to the administrator with this command.
MapReduce jobs cnn also be controlled with the mapred job command. usage: application
-appTypcs <comma-sep:u-atcd list of application types> Works wilh
•·•·lisl lo filler applications based on their type.
-help Displays help for all commands.
-kill <Application ID> Kills the application.
-list Lists applications from the RM. Supports optional use of
appTypcs to filter applications based on application type.
-status <Application ID> Prints the status of the application.
Neither the YARN ResourceManager UI nor the Ambari UI can be used to kill YARN
applications. If a job needs to be killed, give the yarn application command to find the
Applicalion ID and then use the -kill argument.
Selling Container Memory
YARN manai;cs application resource: containers over the entire cluster. Controlling the
amount of container nmnory l:lkes place through three important values in the yam-site.xml
a111I ,tescripliOIIS file:
• yarn.node111anagcr.reso11rce.me111ory-mb is the amount of memory the NodeManager can
(04 Marks)
use for containers.
d out more about them,
• sche<luler.111inin111111-ollocatio11-mb is the smallest container allowed by the
he.org/ docs/current/
ResourccM:11wgcr. A requcsh:d container smaller than this value will result in an allocated
_ Commands. The container of this siic (dcfault 1024Ml3).
administration). Enter
• yarn.schcduli.:r.maximum-allocation-mb is the largest container allowed by the
Resourcc~lanagcr (default 8192MB).
Setting Container Cores
Id be decommissioned You can set the 11t11nbcr of cores for containers using the following properties in the yarn-site.
it from the Ambari web
xml:
ission from the pull-
• yarn.scl11.:dulcr.111axi111um-allocation-vcores: The minimum allocation (or every container
may also be acting as request at 1hc Rc~ourceManager, in terms of virtual CPU ~ores. Requests smal~e~ than
HDFS host in a similar this allocation will not 1akc effect, anti the specified value will be allocated the n11n1mum
number of cores. The dclirnlt is I core.

23
vrrr Sem, (csr:/rsr:)
C13CS • Model, Q ~ t'r.c-n,, Pet.per .
• yarn.schcduli:r.maximum-allocation-vcorcs: The maximum allocation for every container
Bl can help nrnt,.c bolh bctlcr.
request at the Res~Manager, in terms of virtual CPU cores. Request larger than this
Strategic dcci,ion~ are those that i,
allocation will not take dl"cct, anJ the number of cores will be capped at this value. The
out to a Mw customer set would be
default is 32.
Operational decbions arc more
• yarn.no'demanager.resou«e cpu-vcorcs: The number of CPU cores that can be allocated greater e~cicncy. Upd:uing an old 1
for containers. lo stratcgrc decision-making. the !t
Setting Maplkduce Pruptttia the path 10 reach the goal. ~
MapReduce runs <is a YAR!'J a,,plia8aM Consequently, it may be necessary to adjust some of The consequences ol'thc decision w
the mapred-site. ,1111 properties as~ ldlle IO the map and reduce containers. The following scanni?g for new possibilities and 1
properties arc used to set some Jff-cz n -.id memory size for both the map and reduce analysis of man) possible scenarios.
contaim:rs: found from
. data min in":,•
• mnpred.child.Java.opts provides a lllp • smaller heap size for child JVMs of maps Opera11onal dl'cisions can be made m
(e.g., - -Xmx2048m). system can be crc:ucd and modded
• mapreduce.map.mcmory.mt. p.o•-= a lllpr or smaller re~ource limit for maps (default of the domain. rliis model can help
= 1536Ml3). m'.tomalc opcmtio11s level decision-
• m:tpreduce.n:duce.memOl).mb ,..,.,as • larger heap size for child JVMs of maps micro level operational decisions in
(default = 3072Ml3). . F~r e~ample, a bank migh1 want to
• mapreducc.rcduce.java.opts pnmdG a a.pr or smaller heap size for childreducers. sc1t:nt1fic way using data-based mod
Modale-3 A dccision-lrec-bascd model could
Developing sud1 dccision lree m
5. a. Why )houltl uri::111iz11tiun in\·csl •• ,11iu■ lalelllgence(Bl) solutions? Are Bl toob techniques. Elfcc1ive OI has an cvolt
more important than IT sccurit)' toht..._. (08 Marks) people and organizations at't, new fo
Ans. Buslne~s intl'llii;cnce (UI) ban umbrella . . . . _ includes a variety of IT applications that be testcd against the new d,na. and i
are used to analyze an organiwtion 5 ~adQlllllllunicatc the information to relevant users. In that casc, deci~ion models should
Business Intelligence, 131 is a conccpl • ...ity involves the delivery and integration of An _u~cnding process of gencrnting
relevant and usdul business informaaoa ••organization. Its major components are data dec1s1ons, and thus c,111 be a sii.tnifican
warehousing, tbta 111 in ing. query111g. ad lqlllllmg (figure 5. 1). DI tool~ more import.ult Iha~ IT sc
I
..
Hu I . . . lnt-elllgencu 01 tools includc darn wnrdiousino

..... rnporting. dashbo:irds, qucl) ing, a,::i d


01 tools can range from Vl.'rv simplc
D M1n1ng
sophis~icatcd tools that olli:.r a vciy
cxecu11vcs can bc their 0 , 111 13! exp.!
Figure 5.1 811m1as Mldll1enc~ 011d data 111/11/111; cycle mechanisms fur them. Thus. large orua
The nature of life and businc:sscs IS 10 grow. lnfom1ation is the lifeblood of business. th at providt: good information in rea~l t
Businesses use man) tcdmiques for understanding their environment and predicting the A spreadsheet tool, such as Mino~oft
future for their own bcm:lit and growth. °"isions arc maJ,: from facts and feelings. Data- Data can be do,1 nloadcd and stort:d i
based decisions arc more clT~-cuvc than those based on feelings alone. Actions based on then pri:scntl!d in the form of graphs a
accurate data. information. lmowkd;e. experimentation, and testing, using fresh insights, macros and l>thcr li::i1urc~ ·1 hc anal
can more likdy succeed and l,:ad to sustaiot.-d growth. ~unction~. l'i,ot rabies help do sopl
The organiio:iition should iO\ nt in business intelligence(BI) solutions: installed to cn.ible nHxh:,~ilcly sophist"
Companies u~c f31 10 dck•ct significant events and identify/monitor business trends in A dashl>oanling syMcm, such as Table
order to adapt quickly 10 their changing environment and a scenario. If elfectivc business analyzing, and prcsl'1Hi11g data. At Iha
intelligence training is uscJ in the organii.ation, both decision making processes at all levels redesigned e.i~il) 1\ ilh .i r.iplucal us
0

of management and tactil:al strah:gic management processes can be improved . include: many st:11is1ical l'i~nctions. Th
DI for Belter Dcd~iuns: There are two main kinds of decisions: end to ensure that the 1al>ks and or:ip
• Strategic da'(;isions and real time. "
• Operational decisions. Data mining ~)'Mems, sudi as ([3M

24
cacs - Modcl, Q u.e¢c,0' n, f'oq}(w • 1
location for every container Bl can help make l>oth bcltcr.
s. Request larger than this Strategic dcci,ions an: those that impact the direction of the company. The deci~ion to reach
capped at this value. The out to a new customer set would be a strat.:gic decision.
Operational decisions arc more routine and tactical decisions, focused on developing
ores that can be allocated greater d'ficiency. Updating an old website with new features will be an operational decision.
In strnt.:gic dccision-makini:;, the goal itself may or may not be clear, and the same is true for
the path to reach the goal.
essary to adjust some of The consl!quences of the decision would be apparent some ume later. Titus, one i~ rnnstantly
containers. The following scanning for new possibilitic:s and 11ew paths to achieve the goals. B l can help with what-if
001h 1he mup nnd rcd,1cc analysis ofm:iny possible sc.,narios. 131 can also help create new ideas based on n"w patterns
found from data mining.
for child JVMs of maps Operational decisions can be made more efficient using an analysis of past data. Aclassification
system can be created and modeled using the data of past instances to develop a rood model
limit for maps (default of the domain. This model can help improve operational decisions in the fu1urc. Bl can help
automate operations level decision-making and imprO\I! efficiency by makin 6 millions of
or child JVMs of maps micro level operational decisions in a model-driven v.a)
For example, a bl!llk might want to make 1focisions about making financial loans in a more
p size for childreduccrs. scientific way using data-based models.
A decision-tree-based model could provide a consistently accurate loan decisions.
Developing such d.:cision tree models is one of the main applications of data mining
solutions? Are Bl tools techniques. Elfoclivc Bl has an ev.olu1ionary compom:nt, as business models evolve. When
(08 Marks) peopl.: and organizations act, new facts (data) are generated. Current business models can
iety of IT applications that be tested against the new data, and it is possible that those models will not hold up well.
onn,llion to relevant users. In that case, decision modds should be revised and new insights should be incorporated.
li\ery and integration of An unending process of generating fresh new Insights in real time can help make better
~or components are data decisions, and thus can be a signifil:ant compelitivc advantage.
Bl tools more i111porlant than IT sccurily solutio11s:
131 tools include data ward10using, online analytical processing, social media analytics,
reporting, dashboards, querying, and data mining.
Bl tools can range from wry simple tools that could be considered end-user tools, to very
sophisticall!d tools that offer ~ very hro~d and complex set of functionality. Tims, even
executives can be their own !31 i:xperts, or they can rdy on Bl specialists to set up the Bl
mechanisms for lhcm. Thus, large organizations invest in expensive sophisticated Bl solutions
illg cycle that provide good information in real time.
lhe lifeblood of business. A spreadsheet tool, such as Mkrosoft Excel, can act as an easy but elfecti,·e I•I tool by itself.
ent and predicting the Data can be downloaded and stored in the spreadsheet, then analyzed to produce insights,
facts and feelings. Data•
then prcsenkd in lhc form ofgr:iphs and tables. This systern offers limited automation using
alone. Actions based on
macros and other li:!:1turc:s. The analytical features incluck basic statistical and financial
ting, using fresh insights,
functions. Pivot rnbh:s h.:lp do sophisticated what-if analysis. Add-on modules can be
installed to enable modcratdy sophisticated statistical analysis.
A dashboarding 5y~tcm, such as l~1blcau, can offer a sophisticated set of tools for gathering,
onitor business trends in
analyzing, and prcscnting data. At the user end, modular dashboards can be de~igned and
nario. If effective business redesigned easily 11 ith a graphical uso:r interface. The back-end data analytical capabilities
king processes at all levels
include many stmislical functions. The dashboards artl linked to data warehouses at the back
be improved .
end to cnsurt! that the tables and graphs and other tllements of the dashboard are updated in
real time.
Data mining systems, such as IBM SPSS Modeler, are industrial strt)ngth systems that

25
VIII Se.tw (CS'E/IS'E) 8ift'Vam-A~
providccapabilitics to npply a wide mngc ufanalytical models on large data sets. Open source decision making and helpin
systems, such as Wd;a, arc popular platforms dcsigm:u 10 help mine lilrgc amounts of Uilla DW enables a consolidated
to discover path:ms. Thus, the entire organiL1tio
DW lhus provides l>clli.:r a1
b. What ls tlae pur11ose or a dah1 warehouse? (04 Marks)
users to perform c,tcnsive a
Ans. Purpose or D,ua W:m:housc lies socnc..-herc in its definition it.self i.e.a database created by
operational u.11al,as,:s us,·J
combining data th.it is gathcred through various sources thilt can be of different types and
formats (e.g. te,1. sql. xml elc). c. Dusincsscs ncl.'d 11 "two-s~
Now what , ou 11 ill <lo afkr stonng such huge amount of data from different sources into a
single database. you 11 ill a11alyx lite data which you ha".i accumulated and try to answer Ans. Some of the ex;irnplcs citl.'
queries which wi.:re not possible: or were performance intensive earlier. The airlincs haw all lhi: d;tt
In a nutshi.:11 Dara "archoust I process of coll\.'Cting dala , transforming it, loading into until all lhc bags h,t\ i.: arriw
single uatabnsc and th.:n 1151111 a Bl tBusin.iss Intelligence) tool to answer your analytical then n:port ii lo 1he1r n"to
queries and prediction ofal) flll1ba' questions that may arise are helpful to your domain or l..110,1 upfront that lhcir bag
busines~. Po1,er companil.'~ hav.: 1hc
Below arc few rl.'nSOIIS hours after dozo.'ns of c11~101
Im pro, ing \'i~ibility of Dat:a : An organization registers data in different systems, which ohcad oftiim: to pr.:l'cnt foil
support the , ,tn us busm~-ss p11xcs~~ In order to create an overall picture of business The Fcd has all lhi: d,tta 10
opct Miur.~. customm amt supplii.:rs-thus creating a single version of the truth -the data must is still clinging on to ,m o
come 10,.dh,r an one pl.11:.: anJ made compatible. Both external (from the environment) and data and :idjust 1hi: policies
internal dala (from ERP. <:RM and liaanc:1al systems) should merge into the data warehouse smancr. r,•,11 time u>111p111.:1
and thi:n ~ arou~d. Therefor,: having a single source to answers all your qucrie~. l1bco 's products bring Ihat
lmpro,ed Pcrfora1111m:e: On.: could ll5C an already existing operational d.11abase if there is d:lla. Thl• most cvmmon fo
only single <k-stinatio11(D,1tabasc:) ofall the data. yet there few constraints like performance and columns.
which degr.ade for bolh 01)1!ratlOl\al processes and reporting processes. Therefore wc: create n Documents. on th.: othcr h,1
database tum:d and optimiz..-d d;atab.lsc which will be ready to answer queries which require of information conwined i
to bring h11ge amo11111 of data :ind :mal)Sis. anal) zed casil) sincc 1hc co1
lnc:re11sc: D11ta Qualit)· : Stal.chold.:rs and users frequently overestimate the quality of data Enlcrpris.: could unkash if i
in the sourc.: sy~tcms. Unfortunately, source systems quite of\en contain data of poor quality. of docu1111:nts 10 ob1ain the t
When w,· nsc a d.11.1 v.,irchous..:. \IC can gn:atly improve the data quality, either through - lnfinote brings stru,turl! 10
were possihl.: - corrcc1i11g the data 11hilst loading or by tackling the problem at its source. build tools Ihm cnn giw cor
Faster and Mori: 1111\'llnccd Hc11ortini:: The structure of both data warehouses enablc:s end and beconic pru.icti~e instc
ustrs to r.:port in a lkxibll! mannc:r ,tnd 10 quickly perform interncti\'e analysis on the basis lnfinote can hclp) uur corp
of various prcdclini.:tl :mg.ks. Thi.:y may, for example, with a single mouse click jump from your documi:111s.
year level - tv q11,1rt,·r - lo monlh l,:v,:I and quickly switch between the customer data and the Liberty Ston•s C11sc £\ere
product data II ltl.'rcby 1111: indii:.itor remains fixed. In this way, end users can actually juggle Liberty Stores Inc is a spcci,
with the data and lhus ,1uickly gain knowledge about business operations and KPls (Key wellness producls. .imt edui:
Performance lmlii:.itur). and Sus1ainabll.') ciliLl!ns \\\
A dat;a w.1rchuusc ( U\\ ) 1s ,Ill organiz.:d collection of Integrated, ~ubje\:t ui icnlt:u u.1tabases The comp:m) i~ :?O ~""~ o
designl•d to suppm1 d,:cision support functions. DW is organized at the right level of countries, 150 citi.:s, nnd hn
granularil) to pro, idc i:11!,in cnlcrprisc wide data in a standardized format for reports, queries, The company has rl!\cnu,:s
and analysis. DW is pit~ ~icall) .utd func1iom1lly separate from an op,:mtional and transactional The company pa) s ~pecial
datab,tsc. Cn:a1ing a DW for analysis and queries represents significant investment in time nnd produci:d. II donates ab
and elfot1. It h.is to 1>1: wns1a111ly kepi up-to-date for it to ~ useful. OW offers many business chari1ablc causl.'s.
and technical b,:ndits. I. Crcme a cumpr.:hcnsin: l
OW supports b11~i11<.:>> 1.:po11ing and dal:i mining uc1ivities. It can facilitate distributed access 2. Create anolhl!r dasltbo:trd
to up-10-d,ll.: bu~1110·~~ l..no,, h:dge for departments and functions, thus improving business
ellicicncy and customer scrvicl•. OW can pr.:sent a competitive advantage by facilitating

26
riV~A~
data sets. Open source decision making and helping reform business processes
large amounts of data DW enables a consolidated view of corporate d:.ta. all cleaned and organized.
rhus, the entire organi.G.'ltion can see an int..:e:r.:ited, iew of usctf.
DW thus provides belier anti timely information. It simplifies data access and allows end
(0-t Marks)
.a database created by users lO pt:rform extensive analysis. It enhances overall IT performance by not hu ·:fening the
of different types and operational databases used by Enterpris.: Resource Planning (CRPJ and other S) , ms.
c. Businesses need a " two-second :1d,·a111;a~~ to succeed. What does that mean to you?
different sources into a rot Marks)
ated and try to answer s. Some of the examples crtcd for ... Iwo S1."~00l.1 Advantage" are:
r. The airfim.:s have all the dala about your ba=-s Why is then that you have to 11 ai! fc-r eternity
orming it, loading into until all the bags have arrived at the baggag~ can•usel 10 discover that your bag i,. •issing and
answer your analytical then repo11 it to their customer service'.' \\ h) c.10 ·1 airlines be proactive and 11.t ;,n~sengcrs
lpful to your domain or know upfront that th.:ir bags 11 ill be nrri1 ing later''
Powt:r wmpanic~ hnvc thc data rtt hand on grid foilures. Why do they only respond several
hours allcr dozcns of custom.:rs call and compl,,in? Wouldn't it be better ifth~, use the data
ifferent systems, which alwad of timc lo prcl'cnt foilurcs in the first place''
II picture of business The Fed has all the darn to take fi:.cul. economic and monetary dccis,on~ in re::l 1:me. Why
the truth -the data must is still clinging on to an obsoklc model of meo.:11ng only a few tunes a )car to review the
m the environment) and data and adjust the policies and rates in h1mlstght? Why can't the Fed be replaced by a much
into the data warehouse smm1er. real time computer :tlgorithm?
1your querh:s. Tibco's products bring that valu:tbk two second ,1dvantagc ro the cnterprbc fot structured
ional database if there is data. Th.- mo5t common for111 of structun:d data is a database, where data is storl!d in rows
traints like performance and colu11111s.
. Therefore we create a Docu111cnts. on the other hand. represent thc world of unstructured d,1ta. There is a wealth
er queries which require of information contnincd in Enterprise documents. However, this information cannot be
analyzed easily since the content i~ not organiLed in a structured way. Imagine the potential an
imate the quality of data Enterprisc could unleash if it were ablc to analyzl.! the information scattered across thousands
t:tin data of poor quality. of documc.:nts to obtain the two ~ccomt adl'antagc.
quality.-either through - tnfinote brings strurturc to thc u11s1ruc111rcd wo, Id of documents. It provides a platform to
problem at its source. build tools tlwt can give rnrporntions the ability to extract information from their documents
warehouses enables end and become proactive in~tcad of being reactive. In my future columns, I will explain how
1\'C analysis on the basis tnfinotc <:,Ill hdp }'\JUI' corporation get the two second advantage by mining information from
mouse click jump from your documents.
c customer data and the Liberty Storrs C.1sc E,crcisc:
Liberty Stores Inc is a specialized glob:tl rc.:tail ch:iin th:tt sells organic food. u.-ganic clothing.
well_ncss products. :ind education products to enlightened LOI 1/\S (Lifestyles of the Healthy
and Sustainablo.:) citi,:cns worldwide.
oriented databases The company is 20 years old and is growing rapidly. It now operates in 5 continents, SO
at lh..: right level of countrit:s. 150 citi.:s, and has 500 stores. It sells 20.000 products and has I0.000 employees.
for reports, queries, The company has rl.!venucs of over SS billion and has a profit of about 5 p~rctJnt of revenue.
Iand transactional The company pa)s special attention to the conditions under which the products are grown
and 'produced. ft tfonat..:s about one-fifth (20 percent) of its pretax profits from global focal
charitable caus~s.
I. Create a comprehensive dashboard for the CEO of the company.
·1r1a1e distributed access 2. Create another dashbo:irtf for a country hl·ad.
us improving business
\anrage by facilitating

27
VIII Setw (CSE(ISE) 'B~Da.tt..,A~

Suµ,-rv,,ed
le,lr,ung;
Cl,>~slfl(:.itlo

UnsupNvl~
Le.trninr.:
fxr,lnr,,tion
Fi
Data may be mined to he1
e'ICplore lhe data 10 rind in
kind of prnhlem being so
The most import.int class
OR These are problems whet
pallcms 1ha1 would impr
6. a. Whal l.s Ll,1111 ml in,:? \\ hal .ire sur~·n'i)cd ond unsupervised lcarninJ techniques?
data of past decisions is 0
(08 Marks) codified 10 produce more
Ans. Dara min,ni; I) ti,: art anJ scirncc of di5'overing knowledge, insights, Md patterns in data.
learning as lht"rc is a way
It is 1he a~I of c:~•,acting usdul pallcm~ from an organized colleclion of data. Patterns must
A decision ln:e is a hiera
be ,alid, novd, potentially usl•ful, and understandable. The implicit assumption is that data
about the past c.in I cHul patterns of activity that can be projected into the future. an easy and logical mann
many reasons.
Data mining is a multidisciplinary field that borrows techniques from a variety of fields.
It utilizes the lnowlcdgc of data quality and data organizing from the databases area. It I. Decision lrccs arc eas
draws modeling and analytical techniques from statistics and computer science (artificial TI1ey also show a high pr'1
intelligence) areas. It also draws the knowledge of decision-making from the field of business 2. They sdcct lhe mosl rel
management. decision-making.
The field of d.ita mining emerged in the context of pattern recognition in defens.:, such as 3. Decision trees arc tolcr
from the user~.
identifying a fri.:nd-or-foc on a battlefield. Like
4. Even nonlinear relation
many other ddi:nse-inspircd technologies, it has evolved to help gain II competitive advantage
in business. There an: many algorithn
CART, and CHAIO.
For example, "customers who buy cheese and milk also buy bread 90 percent of the time"
would be a useful p:ittcrn for a grocery store, which can then stock the products appropriately. Regression is a relative!
Similarly, "peQplc with blood pressure greater than 160 and an age greater than 65 were at a The goal is to fit a smoo
high risk of dying from a hcan stroke'" Is of gi cat diagnostic value for doctors, who can then for exampl!.!, can be used t
focus on treating such patients with urgent care and great sensitivity. IClllf>':ra1ure !:>imply plolli
Past data can be or predictive value in many complex situations, especially where the pattern equation will lit lhc data
may not be so easily visible without tin: modeling technique. Here is a dramatic case uf a future day can be predict
Artilici11I ncurul nd wor
data-driven decision-making system that beats the best of human experts. Using past data, a
Intelligence slrcam in Co
decision tree mo<lcl was developed to predict votes for Justice Sandra Day O'Connor, who Neurons receive stimuli,
had n swing vote in a 5-4 divided US Supreme Court. All her previous decisions were coded successiwly, and eventual
on n few variables. What emerged from data mining was a simple four-step decision tree that just one neuron :ind the re
was able to :iccurntcly predict her votes 71 percent or the time. In contrast, the legal analysts layers of neurons involve~
could at best predict corrcclly 59 percent of the lime. (Source: Martin et al. 2004) The neural network can~
points. It will continue to
parameters bnscd on fceJ

28
{~,O:;I; A ~ C8CS · Mod<wQ~Paper · l

Machine Decllion Tr-.es


S,,pc~rvi:.cd I e,1rn1ng
l<'llmlng Techniques .-.-C.a1 Neural Networks
Cl.:iss,t,c:otion
Statlst,cal
Technlqut•s n

Un"i.ui>l"'rv,scd Jl,l1achln<' au.a-Analysis


LeJrnin(;. Lealr,inc
rxr,lnr,Jtlon TC'chn1ques Association Ruic- Mining

Figure 6. / /111punu111 data llfi11l111 t1tt:l111iqun


Data may be mined to help makii more efficient decisions in the future. Or it may be used to
explore the daia to find inh:resting associative patterns. The right technique depends upon the
kind of probl.!m being solwd (Figure 6.1 ).
The most irnpon.111: class of problems solved using data mining are classification problems.
These are problems where tlat1 from past decisions is mined to extract the fow rules and
pallems that would improve the accura.:y of the decision-making process in the future. The
techniques? data of past decisions is organized and mined for decision rules or equations, which are then
(08 Marks codified to produce more accurate: decisions. Classification techniques are called supervised
patterns in data learning as then.: is a way to supervise whc:ther the model's prediction is right or wrong.
ta. Patterns mus A decision tree is a hierarchically organized branched, structured to help mali.; decision in
ption is that dat an easy and logical manner. Decision trees are the most popular data mining technique, for
future. many reasons.
variety of fie! I. Decision trees arc easy to understand and easy to use, by analysts as well as executives.
databases area. TI1cy also ~how a high predictive accuracy.
science (artif\ci 2. TI1ey sclccl the most relevant variables automatically out of all the available variables for
e field ofbusine decision-making.
3. Decision trees an: tokrant of data quality issues and do not require much data preparation
from the users.
4. Even nonlinear relationships can be handled well by decision trees.
There are many algorithms to implement decision trees. Some of the popular ones are C5,
CART, and CIIAID.
rcent of the tim Regression is a relatively simple and the most popular statistical data mining technique.
appropriate I The goal is to fit a smooth well-defined curve to the data. Regression analysis techniques,
than 65 were at for example, can be used 10 model and predict lhc energy conswnption as a function of daily
temperature. Simply plolling the data shows a nonlintarcurve. Applying a nonlinear regression
equation will fit the data very \\ell with high accuracy. TI1us, the energy consumption on any
where the patte future day can be predicted using this equation.
ic case of Artifici.11 neunil network (ANN) isa sophisticated data mining technique from the Artificial
sing past data, Intelligence stream in Computer Science. It mimics the behavior of human neural structure:
O'Connor, wh Neurons recci\'c stimuli, process them, and communicate their results to other neurons
·ons were code successively, and eventually a neuron outputs a decision. A decision task may be processed by
decision tree tha just one ncufon and the result may be communicated ~n. Allematively, th~re could be ma~y
,--11, the legal analys Jaye rs of neurons involvc:d in a dccision task, ~~ng upon the complex_iry o_fthe domain.
200-i) The neural net\\ ork can be 1rainc:d by making• dcc1S1on over and over agam with m11~y ~ta
points. /t will con1i11uc 10 /earn by adju~ting irs in!emal c~n:,putation ~nd com~un1cat1on
parameters based on lc.-cdl>a~k received on irs pttVIOUS dec1s1ons. The 1111cnned1ate values

29
VI·II Semt (CSE/ISE)
CBCS - 1-,fcxlel,Q~
pa:.sc\l \\ ithin th.: In) .:rs of neurons mu> not make intuitive sense to an observer. Thus, the
neural m.-tworks ore considered a blat.:k-box system. 6. Outlier data clemen
At some point, the neural network will have learned enough nnd begin to match the predictive of results. For exam
accur01:y ofa human cxpcn or allernativc classlfkation techniqucs The predictions of some educational scuing.
7• Any biases in the sci
ANNs that have been trained over u long p..:riod of time with a large amount of data have
becomt: dcci:.ivcly more aci:uratc than human e.,pcrts At that point, the ANNs can begin co of the phenomena un
than is typical of th
be seriott\l)' considered for dcploymcm, in real situations in real time data.
ANNs nrc reP I ,r bcclluse they arc cvl!ntually able to reach a high predictive accuracy. 8 Data should bc brou
ANNs arc J. , relatively simple to imph:mcnt and do not have any issues with data quality. be available daily, bu
ANNs rl!qum: a Jot of d:ua to tmin to develop good predictive ability. To relate these varia
Clu,ter 11nalysa is an explor.itory learning tc1.hniquc that helps in identifying a set ofsimilar this case, monthly.
gt"()up5 in lhc J.ala It is a techni,1uc used for nucomntic idcn11ficntion of natural groupings of 9. Data may need 10 bet
things Oa&:I lllSla&\ces that ;ire similar to (or near) each other arc catcgorited into one cluster, much variability, bee;
\\ hile dala mst:inces that :ire VCI) different (or far a\\3)') from each other arc categorized into may dull the effects
separale clusacrs TI1ere can be an)' number of clusters that could be produced by the data. information density
The K-means technique is 3 popular 1cchn1quc ant.I al!O\\S the user guidance in selecting the
c. What is data , isunliza
right number (KJ ofclusters from the data. . .
Clustenng is also kno\\11 as the segmentation technique. l he technrque shows thc_clustcrs of Ans. Data Vis11:ilil.1tion is th
things from past data. rhe output is the centroids for each cluster and the ollocatton of data for the end user. ldcal vi
points 10 their clllitcr. right visual form. to con
The ccntruid definition i~ uwd 10 assi~n new data instances that can be assigned lo their
understanding ofthl! co
clustcr hom"-s Clllitering is also a part of the anificial intelligence family of techniques.
availabtc to present data
Assodatlon rules arc a popular data mining method in business, especially where selling is
totality of the situation.
involved. Also l.nown ns marl.ct t,askct analysis. it helps in answering questions about cross•
Data visualization is the
selling opponunities. This is the hean of the personalization engine used by e-commerce
sues til.e Amazon.1:om ,111d streaming movie sites like Netflix.com. 1 he technique helps find presentation in an easy•
in1erestingrdationships (a0init11:s) bet\~ecn \'ariables (items or events). These arc represented data should be convertc
as rules of the form X => Y, "here X and Y are sets of data items. A form of unsupervised the consumer of data. T
lcanung, 1t has no tlt:1'1Cndcn1 variabk; and th,:rc are no right or wrong answers. There arc just an actionablt: 111,inner. I
stronger and \\ eaker allinitics. Thus, each ntle has a confidence level assigned to it. A part data might lose interest
of the machine-learning family, this t\.-chnique achieved legendary status when a fascinating The qu:ility of t.lata vir,
relationship \\as found in the sales of diapers and beers. Data can be pres.:ntl!d
graphs of v:irious type•
b. Why Is data pn.'par11tion so important and time consuming? (04 Marks) in t:ibles" . 1101\ e, cr. ;
Ans. Data cleansing :ind preparation is :i laborintensive or semiautomated activity that can take up give shap.: to dat:i. Tu
to 60 10 70 percent of the time needed for a data mining project. objcctivcr. for graphic,
I. Duplic:11.: data needs lo be n:moved. The snmc data may be rcceivcd from multiple I. Sho\\, and en·n re
sources. When merging the data sets. data must be de-duped. large masSC\ of dat
2. Missing values necd to be fillet.I in, or those rO\\S should be removed from analysis. 2. Induce the viewer t
1\-li~~ing \ olues c;in bc fi 11.:d in \\ ith 11\'crage or modal or default values. so natural to the dal
3. Data elemcnti. ma)' 1wed lo b.: transformed from om: unit to another. For example, total 3. ~void distorting \\
co~ts of health care and the total number of patients may need to be reduced to cosl/ simplifying. some
patient to allow compar.ibility of that value. 4. Make large dat:i sc
'1. Continuous v,alues may ne.:d 10 be binned into a few buckets to help with some analyses.
For c"ample. work ex1ll!fJence could be binned as low, mcdium, and high. data together 10 tt.:I
5. Encourngc th.: eye•
S. Data elements may need to be adjus~cd to make them comparable over time. For example,
e> c~ \\ 01ild rwtural
currency v:ilues may need to be adJustcd for innation; they would need to be convened
to th.: same base yc.ir for com1,arability. TI1ey may need to be convened to a commo 6. Reveal the data :11
~Ull\ill\.)• n curiosity. and thus
30
fi-g-VCttiv A ~ I M \ '
to an observer. Thus, the 6. Outlier data clements need to be removc(j aficr careful review, to avoid the skewing
of results. For example, one big donor could ~i...ew the analysis of alumni donors in an
in to match the pn:clicllve educational setting.
The predictions of some 7. Any biases in the selection of data should be corrected to ensure the data is representative
rge amount of data have of the phenomena under analysis. If the data includ,"S many more members of one gender
t, the ANNs can begin to than is typicnl of the populmion of intcrcM, then ndJustments need to be applied to the
e. data.
igh predictive accuracy. 8. Data should be brought to the same granularity to ensure comparability. Sales data may
issues with data quality. be available daily, but the s:iles person compensation data may only be avail:ible monthly.
To relate these variublcs, the dat:i must be brought to the lowest common denominator, in
dentifying a set of similar this case, monthly.
of natural groupings of 9. Data may need to be selected to incre:ise infonnation density. Some data may not show
e ori2ed into one cluster,
0 much v:iriability, because it was not properly recorded or for ony other reasons. This data
o~hcr arc categorized into may dull the effects of other differences 1n the data and should be removed to improve the
be produced by the data. information dcnsi!y of the d.ua.
guidance in selecting the c. What is data visu:1lizutio11'! llow would you judge the quality of data visualizations?
(0-' M:irks)
,quc shows the clusten. of Ans. Data Visuali1ntion is the a11 and ~cierice of mnking data easy to understand and consume,
and the allocation of data for the end user. Ideal visuali1ation shows the right amount of d.ita, in the right order, in the
right visual form. to convey the high priority information. The right visuali;:•1 ion requir~s an
can be assigned to their understanding oftlte consumer·s needs, nature of the data, and the many tools and •~~!m1ques
family of tcchniq11cs. available to pr..:senl data. The right visualization arises from a complete understanding oftht:
esix-cially where selling is totality of the situation. One should use visuals to tell a trne, complete and fast-paced story.
mg questions about cross• Data visunti,ation is the last step in the dntn lif.: cyclc. This is wherc thc data is processed for
me u~cd by e-commerce presentation in an casy•to-consuml' manner to the right audience for the right purpose. The
The technique helps find data should be conve11ed into a language and fom1at that is best preferred and understood by
ts). These are represented the consumer of dal:i. The pr.:scntation should aim to highlight the insights from the data in
A form of unsupervised an actionable manMr. If the data is presented in too much detail, then the consumer of that
g anS\\ crs. There an: juM data might los.: intcNsl nnd the insight.
level as~i~ncd to it. A pm1 The quality of data visuali1..1tions can be judged by Excellence Visualization.
status "hen a fascinating Data can b1: prcsl'ntcd in the form of 1cctangular tables, or it can be presented in colorful
graph~ of v:iriou~ t:,-pcs. ·'Small. non comp:ir..itive. highly-lnbeh:d da~a sets usually_ belong
in tables" . 1101\ ever. as th.: amount of data grows. graphs arc prelcrablc. Graphics help
give shape to data. fufie. a pioneering expert on data visualization, presents the following
objectives for grnphical cxci:lh:nce: . . .
I. Sho11, and e,en rc\eal, the dat:i: The data should tell a story, especially a story hidden in
large masses of d.ita. 1101\ever, reveal the data in context, so the story is correctly told.
oved from analysis 2. Induce the vie11cr to think of the subst:incc of the data: The fonnat of the graph should be
lues so naturnl to the dat,1. that it hidcs itself anu lets data shim:.
• For example, tota 3. Avoid distorting wh:it th.: data have to say: Statistics can be us~d to lie. In thc n_am~ of
to be ri·duced to cos simplifying, some crucial context could be rcmo,ed leadi~g to_ dts~orted commu111~at1on.
4. Make large data sets cohcn:nt: 13) giving shape to dntn. v1sualiza11ons cnn h~lp bring the
o help 1\ ith some analysi:s. ili1rn togcther to tell a compr<'h1:nsi1 c St0I') . .
, and high. 5. Encour;ige th..: eyes to compar..: diffcrcnt pieces of data: Organize the chart m 1,ays the
le over time. For example, e)·cs 1\ ould naturally mow to derive insights from the graph: . . .
uld need to be conve11ed 6. Reveal thc cl,lla at scvcrnl kwb of detail: Graphs leads to insights, which raise further
converted to a common curiosity. and thus present,1tions should help get to the root cause.
31
VIII Sem,, (CSE(ISE}
7. Serve a ~casonably ~lear pu~sc. - informing or decision-making.
8. Closely mtegi:ate with the stallsllcal and verbal descriptions of lhe dataset. There should
be no separation of ch~rts and text in presentation. Each mode should tell a complett:
story. Intersperse text \\1th the map/graphic to ~ighlight the main insights.
Modulc-4
7. a. What is a decision tree'! Why arc decision trees the most popular classlfkatlon
technique? (02 Marks)
Ans. A decision tree is a tree where each node represents a fcature(attributc), each link(branch)
represents a decision(rule) and each leaf repn:scnts an outcome(categorical or continues
value).The whole idea is to create a tree like this for the entire data and process a single
Fi,:ure 7.1: Scull
outcome at every kaf{or minimite the error in evef) leaf) (Source: Groebner el al.
Decision trees arc a simple way to guide one's path to a decision. The decision may be a simple
Chart (a) sho,\S a very s
binary one, whether to approve a loan or not Or it may be a complex multi-valued decision,
the value ofy increases A
as to what m:iy be the diagnosis for a particular sickness. Decision trees are hierarchically
between the vuriablcs x
branched structures that help one come to a dcc~ion based on asking cert:iin questions in a
decreases proportionnll
particular sequence. • . . .. Chart (c) shows a curvi
Decision trees arc one of the moM widely used techniques for class1ficat1on. A good dec1s1on the value of) decr.:ase
tree should be short and ask only a fe\\ meaningful questions. They are very efficient to use,
relationship, like an nrc
easy to explain, and their cl:issifica1ion acancy is competitive with other methods. Decision
(quadratic means the
trees can gcn.:rate knowledge from a rc:w &esl instanc'--s that can then be applied to n broad
positive curvilinear relal
populallon. 0-:1.i~io,.. tNec ar,· us,•,1 to answer relatively simple binary decisions.
Jh!JS would nol be a stro
That means variables x
·1,. Whot is a rcl(rcssion model? Whal is a scallcr plot? llow does ii ·help?(6 Muks) candidates that model
Ans. Rc:gression is a wdl-knonn st.1tistical tedmi4uc: to model the predictive relationship between regression cqu:11ion en
sevi:ral indepcnd,mt variables (DVs) and one dependent variable. The objective is to find little: more complex, qu
the best-fitting curve for a deJ)l!nd.:nt variable in a mul1idimensional space, with each order polynomial regrcs
ind.:pcndcnt v,u·ioblc being a dimension. TI1e curve could be a straight line, or it could be a Charts (c) and (f) have
nonlinear curve. or using any other mod
The qualit) of fit ofth.: cun.: to the data can be measured by a cocffiei,mt of correlation (r),
i:.Euminc the steps in •
which is the: square root of the amoun1 of variance explained by 1he curve.
kind or ohjci.:1he func
The key slcps for rcsression a~ simple: price predictor S)Slem
I. List oll tho, voriables available for making the model.
2. Establish a Dcpendcnl Variable (DV) of interest.
Ans. It t:ikes resourc.:s, train
3. E~aminc visual (if pos~ible) rdalionships between variables of inlerest. mining pla1fo1111s offer
4. Find a way lo prc:dict DV using the olhcr variables. required to build :in AN
A s~atter plot (or sca~1cr diagram) is a simple exercise for ptouing all dala poinls belween two I. Gather dnt:i: Divide 1
1
variables on a 1wo-d1111ensional graph. It provides a visual layout of where all the da1a po·nts divided into trainin
are placed in that two-dimensional space. 2. Select the nct,1 ork ar
The s~ath:r_plot can_b.: useful for graphically intuiting the rela1ionship between two variables. 3. Sdecl the algorithm.
Here 1s a picture (figure 7.1) that shows many possible patterns in scatter diagrams. 4. Set 11.:1 \\ ork p,1ra111et
5. Train tin: ANN with '!
6. Validate the model w
7. Freeze the" eights a,
8. Test the trained net"
9. Deploy the ANN uh ,
Otht:r neural nct\\'ork :ir

32
classification ....... ...,_....,
(02 Marlu)
link(branch)
or continues
ess a single
Fig11re 7.1: Sc•11fler p/1,1f.f lltuH·i11g IJ-pn ofrdMJuRsl,ips among two 1•arlab/es
ay be a simple (Source: Groebner et al.20 13)
alued decision, Chart (n) shows a vcry strong linear relationship between the variables x and y. That means
hierarchically the value of y increases proportionally with x. Chart (b) also shows a strong linear relationship
questions in a between the variables x and y. Here it is nn invffW ~lationship. That means the value ofy
decreases proportionally with x.
!A good decision Chart (c) shows a curvilinear relationship. It is an inverse relationship, which means that
efficient to use, the value of y decn:ascs proportionally with x. However, it seems a relatively well-defined
thods. Decision relationship, like an arc ofa circle, ,1hich can be represented by a simple quadratic equation
plied to a broad (quadratic means the power of two, that is, using terms like x2 and y2). Chart (d) shows a
decisions. positive curvilinear relationship. J lowever, it docs not seem to resemble a regular shape, and
thus would not be a strong rcl.11ion~hip. Charts (e) and (f) show no relationship.
That means variables x and y ar..i independent of each other. Charts (a) and (b) are good
~6 Marks)
candidates that model a simple linear regression model (the terms regression model and
~tionship between
bjective is to find regression equation can be used interchangeably). Chart (c) too could be modeled with a
space, with each little more comple:-., quailratic regression equation. Chart (d) might require an even higher
,e, or it could be a order polynomial regression equation to represent the data.
Charts (e) and (f) have no rclation~hip, thus, they cannot be modeled together, by regression
t of correlation (r), or using any other modeling tool.
c. Examine the steps in developing a neural nehtork for predicting stock prices. What
kind of objective function and "hat kind of data would be required for a good stock
price predictor ,ptcm using ANN? (08 Marks)
Ans. It takes resourccs, training data. and skill and time to develop a neural network. Most data
mining platfom1s offer at !cast the MLP algorithm lo implement a neural network. The steps
required to build an ANN ar..i as follows:
I. Gather data: Divide mto training data and test data. TI1e training data needs to be further
divided into training data and validation data.
2. Select the m:t\\ork architecture, such as feedforward network.
een two variables. 3. Select the algorithm, ~uc:h as Multila}cr Pcrception.
diayams. 4. Set network p.ir;unctcrs.
5. Train the ANN with training data.
6. Validate the mo<ld with validation data.
7. Fr..:ezc th.: w..:ights :ind other parameters.
8. Test the traincd net"ork with test data.
9. D..ip/oy 1he /\NN nhen it arhie,·cs good pri:dict_i~e ~ccuracy.. . .
Other neural n..:hvorl- architectures include probab11is11c networks and self-org:u11z111g feature
33
VIII Sem,, ( CSf/ ISf) C8CS · N odel-Q~ p
1
maps. A computer neuron is built
to the neuron marked with
Training an ANN: Training data Is split into three parts
~omputation. The input lay
Trainln2 sci lfhis data set is used to adiust the ·•wei11hts on thc nc:ural nctuoric (- we/4). 1s the a>.on. Each input sig
V1lid1tioa wt This data i;t'I is USl.'tl tominimize over liuinit and verifying aecuracv (- 2Xi /o).
0
inpul value and the neuron
This dala set is used only for testing the final solution in order to confirm nrc compuled in the tmini
Testing act lhe actual pn:dic1ivc: JIO" ,-r ofdtc ndwotk (- 20 ¾) d~ttn\ and back \lrO\)ag;l
Neural Ncl\\orks. An acti
approach mcnn~ lhAI Ibo data ii divided into k equal pieces, and the learn in~
· r'C'pCIIICd It-times
· · h each pieces
· · the trammg
· · set in the output signal of the
lid cross - process ,s
k - fold wit becommg
va a IIon trh·1s process leads to Iess b'1as and more accuracy, but is more time C0l1$uming output of other neurons, a
manner. This is the basic i
Machme leaming• has provcd to improve efficiencies si gnificantl y, and there are many
jobs which have b.:en replaced by smarter and faster machines using artificial intelligence more detail in this article.
or machine lcaming. The stock markets are no exceptions to this. Today, there are several Working of Neural Ncl"
Machine Leaming algorithms running in the live markets. These algorithms often provide We will look at an examp
greater returns than other ahemate algorithms or sometimes even higher than experienced consists of the parameters
traders. In this article, I will talk about the concepts involved in a neural network and how it Our brains essentially hav
can be applied to predict stock prices in 1he live markets. Let us staJt by understanding what see, smell und taste. The ri
a neuron is. emotions and feeling:,, fr
make us act or take deci
Neuron
brains. Therefore, there ar
The first layer takes in th
are the inputs to the next
Hence, in this cxtn:mely
input layer, l \10 hidden I
all know that the brain is
computations arc done in
To understand the 11 orl..in
example, ,~here the 0 1
parameters, there is one
price.
This is the neuron that you . . . lie familiar with, well if you aren't you should now be
pateful that you can understacl 6il kause there are billions of neurons in your brain. There
are three components to a neura. • claldrites, the axon and the main body of the neuron.
1be dendrites arc the receivers o(dlc lipal and the axon is,the transmitter. Alone, a neuron
is not of much use, but when ii ii CIGMCCted to other neurons, it does several complicated
computations and helps openle tbc _ . complicated machine on our planet, the human ~
body.
Input value 1 X, ,

Input value 2 X, 1------ Output signal

Input value m Xm

34
C15CS · ModwQue¢icn,p~e,- - l
A comp~ter neuron is b~ilt in a sim_ilar manner, as shown in the diagram. There are inputs
to the ne~ron mark_ed with yellow circles, and the neuron emits an output signal after some
computation. The input layer resembles the dendrites of the neuron and the output s· al
~s the axon. Each Input sign.ii i) U)~igncd • weight, w, This weight is multiplied b~g~he
input value and the neuron stores the weighled sum of all the input variables. These weights
are computed in the training phase of the neural network through concepts called gradient
descent and back propagation, ,,e will cover these topics in our subsequent blog posts on
Neural Networks. An activation function is then applied to the weighted sum which results
in the output signal of thc neuron. The input signals .ire generated by other n~urons, i.e, the
output of other neurons, and the network is built to make predictions/computations in this
manner. This is the basic idea ofa neural network. We will look at each of these concepts in
morc detail in this article.
Working of Neural Nl!l"orks
We will look at an example to understand the working of neural networks. The input layer
consists of the parameters that ,1 ill help us arrive at an output value or make a prediction.
experienc Our brains essentially have five basic input parameters, which are our senses to touch, hear,
rk and how i sec, smell and taste. The neurons in our brain crca_te more complicated parameters such as
tanding wha emotions and feelings, from these basic input parameters. And our emotions and feelings,
make us .ict or take decisions which is basically the output of the neural network of our
brains Therefore, then: arc 1\1 o layers of compulations in this case before making a decision.
The first layer takes in the five senses as inputs and results in emotions and feelings, which
arc the inputs to the next layer of computations, where the output is a decision or an action.
Hence, in this e>.tremcly simplistic modd of the working of the human brain, we have one
input layer, two hidden layers. and one output l.iyer. Of course from our experiences, we
all know th.it the brain is much more complicated than this, but essentially this is how the
comput.ilions an: done in our brnin.
To undersland the working of a m:ural network, let us consider a simple stock prjce prediction
example, where the 01 ILCV (Open-High-Low-Close-_Volume) valu~s. are the input
parameters, there is one hidden layer and the output consists pf the pred1ct1on of the stock
price.
lnpotl.a,rt
u should now be
your brain. There
ody of the neuron.
•r. Alone, a neuron
veral complicated
planet, the human

loft• .S
- OUtput signal

35
VIII Se,m,, {CS'E/IS'E)

~ ~~~~~: ~~:~ ~~~:!~t~: ~


There are five input parametm as shown · th d" ·
neurons ~nd the_ resultant in output lay~~ iseth:a:;:x~t::~
neurons
d · in the hidden

.
• layer will have different weights for each ofthe fi ve input
·

lo various combinauons of the inputs· for example, the fi rst neuron m1g
parameters
an m~ght have _different ~ct•~alton functions, which will activate the input parameters
according
!ook1?g at tht: ~olumc and the ~ilfermce_ between the Close and the Open price and might be
rgnormg the lllgh and _Low pnc~. In this case, th~ weights for High and Low prices will be
· ht be

ze_ro. Based_on the we1~1s that the modrl has trained itself to attain, on activation function
w1ll_bc applied to th~ ,~e1ghted sum an die neuron, this will result in an output value for that
l
This approach could be succ
n~eds to be optimized. Howe
particular neuron. S1m1larly, the other IWO neurons will result in an output value based on ~•dd~n larers increases, the n
their individual nctivation functions and ~ t s . finally, the output value or the predicted lime II will require to train su
value of the stock price will be the sum oldie three output values ofeach neuron. This is how supercomputer. For this rcasQ
the neural network will work 10 pr(dict SIDCk p-ices. computing the weights of the
Gradient descent involves an
Conclusion
There are two ways to code a pro".PID far pafonning a specific task. One is to define all the slope we adjust the weights, t
rules required by the program t o ~ die result given some input to the program. The values for all possible combi
other way is 10 develop tht: frnmcw• .,_ wllidl the code will learn to perform the specific diagrams below. The first pl
iask by training itself on a datak.'t .._.. ......8 the result it computes to be as close to the can be seen that the red ball n
actual results which have been obsam 11lis process is called training the model, we will function. In the second diagra
function. Therefore, we can
now look at how our neural neouwt will- 111elfto predict stock prices.
The neural network will be gi,ca * ....., which consists of the OHLCV data as the moving in the direction of th]
input and as the output, we "ould •a,w* model the Close price of the next day, this is
duration. With this approach,
computations do not take vet)
the value that we want our modd IO - - •predict.The actual value of the output will be
represented by 'y' und the predicted ¥alue will be represented by y", y hat. The training of
the model involves acijllllia& the wcigbas of the variublcs for all the di!Terent neurons present
in the neW':11 nct•111L . . . . . . . . ailimizing the 'Cost Function'. The cost function, as
the name sugaD ~ - - • SI J a pediction using the neural network. It is a measure
of how far oft"dlt. p I S • • 111n die actua1or ~rved valued, y. Thhefrc fareh many
·ons...... $ 6e most popu1ar one 1s compute as a1 o t c sum
cost fun
--" t d.,.-
Cl IINa . - ,__,_,...,ed v:ilues for the ll'lllll•n&'' dataset.
of sqUUJCU 1uC:il - r-· .
C = L□ 1/2 ( "Y - y) 2

h
The way the neural m:t"ork trains ibelfis by first compu~ing the cost functi~n for t e tra_ining
datnscl for a given set of "eights for the neurons. Then ti goes back and adJUStS the we~ghts,
followed by computing the cost function for the training dataset based on the new weights,
The process of i;ending the errors back to the network for adjusting the weights is called Gradient descent can be do
backpropagation. This is .-.:pealed several times till the cost function has been minimized. We gradient descent and mini-ba
will look al how the weights are adjusted and the cost function is minimiz.ed in more detail is computed by summing al
next. computing the slope and adl
Gradient Descent the cost function and the adji
The weights are adjusted to minimize the cost function. One way to do this is through brute dataset. ·1his h ci-tremely u
force. Suppose we take IOOO values for the weights, and evaluate the cost function for these cost function is not strictly
valm:s. Wh.:n w.: plot the graph of the cost function. we will arrive at a graph as shown below. process 10 arrive at the glo
The best value for weights would be the cost function corresponding to the minima of this getting stuck with a subopl
graph.

36
--------
- - -----
er consists of 3
ck price. The 3
nput parameters
put parameters
euron might be
e and might be
iw prices will be
·vation function '
. be successful for a neural network involving a s,·ngle werg
This approach .could • ht w h.rch
n~eds to be op!rmrzcd. However. as the number of weights to be adjusted and the number of
hidden layers increases, lhe number of computations n:quired will increase drastically. The
time it will require to train such a model will be extremely large even on the world's fastest
sul)l!rcompuler. For this reason, ii is essential to develop a better, faster methodology for
computing lhe "'eights of lite neural network. This process is called Gradient Descent.
Gradient descent involves analyzing the slope of the curve of the cost function. Based on the
to define all the slope we adjust lhe weights, lo minimize the cost f1111Ction in steps rather than computing the
e program. The values for all possible combinations. The visualization of Gradient descent is shown in the
form the specific diagrams below. The first plot is a single value of weights and hence.is two dimensional. It
be as close to the can be seen lhat the red ball moves in a zig-zag pattern to arrive at the minimum of the cost
te model, we will function. In the second diagram, we have to adjust two weights in order to minimize the cost
function. Therefore, we can visualize it as a contour, as shown in the graph, where we are
'LCV data as the moving in the direction of the steepest slope, in order to reach the minima in the shortest
e next day, this is duration. With chis approach, we do not have to do many computations and as a result, the
the output will be computalions do not lake very long, making the training of the model a feasible task.
~t. The training of C
mt neurons present
e cost function, as
ork. It is a measure
,, y. There are many
d as half of the sum
dataset. y

ction for the training


!onadjusts the weights,
the new weights.
c weights is called
n minimized. We Gradielll descent can be done in three possible ways. batch gradient descent, stochastic
ized in more detail gradient descent and mini-batch gradie~t descent. In bat~h g':1dicnt de~~nl, the cost function
is compuled by summing all 1hc individual cost functions rn the trammg dataset and then
computing the slope and adjusting the weights. In Mochastic gradient descent, the slope_ of
his is through brute the cost function and the adjustments of weights are done after each ~~ta e~try ln the trarnrng
1 function for these dataset. This is extremely useful to avoid getting stuck at a local m1~1ma ,f_the curve of the
ph as shown below. cost function is not strictly convex. Each time you run the stocha_strc gradient descent, ~e
the minima of this process 10 arrive 01 the global minima will be different. Ba~c~ gradient d~cent m_ay resul~ '!1
gelling stuck with a suboptim:il result if ii scops ar local minima. The third type rs the mmr-

37
VIII S0011 (CSf(I Sf)

batch gradient descent, which is a combination of the batch and stochastic methods. Herc,
we create different batches by clubbing together multiple data entries in one batch. This lnp
essentially results in implementing the stochastic gradient descent on bigger batches of data
entries in the training dataset. Next, let us understand how backpropagation works to adjust
the weights according to the error which had been generated.
The Design Principl or A
Backpropaga tion
I . A neuron is the basi
Backpropagation is an advanced algorithm which enables us to update all the weights in the
element) receives inputs fo
neural network simultaneously. This drastically reduces the complexity of the process to
adjust weights. If we were not using this algorithm, we would have to adjust each weight computation on the basis
individually by figuring out what impact that particular weight has on the error in the passes on the output to the
the weights for each input
prediction. Let us look at tht: steps involved in training the neural network with Stochastic
X
Gradient Descent:
• Initialize the "eights to small numbers very close to O(but not 0)
• Forward propagation - the neurons are activated from left to right, by using the first data x2
entry in our training dataset, until we arrive at the predicted result y
• Measure the error which will be generated
• Backpropagation - the error generated will be back propagated from right to left, and the
weights will be adjusted according to the learning rate
• Repeat the previous three steps, forward propagation, error computation and back
propagation on the entire training dataset Flg11
• This would mark the end of the fu'st epoch, the successive epochs will begin with the 2. A Neural network is a
weight values of the previous epochs, we can stop this process when the cost function neuron, and at least one p
converges within n certain acceptable limit a simple, single-stage co
OR neuron and the result may
of processing elements in
8. a. What is a neural network? llow docs it work? Explain the Design Principles of an depending upon the comp
Artificial Neural Network. (08 Marks) sequence, or they could w
Ans. Artificial Neural Networks (ANN) are inspired by the information processing model of the
mind/brain. The human brain consists of billions of neurons that link with one another in an
intricate pattern. Every neuron receives information from many other neurons, processes it,
~~~l
x2 ' Wll
gets excited or not, and passes its state information to other neurons. ''-wl,il wl l
x~~x>. ...:-

-
Just like the brain is a multipurpose system, so also the ANNs arc very versatile systems.
They can be used for many kinds of pattern recognition and prediction. They are also used
for classification, regression, clustering, association, and optimizatio,n activities. They are
,,.-..-Iii
used in finance, marketing, manufacturing, operations, information systems applications, and ,"
x~124
soon. lnp
ANNs are composed of a large number of highly interconnected processing elements
(neurons) working in a multi-layered structures that receive inputs, process the inputs, and
produce an output. An ANN is designed for a specific application, such as pattern recognition 3. The processing logic of j
or data classification, and trained through a learning process. Just like in biological systems, input streams. The processi
ANNs mak~ adjustments to the synaptic connections with each learning instance. function, from the proces!
ANNs arc_ like a ~la~k box trained into solving a particular type of problem, and they can intermediate weight and pr
develop high prcd1c11vc powers. Their intermediate synaptic parameter values evolve as the in its objective of solving a
sy~tem obtains feedback on its predictions, and thus an ANN learns from more training data an opaque and a black-box
(Figure 8.1 ). 4. The neural network can
many training cases. It w.
communication based on
become better at making a

38
stochastic methods. Herc,
entries in one batch. This
t on bigger batches of data
Inputs L :;:!:§~ § ~utpu\~

ropagntion works to adjust Figure 8.1 Ge11na/AN/II aodel


The Design Principl or ANN :
I. A neuron is the basic proeessin& unit or Ille aetwork. The neuron (or processing
pdate all the weights in the element) ~eceives input~ from its ~ceding neuroas (or PEs), ~oes some nonlinear weighted
plexity of the process to computallon on the basis of those inputs, transforms lbc result mto its output value, and then
ave to adjust each weight pasS<..'S on the output to the next neuron in the netwodt (figure 8.2). X·s are the inputs, w's are
t has on the error in the the weights for each input, and y is the output.
I network with Stochastic x~
• w1
t 0) x2 -........ ,,__ _ _ _ _ __

x3 -::-~,tm-□'tf
right, by using the first data
ult y 'imat~!!_ Y
x4 ·

Fig11re 1.2: Mod~/ for• slnrk •rtlftd•I nnron


2. ANeural network is a D1ulti-layered modeL Thae is at least one input neuron, one output
neuron, and at least one processing neuron. An ANN with just this basic structure would be
a simple, single-stage computational unit. A simple task may be processed by just that one
neuron and the result may be communicated soon. ANNs however, may have multiple layers
of processing elements in sequence. There could be many neurons involved in a sequence
Design Principles or •• depending upon the complexity of the predictive action. The layers of PEs could work in
(08 Marks) sequence, or they could work in parallel (Figure 8.3).
proc~ssing model of the x1........ NeuronH
link with one another in aa '· wln-,"!_!11 N~ronll W2!~ IJ2ll/21 J
other neurons, processes it,
roas.
x3~~ff1 II
- ~12~ - -
fll
r:-,\,~~
l/ "W~l NeUr0l\12
~
~,u Neuron31
jY
x'.3- > . " lleuronl 2 CJ22 /22 . 0,1 /31
are very versatile systems. .rt-23"'. 1 . · ."'2 di
iction. They are also used , ...,•: 1012! n2 j'<. ' w2"1 ~-h,uroN J / /
ion activities. They are x4 _-'wi24 v12',•tm1I 12,.}'
systems applications. and
Input N- on Hidden Outpt.t Neuron

processing elements Figure I.J: Mudel/ur • 11111/tl-l•yer ANN


process the inputs, and 3. The processing logic of each neuron may assign dilferent weights to the various incoming
such as pattern recognition input streams. The processing logic may also u'se nonlinear transfonnation, such as a sigmoid
like in biological systems, function, from the processed values to the output value. This processing logic and the
ing instance. intennediate weight and processing functions are just what works for the system as a whole,
of problem, and they can in its objective of solving a problem collectively. Thus. neural networks are considered lo be
eter values evolve as the an opaque .and a black-box system.
from more training data 4. The neural network can be trained by making similar decisions over and over again with
many training cases. It will continue 10 learn by adjusting its internal computation and
communication based on f1.-edb11ck about its pn:vious decisions. Thus, the neural networks
become better at making a decision as they handle more and more decisions.

39
VIII Sem, ( CSE/ISE)
Depending upon the nature of the problem and the availability of good training data, at some
point the neural network will learn enough and begin to match the predictive accuracy of Now append the scr-J
a h~man e~pert. _In many practical situations, the predictions of ANN, trained over a long number of columns
penod ofllme with a large number of training data, have begun to decisively become more original data and the I
accurate than human e>.pcrts. Al that point ANN can begin to be seriously considered for data to label records
deployment in real situ:itions in'renl time. Generate a predictive
sets. lf it is impossible
b. What is unsupervised learning? When Is ii used? (04 Marks) artifacts then there is !
Ans. Unsupervised learning, by contrast, docs not begin with a target variable. Instead the objective st rong strncture in the
is to find groups of similar records in the data. One can think or unsupervised learning as In the CART model se
a form or data compression: we search for a moderate number of representative records of Original records d ]
to summarize or stand in for the original database. Consider a mobile telecommunications nodes reveal patterns •
company with 20 million customers. The company database will likely contain various randomized anifact.
categories of infonnation including customer characteristics such as age and postal code, We don not expect the i
product infonnation describing the customer's mobile handset, features of the plans the of Original from Copy
subscriber has selected, details of the sub~crib.:rs use of plan fcntures, and billing and payment interesting data group~
infonnation. Although it is almost certain that no two subscribers will be identical on every This approach to uni
detail in their customer records, we would expect to find groups of customers that are very technology because:
similar in their ovcrnll pnttern of llcmoi;rnphics, selected equipment. plan use, and spending • Variable selection
and payment behavior. If we could find say 30 representative customer types such that the groups of variable
bulk of custonms an: \\Cl! described as belonging to their "type". this information could be • Preprocessing or re
very useful for marketing. planning, and new product development. We cannot promise that influenced by how
we can find clusters or groupings in data that you will find useful. But we include a method • Missing values pr
quite distinct from that found in other statistical or data mining software. CART and other • The CART-based
Salford data mining modules now include an approach to cluster analysis, density estimation select the optimal 1
and unsupervised learning using ideas that we trace to Leo Brciman, but which may have
been kno,, n informally In among statisticians at Stanford and elsewhere for some time. The c. What arc assoriatio
Ans. Association rnll! mini
method detects structure in data by contrasting original data with randomized variants of that
help identify shoppin
data. Analysts use this melhod implicitly ,,hen viewing data graphically to identify clusters
interesting relationshi
or other structure in d:11a visually. Take for example customer ages and handsets owned. If
there is a p:lllern in the data then we expect to see certain handsets owned by people in their cross-sell related item
AII data used in this te
early 20's. and rather different handsets owned by customers in their early 30's. If every
learning algorithms. '
hnndscr is just as likely to be owned in every age group then there is no structure relating
these two data dunensions. The method we use generalizes this everyday detection idea to how it is often e:-.plaii
high cfimcnsions. of- sale transaction da
among it.:ms. An exn
The method consists of these slcps:
compulcr and vims
Make a copy of the original datn, and then randomly scramble each column of data separately.
the time."
As an exam pk st.lrting with dnta typical of n mobile phone company, suppose we randomly
exchanged date or bi11h informmion at random in our copy of the database. Each customer In business environ~
marketing, it is used I
re~ord woul~ likely contnin age information belonging to another customer. We now repeat
th~s _process •~ every column of the datn. Orciman uses a variant in which each column of design. online :td\eni
ongmal data 1s rcplaced \\ ith a bootstrap resample of the column and you can use either This analysis can su
method in Salford sofiware. of products promote
In retail environment!
Note that all 11e have donl! is mo,ed infonnation about in the database, but other than moving
data we not changed an) thing. So nggregates such as averages and totals will not have close tougher for cus~
cha?gcd. Any on~ customer rccord is now a "Frankenstein" record, with item of information the cuslomer has to \~
having been obt~ined from a diffl!rcnt cuslomer. Thus, date of birth might be from customer In medicine, this tee
IO 1135. the service plan taken from customer 456779 and the spend data from 98700 I. dingnosis and patient
Represi:nting Associ
40 s..,,..-,....... E,,,..... Sc.,.,,,,...,. s......-,....... E,,,..... Sc.,.,,,,...,.
of good training data, at some Now append the scrambled data set to the original data. We therefo~ now have the _same
h the predictive accuracy of number of columns as before but twice as many rows. The top pon1on of the data 1s the
of ANN, trained over a long original data and the bottom portion will be the scrambled copy. Add a new column to the
n to decisively become more data to label records by their data source (''Original" vs. "Copy").
be seriously considered for Generate a predictive model to attempt to disc~iminate between ~e Original ~nd Copy data
sets. If it is impossible to tell, after the fact, which records are original and which are random
(04 Marks) artifacts then there is no structure in the data. If it is easy to tell the difference then there is
variable. Instead the objective strong structure in the data. . . .
of unsupervised learning as In the CA RT model sepnrating the Original from the Copy records, nodes with o high fraction
r of representative records of Original records define regions of high density and qualify_as potential "clusters"._Such
mobile telecommunications nodes reveal patterns of data values, which appear frequently m the real data but not m the
will likely contain various randomized ai1ifact.
uch as age and postal code, We don not expect the optimal sized tree for cluster de1ection to be the most accurate separator
t, features of the plans the of Original from Copy records. We recommend that you prune•back to a tree size that reveals
tures, and billing and payme interesting data groupings.
rs will be identical on every This approach to unsupervised learning represents an important advance in clustering
ps of customers that ace very technology because:
ment, plan use, and spending • Variable selection is not necessary and different clusters may be defined on different
customer types such that th groups of variable
pe", this information could • Preprocessing or rescaling of the data is unnecessary as these clustering methods are not
ment. We cannot promise th influenced by how data is scaled
eful. But we include a meth • Missing values present no challenges as the methods automatically manage missing data
ing software. CART and oth • The Ci\RT-bnsed clustering gives easy control over the number of clusters and helps
er analysis, density cstimati select the optimal number
Breiman, but which may ha
c. What arc association rules? llow do they help? (04 Marks)
elsewhere for some time.
Ans. Associa1ion rule mining is a popular, unsupervised learning technique, used in business to
ith randomized variants ofth
help identify shopping pallcrns. It is also known as market basket analysis. It helps find
graphically to identify cluste
interesting relationships (allini1h:s) between variables (items or events). Thus, it can help
r ages and handsets owned.
cross-sell related items and increase the size of a sale.
dsets owned by people in the'
in their early 30's. If eve All data used in this technique is categorical. There is no dependent variable. It use~ machine-
there is no structure relati learning algorithms. The fascinating '·relationship between sales of diapers and beers" is
is everyday detection idea how it is ollen explained in popular literature. This technique ac.cepts as input the raw point
of- sale transaction data. The output produced is the description of the most frequent affinities
among items. An example ofan association rule would be, "a Customer who bought a laptop
h column ofdata separately. computer and virus .protection so fl ware also bought an extended service plan 70 percent of
lhe time."
pany, suppose we random
the database. Each custom In business environments a pauern or knowledge can be used for many purposes. In sales and
customer. We now re marketing, it is used for cross-marketing and cross selling, catalog design, e-commerce site
in which each colum design. online adve11ising optimization, product pricing, and sales/promotion configurations.
n and you can use e This analysis can suggest not to put one item on sale al a lime. and instead to create a bundle
of.products promoted as a package to sell other nonsclling items.
base, but other than movinJ In retail environments, it can be used for store design. Strongly associated items can be kept
and totals will not have close tougher for customer convenience. Or they could be placed far from each other so that
with item of information lhe customer has to walk the aisles and by doing so is potentially exposed to other items.
inh might be from customer In medicine, this technique can be used for relationships between symptoms and illnesses;
pend data from 98700 I. diagnosis and patient characteristics/trllatments: genes nnd their functions; and so on.
Representing Association Rules

41
VIII Sem, (CSf/IS'E)
CBCS • Mod<!l,Q"""'°"'l
A generic rule is rcpresenicd berwccn a ser X and Y: X ⇒ y (S%, C%J w.,.... , .....
X, Y: products and/or services
X: LeO-lrnnd-side (LHS or Anrecedcnl) .. ... ......
~inw
O\·-ett11:
...,
~<
::J
!!!!
Y: Right-hand-side (RHS or Consequenr)
S: Support: how often X and Y go together in the total transaction set
C: Confidence: how often Y goes together wi1h X
.......
~

O\'Mllf
r..- .,,,
....,,... i
.:.:.::.
Example: {Laptop Computer, Antivirus Software} ⇒ {Extended Service Plan} [30%, 70%]
Module -5
n..ov
•.a~"\'W
r1A ..,.,
......
,;r

'"""..,... ..,.
9. a. Why ls text mining useful in the age of sodaI media? (04 Marks)
Ans. Text mining is the art and science of discowring knowledge, insights and patterns from an
1°'1.·.-11'11•t
01:~rr,1~
RA'•"
....,.
•,c
organized collcction of textual databases. Textual mining can help with frequency analysis of Step 3: Now, use Naive
important terms, and their semantic relationships. class. The class with th.: I
Text is an important part of th.: growing data in the world. Social media technologies have Problem: Players will pl:i
enabled users to become producers of text and images and other kinds of information. We can solvc ii using abo
Text mining can be applied to large-scale social media data for gathering preferences, P(Ycs I Sunny) - P( Sunn
and measuring emotional sentiments. II can also be applied to societal, organizational and llere we havi: P (Sunny I
individual scales. Now, P (Yes l Sunny) = o.
Text mining works on texts from practically any kind of sources from any business or non- Naive Oaycs uses a simi
business domains, in any formnts including Word documents, PDF files, XML files, text various a"ributes. lliis a
messages, etc. Herc arc some representative examples: having multiple: classes.
I. In thd legal profession. text sources would include law, court deliberations, court orders, Naive Oa) .:s stand for:
etc. The "-Ord Bayes refers to
2. In acadcm ic research, it would include texts of interviews. published research articles, etc. B:iycs) which cornpulcs t
J. The: world of finance will include statutory reports, internal reports, CFO statements, and also on the basis of prior
more. l11e word Naive reprc~cu
4. In medicine, it would include medical journals, patient histories, discharge summaries, etc. arc inJc:pc:nJcnt variables
S. In marketing, it would include advertisi:ments, customer comments, etc. height , weight, age, g1md
6. In the world of technology nnd search, it would include patent applications, the whole of
c. What is II su11port \'Ccto
information on the world-widi: web, nnd more.
Ans. A Supp0rt Vector Mach inc
b. What Is a Naive-Bayes technique? What does Naive & Bayes stand for? hyperplane. In other wor
(08 Marks) outputs an optimal hypi:'11
Ans. Naive Bayes al&orithm : Naive Bayes is a simple technique for constructing classifiers and lhis hypi:rplane is a lined'
models that assign class labels to problem instances. rcpre51:nted as vectors of feature values, Confu~ing? Don't worry,
where the class labels are drawn from some finite set. It is not a single algorithm for training Suppose you arc given pl
such classifiers, but a family of algorithms based on a common principle: all naive Bayes decide a separating line fi
classifiers assume that the value of a particular feature is independent of the value of any
other feature, given the class variable.
Naive Bayes alaiorilhm works :
Let's understand it using an example. Below I have a !raining data set of weather and
corresponding target variable ·Play' (i.uggesting possibilities of playing). Now, we need to
classify whether players will play or not based on weather condition. Let's follow the below
steps to perfonn it.
Step I: Convert the data set into a frequency table Image A: D
Step 2: Create Likelihood table: by finding the probabilities like Overcast probability • 0.29 You might have come
and probability of playing is 0.64. separatl:ls the two classes

42
s.u..+.-t> e.r- ~
rV~A~ CBCS - ModeltQu.e~t"'i.ct'\IPoi.pe,.,- · l

wenttt
...
FIii'

.........
o o] >Jc ...,_,_tatl1
"""""
0\.'4f'Cllt
narav
.
,.1i1t"l4t

.,=··•:•r.
hO
4 ....s,;1,
, ...
....
oz I

......
..._,. > l I
~"'""'"' ~,w
!,tll'l•w ~
,
> d 1 !4 O.>< I
0\'4rtllf
(.A "IV
rt,\f'\y
~r
'-ir
A'I
.I\,,..., ..
5

..
rl'/1.J
,
.,.
vice Plan} [30%, 70%] ",Ot'l'W
nA "'iV

'•tJfl'W
...
...,
(')\•.-n-Hl' .,.
(04 Marks) t-..i.'.."l'T
...,._..
R#li-,V ~;r
ts and patterns from an
ith frequency analysis of Step 3: Now, use Naive Oayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.
edia technologies have Problem: Players will play if weather is sunny. Is this statement is correct'/
r kinds of information. We can solve it using above discussed method of posterior probability.
r gathering preferences, P(Ycs I Sunny)== P( Sunny I Yes) • P(Ycs) / P (Sunny)
cietal, organizational and Herc we have P (Sunny /Yes)= 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes I Sunny)= 0.33 • 0.64 / 0.36 = 0.60, which has higher probability.
om any business or non• Naive Bayes uses a similar method 10 predict the probability of different class based on
F files, XML files, text various allributes. This algorithm is mostly used in text classification and with problems
having multipl.i class.:s.
eliberations, court orders, Naive 13ayes scand for: •
The word Bayes refers lo llaysian analysis (based on the work of the mathematician Thomas
shed research articles, etc. 13ayes) which com pules the probability of a new occum:nce not only the recent record, but
orts, CFO statements, and also on the basis of prior experience.
The word Narvc represcncs the strong assumpcion chat all the parameters of the instances
discharge summaries. etc. arc in<lcpemlent variables with little or no correlation. Thus if people are identified by their
ents, etc. height, weight, age, ~ender, all these variables are nssunh:<l to be independenc of each other.
applications, the whole of
c. Whut is a su1111orl \'cctor 11111chinc'! (04 Marks)
Ans. A Support Vector Machine (SVM) isa discriminative classificrfonnallydefincd by a separating
stand for? hyperplane. In ocher words, given labeled training data (supervised learning), the algorithm
(08 Marks) outputs an optimal hyperplane which categorizes new examples. In two dimentional space
onstructing classifiers and this hyperplane is a line dividing a plane in two parts where in each class lay in either side.
s vectors of feature values, Confusing'/ Don't wony, we shall learn in laymen terms.
ingle algorithm for training Suppose you are given plot of two label classes on graph as shown in image (A). Can you
principle: all naive Bayes decide a separating line for the classes?
endent of the value of any

•••
data set of weather and •• •
laying). Now, we need to • ■ II
'on. Let's follow lhe below • •
Image A: Drawn line that separates black circles and blue squares.
You might have come up with something similar to following image (image B). It fairly
separates the two classes. Any point that is left of line falls into black circle class and on right

43
VIII Se,m, (CSE/IS£)
C8CS • Model,Q~
falls into blue square class. Separation of classes. Thal's whal SVM does. It finds out a line/
hyper-plane (in multidimensional space that separate outs classes), Shortly. we shall discuss pages. There are two bas
why I wrote multidimensional space. I. Hubs: These are pagei
gathering point, where pe
com, or government site!
•••
••
com and yelp.com could
2. Authorities: Ultimate;
• •• • complete and authoritat
■ ■ information, news, advic
of inbound links from ot~
page for expert medical 0
Image B: Sample cut to divide into two classes. news.
Web usage mining
OR As a user clicks anywhe
10..a. What are the three types of web mining? (10 Marks) entities in many location
web server providing the
Ans. The web could be analyzed for its structure as well as content. The usage pattern of web
pages could also be analyzed. Depending upon objectives,_,~eb mining can be divide~ i?to _ activity on those pages.
proxy server, or ad server,
three different types: Web usage mining, Web content mmmg and Web structure mmmg
The goal of web usage min
(Figure I 0.1 ).
through Web page visits a
access logs, referrer logs,
usage profiles are also g

- -
( Web Mining )

• metadata, such as page alt


The web content could be
I. The server side analysi•
Web Content Mlnlnr, Web Structure Mining: Web Usage Mining: Those websites could be h
Using HTML pages Using URL links using visits, clicks, logs 2. The client side analysis
and created by users.
Figure: JO.I Web Mi11i11g slr11cl11re I. Usage pattern could be
Web content mining for patterns of sequence
A website is designed in the form of pages with a distinct URL (universal resource locator). Clickstream analysis can
A large website may contain thousands of pages. These pages and their content is managed research, and analyzing e
using specialized software systems called Content Management ~ystems. Every _page ~an 2. Textual information ac
have text, graphics, audio, video, fonns, applications, and more kmds of content mcludmg text mining techniques. T
user generated content. . . . . technique to build a Term-
The websites keep a record of all requests received for its page/URLs, mcludmg the analysis and association r
requester infonnation using 'cookies'. l11e log of these requests could be analyzed to gauge sentiment analysis.
the popularity of those pages among different segments of the population. The text and
application content on the pages could be analyzed for its usage by visit counts. l11e pages
on a websit11 themselves could be analyzed for quality of content that attracts most users. Website
Vlsltors1
Thus the unwanted or unpopular pages·could be weeded out, or they can be transformed with Website U$o.n.
different content and style. Similarly, more resources could be assigned to keep the more Customers

popular p<1ges more fn:sh and inviting.


Web structure mining
The Web works through a system ofhyperlinks using the hypertext protocol (http). Any page Fig,
can create a hyperlink to any other page, it can be linked to by another page. The intertwined Web usage mining has man
or self-referral nature of web lends itself to some unique network analytical algorithms. The previously learned rules an
structure of Web pages could also be analyzed to examine tlu: pattern ofhypedinks among It can also help design cro

44
~V~A~
I does. It finds out a line/ pages. There are two basic strategic models for successful websites: Hubs and Authorities.
Shortly, we shall discuss I. Hubs: These are pages with a large number of interesting links. They serve as a hub, or a
gathering point, where people visit to access a variety of information. Media sites like Yahoo.
com, or government sites would serve thal purpose. More focused sites like Traveladvisor.
com and yelp.com could aspire to becoming bubs for new emerging areas.
2. Authorities: Ultimately, people would gravitate towards pages that provide the most
complete and authoritative information on a particular subject. This could be factual
information, news, advice, user reviews etc. These websites would have the most number
of inbound links from other websites. Thus Mayoclinic.com would serve as an authoritative
page for e;,.pert medical opinion. NYtimes.com would serve as an authoritative page for daily
news.
es. Web usage mining
As a user clicks anywhere on a webpage or application, the action is recorded by many
entities in many locations. The browser at the client machine will record the click, and the
(10 Marks)
web server providing the content would also male a record of the pages served and the user
he usage pattern of web activity on those p:iges. The entities between the client and the server, such as the router,
ning can be divided into pro;,.y server, or ad server, too would record that click.
d Web structure mining The goal of web usage mining is to extract useful information and patterns from data generated
through Web page visits and transactions. The activity data comes from data stored in server
access logs, referrer Jogs, agent logs, and client-side cookies. The user characteristics and
usage profiles are also gathered directly, or indirectly, through syndicated data. further,
metadata, such as page attributes, content attributes, and usag9tatn are also gathered.
The web content could be analyzed al multiple levels (Figure 10.2).
I. The server side analysis would show the relative populnrity of the web pages nccessed.
Those websites could be hubs and authorities.
2. The client side analysis could focus on the usage pattern or the actual content consumed
and created by users.
I. Usage pattern could be: analyzed using 'clickstream' analysis, i.e. analyzing web activity
for pallerns of sequence of clicks, and the location and duration of visits on websites.
iversal resource locator). Clickstream analysis can be useful for web activity analysis, software testing, market
their content is managed research, and analyzing employee productivity.
Systems. Every page can 2. Textual information accessed on the pages retrieved by users could be analyzed using
inds of content including text mining techniques. The te;,.t would be gathered and structured using the bag-of-words
technique 10 build a Term-document matrix. This matrix could then be mined using cluster
ge/URLs, including the analysis and association rules for patterns such as popular topics, user segmentation, and
Id be analyzed to gauge sentiment analysis.
population. The text and Si.Ulm.A Mine Data for
visit counts. The pages PctRi!:'9 dos, u,~,g~ PJttcrn~
foe ao-1lvil1 u~, Prot,ies
that attracts most users. Webslto
Wcbl.oCt,
lde#1Ufv u,e,,
..w.t, p:a,re
v,slton, ldcm,fv vi1l11
can be transfonned with Wobsltc Unirs, CUtk st,.e.1ms
Identify PO&O
proftles
•Web
igned 10 keep the more Customcirs views
op1lmlt~lon
-summar,zo
o1cl1V'lty dal:t

ocol (http). Any page Figure: 10.2 Web Usage Mi11i11g arc//itcct11rt
paite. The intertwined Web usage mining has 11ld11y bu~incss applications. It can help predict user behavior based on
lyrical algorithms. The previously learned rules and users' profiles, and can help determine lifetime value of clients.
of hyperlinks among It can also help design cross-marketing strategies across products, by observing association

45
VIII Se.tw (CSE/IS'E)
'B(,g,Vam-A~
rules a~ong the pages on the website. Web u_sage can help evaluate promotional cam a, ns
and see rfthe ~s~rs were attracted to the websnc and used the pages relevant to the cam~i'!n.
_Web usage mmrng could be used to present dynamic information to users based on their
interests and profiles. This includes targeted on line ads and coupons at user groups based on Eigl1U
user access panems. C
b. What is social network analysis? faplain the a pplications of SNA? (06 M11rks) Time: 3 hrs.
Ans. Social Network Analysis (SNA) is the mapping and measuring of relationships and streams ,._ Note : ; f11swer Q11y Fl VE
between people, groups, organizations, computers, URLs and other sources of information
that arc connected. M:inagcment consultants in particular use Social Network Analysis to
map their business relations and further investigate their mutual relationships. This can be
compan:d to a central nervous system that connects everything. I . a. Expluln Gcncrul HOF:
Social networks are a graphical representation of relationships among people and/ or entities. Ans. General HOFS Comm
SN/\ is the art nnd science of discouvcring patterns of interaction and influence within the Examples in this scctior
participantsinnnetwork.Thcseparticipantscouldbcpeople,organizations,mnchincs,eoncepts.or S hdfs version
any othcr kind of entities. Then: nrc two mnjor levels ofsocial nel\\ ork analysis: discovering Hadoop 2 .6 .0 .2 .2 _4 .,
sub- networks within the network, and ranking the nodes to find more important nodes or Subversion
hubs. 22aS6~cl>l:448969d0790
Computing the relative influence of each node is doni: on the basis of an input• output matrix Compiled with protoc 2
offlo\\S of influencc among the nodes. Fr~m source with check
Applications orSNA: using /usr/hdp/2 .2 .4 _2_.
• Sclf-awnrcm:ss 1-lDFS provides a series o
• Communities A list of those comman
• Marketing these commands will be t;
• Public henllh S hdfs dfs
Sctr -awareness: Visualizing his/her social network can help a person organize their Usage:. hadoop fs (generi
relationships and support network [•cat [·1gnoreCrc] <src::,.
Communities: SNA can help identification. construction, and strengthening of networks {•checksum <src>...J ..
within communities to build wellness. comfort, and resilience. Analysis of joint authoring [-chgrp [·RJ GROUP PA
relationships and citations help identify subnet works of specializations of knowledge in [·chown f·RJ <MODE(
an academic field. Rescarchcrs nt Northwestern university found that the most detenninant [-copyFromLocal HJ [.'pJ
factor in the success of n broad way play was the strength amongst the crew and cast. (-count (·g) f·h) <path>...
Mnrkcling: there is a popular network insight that any two people arc related to each other [•cp HJ [·p] ·p[topax] J <
through at most seven degrees of links. Organizntions can use this insight to reach out with [•createSnapshot <snapsh
their message to large number of people and also to listen actively to opinion leaders as [-dt!lelesnapsho1 <snapsh
ways to understand their customers needs and behaviors. Politicians can reach out to opinion {-df(-hJ <path::,., ..)
leaders to get their message out. (-du (-sJ [·hJ <path>...] {•c
Public health: Awareness' of network can help identify the paths that certain diseases take [·gel [· pJ [•ignoreCrcJ [-er
to spread. Public health profossionals can isolah: and contain diseases before they expand to f•fetfa11r (·RJ [•n name I ·d
other networks. [•help [cmd...] J
[•Is (·dJ [•h] [·RJ [<path>..
{·mkdir {·pJ <palh>...) {mo
[•movcToLocal <src::,. <loc
[·put HJ f•p) HJ <localsrc
(•rt:nameSnapshot <snapsh
[·rm HJ [•r l ·RJ {•skipTra
[-rmdir [-ignorc-fail-on-n
[•setfacl [·RJ [{·b I •k} {·m
<acl_spec> <path>] J
46
te promotional campaigns
s relevant to the campaign. E ighth Semester B.E. Degree Examimatioo,
on to users based on their
ons at user groups based on CBCS - Model Question Paper - 2
BIG DATA ANALYTICS
SNA? (06 Marks) Time: 3 hrs. Mox. Marks: 80
f relationships and streams Note: Am111er any FIVE/11ll ,111eltio,u, lelet:tlng ONl::ful/ que:itlonfrom eac:11 m odule. ~
1,..

ther sources of information


ocial Network Analysis to Mo<lule-1
I relationships. This cnn be I. a. Exph1in Cencral IIDFS commands 111 di:tall (08 Marks)
Ans. General HDFS Commands: The version of HDFS can be found from the version option.
ong people and/ or entities. Examph:s in this section are run on the UDFS version shown here:
ion and influence within the S hdfs version
ntions,machines,concepts_.or Hadoop 2 .6 .0 .2 .2 .4 .2-2
etwork analysis: discovenng Subversion git@github.com:hortonworks/hadoop.git -r
. nd more important nodes or 22a563cbe448969d07902aed869ac l 3c652b2872 Compiled byjenkinson 20 I5-03-31 TI 9:49z
Compiled with protoc 2 .5 .0
sis of an input• output matrix From source with checksum b348 1c2cdbc2dl8lf2621331926e267 This command was run
using /usr/hdp/2 .2 .4 .2-2/hadoop/hadoop- Common-2 .6 .0 .2 .2 .4 .2-2 .jar
HOFS provides a series of commands similar to those found in a standard POSIX file syslem.
II. lisl of those commands can be obtained by issuing the following command. Several of
these commands will be highlighted here under the user account hdfs.
$ hdfs dfs
Usage: hadoop fs [generic op1ions] [-appendToFile <localsrc>...<dst>]
help a p;rson organize their [-cal [-ignon:Crc] <src>...]
[-d1ecksum <src>...]
nd strengthening of netwo~ks [-chgrp (-R] GROUP PATH...]
e. Analysis of joint authon~g [-chown [·RJ <MODE[,MODEJ...1OCTALMOOE> P/1.TH ...]
ecializations of knowledge in [-copyfromLocal (•f] (·p] (•I) <localsrc>...<dst>)
und that the most determinant [-count [•g] [-h] <path>...]
ongst the crew and cast. [-cp [·f] [·p) -p(lopax]] <src>...<dsl>]
people :m~ related to each ot~cr [-crcaleSnapshot <snapshotOir> [<snapsho1Narne>J ]
sc this insight to reach out with [-deletesnapshot <snapshot Dir> <snapshotName>]
actively to opinion leaders as [•df{-h] <path> ...]
iticians can reach out to opinion [-du [•s] (-h] <path> ...) [-expunge)
(-gel (·Pl (-ignorcCre] [•ere] ""'src>...<localdsl>) [•getfacl [·R) <path>
aths that certain diseases take
[-fetfallr [·R] [-n nnme I ·d] [•i: en] <path> [-getmerge [-nl] <src> <localdst>]
1 diseases before they expand to {-help [cmd...]]
[-Is [·d] [-h] [-R) [<path>...] J
[-mkdir [·p) <path> ...] [movcFromLoeal <localdst>...<dst>]
[-moveToLocnl <src> <localsre>...<ds1>] [-mv <src>...<dst>]
[-put [-f] [·Pl (-1] <localsrc>...<dst>]
[-n:namcSnapshot <snapshot Dir> <oldName> <newNamc>]
[-rm [•f] [•r j -R] {-skipTrash] <src>...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-se1facl [-RJ [{·bl •k} {-m 1-x <ael_spec.} <path>] I [--set
<acl spec> <palh>l 1

47
VIII Set111 (CSE/ISE) Blg,VattvA ~

[-setfattr {-n name [-v value] 1-x name} <path>] (-setrep [·RJ [-w] <rep> <path>...)
[•stat [format] <path>..] drwxr-xr-x -hdfs
[-tail -[-f] <file>] drwxr-xr-x -hdfs
[~test -[dcfsz] <path>] d,wxr-xr-x -hdfs
[-text [-ignoreCrc] <src>...] [-touchz <path>...]
drwxr-xr-:-. -hdfs
[-truncate [-w] <length> <path>...] [-usage [cmd...]]
Generic options supported are drwxr-xr-x -hdfs
-conf <configuration file> specify an application configuration file drwxr-xr-x -hdfs
-D <property=value> use value for given property The samc result l·an be
-fs <local I namenode:port> specify a namenode $ hdfs dfs -Is /user,hdts
-jt <local lresourcemanager:port> specify a ResourceManager Make a Directory in I
-tiles <comma separated list offi lcs> specify comma separated files to be copied to the map To make a directory H9
reduce cluster path is supplied. tht: us
-libjars <comma separated list of jars> specify comma separated jar files to include in the S hdfs dfs -mkdir stuff
class path. Copy Files to IIDFS
-archives <comma separated list of archives> specify comma separated To copy a file from you
archives to be unarchived on the compute machines. full path is not supplit:d
The general command line syntax is in the din:ctory stuff tha
bin/hadoop command [genericoptions) [commadOptions] $ hdfs dfs -put test stuff
List Files in HDFS The file transfer can be
To list the files in the root HDFS directory, enter the following command: $ hdfs dfs -Is stuff
$ hdfs dfs -Is / ' Found I iteri1s
Found 10 items -rw-r--r--- 2hdfs hdfs 12
drwxnvxrwx -yarn hadoop 0 2015-04-28 16:52 /app-logs Copy Files from IIDFS
-hdfs hdfs 02015-04-21 14:28/apps Files can be copied back
drwxr-xr-x
the file we copit:d into H
drwxr-xr-x -hdfs hdfs 0 2015-04-2 1 I0:53 /benchmarks name test-local.
drwxr-xr-x -hdfs hdfs 02015-04-21 15: 18/hdp $ hdfs dfs -get stulli'test t
drwxr-xr-x -mapred hdfs 02015-04-21 14:26 /mapred Copy files Within 1101
drwxr-xr-x -hdfs hdfs 02015-04-21 14:26 /mr-history The following will copy
S hdfs dis -ep stuff/test te
drwxr-xr-x -hdfs hdfs 0 20 15-04-21 14:27 /system Delete a File within 110
drwxrwxrwx -hdfs hdfs 0 2015-05-07 13:29 /tmp TI1e following command
drwxr-xr-x -hdfs hdfs 0 2015-04-27 16:00 /user prcviously:
02015-05-27 9:0 I /var $ hdts df.s -m1 test.hdfs
drwx-wx-wx -hdfs hdfs
Movc;:d: ·hdfs: / /limulus:
To hst files m your home directory, enter the following command.$ hdfs dfs -ls
hdfs/ .Trash/Current
Found 13 items Note thrtt when the ts.tr.i
drwx---·· · -hdfs hdfs 0 2015-05-27 20:00 .Trash deleted files are moved to
drwx------ -hdfs hdfs 0 2015-05-26 15:43 .staging -skipTrash option.
drwxr-xr-x -hdfs hdfs 02015-05-28 13:03 Distributedshc 11 S hdfs dfs -rm -skipTrash
Oi,ll;ti, u Diri,c to ry in HO
drwxr-xr-x -hdfs hdfs 0 2015-05-14 09: 19 TeraGen-S0GB
The following command\
drwxr-xr-x -hdfs hdfs 0.2015-05-14 10: 11 TeraSort-50GB S hdfs dfs -rm -r -skipTra
drwxr-xr-x -hdfs hdfs 0 2015-05-24 20:06 bin Gel an IIDFS Status Re1
Regular users can gt:t an
drwxr-xr-x -hdfs hdfs 02015-04-29 16:52 ecamples
Those with I-IDFS admini
this commamr uses dfsad,
48
~Vc::t,0;vA~ C !iCS · Modclt Que«c.en., Petpe-Y · 2

I<rep> <path>...) drw,r-xr-x -hdf~ hdfs 02015-01-27 16.00 0umc-chauncl


dnv,r-xr-,
d1 wxr-xr-x
-hdfs
-hdfs
hdfs
hdfs
0 2015-04-:?9 14:33 oo7ic-t.l.O
0 20 I5-0-1-30 I 0:35 oozic-examples --
dn,xr-xr-x -hdfs hdfs 0 20 I5-04-29 20:35 oozie-oozi
dn1 :-.r-:-.r-x -hdfs hdfs 0 20 IS-05-24 IS: 11 "ar-and-pcacc-inpu•
-
drwxr-xr-:-. -hdfs hdfs 02015-05-25 15:22 ,~ar-and-peace-outplll
1 he same result can be obtarncd by 1ssu111g the tollowrng command: -
$ hdfs dfs -Is /u~er, hdfs
Make :1 Uircctory in II DFS
s to be copied to the map To make a directory I IDFS, use the following command. As with the -Is com1M11(1, when no
path is supplied, the user's home directory is used (e.g., /usl!fs'hdfs).
jar files to include in the S hdfs dfs -mkclir ~luff
Copy Files lo I IDFS
To copy a file from your cum:nt local dirccto[') into IIDFS, use th1, folio" 1n~ command. If a
roted
full p:ith is not supplied, }our home directory is assumed. In 1h1s r,,w the file 1.:s1 1s placed
in the directory ~tuff that was created previously.
$ hdl's dfs -put test stuff
The file transfer can be confinm:d by using the -Is command'.
mmand: S hdfs clfs -ls stuff
found I itc1ns
-rw-r--r••· 2hdfs hdfs 12857 2015-05-29 13: I2 stuff/test
Copy File~ from IIDFS
(app-logs Files can be copied back 10 your local file system using the following tommand. 111 this case,
apps the file 11c copicd into I IDI S, test, will be copied back 10 the current local directory with the
/benchmarks namc test-local.
$ luffs dis -get stufflte~t test-local
'hdp
Copy Files Within IIOFS
mapred The following will copy a file in I IFDS:
mr-history S hdfs dfs -cp stt1ff/tes1 test.hdfs
system Delete a File within IIDFS
llie following command will delete the HOFS file h:st.dhfs that was created
tmp
previously:
user
S hdfs dfs -rm tcst.hclfs
Moved: 'hdfs: / /limulus:8020/uscr/hdfs/stuffltest' to trash at: hdls: / lirtulus:8020 /user/
hdfs/ .Trash1Current
Note: that when the ts.trash.interval option is set to a non-zero value in core-site .xml, all
deleted files arc movc:d to the user's .rrash directory. This can be avoided by including the
-skipTrash option.
S hdfs dfs -rm -sl,.ipTrnsh stuff/test Ddcted stuff/ti.:st
Delcie a Directory in IIDFS
1 he following command will <lclch: lhe HDFS directory stuff and all its contents:
S hdls dfs -nn -r -skipTrash stuff Dclc11:d stuff
Get :111 HDFS Status Report
Regufar usi:rs can get an abbreviated I IDFS slatus report using the following command.
Those with I-IDFS administr:itor privileges" ill get a full (and potential!) long) rcport. Also,
lhis comm:ind uses dls:1dmin inst~11d of sfs to in• ob; :1dm11,istr:11in, 00111111111,ds. TJ,o ~t atua
49
VIII Setn, (CSf/ISf:)

repo,, is ~imilar to the data presented in the IILJFS web GUI dbcount: An exam
S hdfs dfsadmin -report tlistbbp: A map/red
Configured Capacity: 1503409881088 (1.37 TB) i:rcp: A mapfn:duce
Present Capacity: 1407945981952 ( 1.28 ra) effects a join over s~
DFS Remaining: 12555 I0564864 l 1.14 TB) 1nullililc1,c: /\job t
DFS Used: 152435417088 (141.97 GB) program to find sol
DFS Used%: 10.83% pi: A map rcJuce pn
Under replicated blocls: 54 Blocks with corrupt replicas: 0 Missing blocks:0 nt11domtc1Hilcr: A
------······-·············-···-··-··-·· sccond11rysort: An
report: J\ccess denied for user deadline. Superuser privilege is required. program that sorts ti
b. Wrile a ~hort note on ruMini: map reduce na mplc and alJ>o e,plain the e1isting )udoku: A sudol..u 5
ft\'ailable l'\amplc~. (08 Marls) tcragcn: Generate <l
Ans. Running J\111pltcducc E'\:111111lcs tcrasorl: Run lhc tc~
All Had Jop , ell-ases come wi1h MapReduce example applications. Running the existing lt'r:1111lidatc: C'heckl
MapReJu~e c:-.amplcs is a s;mple pru.:css•once the example files are located, that is. For \\Ordl'Ount: A rnap-'ri
example. 1f you installed ! laddop version 2.6.0 from the Apache sources under /opt, the \\ord111can: A map/r
cx:unrk ,··ill he in •he follo1\i11g dirwory: files.
/opt /h.,<' -r·" .6.0l~hare'hadoop maprcduce/ wordian: A map'redu
In other, .i1 ~ion~. the c:-.amples may I~ in ·un.'lib 'hadoop-maprcduce/ or some other location. "ord slandard de, i
The exact location .>f the e-.:amplc jur file can be found using the find cominand. length of the words in
S find/ -name ''haddop-mapreduce-i:xample* .jar" •print
Consider the following software cn~ironment
2 · a. EAplain with neat di·
• OS: Linux
• Platform: RHEL 6.6 basic steps of MopR •
• Hortonworks I IDP 2.2 with Iladoop Version: 2.6 (diagrnm).
In this en, ironment, the location of the examples is /usr/hdp/2.2.4.2•2/hadoop- mapreduce. Ans. M:1flltcduce rarallcl
For the purpose of this example, an env1ronn1cnt variable called HADOOP EXAMPLES can algorithm is fairly sim
be defim:d as follows: - function. Operntionall
$ export HA DOOP_EXAM Pl ES•tusr/hdp/2.2.4.2•2/hadoop-mapreducc be quite compfc:,. Pura
Once yo,11kfine thi: examples path. >ou can run the Hadoop examples using the commands mapper and reducer pru
discussed in lhe folio" ing sections. The basic steps 11rc as
Listing AvnHuble Eumplcs l.lnpu l Splits, IIDFS ~
A list of the available c:-.umples cnn be found by running the following command. In some chunk or block and wril
cases, tl_1c version number may be pa11 of the jar file (e.g., in the version 2.6 Apache sources, on multiple machines (1
the file 1~ named hadoop-mapreducc-cxamples•2.6.0.jar). determined by I IDf'S a
$ yam jar SHA DOOP_EXAMPLES/hadoop-mnpn:duce•cxnmpte.jar considered pa11 or tire I
Note: In previous Vl!rsion of Hadoop, the command hadoop jar... was used to run Map Reduce throughout HOFS ~crvci
programs. Ne_wer versions provides the yarn command. which offers more capabilities. Both The inpul splits used b>
commands will ,,ork for these e~.imples. example, lhe splil siLe c
The possible examples arc as follows: records) or an nclual size
An example program must be givcn as the first argument. The number of~plits corr
Valid program names are: 2.Map Step. The mappir
agg~rgalc n ordcount: An Aggregate based map/reduce program that counts the words in For large amounts ofdma
the input files. the specific mapping pr
agi:rcgalt!wo_rdhis~: An Aggregate based map/reduce program that computes the histogram where the block resides.
of the words 111 the input files. will lry to pick a node th
bbp: A map redure program that uses Oailey•Borwcin•Ploulfe that compute exact bits of Pi. called rack awareness)

50
C.BCS • >--fodel,Q~«on,petpe,.,- - 2

tlbcount: An example job that count the pagevie" counts lrom a tlatabasc
tlistbbp: A map'reduce program that uses a BUP•t)pe formula to compute exact bits of Pi.
grep: A map/reduce program that counts the match~'S of a regcx in the input jo1fi ·\ job that
effects a join over sorted, equally partitioned data~ct~
mulllfilt>,,c: A job that counts words from several files. pcntomlno: A map/reduce, till laying
program lo lind solutions to pentomino problems.
pi: A map. reduce program that estimates Pi using a quasi-MonteCarlo method,
blocks:0 rnntlomtewritcr: A map/reduce program that writes IOGA of random textual d~ta per node.
sccontlar)sort: An example defining a secondary sort to the reduce. aort: A mJpircduce
program that sorts the data written by the random writer.
ed.
sudoku: A s11dok11 solver.
0 exi>lain the c,isting tcragcn: Generate dntn for the ter.i~on
(08 Marks) tcrn)ort: Run the teroso11
tcravalidate: Checking results oftcrasort
s. Running the e>.tsting wonlcount: ,\ mnp/rl·ducc progrnm thnt counts the word~ in the input files.
are located. that is. For wordmcun: A map/n.:duce progrnm that counts the.: average length of the words in the input
• sources under lopt. the files.
wordian: A map/reduce program that counts the median length of the words in the i11put files.
" ord standard dc1 iation: A map reduce program that counts stand.ire.I dcvbt,on of the
•e/ or sortll! other location length of the words in the input liks.
d cominnnd:
OR
2. E.~plain with ucat diagram Apache lladoop parallel m:1p !'educe data now (or) Explain
11.
basic steps of MapRcducc parallel data flow "ith the CAamplc of 11ord count program
(diagram). (08 Marks)
Ans. MapRcducc Parallel Data Flow: From a programmcrs perspective, the MapReduce
4.2-2/hadoop- mapreduce. algorithm is fairly simple. 1 he programmer must provide a mapping f11ncti1•11 nnd a ri.'ducing
\DOOP_EXAMPLES can function. Operationally, ho11 ever, the Ap:iche Hndoop parallel MapReduc~ data now can
be quite complex. Pnralll'I execution of MapRcduce r.:quircs other steps in addition to the
reduce mapper and reducer processes.
pies using the commands The basic ~tl'ps arc ns follows:
I.Input Splits, HDFS distributes and replicates data owr multiple servers. The default data
chunk or block and wrillen to different machines in the cluster. The data arc also replicated
wing command. In some on multiple machines (typically thr.:e machine). These dnta slices are physical boundaries
rsion 2.6 Apache sources, determined by HDFS and have nothing to do "ith the data in the file. \lso, while not
considered part of tire MapReduce process. the time required to load and Jistributc data
:ll' throughout I IDFS servers can be considercd part of the total processing llmc.
used to run MapReduce TI1e input splits used by MapRe<luce :ire logical boundaries based on the input dnta. For
more capabilities. Both example, the split size can b.: based on the number of records in a ti le (if the d,tl,1 exist ns
records) or an actual size in b:rtes. Splits are almost always smaller than the I IDFS block size.
The number ofsplits corresponds to the number of mapping processes used in the map stage.
2.Mnp Step. The mapping process is II here the parallel nature of I ladoop comes into play.
For large amounts of dat:i, many mt1ppcrs can be operating at the same time. I he user provides
that counts the words in the specific mapping process. MapReducc will try to ext!cutc the m~pper on the machines
where the block resides. Because thc file is replicated in I IDFS, th..: kast b•t~y. Mar,Reducc
computes the histogram will try to pick a node that is closest to the node that hosts the tfata block (a ch:iracteristic
called rack awareness). Thi: last choice is any node in the cluster th.it has access to HDFS.
comp1.1e exact bits of Pi.

51
VIII S<?-iw (CSf(ISE)
3.Combincr Step. It possibl~ to provide an optimization or pre-reduction as pa~ of the map step is to write the outpu
stage when: key-value pairs are combined prior to tht: next stage. The comb111c:r stage 1s As mentioned, a combin
optional. · ·1 k tb For instance, 111 the prev
4.Shuffte Step. Because the parallel reduction stage can complch:, a 11 s11111 ar eys mus e (spot, I)
combined and counh:d by th.: same reducer process. Therefore, results of the map stage ~1ust (run, I)
be collected by the key-value pairs and shulTh:d to the same reducer process. If only a smgll: As shown in figure: I .2,
reducer process is used, the shullle stage is not needed. . . This optimization can he
5.Ucduce Step. The final step is the actual reduction. In this stage, the data reductton 1s
erformed as per the programmers design. The reduce step is also optional. The results ~re
~rittcn to HDFS. Each reducer will write a11 output file. For example, a MapRcducc Job
ru11ni1w four rc:ducers will create files called pa11-0000. part-000 I , and part-0003.
Figure"1. I is an example ofa simple Hadoop MapReduce data now for a word count program.
The map process counts tin: words in the: split,and the reduce process calculates the total for
each word As mentioned earlier, the actual computation of the map and reduce stages are up
to the pro~•rammcr. The Mapreduce data now shown in Figure 1.1 is the same regardless of
the specific map and reduce t,1sl-s.
Input , Spl~ Map Shullle Aeduce OutplA Figure 1.2 Atltl,
b. With lwo programming c
Interface.
Ans. Using lhc Streaming lute
The Apache Hadoop strca
engine. The streams intcrf
and stdout.
When working in the 1-/ado
by ~he user. This approach
eas,ly tested from the comm
Listings I. I and 1.2, will be
Listing 1.1 Python Mappc
#!/ust/ bin/cnv py1hon
import sys
Figure I.I Ap11d1e ll<11luop p1m1llel /llapre<luce 1/maj/0111 # input comes from STDIN
The input to 1hc MapRcduce application is the following file in HDFS with three lines of for line in sys.stdin:
text. The goal is to count the number of times each word is used. # ren_iovc leading and trniling
see spot run # split lhc line into words wo
run spot nm # increase counters
see the cat for word in words:.
The first thing MapReduce will do is create the data splits. for simplicity, each line will be # write the results to STDOU
one split. Since each split will require a map task, there are three mapper processes that count the
the number of words in the split. On a clusler. the results of each map task are written to local # Reduce step. i.e. the input r.
disk and not to HDFS. Next, similar keys need to be collected and sent to a reducer process. # tab-delimited; the trivial wo
The shufile s11:p required data movement and can be expansive in terms of processing lime.
Depending on the nature of the applicntion, the amount of data that must be shuffle throughout Listing 1.2 Python Reducer
the cluster can va,y from small to large. #!/usr/bin/env python
Once the da1a have been collected at1d so,1cd by key, the reduction step can begin (.even if from operator import itemegct
only partial results arc availablc). It is not necessary-and not normally recommended-to have
currcnt_word~None current c
a reducer for each key-value pair as shown in Figure 1.1 . In some cases. a single reducer will
#input comes from STDIN -
provide adequate p,•rfonnnnc.:: in 01hcr cases. multiple reducers may be required to speed up
for line in sys.stdin:
the reduce phase. Th.: number of reducers is a tunable option for many applications. The final
# remove leading and trailing
52
C13CS - Mode.lt Q ~ Pc;,.,per- • 2
ct ion as part of the map step is to write the output to I IDFS.
I he combmer ~tage is As mentioned, a combiner step enables SIJfflC ~reduction of the map output data
For inst:mcc, 1,1 the previous cAample, one map produced the following counts: (nin, I)
II similar keys musl be (spot, I)
!I of then1.1p stage must (nrn, I)
process. If 01,l) a singlt: As shown in figure I .2, the count for run can be combined into (run ,2) before the snuffle.
This optimization can help minimize of data lnlnsfer needed for the shuffic phase.
e, the data reduction is ""' j' ~hulDC!

optional. The results are


nplc, a MapReduce job
nd part-0003.
r a word count program.
s calculah:s the total for
and reduce stages arc up
is the same re11,ard\ess of
Figure J.2 A ,l,li111: 11 combiner pruc:eu lo t'1e 111t1p 1tep i11 M11pRe1/uct.
IJ. With h,o programming ex mapper script 11nd reducl!S script upl11in u~ing the Streaming
lnterruce. (08 Marks)
Using the Streaming Interface:
The Apach~ I ladoop stre:uning interface enable almost any program to use the MapReducc
engine. The streams interface will 11ork 11 ilh any program that can read and write to std in
and stdout
When working in the Iladoop streaming mode, only the mapper and the reducer arc created
by the user. This approach docs h:111c the advantage that th.: m11pper and the reduc.:r can be
easily tested from the command line. In this example, a Python mapper and reduce, shown in
Listings 1.1 and 1.2, will be used.
Listing I. I Python Map1>cr Script (fnappr.py)
#!/ust/bin/env P) thon
import sys
# input comes from STDIN (stm1dard input)
lines of for line in sys.stdio:
# remove l!.!ading and trailing " hitespacc line - line.strip ( )
# split the line into words words= line.split ()
# increase counters
for word in words:-
plicity, each line will be # writl! the results to STDOUT (st;indard output);# what we output hi:re will be the input for
r processes that count the
task are written to local # Reduce step, i.e. the input for m lucer .py #
to a reduceqirocess. # tab-delimited; the trivi.il 11ord count is t print ·%s/t%s' •• (word, t)
s of processing time.
be shuffle throughout Listing 1.2 Python Redui:cr Script (reduce .py)
#!/usr/bin/env python
step can begin (.even if from operator i111po11 itemegcttcr import sys
n:commcnded-to have current_word=Nonc cum:nt_count-0 word = None
a single reducer will #input comes from STDIN
be required to speed up for lim: in sys.stdin:
applications. The final # remove leading and trailing" hitespacc line= line.strip ()

53
VIII Se.mt (CS'E/ISf)
# parse the input we got from mapper .py word, count= line .split(' /t', I)
Locate the hadoop-strenm
# convert count (currently a string) to int contain a version tag. In 1
try: following command line \
count= int(count) input file.
except ValucError: $ hadoop jar /usr/hdp/curr
# count was not a number, so silently# ignore/ discard this line -+
continue -file ./mapper .py
# this IF-switch only works because Hadoop sorts map output # by key (here: word) before -mapper ./mnppcr .py
it is passed to the reducer
-~ le .!reducer ·PY -reduce .
if current_word ~ = word: current_count += count else:
-mput war-and-peace-inpu
if current_word: -output w:ir-and-peacc-out
# write result to STUDOUT
print '%s/t%s' 5 tcurrent_word, current_count) current_count "- count
T_he output will be the f.11111
din:ctory. The ac1ual file na
current word "-word
Also note th,:! lhe Python sc
# do not forget to output the last word if needed! if current_word "- = word:
C code,o,· nny language that
print 'o/os/to/os• % ( current_word, current_count)
The operation of the mapper .py script can be observed by running the commands as shown Al_though the streaming int
usmg Jav~ directly. In parti
in the following: Another disadvantage is that
$ echo "foo foo quux labs foo bar quux" I ./mapper .py
API are not itvaifable in strc
Foo I
Poo I
Quux I 3. a. Explain How to quire data'
Labs I Ans. Apache flume is nn indepen
Foo I HDF~. Often dnta trnnsport i
Bar I machmes and locations. Flu
Quux I message, and just about nny c
Piping the result of the map into the sort command can create a simulated shuffic phase: As shown in Figure 3.1, a Flu
Bar I • Source. The source compo
Poo I to more than one channel.
Foo I another Flume ngent.
Foo I • Channel. A channel is a d,
Labs I It can be though of as buffe
Quux I • Sink. The sink delivers data
Quux I A Flume agent must have nil 1
Finally, the full MapReduce process can be simulated by adding the reducer .py script to the severn/ sources, channels, and
following command pipeline: take data from only a single ch
$ ehco "foo foo quux labs foo bar quux" I ./mapper.py I so1t a si_nk removes the data. By de
-+ -k I, I l./rcduccr .py optionally stored on disk to pre
l3ar I
foo 3
Labs I
Quux 2
To run this application using a I-ladoop installation, create, if needed, a dirertory and move
the war-and-peace.Ix! input file inlo HDFS:
$ hdfs dfs -ml-.dir war-and-peace-input
$ hdfs dfs -put war-and-peace.txt war-and-peace-input
Mnke sure the output directory is removed from any previous rest runs: ..
$ hdfs dfs •rm •r -skpiTrash war-and-peace-output Figure J. / Flume Age11I will,

54
rLg, Vctt~A ~ '
cacs - ModeLQ~t'WYIIPc>.per -2
(' I t', I)
Locat~ thc ha~oop-streaming.jar file in )Our d1~nbution. The location may vary, and it may
contam a vers1on tag. In this example, the Hononworks HOP 2.2 distribution was used The
:ollowing command line will use the mapper py and reducer .py to do a word count 0 ~ the
input file.
S hadoop jar /usr/hdp'currctn/hadoop-mapreduce-cltent/hadoop-streaming.jar
-+
-file /mapper .py
y key (here: word) before -mapper ./mapper .py
-file }reducer .py -reduce ./reducer .py
-input war-and-peacc-input/war-1nd-pcacc .txt
-output war-and-peace-output
The output will be the familiar LSUL'CESS and part -00000) in the war-and-peace- output
unl directory. The actual file name may be slightly difference depending on your I ladoop version.
Also note th~! the Python scripts used in this example could be Bash, Perl, Tel, Awk, compiled
., =word· C code,or nny langua1:,.: that can read nnd ,Hite from stdin and ~tdout.
Although the streaming interfa~e i~ rather sunple, it docs have some disadvantages over
g the commands as shown using Java directly. In pm1icular, not all .. p;,lications are string and character binary data.
Another disadvantage is that many tuning parameter~ available through the full Java Hadoop
API are not available in streaming.
Modulc-2
Explain I low lo quire data stn•ums using Apache F'lume? (0~ Marks)
Apache rlume i~ an 111dcpcndcnt ngent designed to collec1, transport, and store data into
HDFS. Often daw transport involves a number of flume agents that may traverse a series of
machines and locations. Flume is often used for log files, social media-generated data, email
message, and just about any conlinuous data source.
As shown in Figure 3.1, a flume agent is composed of three components.
imulated shuffie phase: • Sou rec. The source component receives data and sends it to a channel. It can send the data
to more than one channel. The input dala can be from a real-time source (e.g., weblog) or
another Flume agent.
• C h111111cl. A channel is a data queue thnt forwards the source dntn to the sink destination.
It can be though of as bufft:r thal manages input (source) and oulput (sink) flow rates.
• Sink. The sink delivers data to destination such as IIDFS, a local file, or another flume agent.
A Flume agent must have all lhree of these components defined. A flume agent can ha,·e
several sources, channels, and sinks. Sources can \Hite to multiple channels, but a sink can
the reducer .py script to the take data from only a single channel. Doto written to n channel remain in the channel until
a sin"- removes the data. By default, the data in a chanmd arc kept in memory but may be
optionally stored on dis"- to prevent data loss in the event ofa network failure.

ded, a directory and move

runs:
..
Figure J. l Flume Age111witlt source, ,·1tu1111el, u11d li11k(atfuptetlfrom Apatl,e Flume
t/tJt'llllll!IIIOII0/1)
55
VIII Set11, (CSE/ISE) CBCS - Modcl,Q~

As sho\\n in Figure 3.2, Sqoop agents may be placed in a p1pclmc, possibly tu traverse namespacc andlogs.
several machines or domains. This configuration i~ normally used when data are collected Th-.: web-based UI c
on one machine (e.g., a web server) and sent to another nmchine that has access to I !DI'S. the NamcNodc. In A
Links pull-down me
tab 1~ iIf open with t

~ ~c,,
____..,,...____
the following comm
S fircfu.-: http:loc:alh
loo Thcire are five labs
Ulilities. fhc Overv
Figure 3.2 Pipeline is created b} connecting Fume agents ( Adapted from Apache Flum.: line tools also offer
Sqoop Documentation) information lir..c th~
The Sanpshot win
informalion on snap
Figure 3.6 provide
read~ the prcv iou~ Ii
image, thereby crca
DataNodcs come o
starts. Completed pl
in italics. Phases tha
have been complete
out of safe mode.
The Ut1lillcs menu
browser. From this\
which is not shown,

......
....,,...a.,....., ...

Overvi8W °llolu
Figure J.J A Flume c:u,uulidutiu11 11f!lwork (Ad11ptedfrom Ap11c:fle Flume
D uc11m,mtutlo11)
In a Flume pipdine, the sink fonn one agent is connected to the source of another. The
data tran~fer format normally used by Flume. which is called Apache Avro, provides several
useful features. First, Avro is a data serialization/deserilization system that uses a compact
binary format. The schema is sent as part of the data exchange and is defined using JSON
(JavaScript Object Notation). Avro also n:mot.: procedure calls (RPCs) to send data. That is, Summary
an Avro sink will contact an Avro source to send data. Another useful flume configuration is
shown in Figurc 3.3. In this configuration.Flume is used to consolidate SC\'eral data sources
before committing them to IIDFS. There arc many possible ways to constnict Flume transport
...,.. • ._...wd1nn
networks. In addition, other Flume features not described in depth here include plug-ins and .....,_ toon(IJ . . . ., . , . . •u

interceptors that can enhance flume pipeline. For pipclini:s.


b. Explain bricOy bask HDFC adminislralion?
Ans. The NameNode User l ntcrfocc
( 12 Marks) -·- . --
.............
Monitoring I IDFS can be done in scwr.il ways. On.: of the more convenient ways to get a
quick view of HDFS status is through the N.omcNodc user interface. This web-based tool
...
provides essential information about I-IOFS and offers the capability 19 browse the I IDFS

56
l lfl'Vett"G\IA~
namespace and logs.
line, possibly to traverse The web-based UI con be s1nrted from within Amban or from n web browser connected tu
when daH1 arc collec1ed the NamcNode. In Ambari, simply select the IIDFS Kn.<ice window and click on the Quick
al hns access to \-IDFS. Links pull-down rncmt"in the top middle of the page. Select NameNodc UI. A n.:w browser
tab will open with the UI shown in Figure 3.4. You can also start the UI directly by entering
the following command:
S fire fox http: loc(llhost:50070
There arc five t111Js on the UI : O,·en iew, Datanodcs, Snapshot, Startup Progress, and
Utilities. fhc Ovcrvicw page provides much of the essential information that the command-
line tools also offer, but in a much easier- to - read format. Th" Dntnnodcs tab cJbpluys nude
pied from Apache Flume information like that shown in figun; 3.5
The Sanpshot window lists the "snap-shottablc" directories and the snapshots. Further
information on snapshots can be found in the "HDFS Snapshots'' section.
Figure 3.6 provides a NamcNode sta1111p progress view. when the NameNodc starts, ii
reads the previous file system image file(fsirnage); applies any new edits to the file system
image, thereby creating a new file system image; and drops into safe mode until enough
DataNodes come onlinc. This progress is shown in real time in the UI as the NameNode
starts. Completed phases nre displayed in bold te,t. The currently running phase IS displayed
in italics. Phases that have not yet begun are displayed in gray text. Figurl! 3.6, all the phases
have been completed, and as indicated in the overview window in Figure 3.4, the system is
out of safe mode.
The Utilities menu ull't:rs twu options. The first, as shown in Figure 3.7. is a file system
browser. From this window, you can easily explore the HDFS namespace. The second option,
which is not sho,~n. links 1~ the various NnmcNodc lolls.

HDFS

Overview 'llntufL8C!2ll '"'· '"~'


....,.,.. nn J♦ J.,•M '1'.1\
Apache Flume ~··I:., ta •ll•"t)fW••..,....J-.Z_ _,.., .. t.\J9-,eTJ

1'91)¥111••111 ,,•Lt .n•,,,_._,.ii,.,,M


he ~ource of :mother. The
he Avro. provides several
)Stem that uses a compact
and is defined using JSON
PCs) to send data. Thal is, Summary
cful Flume configuration is 'lolt<\lllr,• . .

_..
:::::.~::....-C ..
lidate sc,·eral cfala sources 4L"l,ot,t'Ml l • )\J)fll ..... ,._~•lolfl<.....
construct Flume transport ".,..,.,....,y..witna,,.,...,,M-4 ... --'~"Ma•,....,..,..,._.,., .. ,_... -----.•~Ml-
here include plug-ins and 9"ff'MWMll"9'r.,...•••l.,__,,,. M,MI(....,.,....,.,_,.__,,,,,.._._._.., tilt
_.,.

......-
o,,.,..., Ot\C'A
(12 Marks)

-
.... Of • . , . .,
lUff

convenient ways to get 3


ac:e. This web-based tool Figure ].4 01'efl'lew p11gefur NnmeNuJe 11vt!r lult!r/ac.:

57
VIII Se.tw (CSE/ISE) 13i.g-Vc;,.m,A~

1. Add the 1
most,cases, the
hdfs.

Datanode Information

In operalJOn

- <_.,
l~IC.J

no1c.1
1 tJ{4

11n<:ot
.... .,,~
'4,.....

, .. uca
- -...
nn,01
lUUQI
"'
"'
...... ,._. ..-.-
111 w1>.»111t

ll.!SCil4l.N
.• -
f tUJ,.N

JUUU~
Browse Direc
Jid t ,c.a

lM1fA
UM<A

11Mt.r
HAl(;,I

1•••ot
)UJI-

'"""°'
...
1tOU•C10'M
UMc.t j a.ti-. .
• ,UU-l:U-l
)U)U)-.J
0

Oecom,sslomng

- .... .-
__.............
....... _.,__
..

Startup Progress Fig11r1

-
~ ,__... ~~...J<wl'r•~ -• 1111•••••••teM.l1 I._JJ Ml
.,.,...._
.....
.....
--
------
aac
2. Create the username
hdfs dfs -mkdir /uscr/<u
3. Give that account
<usemame>:<groupnam

·-·- ·-
To check the health of
command. The entire 11

--· ,..,. an argument to 1he com

--
--
,Mf,.~tUfff~---00000,aoaaaaa,,,.)t.,OOOGOOCUIDIDUIJTIJ I Ml 0.Hll~J

~repo,urdMKt,(111n:11
l-

·- ·-
,..,.
·-
1---.,-
$ hdfs fsck/
Connecting to namenode
FSCK started by hdfs (a
14:48:01 EDT20 15
·············· .. ,,,,............
,,
' Status: HEALTHY
Figure: 3.6 NameNo,le web i11terfau sl,owi11g start11p progress To\a\ si1.e·. \004'.BS6S1g
Adding Users to HDFS Total symlinks: o (F
Keep in mind that errors that crop up while Hadoop applications are running are often due 'Tota\ b\ocks (va\idated:
to file pennistiont. b)t:>rk.s- (not vs,J;J,.,.J),
To quickly crea1e user accoums manually on a Lmux-based system, perform thefollowing Min
steps: · Ove
Und
S8
l. Add the user to the group for your operating S)stcm on the HDFS client system. In
most,cases, the groupnamc should be that of the HDFS superuser, which is ofi.:n hadoop or
hdfs.
useradd G <ttrou name> <username>

........-- ...... Browse Directory


)H~

''"' •
(lt\"'4

.....,...

..
110101

,uu,1-1
u•1• • H
, ••,1,>J
,.,,..,
... ~- .....
.,.,....,.I'
°""''
~tfll

....

......
.....
....
,.
..••
••
..- --.... ••
-.......
""
~
...

......
.... _,.
-- ..... "11:tM14fb

.. -... .....--
-----·IT·I
....,
..... ....
..,
_
""" ...
--
"""
....••
,...

...,..
··-
.,....,..,. ....
....
.....
.. ....
_..,.
..........
-··
.,.....,
-
-.... -....
.....
..
01
....
..,
, .

Figure: J. 7 NameNode web i11terface tlireclury browl11r


2. Create the uscrname directory in IIDFS.
hdfs dfs -mkdir /user/<uscrnamc>
........- 3. Give that account ownership over its directory in IIDFS. hdfs dfs -chown
<usernnme>:<groupname> /user/<usemamc> Perform nn FSCK on HDFS
To check the health of HDFS, you can i~su.: rhe hdfs fsck <path> (file system check)
command. The entire HDFS namespace can be checked, or a subdirectory can be entered as
an argument to the command. The following example checks the entire IIDFS namespacc.
S hdfs fsck /
Conncccing to namenode via http:/llimulus:50070
FSCK started by hdfs (auth: SIMPLE) from /10.0.0.1 for path/ at Fri l·hty :??
14:48:01 EDT2015
.....................................................................................................................................
······················································•"'''''"''''''''''''"'.......................................................
Status: HEALTHY
Total size: I 00433565781 8 (Total open files size: 498 B) Total di~: 201331 Total fih:s: I 003
Total symlinks: 0 (Files currently being written: 6)
Total blocks (validated): 1735 (avg. block size 57386781 B) (Total open file
running are often due blocks (not validated): 6)
Minimally replicated blocks· 173S (100.0 %)
Over-replicated blocks : 0 (0.0%)
Under - replicated blocs : 0 (0.0 o/o)
59
VIII Se-iw (CSE/ISE) a,cs · Mod,l.,Q~
after the DataNodes ha
Mis-replicated blocks 0 (0.0%)
The administrator can~
Default n:plication factor: 2 S hdfs dfsadmin -safcm
Avergae block replication: 1.7850144 Corrupt blocks: 0 Entering the following
Missing replicas :,0 (0.0%) Number of data• nod1:s : 4 Number of racks : I $ hdfs dfsadmin -safem
FSCK ended at Fri May 29 14: 48: 03 EDT 2015 in 1853 milliseconds I tOFS may drop inlo
The filcsystcm under path'/' is HEALTHY DataNode). The file sys
Other options providt: mort: detail, include snapshots and open files, and management of whefher HDFS is in Sa
corrupted files. $ h<lfs dfsa<lmin -safem
• move moves corrupted files to /lost + found Decommissioning HD
• delete deletes corrupted fi lcs If vou need to remove
• files prinls out files being checked it first. As~uming the n
• openforwrite prints out files ope111:d for writes during clock web UI. Simply go to t
• inch11.lcSnap~hots includes snapshot data. The path indicates the existence of a pull-down menu next t
snap~•,ottable directory or the presence of snapshottablc directories under it.
Note that the host ma
• Jist-c11rru11tlilcblocks prints out a list of missing blocks and the files lo which theybelong
decommission the YAK
• blur!<~ prints out a block r<!pon The restoration process
• lo(.ation~ pr111ts out locations for every block
(or anywhere else). Not
• rn..:k., prints o..t network topology for data-node locaitons.
$ hdfs dfs -cp /user/h
Dal;w.:inr. liPF5
user/hdfs/war-and-peac1
Based on w,agc pallerns and DataNode availability, the number of data blocks across the
DataNodcs may r~come unbalanced. To avoid over-utilized DataNocks, the HDFS balancer Confirmation that the fi
command:
tool rebalances d.1ta blocks across the availabk DataNodes. Data blocks are moved from
over-utilitc<l to under-utilized noctes to within a certain percent threshold. Rebalancing can . ,.,,..,., ~ • 1 ---.tfCII(

be done when new DataNodes are ndded or whc11 a Data Node is removed from service. This ♦- · ~ ......-•.· ,, ' p ... .., . .. . .

step does not create more space in I IDFS, but rather improves efficiency.
The HDFS superuser must run the balancer. The simplest way to run the balancci is to enler
the following command:
S hdfs balancer Snapshot Summ
By default, the bal:incer will continue 10 rebalance the nodes until the nwnber of data block
on all Data Nodes are within I0% of each other. The balancer can be stopped without harming SnapshoUablo dlreetor,es.
HDFS, at any tum: by entering a Ctrl-C. Lower or higher thresholds can be sci using the
-threshold argument. For example, gmng the following command sets a 5% threshold:
$ hdfs balancer -threshold 5
...
The lower the threshold. the longer the balancer will run. To ensure the balancer d~s not
swamp the cluster networks, you can set a bandwidth limit before running the balancer, as S naps~ollcd (1,rcccoroes. I
follows:
$ dfsadmin -setl3alancerl3andwidlh newbandwidth
111c newbandwidth option is the maximum amount of network bandwidth, in bytes per
second, that each DotaNode con use during the balancing opcrnlion.
Balancing datablocks can also break HBase locali1y.When HBase rgions are moved, some Figure: 3.8 Apuc/1
data locality is lost, and the RcgionServcrs will then reqnest the data over the network from $ hdfs dfs -Is /user/hdfr
remote DataNode(s). This condition will persist until a major Hl3ase compaction event take, -rw-r-r- 2 hdfs hdfs
place (which may either occur at r<!g111:lr intcrvnls or be initiated by the administrator). input/war-and-peace.txt
IIDFS Safe Mode The NameNode UI prov
when the NameNode starts, it loads the file system state from the fsimage and then applies have been 1aken. Figure 3
th.: edits log file.. It then waits for DataNodes to n:pon their blocks. During this time, the snapshot. give the followi
NameNode stays 111 a-read-only Safo Mode. The NameNode leaves Safe Mode automatically $ hdfs dfs -deleteSnapsho

60
~V~A~
after the DataNodcs have reported that most file system blocks are available.
The administrator can place liDFS in Safe Mode by giving the following command:
$ hdfs dfsadmin -safcmode enter
Entering the following command turns off Safe Mode:
cks: I $ hdfs dfsadmin -safemode leave
onds I IDFS may drop into Safe Mode if a major issue arises within the file system (e.g., a full
DataNode). The;: file system will not leave Safe Mode until the situation is resolved. To check
les, and management of whelher HDFS is in Safe Mode, enter the following command:
$ hdfs dfsadmin -safemode get
Decommissioning HDFS Nodes
If vou need to remove a DataNode host/node from the cluster you should dccom- mission
it first. Assuming the node is responding, it can be easily decommissioned from the Ambari
web UI. Simply go to the I losts view, click on the host and selected Decommission from the
ates the existence of a pull-down menu next to the DataNode component.
ories under it. Note that the host may also be acting as a Yarn NodeManager. Use the Ambari 1-1 to
e fi ks to ~-.,hich theybelong decommission the YARN host in a similar fashion.
The restoration process is basically Hl-irnple copy from the snapshot to the previous directory
(or anywhere else). Note the use of the ~/. snapshot/wapi-snap- 1 path to restore the file:
$ hdfs <lfs -cp /uscr/hdfs/war-and-peacc-input/.snapshut/wapi-snap-1/war-and- peace . txt /
user/hdfs/war-and-pcace-input
r of data blocks across the Confirmation that the file has been restored can be obtained by issuing the following
Nodes, the HDFS balancer command:
ta blocks are moved from
hreshold. Rebalancing can • ~•~ • ,A l ~
♦.. cflo,r,,.IJ'! ..,..,~.• ti • 1 p •I•• ,t 'II •'(1
l

emoved from service. This ~

ciency.
, run the balancei is to enter
Snapshot Summary
ii the number of data block
be stopped without harming SnJpshonable dtrec10,,es· 1
sholds can be set using the
d si:ts a 5% threshold: .............
nsure the balancer does not
bre running the balancer, as Snaps'1ollcd d1tectoroes· 1
--
rk bandwidth, in bytes per -· ---
"1tnYU,llt.-01,..

se rgions are moved, some Fig11rl!: 3.8 Ap11clte N11111eNotll! Hll!b i11ll!tjt1ce slwHJi11g Mltp.f/,o/ i11for111atio11
data over the network from $ hdfs dfs -ls /user/hdfs/war-and-peacc-input/ Found I items
ase compaction event take, -rw-r--r- 2 hdfs hdfs 32887462015-06-24 21: 12 /user/hdfs/war-and- peace-
by the administrator). input/war•all(f-peace.txt
The NameNode lJI provides a listing of snapshottable direclories and the snapshots that
e fsimage and then applies have been taken. Figure 3.8 shows the results of creating the previous snapshot. To delete a
cks. During this time, the snapshot, give the following command:
s Safe Mode automatically $ hdfs dfs -deleteSnapshot /user/hdfs/war-and-peace-input wapi-snap-1

61
VIII Sem, (CSf/ISf) CBCS - 1,,todel,Q~
To make a directory "un-snapshottable'' (or go back to the default state), use the When the DataNode
following command: from the user.

,... - .....
S hdfs dfsadmin -disallowSnnpshot /user/hdfs/war-and-peace-input Disallowing snapshot
on /user/ hdfs/war-and-peace-input ~uccceded i - -· - ~-~
....
OR
4. a. llow to manage Hadoop scn •kc? (08 Marks)
Ans. During the cot1rse of normal lladoop cluster operations, service may fail for any number of ·-··-
·-·-·-
' u ..
reason. Amb,1ri monitors all of the Hadoop service and reports any service interruption to
the dashboard. In addition, when th!! system was installed, an administrative email for the
Nagios monitoring system was required. All service interruption notifications are sent to this
email address.
·----
g ..

Figure 4.1 shows the Ambari dashboard reporting a down DataNode. The service error
indicator numbers next to the IIDFS service and Hosts menu item indicate this conditions.
The DataNode widget also has turned red and indicates that 3/4 DataNode are operating.
·-··-·-
~
Clicking the HDFS service link in the left vert ical menu will bring up the service summary
screen shoOwn in figure 4.2. The Alters and llealth Checks window confirms that a DataNodc
is down.
The specific host (or hosts} with an issue can be found by examining the llosts window. As
shown in Figure 4.3, the status of host n I ha~ changed from a green dot with a check mark
inside to a yellow dot with a dash inside. An orange dot with a question mark inside indicates
the host is not responding and is probably down. Other service interruption indicator may also

·-- -
be set as a result of the unrcs sive node.

- •· --··-··
I
.............. . . • 1•
Fi~ur~ 4,2 AnllJari

·-·•
·-- ---_... - ·-
·- ---- --
_.,. ....
., •-·
1·-- ...
♦ . . ..........
---
·--
•·-
- - - --
ll;~•- 3,14 ·-·-· ....... ..
'

·-
•·-
.. --
.h• ·-
~
--_, -- - 0.1 4 ms
L• ~
no

·--- -·- ·•· ._·;....,"(..


LJ• a4

"··.. -
-· .. ·-
....... ---
52.9d
,_
1.33 52.9 d
A
- .. - - ·- ·- - --
-- Flg11r~ 4.J

Oft 19.1 d 4 '4 4/4


"'
. . Figure 4.1 Ambarl ma/11 dusltboard l11dlcatlt1g a DataNode issue
C~1ckmg on the n I host link opens the view in Figure 4 .4. Inspecting the Components sub-
wmdow reveals that the DataNode daemon has stopped on the host. Al this (>Oint checking
~h; Data~ode logs on host n I will help identify the actual cause of the failurll. Ass.urning the
11 ures 1s resolved, lhe DataNode daemon can be started using the Start option in the pull-
down menu next lo the service name.

62
~V~An~

fie), use the When the DataNodc daemon is restarted, a confirmation similar to Figure 4.5 is required
from the user.
Disallowing snapshot
•.....,.~•- , ......,"_'
......,. ......., • ~
~-----~........
~ •...:.:-'-"---•------.....-a......,,~ie:,..,.~....,.•-

(08 Marks)
y fail for any num~r of
service intcm1puon to ·-........
...
- ---·-- .,.......,_
----
A=.,,,.,,_ __
---- ..
ninistrative email for th_e
ifications are sent to this
·- --··-
__ .... ·- -
...................
.... - •
,......
· - • • • • •~ t ~ -
- ·•

.,,, ~..:;:-..... ,..i..-•.


-·--·-··--- .,, ---·- - .......
,Node. The service_ ~rror
.I ndicate this cond1t1ons. --·-·--
____ ......... - ---·~
.....
., ~:-==--- ... ...... -. .

---

ataNode are operating.
g up the service summary
confirms that a DataNode

~ing the Hosts window. As


en dot with a check mark
estion mark inside indicates
-
... r--
~
__
•rruption indicator may also

~
-··-·· .......
.,. Fi ure 4.2 Ambari HDFS sen•h·e s11111111arv wi111/ow i11dit-ali11 a dow11 DutaNode
... .... .._..._,_
L- --
-.;,Will ..... ·- . . . -·
-- - ,_...
0
••••

-
n

- ..--. I$ "' • co
......
t \ lo~. .
I
... ,,.
, 11-t. , .... ,.

.. -~~--
u• • I , I ,,·~ "
r:e •• ~
I
....
,.

.33
--- 52.9 d Fig11rt! 4.J A111b11ri Ho1ts SL'tet!1t l11dicat/11g an inut! witll /1ost n/
.:J

414

ataNode issue
cting the Components sub-
host. At this point, checking
of the failur.:. Assuming the
the Start option in the pull•

63
VIII Se+w (CSE/ISE) 13lft'V~A~ CBCS - Modcl,q~

~, _. - .....-
.,,. g ... __,_.o.,:....._,~ 0 D 6 • •

-- ·-·
··-- ·-
·-- - ~· ·-
.....
., __

__
c.........._..,,.,.~~
"'°";,,...t ,_ '--·
~~-'°"""'°"""'
,c-, ,.....-.,1., .. p ~ -
_.,_
.,.,_
·--.
.~~J~
...... ,;.,..
i,

.......
Ml
L--
• ,,,..,_ tcflll!

__, ,,... 1-
OI , ...... ~ ...

--
~f'.-,4 , •
... _

,_..... . - 1...
,.,. .l.._1 . . .

l ... . . . 6 """
'M-'',.,,,.

----
Fig11re 4.4 Amb"ri 111ii11low for /toll 11/ i11dic"ti11g tl,e D"t"Notle/1/DFS sefl•lce /,as
~topped :ig_ure 4.6 Ambari dash
When a service daemon is started or stopped, a progress window similar to Figure 4.5 is ~nd~cators will slowly J
opened. The progress bar indicates the status of each action. Note that previous actions are md1cators arc beginning
part of this window. If something goes wrong during the action, the progress bar will tum n:d. real-time widget updates
If the system generates a warning about the action, the process bar will turn orange.
When these background operations are running. the small ops (operations) bubble on the top b. Wrile a shore note on v.
menu bar will indicate how manv o rations are runnin •.
Ans. The central YARN Re
machine and acts as the
1 Background Operations Running
applications in the clusterl
·,i.-i f ...... . Sto. M ! IO) J resources and, therefore,
I ... users. Depending on the a
the ResourceManagcr dy
I..,,.
• particular nodes. A conta1
to a particular cluster n
100'0
• interacts with a special sy
100'0
• for scalabili1y. NodeMan
fault reporting, and conta
.,,,~,!~ ,,..,v.......
100'0
• ResourceManager dcpen
C:oo~~ ,.,., <>. :11 --- IOO'lo
• User applications are su
through an admission c
various operational and
I-
• accepted pass 10 the scti
Oo,.,. _ _ _ _ _ _ _ _ . . _ _
resources to satisfy the re
state. Aside from intcma
single ApplicationMaster
Fig11re 4.5 Amb,,ri progress wi111/ow for DataNo(le restart the Applicat1onMaster d
Once the DataNode has been resfarted successfully, the dashboard will reflect 1he new status request additional resour
(e.g., 4/4 Data Node are Live). As shown in figure 4.6, all four

64
I•- · - ..........--
. \ ...,.._,
v (: u ... .,.w- o.-M-b~... 0- e -. -;- ■

•-·

--__ --
--.... --
..
--L a..... , ...

;::.:,-_J.._,..
--e) 0.07 ms
-""-
....
-- •...
52.9 d
............_
"'
-·-
--·
'"'---•
,......,
1.33
... ......
52.9d
......-.

- . --
L-- - - - i l - -

"" 19.1 d 4/4 ... 4/4 ....

ru/e/1/DFS seri•ice ltas


Figtire 4.6 Ambari dashboard indicating all DataNodcs are running (The service error
w similar to Figure 4.S is indicators will slowly drop off the screen) DataNodes are now working and the service error
e that previous actions arc indicators arc beginning to slowly disappear. The service error indicators may lag behind the
progress bar wiII turn red. real-time widget updates for several minutes.
r will turn orange.
b. Write a short note on YARN Applic11tion? (04 Marks)
crations) bubble on the top
Ans, The central YARN ResourceManager runs as a scheduling daemon on a dedicated
machine and acts as the central authority for allocating resources to the various competing
applications in the cluster. The ResourceM:mager has a central and global view of all cluster
.:l resources and, therefore, can ensure fairness, capacity, and locality are shared across all
---- users. Depending on the application demand, scheduling priorities, and resource availability,
- -----
"'"'" .
---
the ResourceManager dynamically allocates resource containers lo apllications 10 run on
particular nodes. A container is a logical bundle of resources (e.g., memory, cores) bound

-
S ·- . - ~
to a particular cluster node. To enforce and track such assignments, the ResourceManager
interacts with a special system daemon running on each node called agers are heartbeat based
for scalability. NodeManagers are responsible for local monitoring of resource availability,
fault reporting, and container life-cycle management (e.g., stmting and killing jobs). The
ResourceManager depends on the NodeManagers for its "global view" of the cluster.

6-- User applications are submitted to the ResourceManager via a public protocol and go

1
,..,,. . through an admission control phase during which security credentials are validated and
various operational and administrative checks are performed. Those applications that are
accepted pass to the scheduler and are allowed to run. Once the scheduler has enough
resources to satisfy the request, the application is moved from an accepted state to a running
slate. Aside .from internal bookkeeping, this process i1.volves allocating a container for the
single ApplicationMaster and spawning it on a node in the cluster. Often called container 0,
otle rellart the ApplicationMaster does not have any additional resources at this point, but rather must
d will rcf!l::ct the new status request additional resources from the ResourceManagcr.

65
VIII Sem, (CSE/IS£)
The Application Master is the ·•master" userjob that manages all application life-cycle aspects, was designed and integ
including dynamically increasing and decreasing resource consump1io11 (i.e., containers), Figure 4.7 illustrates th
managing tile flow of execution (e.r., in case of MapReduce jobi, running reducers against the YAl{N components app
oulpul of maps), handling faults and computation skew, and performing other optimizations. and the two applicatio'l
The Applica1ionMaMer is designed to run arbirtratry user code that can be \Hillen in any application uses a dille
programming language, as all communication with the Reso'urceManager and NodeManager passing Interface (MPI)
is encoded using extensible network protocols application.
YARN makes few assumptions about the ApplicationMaster, although in practice it expects The darker client(MPI
most jobs will use a higher-level progr.1mmin:; framework. By delegating all these functions running n MapReduce a
to ApplicationMash:rs, YARN's archilecture gains a great deal of scalability, programming
c. E,pl:iin c:ipaeicy sch
model flexibility, and improved user agility. For example, upgrading and teMing a new
MapReduce framework can be done independently of olhcr running MapReduce frameworks.
Ans. Ca11acity Scheduler Ba
Typically, an ApplicationMastcr will m:ed to harness the processing power ofmuhiple servers Thi: Capacity scheduler
to complete a job. To achieve this, the ApplicationMastcr is~ues resource reqmms to the si:curcly ~hare a large I I;
Resoum:Manager The form of these requests includes specification of locality preferences Capacity scheduh:r has
(e.g., to accommodate I IDFS use) and properties of the containers. The ResorceManager \\ ill To use the Capacity sc
:11tempt to satisfy the resource requests coming from each npplication according to avnilability fmction of the total slo
and scheduling policies. When a resource is scheduled on behalf of an ApplicationMaster, amount of resources fo
hard limits on tho: capa
the ResourceManagcr generates a lease for the resource, which is acquired by a subsequent
Control Lists) that cont
ApplicalionMaster heartbeat.
~- - Mapll.._Cllenl
safeguards arc in place
users.
I Sche!duler I ~ The Capacity scheduler
IAppllcatlonManage, I .,
I
NMU¼#SIM minimum capacity guar
of dcmand (i.e.. a group
fa.ccss slots arc given t
divided by rhe queui: ca
capacity guarantee gel th
elasticity for the users in
Administrntors can chan
run time without disrupt
delete queues at run time
that while existing appli
The Capacity schedule,
application can oprional
Using information from
on the best-suited nodes.
The capacity scheduler
assiiining the minimum
should be assigned a m
Figure 4. 7 : ~1r11 11rd1lre,·111re wir/1 fwo t:lie1lfl(M11pR,:d11,·e tmd MP/). Within each queue, mul
The ApplicationMasler then works with the NodeManagcrs to start the resource. A loken- the approach used with t
based si:curity mechanism guaranh:cs ils a111hentici1y when the ApplicationMaster presents jobs nri: placed in the di!
the container lease to the NodcManager. In a typical situation, running containers will The ResourceManager
communicate with the Applicatio11Mastcr through an applicalion-specific protocol to report thi:ir utilintion. Figure
slatus and health infonn:uion and 10 receive fra111cwork-sp,:cific co111111an<ls. In this way, scheduler view, click th
YARN provides basic infr.i\lructurc for monitoring and life-cych: managcment of containers, Information on configur
"hile each framework manages applic:uion-specific semantics design, in which scheduling org/does/currcnt/hadoo

66
VCNtl;vA~

at ion life-cycle aspects, was designed and integrated around managing only MapReduce tasks.
ption (i.e., containers), Figure 4.7 illustrates the relationship between the application and YARN components. The
ing reducers against the YARN components appear ns the large outer boxes (ResourceManager and NodeManagers),
ing other optimiz.1tions. and the two applications appear as smaller boxes (containers), one dark one light. Each
t can be written in any application uses a different ApplicationMaster; the darker client is running a Message
ager and NodeManager passing Interface (MPI) application and the lighter client is running a traditional MapReduce
application.
gh in practice it expects The darker client(MPI AM.) is running an MPI application,and the lighter client(MR AM 1)is
ating all these functions running a MapRcduce appiication.
calability, programming c. Explain capacity sclu:dulcr background. (04 Marks)
ding and testing a new Ans. Capacily Sd1cdulcr Background
apRcduce frameworks. The Capacity scheduh:r is the default scheduler for YARN thal enables multiple groups lo
ower of multiple servers securely share a large Hadoop clus1cr. Developed by 1he original I ladoop team at Yahoo!, the
resource requests to the C:ipacity scheduler has successfully run many of the largest Hadoop clusters.
n of locality preferences To use tht! Capacity scheduler, one or more queues are configured with a predetermined
he ResorceManager will fraction of the total slot (or processor) capacity. This assignment guarantees a minimum
n according to availability amount of n:sourccs for each queue. Administrators can configure soil limits and optional
f an ApplicationMaster, hard limits on the capacity allocated to each queue. Each queue has strict ACLs (Access
acquired by a subsequent Control Lists) that control which users can submit applications to individual queues. Also,
safeguards anJ in place to ensure that users cannot view or modify applications from other
users.
The Capacity scheduler pcrmits sharing a cluster while giving each user or group certain
~~:MPICllont ,,._':-•· minimum capacity guarantees. Thes..: minimum amounts are not given away in the absence
of th:mand (i.e.. a group is always guaranteed a minimum number of resources is available).
Excess slots an: giwn to the mos, starved queues, based on the number of running tasks
divided by th..: queue capacity. Thus. lhc fullest queues as defined by their initial minimum
capacity gunrantee get the most nee,Jt!d resoum:s Idle capacity can be assigned and provides
elasticity for the users in a co~l-cffectivc manner
Adminislrators can change queue dcfinilions and properties, such as capacity and ACLs, at
run t:me without disrupling users. They can also add more queues at run time, but cannot
delele queues at run time. In addition, administrators can stop queues at run time to ensure
that while existing applications run to completion, no new appli-cations can be submitted.
The Capacity scheduler currently supports memory-intensive applications, where an
application can optionally specify higher memory resource requirements than the default.
Using information from the NodeManagcrs, the capacity scheduler can lhcn place containers
on the best-suited nodes.
The capacity scheduler works best when the workloads arc well known, which helps in
assigning the minimum capacity. For this scheduler to work most effectively, each queue
should be assigned a minimal capacity that is less than the maximal expected workload.
•,hm! "'"' MP/), Within each queue, multiple jobs are scheduled using hierarchical FIFO queues similar to
tart the r..:sourcc. A token- the approach used with the stand-along Flf'O scheduler. If there are·no queues configured. all
l\pplicat ionMastcr presents jobs are placed in the default queue.
, running containers will The ResourceManager UI provides a graphical represrntation of the schedular queues and
-specific protocol to report
their utilization. Figure 4.8 shows two jobs running on ,1 four - node cluster. To select the
c commands. In this way, scheduler vicw, click the scheduler option at the bottom of the lefi-side vertical menu.
nanagcmcnt of containers, Information on configuring lhe capaci1y scheduler can be: found at https : //handoop. apache.
sign, in which scheduling org/docs/currcnt/hadoop-yarn/hadoop -yam-site'CapacitySchcduler.html and from Apache

67
VIII Se,m, (CSE/IS'E) 13~Vam.,A~ CBCS • Mode!,qd
Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2.
In addition to the capacity schedular, H.idoop YARN offers a Fair scheduler. More information
13. Microsoft Bl pla
can be found on the Hadooo webs ite. 14. MicroStrategy
l S. Ml't'S
· - -·..... • 1- - . ...................... .
• .A.._..J..'.~~ .. •· ~ -- ---· --- .. c ■•'l'!lfl,/t • ••••• 16 Open!
~
__ NEW,NEW_ SAVING,SUBMITTED,ACCEPTED,RUNNING Application• - - - -- 17. Oracle Bl

--..,.,
,.-. a ...,.. -..""
..... ._. ._._ ._. <;rt....., .,..,.._~ ......., K •"'IIC.-.IWC'- ..., ~ _ , . . ~ ~ ....... 18. Oracle Enterpris,
~" ....... .........., , ...~ ... --..... ~ ... fr.I.I -.. "'-" .. "'""'" ..... ...... .... ..... --
• • , , a JJ:)O . ....._ oe n 11 t t t t t t
19. Oracle Hyperion
..- ear r · ... "'
,. - • ,..,. :iG2P ; tsrt norcesn
~ :et e cer The Bl tool used in o 1
.....
·--
l'Q ~
..... \,e. . . . . -
As higher education
decision-making. The
quality of student ex
,.,.,.
....... w.,.,.,, ...... ~ .ao --- I. Student enrolmen
·-1!)11
.,,.~ \.11""'1' requires schools to de
. . ... ..
...,,..

, t,.,t .....Mu
- H~"'f
,.,.,.
~>on
~ IN)l'OC D alW
develop models of wll
,..,,..,....
u , ,10,
those students. The st
can be taken in time.
Fig11re 4. 7 : Ap11cl1e YARN res11urce 111a11age web i11terface s!,owi11g c11pacity scl1e,luf(:s
2. Course offerings:
i11/or111atl1111
new courses are likely
Modulc-3 reduce costs, and impr~
Describe list or business intelligence tools used in the organization. Explain any 2 or 3. Alumni pledges: Sc~
5. a
to pledge financial sup
them used in your organization. (10 Marks)
to pledge donations to
Ans. According to the list of best business intelligence tools prepared by experts from Finances
other forms of outreach
Online the leading solutions in this category comprise of systems designed to capture,
categorize, and analyze corporate data and extract best practices for improved decision b. What are the key elem
making. The more advanced the system is, the more data sources it will combine, including Ans. OW has four ke elcmc
internal metrics coming from different company departments, and external data extracted Pata Sources
Operations
from third-party systems, social media channels, emails, or even macroeconomic data.
-ERP systems
Ultimately, business intelligence sofiware helps companies gain insight on their overall · Legacy systems
growth, sales trends, and customer behavior. · Point of Sale
• RFID systems
I. Sisense 20. Palo OLAP Server -W eb usage
2. Actuate Business Intelligence and Reporting Tools (BIRT) 21. Pentaho External
-Suppliers
3. icCube 22. Profit base a Cust omer s
23. QlikView •Government
4. Domo
5. Board Management Intelligence Toolkit 24. Rapid insight.
The first element is the
6. Clear Analytics 25. SAP business intelligence
process of transforming t
7. Duccn 26. SAP DusinessObjects of regularly nnd accurate
8. Gooddata 27. SAP NetWeaver BW is the data access and an,
9. IDM Cognos Intelligence 28. SAS Bl deliver insights and othc
I0. lnsightsquared 29. Silvon Data Sources
OWs are created from s~
I I. JasperSoft 30. Solver need lo be structured befi
12. Looker 31. SpagoBI

68

~ - - --- --
ing with Apache Hadoop 2. 13. Microsofi Bl platform 32. SQL St:rver Analysis Servic~
·che<luler. More information
14. MicroStmteg~ 33. Style Intelligence
15. MITS 34. Syntell solutions
16. Openl 35.Targit
UNNING Applications 17. Oracle DI 36. Vismatica
18. Oracle Enterprise Bl Server 37. WebFOCUS
19. Oracle Hyperion System 38. Yellowfin Bl
The Bl tool ust:d m our organization : Education
As higher education becomes more expensive and competitive, it is a great user of data-based
decision-making. There is a Wong need for efficiency, increasing revenue, and improving the
quality of student experience at all levels of education.
J. Student enrolment (recruitment and retention): Marketing to new potential students
requires schools to develop profiles of the students that are most likely to auend. Schools can
develop mod1ils of what kinds of students are attracted to 1he school, and then reach out to
those students. 1l1e students at risk of not retuming can be nagged, and corrective measures
can be taken in time.
2. Course offerings: Schools can use the class enrolment data to develop models of which
new courses are likely to be more popular with students. This can help increase class size,
reduce costs, and improve student satisfaction.
3. Alumni pledges: Schools can develop predictive models of which alumni arc most likely
iization. Explain any 2 of to pledge financial support to the school. Schools can cr.eate a profile for alumni more likely
( 10 Marks)
to pledge donations to the school. This could lead to a reduction in the cost of mailings and
i1 by experts from Finances other forms of outreach to alumni.
terns designed to capture,
Ices f~r impr~ved. decis~on b. What arc the key clements or a data warehouse? Describe each. (6 Marks)
s it wall combine, including Ans. DW has four ke clements (Figure 5.1 ).
nod external data extracted Qi!Si! ~!2!.!'~l:i
Oper.,t ions l2ila oasa Mart or . Accesslne u,en
ven macroeconomic data. -ERP system~ Transformation & apol(cauons
~ar1bP!.!H
in insight on their overall · Legacy systems • One data OlAP tools
-Point of Sale - Select Data Reporting Tools
-Extract Data mart ror e;,ch
-RFID systems department,
Dashboards
Palo OLAP Server •Web USilgC?
-Cleanse data or
Queries
Exte?rnal
-Compute Data Mobile Devices
•A Ware?house
-Suppliers fields Data Mining
•Integrate Data for the whole
rofit base ·Customers enterprise Custom apps
-Government •load dala

~apid insight. Fig11re 5.1 Data warcl1oull11g arc//itect11re


AP business intelligence The first element is the data sources that provide the raw data. The second element is the
process of transforming that data to meet the decision needs. The third element is the methods
~AP BusinessObjects of regularly and accurately loading of that data into EDW or data marts. 1l1e fourth element
AP Net Weaver BW is the data access and ilnalysis part, where devices and applicnlions use the data from DW to
deliver insights and other benefits to users.
Datu Sources
DWs are created from stnicturcd data sources. Unstructured data, such as text data, ·would
need to be structured before inserted 11110 OW.

69
VIII Semt (CS'f/IS'f) CBCS - lvtodel,Q~
I. Operations data include data from all business applications, including from ERPs systems database management sy
that form the backbone of an organization's IT systems. The data to be extracted will depend and reliable providers of
upon the subject matter or OW. For example, for a sales/marketing DW, only the data about The provider oflhe ope
customers, orders, customer service, and so on would be extracted. Alternatively, a best-of-b
2. Other applications, such as point-of-sale (POS) terminals and e-commerce applications, there for data migration,
provide customer-facing data. Supplier data could come from supply chain management OW Access
systems. Planning and budget data should also be added as needed for making comparisons Data from OW could be
against targets. I. A primary use of D
3. External syndicated data, such as weather or economic activity data, could also be added example, a sales perform
to DW, as needed, to provide good contextual information to decision mal,.ers. with plan. /\ dashboardin
Figure 5.2 Data warehousing architecture users. The data from o
D11ta Transformation Processes executives. The dashboar
The heart ofa useful DW is the processes to populate the DW with good quality data. This is data for root cause analys
called the extract-transform-load (ETL) cycle. 2. The data from the war
I. Data should be extracted from many operational (transactional) database sources on a that make use of the inte
regular basis. 3. Data from DW is used
2. Extracted data should be aligned together by key fie lds. It should be cleansed of any extracted, and then combi
irregularities or missing values. It should be rolled
up together to the same level of granularity. Desired fields, such as daily sales totals, should
be computed. The entire data should then be brought to the same fonnat as the central table 6. a. Des~ribc the key steps i
ofDW. . processes?
3. The transformed data should then be uploaded into DW. This ETL process should be Ans. Effective and successful u
run at a regular frequency. Daily transaction data can be extracted from ERPs, transfonned, skills. The business aspect
and uploaded to the database the same night. Thus, DW is up-to-date next morning. If DW one imagine possible relati
is needed for near-real-time information access, then the ETL processes would need to be help fotch the data from m,
executed more frequently. ETL work is usually automated using programing scripts that are business problem, and the
written, tested, and then deployed for periodic updating DW. An important element is to
OW Design the problem with smaller
Star schema is the preferred data architecture for most DWs. There is a central fact table that it~r~tive sequence of step
provides most of the information of interest. There are lookup tables that provide detailed mmmg techniques over a I
values for codes used in the central table. For example, the central table may use digits to Cross-Industry Standard p
represent n sules person. The lookup table will help provide the name for that sales person (Figure 6. 1):
code. 1lere is an example of a star schema for a data mart for monitoring sales performance
(figure5.2).
m:clJIIOtitm' a tt-ltt'tirtlil:Jtff\RI

u,·m i d ') tit•\ Rt..•1!Ull'I Pl!<IOtf rvtnl !t,,lt::'lo


P••r~iJti 10 Id '.,i ii•-. Quol ~
,n

I
ISh@frASi t ·M!Mli

Fig
Figure 5.2 Star schema arcl1itect11re I. The first and most import
Other schemas include the snownake architecture. The difference between a star and the right business questions
snowflake is thnt in the latter. the lookup tables can have their own further lookup tables. payolfs for the organizatio
There are many technology choices for developing DW. This includes selecting the right mining project is like any otl

,o ~l\t.+M EK""' S....Mv


uding from ERPS systems database management system and the right set of data management tools. There urc a few big
be extracted will depend and reliable providers of DW systems.
OW, only thi: data about The provider of the operational DBMS may be chosen for DW also.
Alternatively, a best-of-breed DW vendor could be used. There are abo a variety of tools 0 111
-commerce applications, there for data migration, data upload, data retrieval, and dat;i analysis.
pply chain management OW Access
for making comparisons Data from DW could be accessed for many purposes, through many devices.
I. A primary use of DW is lo produce routine m;in;igemcnt and monitoring rcpons For
data, could also be added example, a sales performance report would show sales by many dimensions, and compared
ion makers. with plan. /\ dashboarding system will use data from the warehouse and present analysis to
users. TI1e data from DW can be used to populate customized performance dashboards for
e;,,.ecutives. The dashboard could include drill-down capabilities to analyze the performance
good quality data. This is data for root cause analysis.
2. The data from the warehouse could be used for ad hoc queries and any other applications
I) database sources on a that make use of the internal data.
3. Data from OW is used to provide data for mining purposes. Pans of the data would be
hould be cleansed of any c>.tracted, and then combined with other relevant data, for data mining
OR
s daily sales totals, should
format as the central table Describe the key steps in the data mining process. Why Is It important to rollow these
processes? (08 ~larks)
is ETL process should be Effective and successful use of data mining activity requires both business and technology
from ERPs, transformed, skills. The business aspects help understand the domain and the key questions. It also helps
-date next morning. If DW one imagine possible rt:lationships in the data and create hypotheses to test it. The IT aspects
rocesses would need to be help fetch the data from man> sources, clean up the data. assemble it to meet the needs of the
programing scripts that nrc business problem, and then run the data mining techniques on the platform.
An impo11nnt dcmcn1 is to go after th,: problem iteratively. It is better to divide and conquer
the problem 11 ith smaller amounts of data, and get closer to the heart of the solution in an
re is n central fact table that iterative sequence of ~teps. There an: several best practices learned from the use of data
ables that provide detailed mining techniques over a long period of time. The data mining industry has proposed a
tral table ma) use digits to Cross-Industry Standard Process for Data Mining (CRISP-OM). It has sLx essential steps
name for that sales person (Figure 6. 1):
, nitoring sales performance

Fi;:11re 6./ CRISP-DM 1/11/a 111i11i11g cycle .


I. TI~e first a~d most im~ortant step in data mining is business understanding that is, asking
nee between a star and the right business qucstwns. /\ question is a good one if answering it would kad to large
further lookup tables. pa_y~lfs for the_organization, financiall} and otherwise. In other words, selecting a data
eludes selecting the right mining proJect ,s hke nny other project, in 11hich it ~hould show strong payoffs if the project
1\1.-f-Af' f,,1'11\ &.AMv Sttl\.tAf' f.cAM &.AMU' 71
VIII Se-tHI (CSF./IS'£) CBCS · ModeiQ

is successful. There should be strong executive suppon for the data mining project, which 2.Ceomctric Project
means that the project aligns well with the business strategy. A drawback of pixel•
2. A second important step is to be creative and open in proposing imaginative hypotheses understanding the dist
for the solution. Thinking outside the box is important, both in terms of a proposed model as Geometric projection
well in the data sets available and required. data sets.
3. The data should be clean and of high quality. It is important to assemble a team that has a A scatter plot displays
mix of technical and business skills, who understand the domain and the data. Data cleaning added using different
can take 60 to 70 percent of the time in a data mining project. It may be desirable to add new Eg. Where x and y
data elements from external sources of data that could help improve predictive accuracy. different shapes
4. Patience is required in continuously engaging with the data until the data yields some good Through this visualilU
insights. A host of modeling tools and algorithms should be used. A tool could be tried with
different options, such as running different decision tree algorithms.
5. One should not accept what the data says at first. It is better to triangulate the analysis by
applying multiple data mining techniques and conducting many what-if scenarios, to build
confidence in the solution. Evaluate the model's predictive accuracy with more test data.
6. The dissemination and rollout of tht: solution is the key to project success. Otherwise the
project will be a waste of time and will be a setback for establishing and sllpporting a_data•
based decision-process culture in the organization. The model should be embedded in the
organization's business pr~esses.
b. What arc the data visualization techniques? When would you use tables or graphs?
(08 Marks)
Ans. Data Visualization techniques are :
I .Pi,.el oricnlL-d visualization techniques:
Fig
A simple way to visualize the value of a dimension is to use a pixel where the color of the 3.k on based \ isualiza
pixel renects the dimension's value. It uses small icons to re
For a data set of m dimensions pixel oriented techniques create m windows on the screen, 2 popular icon based te
one for each dimension. 3.1 Chem off faces: _ TI
The m dimension values of a record are mapped to m pixels at the corresponding position in human face.
the windows. The color of the pixel renccts other corresponding values.
Inside a window, the data values are arranged in some global order shared by all windows
Eg: All Elcctronil"s maintains a customer information table, which consists of 4 dimensions:
income, credit_limit, trnnsaction_volume and age. We analyze the correlation between
income and other attributes by visualization. Fig 6. 4 : d tcm ujffi
We sort all customers in income in ascending order and use this order to layout the customer ~hspacc {I cm}$ 3.2 St
data in the 4 visualization windows as shown in fig. where each figure has 4
The pixel colors are chosen so that the smaller the value, the lighter the shading. 2 dimensions are mappe
Using pixel ba~cd visualintion 11e can easily observe that credit_limit increases as income angle and/ or length oft
increases customer whose income 1s m the middle range are more likely to purchase more 4.Hlcra rchical \'l~ua li
from All Electronics, these is no clear correlation between income and age. The subspaces are visual

□ EJ@
II Types or Charts
TI1ere arc many kinds
I"' popular fonn of data. It
around alphabetical fist
I shows some of the popu
Income ctC\SitJun.at truu1e1.1oa_volumc .qe
I. Linc graph. This is a
Fig 6.2 : Pixel uric11/1t1/ 1°illl(1/iwtio11 of, 11/lr lbutes ~~ sorli11g 111/ c·ustumers in i11c·ome as a series of points conn
Asc1t11di11g order . is usually sho" n on the l

72
[9'V~A~
' mining project, which 2.Ccometric Projection visualization tech■-...
/\ drawback of pixel-oriented visualization techniques IS dlat they cannot help us much in
imaginative hypotheses understanding the distribution of data in a multidimensional space.
; of a proposed model as Geometric projection techniques help users find interesting projections of mullidir.wnsional
data sets.
emble a team that has a A scatter plot displays 2-D data point using Cartesian cCMXdinates. A third dimension can be
the data. Data cleaning added using different colors of shapes to represent different data points.
be desirable to add new Eg. Where x and y arc two spatial attributes and the third dimension is represented by
predictive accuracy. different shapes
e data yields some good Through this wisualization, we can see that points of types"+" &"X" tend to be collocated.
tool could be tried with IO
~
10 □
iangulate the analysis by 60
hat-if scenarios, to build t:,
y with more test data. y so
ct success. Otherwise the 00 0
ng and supporting a data- a
30
ould be embedded in the b.
10
I,.
to L:,..
use tables or i raphs?
(08 Marks) 0
=
o tO 10 lO 40 ~ 60 10 IO

Fig 6.J : visutilizutiu11 of 1D tla/a set usi11g S<'tlltl!r plot


Kel where the color of the 3.lcon based visualization h:chniqucs:-
lt uses small icons to represent multidimensional data values
m windows on the screen, 2 popular icon based techniques:-
3. 1 Chem off faces: -They display multidimensional data ofup to 18 variablc:s as a cartoon
e corresponding position in
walues.
human face.
,,,.-
.......
er shared by all windows
(h consists of 4 dimensions:
(o_~'')
.... --
re the correlation between
'-...
Fig 6.4: d i em ulfftu•es e11d1fiu·I! reprl!Sl!IIIS 1111 '11' tlimemio1111/ d11111 poi111s (11</8)
rder to layout the customer $\hspace{ I cm}$ 3.2 Stick figures: It maps multidimensional data to five -p1. cc stick figure,
where each figure ha~ 4 limbs amt a body.
ter the shading. 2 dimensions are mapped to the display axes and the remaining dimensions arc mapped to the
t limit increases as income angle and/ or length of the limbs.
~e likely to purchase more 4.Hicrarchical visualiz.ition techniques (i.e. subspaces)
eand age. The subspaces are visualized in a hierarchical manner.
'1 Types of Ch11rts
~ TI1cre are many kinds of data as seen in the case let above. Time· series data is the most
. '·,. . 4 popular form of data. It helps reveal pattems over time. However, d.ita could be organized
• ¥ around alphabetical list of things, such as countries or products or salespeople. Figure 6.1
shows some of the popular chart types and their usage.
age I. Linc graph. This is a basic and most popular type of displaying infom1atio11. It ~hows data
all customers /11 i11co111e ~s a series of points connected by straight line segments. If mining with time-series data, time
1s usually shown on the x-axis. Multiple variables can be represented on the same scale on

73
VIII Se.m, (CSE/ISE)
y-axis to compare of the line graphs of all the varmbles.
2. Scatter plol: This is another very basic and useful graphic fonn. It helps reveal the
relationship bcl\\een two variables. In the 11bo,e caselet, it shows two dimensions: Life
Expectancy and Fertility Rate. Unlike in a line graph, there are no line segment• connecting
the points.
3. Dai- graph: A bar graph shows thin colorful rectangular bars with their lengths being
proportionnl to the values represented. The bars can be plotted vertically or horizontally. The
bar graphs use a lot of more ink than the line grnph and should be used "hen line graphs are
inadequate.
4. Slacked Dar graphs: These are a particul.u method of doing bar graphs. Values of
multiple variables are stacked one on top of the other to tell an interesting story. Bars can
also be nonnaltzed surh as the tolal height of every bar is equal, so ,t can show the relative
composition of each bar.
5. Hlstoi;rams: The~..: ore like tiar graphs, except that they are useful in showing data
freq11en-.1i!!- v: data values on classe~ (or ranges) ofa numerical variable
6. r,e c,,:irts: Tht~e an: very ropul,tr to ~hO\V the di~tribution of a variable, such as sales by
region. ·11,: s:1, of a slice? is 1cp,csentative oithe relative strcngths of each value.
7. 801. l:h:irts· r'it~e arc sped.ti form of charts to show the distribution of variables. The
box slco, ~ thc 1111Jdlc nalfofthe value~, while \,hiskcrs on both side~ extend to the extreme
value, ,11 either direction.
8. Bubbl~ C rnrh: This 1s an inleresting way of displaying multiple dimensions in one chart.
It is a variant of a scatter plot with many data points mao..ed on t\\ o dimensions. Now imagine
that each data point on the graph is a bubble (or a circle) ... the size of the circh: and the color FOOTP
fill in the circle could represent two additional dimensions.
~~~-,--
fr
6SO ....,.,

I
6so r....,__

...-... . Fi,:11re6.7:

~'.~
C.,-,ot, s.--.vi,;.1..,
·-·.,._·.,,• A table ;, b.:st when;
• You need to look up $
• Users need precise vt
• You need to prccisel)
Figure 6.5: Man.I' typ~ ofgraphs • You have multiple da
9. Dials: These are charts lil-e the speed dial in the car, that shows whether the variable value A graph is best when:
(such as sales number) is in the low range, medium r.1ngc. or high range. These ranges could
• The message is conta
be colored red. ycllow and grcen to give an instanl view oftbe data. • You want to reveal r,i
10. Geographical Data maps are panicularly useful maps to denote statistics. Figure 6.6 • Show general trends
shows a tweet density map of the US. It shows where the tweets emerge from in the US. • You have large data ~
11. Pictographs: One can use picturis to rtpresent data. E.g. Figure 6.7 shows the number of
• Graphs and tables sei
liters of water needed to produce one pound ofeach of the products, \Vhere images are used to purpose.
show the product for ea>y reference. Each droplet of \~ater also represents 50 liters of water.

74
C'BCS • MocleiQ~Petpu · 2

1ic form. It helps reveal the


hows two dimensions: Life
no line segments co nnecting

>al'5 with their lengths being


ertically or horizontally. The
be used when line 11,raphs are

oing bar graphs. Values of


an interesting story. Bars can
al, so it can show the relative

are useful in showing data


variable.
fa variable, such as sales by
ths of each value.
aistribution of variables. The
th sides extend to the extreme Fi;:ure 6.6: US tweet 11111p (Source: Slate.CQIII)

ltiple dimensions in one chart.


w O dimensions. Now imagine FOOTPRINT - -
size of the circle and the color

f'/ ,, ~ f~
6SO •ui.r__ 65Q Wh•".;_ 140 0 ~••shwn 2SOOMIIJ•!._

.1
6SQ Tou1__
i
750'•··'~&.t
0
go ru -
r
84 0 C.lln -

.·-·...·....
Figure 6. 7: Pictugraph uf llrtter footprilll (lo11rce: waterfoQ/J)ri11t.org)

-- . A table is best when:


• You need to look up specific values
• Users need precise values
• You need to precisely compare related values
• You have multiple data sets with different units of measur.i
s whether the variable value A graph is best when:
h ronge. These ranges could • The message is cont:iined in the shape of the values
ata. • You want to ri:veal relationships among multiplc values (similarities and differeni:es)
denote statistics. Figure 6.6 • Show general trends
emerge from in the US. • You have large data sets
ure 6.7 shows the number of • Graphs and tables serve different purposes. Chuust: the appropnute data display to lit your
, where images are used to purpose.
represents 50 liters of water.

75
VUI Se-111, (CS£:/IS£)

Modulc-4 a. When to stop buildi


The tree building could
7. a. What Is a splitting variable? Describe lhree crilcri11 for choosing splitting ,·arinble. the tree beco1;1es unrca
(08 Marks) any node is witnin pred
Ans. Splitling lhe Tree: From the rool node. the decision tree will be split into three branches 3. Pru11i11g: The lree c
or sub-trcc:s. one for each of the three values or outlook. Data for the root node (the entire 1 he pruning is oflc11 d
data) will be divided into the three segments. om: for each of the value: or outlook. The sunny usability. The symptom
br,rnch will inherit the data for the in~tnm:e·; that had sunny as the value of outlook These,, ill some of which ma)' refl
be used for further building of that sub-tr..:e. S111111:irly, the rainy branch will inherit data for Then.: arc two approach
the instances that had rainy as the value of outlo,11.;. lnese will be used for further building of
that sub-tree. The overcast branch will inherit the data for the: instances that had overcast as b. Compare 11nd contras
the outlooi.... However. !here will be no need to build further on that branch. There is a clear Ans. Ad,•antages and Disad
decision, yes. for all instances when outlook value is overcast. Rcgn:ssion models are
The decision tree will look like this afier the first level of splitling. I Regression modds a
such as ~orrelation and
2. Regression mw.!I~ p
3. The Mrcngth (or the
correlation coefficients,
4. Regression models c
h.1.,11 \\ ,ndJ l'UJ
YES
T... p .... ~ 5. Regression models ca
6. Regression modeling
Ho1 lh, h r.,1.., No MilJ lhl!h rob< Y-.:.~
IIOI l'·~h r, •• Nu Cool Normol 1-,1..
data niining packages.
"\!'
!1.1,ld If ,:n hll,c N" Cool Ncn,,al ,.... No
capabi Iit ies.
Regression models can
Cc-ol NNm:tl F.,hc y,,. Mold N0t11131 r.JIJC Ve,
I. Regression models c:
Mold Now1~I True: \'u M,hl lhi;h ln1e No
well to remove missing
Decision trees employ the divilie .inJ conquer method. The data is branched at each node validity oflhe models
according to certain criteriJ umil all the data is assigned to leaf nodes. It recursively divides 2. Regression models
a training set until each division consists of examples from one class. among some independ
The following is a pseudo code for making deci!>ion trees: among themselves, the
I. Create a root node and a•sign all oftl1t: training data to it. coefficients will lose the
2. Select 1he b.:st splitting attribute n1:cording to certain criteria. 3. Regression models
3. Add a branch to the root node for each value of the split. although some package
4. Split the dala into mutually exclusive subsets along the lines oflhe specific split. if a large number of vart
5. Repeat steps 2 and 3 for each and every leaf node until a stopping criteria is reached. will be reflected in the r
111crc arc many algorithms for mal.ing decision trees. The most popular ones are CS, CART, power of the modd. Th
and CHAID. TI1ey differ on three key elements: 4. Regn:ssion models di
I . Splittl11g t:rlteriu
The user needs 10 imag
a. Which vuiable to use for the first split? How should one detennine the most important the regression model to
variable for the first branch, and subsequently, S. Regression models ,
for each subtiec'l There urc: many mcasures like least errors, lnfonnatlon gain, and Gini are ways to deal with c
coelficie111.
yes/no value.
b. Wl1111 vnlucs to use for the split? If the variables have continuous values, such as for age
or BP, what \':tlue-rangcs should be used to make bins?
~- I low many branches should be allowed for each node? TI1erc could be binary tn:es, with 8. 11. Compare Neural nctw
Just two branches :it each node. Or there could be more branches allowed. neural network.
2. Stoppi11g uiteriu
Ans. There are maoy dilfere
things to consider: spee
76
CBCS · Modcl,Q~Petpe.· · 2
a. When to stop building the tree? There are two major ways to make that detennination.
The tree hui lding could be stopped when a certain dcpda of'the branches has been reached and
splitting rnriable. the tree becomes unreadable after that. The tree could also be stopped when lhe error level at
(08 Marks) any node is within predefined tolerable levels.
plit into three branches 3. Pfl111i11.I(: The tree could be trimmed to make it IDCIR balanced and more easily usable.
e root node (the entire The pruning is ofien done aller the tree is constructed, lo balance out the tree and improve
e of outlook. The sunny usability: The symptoms of an ovcrfitted tree an: a 11W too deep, with too many branches,
e of outlook. These will some of which may reflect anomalic:s <lue to noise oroudiers. 11111s, the tree should be pruned.
nch will inherit data for There: are two approaches to avoid over-fitting.
d for further building of
b. Compare and contrast decision trees with rcgrcssioD models'! (08 Marks)
ces that had overcast as
Ans. Advantages and Disadvanhtgcs or Regression Models
branch. There is a clear
Regression models arc very popular because they off'er many advantages.
I Regression models arc easy to understand as they are built upon basic statistical principles,
such as .:orrelation and least square error.
2. Regression models provide simple algebraic equations that are easy to understand and use.
3. The strength (or the good:ie~s of fit) of the regression model is measured in terms of the
ny correlation codficients, and other relat.:d statistical parameters that are well understood.
4. Regression models can match and beat the preJictive power of other modeling techniques.
5. Regression models can include all the variables that one wants to include in the model.
\\ 'inti) l'lay
6. Regression modeling tools arc pt:rvasive. They are found in statistical packages as well as
fnls, Yt:'/J
data mining packages. MS Excel spreadsheets can also provide simple regrc~~ion modeling
al False \'cs
capabilities.
al T,uc No
Regression models can however prove inadequate under many circumstances.
31 fol>< Ye< I. Regression models cannot cover for poor data quality issues. If the data is not prepared
~• rrue No well to remove missing values, or is not well-behawd in terms of a normal distribution, the
is branched at each node validity of the model suffers.
es. It recursively divides 2. Regression models suffer from collinear problems (meaning strong linear correlations
ss. among some independent variables). If the independent variables have strong correlations
among themselves, then they will eat into each other's prediclive power and the regression
coefficients will lose their ruggednc:ss.
3. Regression models will not automatil;ally 1:hoose between highly collinc:ar variables,
alll!ough some packages attempt to do that. Regression models can be unwieldy and unreliable
the specific split. ifa large number of variables are included in the modt:I. All variables entered into the model
1g criteria is reached. will be reflected in the regression equation, irrespective of their contribution to the predictive
, pular ones are CS, CART, power of the modc:I. There is no concept ofautomatic pruning the model.
4. Regn::ssion models do not automatically take care of nonlinearity.
The use.r needs to imagine the kind of additional tem1s that might be needed to be addi:d 10
ennine the most important the regression model to improve its fit.
5. Regression modds work only with numeric data and not with categorical variables. Then,
fonnation gain, and Gini are ways to deal with categorical variables though by creating multiple new variables with a
yes/no value.
us values, such as for age OR
ould be binary trees, with 8. a. Compare Neural network with decision tree. Give the advantage and disadvantages of
llowed. neural network. (08 Marks)
Ans. There are many differences between these two, but in practical terms, there are three main
things 10 consider: speed. interprctability, and accuracy.

77
VIII Seoi,, (CS£/IS£)
difficult.
Decision Trees
Should be faster once trained (although both algllrilluns can train ~lowly depending on An ideal cluster can be de
exact algorithm and the amount/dimens1onality of the data). This is because a decision tree In reality, a cluster is a subj
inherently "throws away" the input features that ii doesn·1 find useful, v.hereas a neural net knowledge. In the sample~
will use them all unless you do some feature sdection as a pr.:-processing step.
If it is important to understand what the model is doing, the trees are very 111terpretnbtc.
Onl)' model functions which are axis-parallel splits of the data, which may not be the case.
You probabl· •111 to be sure to prune the tree to avoid over-fining.
Neural Neb
Slower (both for training and classification), and less interpretable.
If your data arrives in a stream, you can do incremental updates with stochastic gradient
descent (unlike decision trees, which u~e inherent!)' batch-learning algorithms).
Can model more arbitrary functions (nonlinear int.:ractions, etc.) and therefore might be more Jr seems like there are two
accurate, provided there is enough training data. But it can be prone to over-fitting as well. as three cl11sters, dependin
There are many advantages of using ANN. way to calculate ii. Heuristi
I. ANNs impose very little restrictions on their use. ANN can deal with (identify/model) Three business application<
hi1;hly nonlinear relationships on their own, without much work from the user or analyst. Cluster analysis is used in
1l1ey help find practical data-driven solutions where algorithmic solutions arc nonexistent or It helps provide characteriz
too complicated. ~atural groupings of custom
2. There is no need to program ANN neural networks, as they learn from examples. They get m a specific domain and
better with use, without much programing effort. business applicatioo of clus
3. ANN can handle a variety of problem types, including classification, clustering, clusters based on their cha~
associations, and so on. on. Here are some e>.ampl~
4. ANNs arc tolerant of data quality issues, and they do not restrict the data to follow strict I. Market Se;:1111mtatio1t: C·
nonnality and/or independence assumptions. by their common wants an
S. ANN can handle bol11 numerical and categorical variables. 2· Product portfolio: Peopl
6. ANNs can be much faster than other techniques. and large sizes for clothing •
7. Most importantly, ANN usually provide better results (prediction and/or clustering) 3· Tl!.\! Mil1/11g: Clustering
10 their contc111 similarities ~
compared to statistical counterparts, once they have been trained enough.
The key disadvantages arise from the fact thal they are not easy to interpret or explain or
compute.
9. II. Give the Comparison bet
I. They are deemed to be black-box solutions, lacking explainability.
Ans. Text Mining is a fonn of da
2. Optimal design of ANN is still an art: It requires expertise and extensive experimentation.
3. It can be difficult to handle a large number of variables (especially the rich nominal Data Mining. However lher
attributes) with an ANN. is that te>.1 mining requ,ires
4. It takes h1r~e data sets to train an ANN. techniques can be applied.
Dimtnsion
b. Define Clusters'! Describe three business applications in you r industry where cluster
analysis will be useful. (08 Marks) Natur, or
data UnMru<:lun:d d
Ans. Definition of a Cluster :An operational definition of a cluster is that, given a representation of
n objects, find K groups based on a mt:asure of similarity, such that objects within the same Languaee Many languag
group arc alike but the objects in different groups are not alike. ustd world; many I
However, the notion of similarity can be interpreted in many ways. Clusters can differ in documtnts an:
terms of their shape, size, and density. Clusters are pattcms, and there can be many kinds Clarity a nd S.:ntcnc.:s can
of pattcms. Some clusters are the traditional lypcs, such as data points hrnging together. prtcision conrrndict the
However, there are other clusters, such as all points representing the circumference of a
Consistenfy Din.:rcnt pans
circle. ll1crc may be concentric circles with points of different circles representing different other
clusters. The presence of noise in the data makes the detection of the clusters even more

78
' V~A~
difficult.
An ideal cluster can be defined as a set of points that is compact and isolated.
n slowly depending on In reality, a cluster is a subjective entity whose significance and interpretation requires domain
· because a dccision m:e J...nowledge. In the sample data below (Figure 8.1), how many clusters can one visualize?
ul, whereas a neural net X
essing step.
e very mterprctable. X
X
ch may not be the case. X
X X
X X
X
X X X
X
with stochastic gradient Fig11re 8.1: Visual clus/er example
algorithms). It seems like there are two clustm of approximately equal sizes. However, they con be seen
d therefore might be more as three clusters. depending on how we draw the dividing lines. There is not a truly optimal
c to over-fitting as well. way to calculate it. I h:uristics are ofien used to define the number of clusters.
Three business applications:
cal with (identify/model) Cluster analysis is used in almoM every fir Id where thet e is a large variety of transactions.
from the user or analyst. It helps provide characterization, definition, and labels for populations. It can help identify
lutions arc nonexistent or natural groupings of customers, products, patients, and so on. It can also help identify outliers
in a specific domain and thus decrease the size and complexity ot prchlems. A prominent
from examples. They get business applicatioo of cluster analysis is in market research. Customers are sei>:-iented into
clusters based on their characteristics-wants and needs, geography: price sensitivity, au-! :o
classification, clustering, on. Herc are some examples of clustering:
I. Market Seg111e11tati1111: Calegorizing customers according to their similarities, for instance
ct the data to follow strict by their common wants and needs, and propensity to pay, can help with targeted marketing.
2. Product portfolio: People ofsimilar sizes can be grouped together to make small, medium
and large sizes for clothing items.
3. Text Mi11i11g: Clustering can help organize a given collection of text documents according
.diction and/or clustering) to their content similarities into clusters of rcluted topics.
enou!?,h.
to i~terpret or explain or Modulc-S
9. a. Give the Comparison between Tcxl Mining and Data Mining? (08 Marks)
11ity. Ans. Text Mining is a form of data mining. There are many common clements between Text and
extensive experimentation. Data Mining. However, there art: some key differences (Table Below). The key difference
:specially the rich nominal is that text mining requires conversion of text data into frequency data, before data mining
techniques can be applied.
Dim,nsion Tut Mining Data Mlaln1
ur industry where cluster Nature of Numbers; alphabetical and logical
(08 Marks) Unstructured data: \\-Ords, phrases. sentences
data values
iat, given a representation of Many languages and dialects used in lhc
hat objects within the same Language Similar numerical S)Stcms across
world; many languages arc extinct. new
UKd
docu1rn:nts arc di~owrcd the world
ways. Clusters can differ in
Clarity 11ml Sentences can be ambiguous: sentiment may Nwnbers an: pn.-cisc
d there can be many kinds pr,cision con1rndict the words
ta points hrnging together.
ing the circumference of a Ditlen:nt parts of data can b.:
DitTen:nl parts of the tc,t can contradict each inconsistc:nt.
Consist,ncy thus. requiring
ircles representing different oth.:r statistical significance analysis
of the clusters even more

79
VIII Sem, (CS'E/IS'E) cacs - ModeiQ~P,
Tcxl may prcscnl a clear ond consis1cn1 nr the computations involved
Scnlimenl mi:1.cd ~n1i1111:nt. across a ~ontinuum. Spoken NIA Bayes, simple Bayes, or in
"ords adds funhcr s.:nlimcnt The advantages of Naive B1
Spdling errors Ditrcrinp , alucs of proper ls,ucs wilh missing valm:,. · II uses a very in1ui1ive 1
Quality nouns, such as nnnu:s. Varying 4uali1y of several free parameters tha
language translation 0111licrs, and so on
· Since lhe classifier return
N111ur\' or ~\:) wonJ•ba~J ~arch. 1,;oc:\.i~lcnce of
/\ full "id.: rJngc of ,1a1is11cal of tasks than ifan arbitrary
and machinc-1.:ammg analysis tor It does not require large
analysis 1hcmcs: scntimcnt mining relationships and diffc:n:nccs · Naive Bayes classifiers a
b. In what wiiys Is Naive-Day&:$ belier than other classification technlqul-s? Compare with
decision tree. (08 Marks)
10, a. What is clickstream anal
Ans. Classification is the separation or ordering of objecls into classes .There are two phases in
On a Web site, clicks1rean
classification algorithm: first, the algorithm tries to find a model for the class attribute as a
collecting, analyzing and
function of other variable\ of the datasets. Next, ii applies pn:viously desigi;icd model on the
- and in what order. The p
new and unseen datasets for detennining the related class of each record.
There are 1wo levels of c
Classification has been applied in many fields such as medical, astronomy, commerce,
Traffic analytics operates at
biology, media, e1c. There are many techniques in classifi&alion method like: Decision Trt:e,
how long ii lakes each png
Nai've Bayes, k-Nearesl Neighbor, Neural Networks, Support Veclor Machine, and Genetic
and how much dala is •~
Algorithm. In this paper we \\-ill use Decision Tree, Naive Bayes, and k-Ncarest Neighbor.
uses clickstream data to de
They are both supervised learning algorithms used for classification tasks. It strongly depends
concerned with what pages
of the data you have and what you arc trying 10 leam. a shopping can, what ilem
Although ii depends on the problem you arc ~olving, but some gcm:ral advan1ages are
loyally program and uses a
following:
Because an extremely larg~
Naive - Biiycs :
many e-businesscs rely on
I. Work well wi1h small dataset compared to DT which need more data
inlerprcl the dala and gene
2. Lesser overfilling
considered to be most effect
3. Smaller in size and faster in processing
evaluation resources.
Decision Tree:
I. Decision Trees arc very flexible, easy lo understand, and easy to debug b. Explain brlcny the leehnl
2. No preprocessing or transfonnalion of features required
3. Prone 10 overfilling bul you can user pruning or Random forests to avoid that. Ans. TECIINIQUES AND AL
In Britf : Decl.s/011 Trl!I! analysis • discovering sub-1
A d1:cision tree is a now-chart-like 1rce s1ruc1un:, \\-here each internal node denotes a test on important nodes or hubs.
an auribute, each branch represents an outcome of 1he test, and leaf nodes represent classes Finding Sub-networks: A
or class distri butions. l l1e popular Decision Tree algorithms are ID3, C4.5, CART. The ID3 be seen as an interconnected
algorithm is con~idcrcd as a very si111ple decision lrce algorithm. II uses infonnalion gain as unique charac1eris1ics. This
splitting criteria. C4.5 is an evolution of 1D3. It uses gain ratio as splitling criteria. between lhem would belong
CART algorithm uses Gini coefficicnl as lhe test attribute selection criteria, and each time belong to separate sub-netwo
selects an a11ribu1e with 1he smallest Gini coefficient as the 1es1 attribute for a given set [5]. no correct number ofsub-ne
n ,e advantage: of using Decision Trees in classifying the dala is that they are simple to for decision -making is lhe
undersland and in1erpre1 . I lowever, decision trees have such di~dvantagcs ,s : Visual represen1a1io11 of 11
I) Most ofthe ulgorithms (Ii/cit /DJ und C.U) require thut the target ut1rih11te will /11.rve only differentiate the types of n
discrete values. help visually identify the s
2) A.f decision tre1ts use the "clivide um/ ,·mUJuer" 1111ttltod, the,• t,mcl to perjurm well ifufew strong relationships around
highly relt.1vun1 uflribut<!.!i l!.ri.il. but less .10 ifmmry complex i11teractiuns are present. sub-network. A sub-network
Naive Buyes : Nai"ve Bayesian classifiers assume 1ha1 there are no dependencies amongst them. In this case ,one or m
attributes. This assumption is called class conditional indcpendenc1:. It is mad11 to simplify

80
the computations involved and, hence is called "naive" . This classifier is also called idiot
Bayes, simple Bayes, or independent Bayes .
The advantages of Naive Bayes are:
· It uses a very intuitive technique. Bayes classifiers, unlike neural networks, do not have
missing values. several free parameters that must be set. This greatly simplifies the design process.
u so on · Since the classifier returns probabilities, it is simpler to apply th!i!se results to a wide variety
range or statistical of tasks than ifan arbitrary scale was used.
tnc-lcarning analysis for · It does not require large amounts of data befon: learning can begin.
iPS and ditTercnccs · Naive Bayes classifiers arc computationally fast when making decisions.

niqucs? Compare with OR


(08 Marks) 10. a. What is clickstream analysis? (02 Marks)
here are two phases in On a Web site, clickstream analysis (also called clickstream analytics) is the process of
r the class attribute as a collecting, analyzing and reporting aggregate data about which pages a website visitor visits
y desigl)e<l model on the -- and in what order. The path the visitor takes though a websi1e is called the clickstream.
cord. There are two levels of clickstream analysis, traffic analytics and e-commerce analtyics.
astronomy, commerce, Trame analytics operates at the server level and tracks how many pages are served to the user,
thod like: Decision Tree, how long it takes each page to load, how often the user hits the browser's back,or stop button
or Machine, and Genetic and how much data is transmitted before the user moves on. E-commerce-based analysis
nd k-Nearest Neighbor. uses clickstream data to determine the effectiveness of the site as a channel-to-market. It's
tasks. It strongly depends concerned with what pages the shopper lingers on, what the shopper puts in or takes out of
a shopping cart, what items the shopper purchases, whether or not the shopper belongs to a
general advantages are loyalty program and uses a coupon code and the shopper's preferred method of payment.
Because an extremely large volume of data can be gathered through clickstrcam an;,lysis,
many e-businesses rely on big data analytics and related tools such as Hadoop to help
data interpret the data and generate reports for specific areas of interest. Clickstream analysis is
considered to be most effective when used in conjunction with other, more traditional, market
evaluation resources.

o debug b. Explain briefly the techniques and algorithms of social network analysis?
(14 Marks)
s to avoid that. TECHNIQUES AND ALGOIUTIIM : There are two major levels of social network
analysis - discovering sub-networks within the network and ranking the nodes to find more
rnal node denotes a h:st on important nodes or hubs.
af nodes represent classes Finding Sub-networks: A large network could be 1he better analyzed and managed if it can
D3, C4.5, CART. The ID3 be seen as an interconnected set of distinct sub-networks each with its own distinct identity and
It uses infom1ation gain as unique characteristics. Thi~ is like doing a cluster analysis of nodes. Nodes with strong ties
splitting criteria. between them would belong to the same sub-network ,while those with weak or no ties would
ion criteria, and each time belong to separate sub-networks .This is unsupervised leaming technique ,as in Apriori there is
tribute for a given set [5]. no correct number of sub-networks in a nc:twork. The usefulness of the of the network structure
is that they are simple to for decision -making is the main criterion for ad'opting a particular structure.
Visual representation of networks can help identify sub-netw<;irks. Use of color can help
differentiate the types of nodes. Representing slrong ties with thicker or bolder lines could
help visually identify the stronger relationships. A sub-network could be a collection of
d lo perform well if a few strong relationships around a hub node. In 1his case, th1. hub node could represenl a distinct
·tiuns are present. sub-network. A sub-network could also be a subset of nodes with dense relationships between
no dependencies amongst them. In this case ,one or more nodes will act as gateway to the rest of the network.
. It is made to simplify

81
VIII Sem.- (CSf/ISf) CBCS - Mode.l,Q~wn, j
There are two outbound I
half of node A's influenc
A ,so both C and A recei
There is only outbound I
node D . There is only
influence of node C.
Node A gets all of the i1
Thus,
Node B gels half the inH
Thus,
Node C gets half the infl
Thus,
Node D gets all of the in
Thus,
Fig A nch••ork wit/1 Dlsti11ct sub Nt!twork We have 4 equations usi
Computing lmportuncc of Nodes: When the connection~ between nodes in the network We can represent the coe
have a direction to them ,then the nodes can be compared for their relative influence or ( I0. I) given below..This
rank. This is done using · Influence Flow Model'. Every outbound link from a node can be not represented in an equ
considered an outflow of influence. Every incoming link is similar an inflow of influence.
More in-links to a node means greater importance. Thus there will be many direct and
indirect flows of influence between any two nodes in the network.
Computing the relative influence of each node is done on the basis of an input-output matrix Ra
of flows of influence among the nodes. Assume each nodes has nn influence value . The Rb
computational task is to identity a set of rank values that satisfies the set of links between the Re
nodes. It is an iterative task ,,here we begin with some initial values and continue to iterate
H.d
till the rank values stabilize.
For simplification , let us
Consider the following simple network with 4 nodes (A,B,C,D) and 6 directed links between
a fraction as the rank valu
them as shown in the figure( I0.2). Note that there is a bidirectional link . Here are the links:
compute new rank values
Node A links into B
as Ji n or //.I for the nod~
Node B links into C
Node C links into D
Node D links into A
Node A links into C
Node B lin!..s into A
0 :!:==:=====;
j
Vari11
ll&i
H.~
0◄•------~ Rel

Fig R~
Computing the revised va,1
The goal is to find the relative importance, or rank, or every node in the network . This will
help identify the most imponance node(s) in the network. shown as iteration I.
Using the rank values fro
We begin by assigning the variables for influence ( or rank) value for each node, as Ra , Rb
for these variables ,show
, Re , and Rd. The goal is to find the relative values of these variables.

82
There are two outbound links from node A to node 8 and C. Thus , both Band C receives
half of node A's influence. Similarly, there are two oudlound links from node B to node C and
A ,so both C and A receives half of node B's influcnc:c
There is only outbound link from node D to node A . Thus ,node A gets all the influence of
node D. There is only outbound link from node C to node D nnd hence, node D gets all the
influence of node C.
Nodc A gets all of the influence of node D and half the inlluence of node 8 .
Thus, Ra =IJ.5 X Rb +Ri.
Node O gets half the infl uence of node A.
Thus, Re =0.5 X Rn.
Node C gets half the infl uence of node A and half the influence of node B.
Thus, Re =IJ.5 X Ra+ 0.5 X Rb.
Node D gets all of the influence of node C and half the influence of node B.
Thus, R,I : Re.
rk We have 4 equations using 4 variables . These can be solved mathematically.
een nodes in the network We can represent the coefficient of these 4 equations in a matrix form ns shown in the Dataset
their relative influence or (10. 1) given below..This is the Influence Matrix. The zero value represent that the term is
d link from a node can bo not repre~ented in an equation.
'tar an inflow of influence. Data St!t lO. I
will be many direct and Ra Rb Re Rd
Ra 0 0.50 0 1.00
is of an input-output matrix
Ith 0.50 0 0 0
an influence value . The
the set of links between the Re 0.50 0.50 0 0
ues an<l continue to iterate Rd 0 0 1.00 0
For sirnplifica11on, let us also state that all the rank values add up to I. Thus. each node has
ind 6 directed links between a fraction as the rank value . Let us start with an initial set of rank values and then iteratively
al link . Here are the links: compute new rank values till they stabilize. One can start with any initial rank values, such
as J/11 or // .J for the nodes.
Variable Initial Value
lb 0.250
Rb 0.250
Re 0.250
Rd 0.250
Variable Initial Value Iteration I
lb 0.250 0.375
Rb 0.250 0. 125
Re 0.250 0.250
Rd 0.250 0.250
Computing the n:v1se<l values usmg the equations startc1l earhcr,we get a revised set of values
~e in the network . This will shown as iteration I.
Using the rank values from Iteration I a~ the new starting values ,we can compute new values
e for each node , as Ra , Rb for these variables ,shown as Iteration 2. Rank values will continue to change.
blcs.
83
VIII Sun, (CSE/ISE)

Variable lnilial Value Iteration I Iteration 2


Ru 0.250 0.375 0.3125
Rb 0.250 0.125 0.1875
Re 0.250 0.250 0.250
Rd 0.250 0.250 0.250 Time: 3 hrs.
Working from values of lterataon2 nnd so. we can do a few more 1terataons sull the values Note : d11n11~r atty Fl
stabili1e. Dataset(I0.2) shows the final values after the 8th iteration.
Dat11 St!t 10.2
Variable Initial V;1l11c Iteration I Iteration 2 ..... Iteration 8 I . a. Write notr on folio
Ra 0.250 0.375 0.313 ..... 0.333 (I) Rack awareness
Rb 0.250 0.125 0.188 ..... 0. 167 (2) HDFS snapshot
0.250 0.250' 0.250 0.250 (3) HDFS Namc-n
Re ··-·· Ans. ( I) Rack Awareness
Rd 0.250 0:250 0.250 ..... 0.250
Rack awareness deal
The fi nal rank shows that rank of node A 1s the highest at 0.333. Thus . the most important
MapReduce is lo mo
node is A . The lowest rank is 0.167 of Rb. Thus , B i~ the least important node. Nodes C and
do not offer bisectio
D arc in the middle. In this case,their ranks did not change at all. locality:
The relative scores of the nodes in this network would have been the .same irrespective of the
initial values chosen for the computations. It may take longer or ~horter number of iterations I. Dara resides on th
for the results to stabilizc for different sets of initial values. 2. Data resides in the
PAGERANK: PagcRank is a particular application of the social network analysis techniques 3. Data resides in ad
above to compute the relative importance of websites in the overall World Wide Web. The When the YARN sch
data on website and their links is gnthered through web crawler bots that traverse through try to place the cont
the \.\Cbpage at frequent interv,lls. Every wcbpagc is a node in the social networl.. and all another rack.
the hyperlinks from that page become din:cted links to other web-pages . Every outbound In addition, the Nam
link from a web-page is considered an outflow of influence of that web-page. An iterative fault tolerance. In su
compulntionnl lcchniquc is applied to compute a relative importance lo each page. That from working. Perfo
score is calh:d PageRank. according to an eponymous algorithm invented by the founders of I IDFS can be madc
Google .the web search company. map the network to~
Page Rank is used by Google for ordering the display of websites in response to search
4ueries. To be shown higher in the search results, many websites owners try to artificially
belong to the same
(2) HDFS Snaps hot
(I'
boost their PageRank by creating many dumm) websites whose rnnks can be made to now HDFS snapshots are
into their desired websites. Also, many websites can be designed to cyclical sets of links -snapshot command.
from nhere the web crawler may not be able to break out. These are called spider traps. TI1ey offer the folio
To overcome these and other challenges , Google includes a Tcleporting factor into • Snapshots can be
computing the PageRank. Tekporting assumed that there is a potential link from an) node • Snapshots can
to any other node, irre~pective of whether it actually exists. Thus, the influence matrix is recovery.
?'ultiplied ~y a w~ighting factor called Beta with a typical value of 0.85 or 85(percent)¾ . • Snapshot creatio
I he ~cma1111ng we1~ht ~f 0.15 or I 5 (percent)% is given to teh:portation . In Teleportation • Blocks on the D
matrix , ~ach cell 1s given a mnk of 1/n , when: 11 is the number of nodes the web. The and the file size.·
~wo ~1atnccs arc added to compute the final inllucncc matrix. This matrix can be used to duplicate files.
1tera11vely compute the l'ageRonk of all the nodes.
• Snapshot do not
(3) IIDFS N11meNo
Another important
providt!d a single na
resources of a single
84
lterulion 2
Eighth Semester B.E. Degree Examination,
0.3125
CBCS - Model Questioa Paper - 3
0.187S
0.2S0
BIG DATA ANALYTICS
Time: 3 hrs. Max. Marks: 80
0.2S0
~ Note: .(Jmwtr a11y FIVE full q11tstio11s, stltt:ti111 ONE/111/ quntionfro111 tach modult. ,
ore itcrat ions still the values
ion.
Module- I
Iteration 8 I. a. Write note on following. (08 Marks)
0.333 (1) Rack awareness
0.167 (2) IID FS snapshots
(3) HDFS Nanu....nods Federation
0.250
Ans. ( I) Rack Awareness
0.250 Rack awareness deals with data locality. Recall that one of the main design goals of Hadoop
. Thus , the most important MapRcduce is to move the computation to the data. Assuming that most data centre networks
important node. Nodes C and do not otrer bisection ~andwidth, a typical Hadoop cluster will exhibit three levels of data
II. locality:
n the same irrespective of the I . Data resides on the local machine (best)
shoner number of iterations 2. Data resides in the same rack (better)
3. Data resides in a different rack (good).
I network analysis techniqU1:s
erall World Wide Web. The When the YARN scheduler is assigning MapReduce containers to work as mappers, it will
er bots that traverse through try to place the container first on the local machine, then on the same rack, and finally on
another rack.
n the social network and all
web-pages . Every outbound In addition, the NameNode tries to place replicated data blocks on multiple racks for improved
that web-page. An iterative fault 1olerance. In such a case, an entire rack failure will not cause data loss or stop IIDFS
portance to each page. That from working. Performance may be degraded, however.
invented by the founders of HDFS can be made rack-aware by using a user-<krived script that enables the master node to
map 1hc ne1work topot"K)' of the cluster. A default Hadoop installation assumes all the nodes
cbsitcs in response to search belong 10 the same (large) rack. In that case, there is no option 3.
ites owners try to artificially (2) HDFS Snapshots
se ranks can be made to flow HDFS snapshots are similar to backups, but an: created by administrators using the hdfs dfs
gned to cyclical sets of links -snapshot command. IIDFS snapshots are read-only point-in-time copies of the file system
~e are called spider traps. They offer the following features:
es a Tclcporting factor into • Snapshots can be taken ofa sub-tree of the file system or the entire file system.
potential link from any node • Snapshots can be used for data backup,protection against user errors, and disaster
:fhus, the influence matrix is recovery.
ue of 0.85 or 85(pcrcent)% . • Snapshot creation is instantaneous.
lc:portation . In Tdeportation • Blocks on the DataNodcs are not copied, because the snapshot files record the block list
ber of nodes the web. The and the file size. There is no data copying, although it appears tot the user that there are
. This matrix con be U)ed to duplicatc files .
• Snapshot do not adversely affect regular HDFS operations.
(3) HDl-"S NameNode Fcdcratio11
Another important foaturc of I IDFS i5 NameNode federation. Older versions of IIDFS
provided a single namespace for the entire cluster managed by a single NameNode. Thus, the
resources of a single NameNode determined the size of the namcspace. Federation addresses

85
VIII Sern, (CSf/ISf) C!lCS·M~Q~
this limitation by adding support for multiple NameNode namespaces to the HDFS file I. Run teragen to gene~
system. The key benefits are as follows: $ yam jar SHA DOOP El
• Namespace scalability. HDFS cluster storage scales horizontally without placing a burden 500000000 -
to the NameNode. --,/ user/hdfs/TeraGen-50
• Better performance. Adding more NamcNode to the cluster scales the file system read/ 2. Run terasort to sort
write operations throughput by separating the total namespaces. $ yam jar $HA DOOP E1
• System isolation. Multiple NameNode enable different categories of applications to be --tuser/hdfs/TeraGen:So
distinguished and users can be isolated to different namespaces. 3. Run teravalidate to
Figure I.I illustrates how IIDFS NameNode Federation is accomplished. NameNodel $ yam jar SHADOOP E
manges the / research an /marketing namespaces, and NamcNodc2 manages the / data and /
--t user/hdfs/TeraSort:SO
project namespaces. The NameNode do not communicate with each other and the DataNodes
To report results, the tim
"just store data block" as directed b either NameNode.
megabytes/second (MB/
should be run with a rcpli
- .....
Nl..,...,;oonOO l"IOI
tasks is set to I. lncreasi

- -
....
,
"""'-
.... aaafN For example, the followi 1
$ yarn jar SHA DOOP E
_, -Dmapred.reduce::Us
do not forget to clean u
following command will
$ hdfs dfs -rm -r -skipTra
Running the TcslDFSI
Hadoop also includes an
Figure I . I HDFC Na111eN01le Feder111/011 ex11mple.
benchmark is a read and
b. Explain with a concept or running basic lladoop Benchmarks. (8 Marks) to and from HDFS and i
Ans. Running Basic Hadoop Benchmarks: Many Hadoop benchmarks can provide insight file size and number offi I
into cluster performance. The best benchmarks are always those that reflect real application benchmark, you should r
performance. steps. In the following ex
The two benchmarks discussed arc : tcr asort and TcstDFSIO, provide a good sense of how 16files of size IGB ares
well your Hadoop installation is operating and can be compared with public data published mapreduce-client-jobclien
for other Hadoop systems. The results, however, should not be taken as a single indicator for file. Running it with no
ssytem-wide perfommnce on all applications. . (load testing the NameN~
The following benchmarks are designed for full Hadoop cluster installations. These tests commonly used Hadoop I
assume a multi-disk HDFS environment. Running these benchmarks in the Hortonworks reported of these benchm
Snadbox or in the pseudo-distributed single-node install is not recommended because all I . Run TestDFSIO in w~
input and output ( 1/0) are done using a single system disk drive. $ynrnjar$HADOOP E
Running the Ter11sort Test -+ TestDf.'SIO -write .:;-,rf
The terasort bechmarks so11s a specified amount of randomly generated data. Example results arc as fol
This benchmark provides combine testing of the HDFS and MApReduce layers ofa Hadoop fs.TestDFSIO: ----TcstO
cluster. A full terasort benchmark run consists of the following three steps: fs.TestDFSIO: Date & ti1
I. Generating the input data via teragen program. files: 16
2: Running the actual tcrasort benchmark on the input dllta. fs.TestDfSIO: Total M
3. Validating the so1ted output data via the teravalidatc program. 14.890106361891005 fs.
In general, each row is 100 bytes long; thus the total amount of data wrillen is 100 times fs.TcstDFSIO: 10 rate ~t
the number of rows specified as part of the benchmark (i.e., to write 100GB of data, use I sec: 105.631
billion rows). The input and output directories need to be specified in HDFS. The following 2. Run TcstDFSIO in re
sequence of commands will run the benchmark for 50G B of data as user hdfs. Make sure the $ yarn jar $HA DOOP_ E
/ user/hdfs directory exists in HDFS before running the benchmarks. --> TcstDFSIO -read -nrFi
'Bi,g,V~ A ~

mespaccs to the HDFS file I. Run tcragen to generate rows of random data to sort.
S yam jar SHADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teragen
olly without placing a burden 500000000
-+/user/hdfsfferaGen-50O0
scales the file system read/ 2. Run tcrasort to sort the d u tu base.
$ yam jar $HA DOOP_EX/\MPLES/hadoop-mapreduce-examples.jar h:rasort
ces.
gories or applications lo be -+/user/hdfsfferaGcn-50GO /user/hdfs/TcraSort-50O8
ces. 3. Ru n teravalidatc to validate the sort.
accomplished. NameNode I S yarn jar SHA DOOP_EXAMPLES/hadoop-mapreduce-cxample.jar ter::ivalidate
cle2 manages the / data and / -+/user/hdfsfferaSort-50GB /user.hdfsfferaValid-50GB
ach other and the DataNodes To report results, the time for the actual sort (terasort) is measured and the benchmark rate in
megabytes/second (M13/s) is calculated. For best performance, the actual tcrasort benchmark
should be run with a replication factor of I. In addition, the default number of terasort reducer
tasks is sel to I. Increasing the number of reducers oOcn helps with benchmark performance.
For example, the following command will instruct tcrasort to use four reducer tasks:
S yarn jar SIIADOOP_EXAMPLES/hadoop-mapreduce-example.jar terasort
.... -Dmapred.reduce.tasks~4 /user/hdfs/TeraGen-50GB /user/hdfsfferaSort-50GB Also,
do not forget to clean up the terasort data belween runs (and after testing is finished). Tite
following command will perform the cleanup for the previous examplt::
$ hdfs dfs -rm -r -skipTrash 'Iera•
Running the TcstOFSIO Benchmark
Hadoop also includes an I IDFS benchmark application called tcstDFSIO. Tite TestDFSIO
ex ample. benchmark is a read and write test for I IDFS. That is, it will write or rc:ad a number of files
(8 Marks) 10 and from HDFS and is designed in such a way that it will use: one map task per file. The

ch~arks can provide insight file size and number of files are specified as command-line arguments. Similar to the terasort
sc that reflect real application benchmark, you should run this test as u~er hdfs. Similar to terasort, TestDFSIO has several
steps. In the following example.
provide a good sense o~how 16files of size I GB are spc:cificd. Note that the TestDFSIO benchmark is part of the hadoop-
~d with public data published mapreduce-client-jobclicnt.jar. O1her benchmarks are also available as part of this jar
taken as a single indicator for file. Running it with no arguments will yield a list. In addition to TestDFSIO, NNBench
(load testing the NameNo<lc) 3nd MRBcnch (load testing the MapReduce framework) are
ter installations. These tests commonly used Hadoop benchmarks. Nevertheless, TestDFSIO is perhaps the most widely
hmarks in the Hortonworks reported of these benchmarks. The steps to run TestDFSIO are as follows:
t recommended because all I. Run TestDFSIO In write mode and create data.
•e. S yam jar SHA DOOP_EXAMPLES/hadoop-mapreduce-clicnt-jobclient-tests.jar
-+ TestDfiSIO-wrilc -nrfiles 16 -fileSize 1000
encratcd data. Example results are as follows (data and time prefix removed).
pReduce layers of a Hadoop fs.TcslDFSIO: -----TestDFSIO ----: write
three steps: fs.TestDFSIO: Date & lime: Thu May 14 10:39:33 EDT 2015 fs.TestDFSIO: Number of
files: 16
fs.TestDFSIO: Total MBytes processed: 16000.0 fs.TestDFSIO: Throuhput mb/scc:
14.890106361891005 fs.1estDFSIO: Average 10 ralc mb/sec: 15.690713882446289
of data written is 100 times fs.TestDFSIO: 10 rate std deviation: 4.0227035201665595 fs.TestDFS IO: Test exec time
write I00GO of data, use I sec: 105.631
ed in HDFS. The following 2. Run Tcst0FSIO in read mode.
as user hdfs. Make sure the $ yam jar $HA DOOP_EXAMPLES/hadoop-mapreduce-client:jobclient-tests.jar
-+ TcstDFSIO-read -nrFiles 16 -fileSize 1000

87
VIII Semt (CSE/IS£) 8~V~A~ C13CS - Modei-Q~
Example results are as follows (data and time prefix removed). The large standard deviation archives to be unarch
is due to the placement of tasks in the cluster on a small four-node cluster. fs.TestDFSIO: rhe general commam
--TestDFSIO --- : read bin/hadoop command
fs.TestDFSIO: Data & time: Thu May 14 10:44:09 EDT 2015 fs.TestDFSIO: Number of
files: 16
fs.TestDFSIO: Total MBytes processed: 16000.0 fs.TestDFSIO: Thrughput mb/sec: 2. a. Write a short note o
32.38643494172466 fs.TestDFSIO: Average 10 rate mb/scc: 58.72880554199219 i. Specuulative exec
fs.TestDFSIO: 10 rate std deviation: 64.60017624360337 fs.TestDFSIO: Te~t exec time sec: ii. Hadoop Map red;
62.798 Ans. i. Specuh1tivc EllCd
3. Clean up the TestDFSIO data. predict or manage un
$ yam jar SHA DOOP_EXAMPLES/hadoop-mapreduce-client-jobclient-tests.jar and monitor resourc
-. TestDFSIO -1.:lenn prac1ice, however, tH
Running the TestDFSIO and terasort benchmarks help you gain confidence in a Uadoop possible that a cong ,
installation and detect any potential problems. It is also instructive to view the Amabri some olher similar p
deshboard and the YARN web GUI (as described previously) as the tests run. When one pan of a
Managing Hadoop MapReducc Jobs else because lhe ap
Hadoop MapReduce jobs can be managed using the maprcd job command. The most of lhe parallel Map
important options for thi~ command in terms of the examples and benchmarks are -list, -kill, that input da1a are i
and -status. copy of a running m
In particular, if you need to kill one of the examples or benchmarks, you can use the mapred example. suppose 1h.
job -list command to find the job -id and then use mapred job -kill <job-id> to kill the job that some are still
across the cluster. MapReduce jobs can also be controlled at the application level with the busy or free servers.
yam application command. The possible options for mapred job are as follows: then terminated (or
S mapredjob approa.ch can be ap~
Usage: CLI <command> <args> [-submit <job-file>] execution can seem
[-status <job-id>] .xml configum1ion ~
[-counter <job-id> <group-name> <counter-name>) (-kill <job-id>] ii. Hadoop MapRe
[-set-priority <job-id> <priority>] . Valid values for priorities are: VERY HIGH HIGII The capability of I
NORMAL LOW VERY LOW -
foilures can intluenc
[-events <job-id> <from:-evcnt-#> <#-of-events>) [-history <jobHistoryFilc> J Hadoop clusters ha
[-list [all] J
possible for many d;
[-list-active-trackersJ
a cluster.
[-list-black Iisted-tmckers]
The use of server
[-lisl-attempt-ids
RED ..;job-id> <task-t..,..,..>
., r- <task-state>] • "vii I'd va Iucs for <task-typ.!> arc
I somewhat dilfercnl
. UCE MAP. Valid values for <taks-state> are running, completed
is possible lo build
[-krll-task <task-attempt-id>) [-fail-task <task attempt-id>)
[-logs <job-id> <tosk-allempt-id>] nodes). However.a
Generic options supported are bo1h roles. Another
tolcra1e dissimilars
-conf <configuration file> specify an application configuration file
large disparities in
-D <property; valucs> use value for given propeny
-~s <local I namcnodc:port> specify a namenode MapReduce executi
·JI <local I resourcemunager:port> specify II RcsourceManager b. Ei.plaln com pili ng
-files ~comma separated list offiles> specify comma separated files 10
be cop,ell to the m:ip reduce cluster Ans. rile Hadoop Grep.j
-libjars _<comm? scpamtcd list ofjars> specify comma separated jar times they occum:
files to include m the classpath. docs not display the
-archives <comma separated list of archives> specify conima separated needed for the strin
The program runs
88
oved). The large standard deviation archives to be unarchived on the compute machines.
II four-node cluster. fs.TestDFSIO: The general command line syntall. is
bin/hadoop command (genericOptionas) [commandOptions]
T 2015 fs.TestDI-SIO: Number of OR
.TestDFSIO: Thrughput mb/sec: Write a short note on following
te mb/sec: 58.72880554199219 i. Spccuulative e,.ccution
fs.TestDFSIO: Test exec time sec: ii. lladoop Map reduce h:irdwure. (4 Marks)
Ans. i. Speculative Execution: One of the chalhtngcs with man)' large cluster is the inability to
predict or manage unexpected system bonlcncd.s or failures. In theory, it is possible to control
and monitor resources so that network traffic and processor load can be evenly balanced: in
client-jobclient-tests.jar practice, however, this problem represents a difficult challenge for large systems. Thus, it is
possible that a congested m:twork, slow disk controller, failing disk, high processor load, or
ou gain i:onfidence in a lladoo~ some other similar problem might lead 10 slow performance without anyone noticing.
instructive to view the Amabn When one pait of a Map Reduce process runs slO\\ ly, it ultimately slows down evel)'thing
ly) as the tests run. else because the application cannot compkte until all processes arc finished. The nature
of the parallel MapReduce model provides an interesting solution 10 this problem. Recall
apred job command. The most that input data arc immutable in the MapReduce process. Thercforc,il is possible to start a
les and benchmarks are -list, -klll, copy of a running map process without disturbing any other running mapper processes. For
example, suppose that as most of the map tasks arc coming to a close, the ApplicationMaster
chmarks, you can use the mopred that some arc still running and schedules redundant copies of the remaining jobs on less
ed job -kill <job-id> to kill t~e job busy or free servers. Should the secondary processes finish first, the other first processes are
•d at the application level with the then terminated (or vice versa). This process is known as speculative cxc<:ution, The same
cd job are as follows: approach can bi: applied to n.:duccr processes that sc!\!m to be taking a long time. Speculative
execution can seem to have a slow spot. It can als? b.: turned olf and on in the mapred-si1e
.xml configuration file.
Ii. Hadoop MapRcduce Hardwuc
<job-id>] The capability of I ladoop MapRcduce and IIDFS lo tolerate server-or even whole rack-
iorities are: VERY_HIGH HIGH
failures can influence hardware designs. The use of commodity (typically x86_64) servers for
Hadoop clusters has made low-cost, high-cost, high-availability implementations ofHadoop
<jobH istoryFile>1 possible for many data centres. Indeed, the Apache Hadoop philosophy seems to assume on
a cluster.
The use of server nodes for both storage (I IDFS) and processing (mappers, reducers) is
somewhM dilferent from thi: traditional separation of these two tasks in the data centre. II
Valid values for <task-type> are is possible to build Hadoop system and separate the roles (discrete storage and processing
completed nodes). However.a majority of Iladoop systems use the general approach where servers enact
l both roles. Another interesting feature of dynamic MapReducc execution is the capability to
tolerate dissimilar servers. Thnt is, old and m:w hardware can be used together. Of course,
large disparities in performance will limit the faster systems, but the dynamic nature of
MapReduce execution will still work effectively on such systems.
b. E:r.plain compiling and running the lladoop Grep chaining cumpte .. ith pro,:ram.
(08 Marks)
Ans. The Iladoop Grep.java example extracts matching string from text fi lcs and counts how many
times they occurred. Thc command works differently from the •nix grcp command in that it
docs not display the co111ple1e matching line, only the matching string. If matching lines are
needed for the string foo, use. •roo. • as a regular expression.
The program runs two map / reduce jobs in sequence ans is an example of MapReduce

89
VIII Sem, (CSc/IS'f) cscs · ModevQ~
chaining. The first job counts how many times a matching string occurs in the input, and the grepJob.setOutputKeyCla
second job sorts matching strings by their frequency and stores the output in a single ft le. grepJob.se1Output-Valuescl
Listing 1.1 displays the source code for Grcp.java. grcpJob. waitforCompletio
Note that all the Hadoop example source files can be extracted by locating the hadoop- Job sortJob = new Job (co
mapreduce-example-•-sources.jar either from a Hadoop distribution or from the Apache Fi le InputFormat.setlnputP
Hadoop website (as part of a full Hadoop package) and then extracting the files using the sortJob.sctlnputFormatCla
following command (your version tag may be different): sortJob.setMapperClass(ln
$ jar xf hadoop-mapreduce-example-2.6.0 sources.jar. sortJob.setSortNumReduc
Listing 1.1 Hadoop Grep.java Eumplc sctOutputPath(sortJob,nc
package org.apache.hadoop.examples; sortJob. waitForCompletio
import java.util.Random; }
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf. finally {
Configuration; import org.apachc.hadoop.fs.Filesystem; import org.apache.hadoop.fs.path; FileSystem.get(conf).delet
import org.apache.hadoop.io. LongWritable; }
import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.•; return O;
import org.apache.hadoop.mapreduce.lib.input.FilelnputFormat; }
import org.apache.hadoop.mapreduce.lib.input.Sequencefilelnputformat; public static void main (St
import org.apache.hadoop.mapreduce.lib.map.lnverseMapper int res == ToolRunncr,run(n
import org.apachc.hadoop.mapreduce.lib.map.RegexMapper; System.exit(rcs);
import org.apache.hadoop.mapreduce.lib.output.fileOutputformat; }
!mport org.apache.hadoop.mapreduce.lib.output.SequenceFileOutpulFormat; }
~mport org.apache.hadoop.mapreduce.l ib.reduce. LongSumReducer; In the preceding code, each
import org.apache.hadoop.util.Tool; provided regular exprcssio
import org.apache.hadoop.util.ToolRunner; task and extracts text mate
/*Extracts matching regexs from input files and counts them.*/ output as <matching strin
public class Grep extends configured implements Tool ( sum up the number of mate
private Grep ( ) () // singleton ' reducer uses the LongSum
public int run ~String [] args) throws Exception { if(args .length <3) { key.
System.out.prit In ("Grep <inDri> <outDir> <regex> (<group>]"); The second job takes the ou
return 2;
reverses (or swaps) its input
)
so the Identity Reducer class
Path tempDir =
There is also an ldentityMal
new path ("grep-temp-"+
s1ored in one file and it is soJ
Integer .toString (new Random ( ) .
nextlnt (Integer .MAX_YALU£)); count and a string per line.
The example also demonst ,
Configuration conf=getCom () ;
or a reducer.
~onf.set (RegexMapper .PATTERN, arge(2] );
tf(args. length = .. 4) The following discussion de
conf .set (RegexMapper .GROUP, nrgs [JJ ); The steps are similar to the p
Job grepJob "' new Job (count); I. Create a directory for the
try { $ mkdir Grep_classes
grepJob.setJobName ("&rep-search")· 2. Compile the WordCount.j
Filelnputformat.setlnputPaths (grcpiob, args[OJ); $ javac -cp 'hadoop classpat
grepJob.setMapperClass (RegexMapper.class) ; 3. Create a Java archive usin~
grepJob.selCombinerClass (LongSumReducer.class); $ jar -cvf Grep.jar -C grep_ch
~pJob.setReducerClass (LongSumReducer.class); If needed, create a directory a
FtleOutputFormnt.setOutputPath (grepJob tempDir)· $ hdfs dfs -mkdir war-and-pe
$ hdfs dfs -put war-and-peace
grepJob.setOutputFormatClass (SequencefileOutputFormat.class);
90
S.,,1,l-11.- e«- 5r...,,,,..,..
I 8~V~A~
g occurs in the input, and the grepJob.sctOutputKeyClass (Text.class);
s the output in a single file. grepJob.setOutputValucsclass (Long Writable.class);
grepJob.waitForCompletion (true);
·ted by locating the hadoop- Job sortJob - new Job (conf); sortJob.setJobName ("grep-sort'');
ibution or from the Apache FilelnputFormat.setlnputPaths (sortJob, tempDir);
extracting the files using the sortJob.sctlnputformatClass(Sequencefilelnputfonnat.class);
sortJob.setMapperClass(lnverscMapper.class);
sortJob.setSortNumReduceTasks (I); //Write a single file FileOutputFormut.
sctOutpulPath(sortJob,new path(args [I]); LongWritablc.DccreasingComparator.class);
sortJob.waitForCompletion (true);
}
finally {
org.apache.hadoop.conf.
FilcSystcm.get(conf).delcte (tcmpDir, true);
org.apache.hadoop.fs.path;
}
return O;
apreduce.•; )
public static void main (String [] args) throws Exception {
utformat;
int res = ToolRunner,run(new Configuration ( ), new Grew (), args);
System.exit(res);
}
at; }
utputformat;
In the preceding code, each mapper of the first job takes a line as input and matches the user-
cer; provided regular expression against the line. The RcgcxMnppcr cfnss i~ used to perform this
task and extracts text matching using the given regular expression. Thc matching strings are
output as <matching string, I> pairs. As in the previous WordCount example, each reducer
sum up the number of matching strings and employs a combiner to do local sums. The actual
reducer uses the LongSumRcducer class that outpub 1hc sum of long values per reducer input
key.
<3){
The second job takcs the output of the first job as its input. The mapper is an invcrsemap thal
I"); rcverses (or swaps) its input , key, value> pairs into <valuc,key>.There is no reduction step,
so the ldcntityRcduccr class is used by default. /1. II input is simply passed to the output (Note:
There is also an ldentityMappcr class.) The number of reducers is set to I, so the output is
stored in one file and it is sorted by count in descending order. The output text file contains a
count and a string pcr line.
The example also dcmonstmh:s how to pass a command-line parameter to a mapper
or a reducer.
The following discussion describes how to compile and run the Grep.java example.
The steps arc similar to the previous WordCount example:
I. Create a directory for the application classes as follows:
$ mkdir Grep_classes
2. Compile the WordCount.java program using the following line:
S javac -cp 'hadoop classpath' -d Grep_classcs Grcp.javn
3. Create a Java archive using the following command:
$ jar -cvf Grep.jar -C grep_classes/.
If needed, create a directory and move the war-and-pcacc.txt lite into I IDFS:
S hdfs dfs -mkdir war-and-pcaC\.'-input
S hdf~ dfs -put war-and-peacc.txt war-and-peace-output

91
VIII Se+1'11 ( CSE/IS£)
As always, make sure the output directory has been removed by issuing the following Container: container_ 143266'
command: ....:----=-=-- - -- ---
$ hdfs dfs -rm -r -skipTrash war-and-peace-output
Entering the following command will run the Grep program: Container: container_ 142667
$ hadoop jar Grep.jar org.apache.hadoop.example.Grep war-and-peace-input -=====-===-~= - == It:

---+ war-and-peace-output Kutuzov [..:]


As the example runs, two stages will be evident. Each stage is easily recognizable in A specific container can be
the program output. The results can be found by examining the resultant output file. preceding output. For ex
$ hdfs dfs -cat war-and-peace-output/part-r-00000 examined by entering the co
530 Kutuzov and port number are written
simply replace the _ with a:
1
c. Explain command line log voiding. (04 Marks)
container can be found by ent
Ans. MapReduce logs can also be viewed from the command line. The yam logs command enables
the logs to be easily viewed together without having to hunt for individual log files on the S yarn lo_gs -applicationld ap
-+ contamer_J432667013445
cluster nodes. As before, log aggregation is required for use. The options to yam logs are as
follows:
$ yarn logs
3. a. Exp!ai~ with an diagrams D
Retrieve logs for completed YARN applications.
usage: yarn logs -applicationld <application ID> [OPTIONS] Ans. Ooz1e is a workflow director
general options are: Hadoop jobs. For instance c
-appOwner <Application Owner> AppOwner (assumed to be current user if not Hadoop jobs to be run as ; w
specified) a successive job. Ooze is not
-<:ontainertd <Container ID> Containerld (must be specified if node address is resources for individual Had
specified) Hadoop jobs on the cluster
-nodeAddress <Node Address> NodeAddress in the format nodename: Oozie workflow jobs are re.pre
port (must be specified if container id is specified) are basically graphs that cannot
For example, after running the pi example program (discussed in Chapter 4), the logs can be • Workflow-a specified sequ
examined as follows: · control dependency. Progr
$ hadoop jar SHADOOP_ EXAMPLES/hadoop-mapreduce-examples.jar pi 16 action is complete.
100000 • Coordinate-a scheduled wo
After the pi example completes, note the application Id, which can be found either from the become available.
a~plicati~n o_utput or by using the yarn application command. The applicationld will start • ~undle-a higher-level Oozi
with apphcat1on_and appear under the Application-Id column. integrated with the rest of th
$ yarn application -list -appStates FINISHED of the box (e.g., Java MapR
Next, run the following command to produce a dump of all the logs for that application. Node as system-specific jobs (e.g.,
that the output can be long and is best saved to a file.
. and a web UI for monitoring
$ yarn logs •applicationl~ application_l432667013445_0001 > AppOut Figure_ 3.1 depicts a simple Oo
The AppOut file can be inspected using a text editor. Note that for each container stdout ope~at1on. lft,he application was
!tderr. _and syslog are provided . :Y-he list of actual containers can be found by u~ing th; Ooz1e workflow definitions are
,ollowmg command:
Such worktlow.contain several t
$ grep -8 I = == ==AppOut For example (output truncated): • Con trol flow nodes define th
[...]
end. and oplional fail nodes.
:~n~ain~~containt:r_l43?667013445_0001_01_ 000008 on limulus 45454 • Action nodes are where the
--- =--•==-- - =-=---- =====~~==-= ~==~--- -- finishes, the remote systems n
Action nodes can also include
• Fork/join node enable pamll
two or more tasks to run at the
must wait until all forked tas.

92
cacs • ModeiQ~Pc;q,u. 3
y issuing the following Container: container_ l432667013445_000 I_0 I_OOOOOI on no_45454

Container: contnincr_ 14266701344S_OO0I_0I 000023 on nl _45454


:ace-input =- • === ======= - - =:=:11;1',.r::==•===~· ===a:::z== -
[...]
y recognizable in A specific container can be examined by the containt:rld and the nodcAddress from the
ltant output file. preceding output. For example, container 1432667013445 000 I 0 I 000023 can be
examined by entering the command followingthis paragraph. N-otc th;t the node name (n I)
and port number are written as n I_4S454 in the command output. To get the nodeAddress.
simply replace the _ with a: (i.e., -nodeAddress n I :45454). Thus, the results for a single
(0-1 Marks) container can be found by entering this line:
am logs command enables $ yam logs -application Id application_ 1432667013445_000 l_containerld
individual log files on the -o container_l432667013445_0001_0 1_000023-nodeAddress nl:45454 I more
options to yam logs are as
Module-2
Explain with an diagrams DAG work Oows? (10 Marks)
Oozie is a workflow director system designed to run and manage multiple related Apache
Hadoop jobs. For instance, complete data input and annlysis may required several discrete
1-ladoop jobs to be run as a workftow in which the output of one job serves as the input for
be current user if not a successive job. Ooze is not a substitute for the YARN scheduler. That is, YARN manages
resources for individual Hadoop jobs, and Oozie provides a way to connect and control
if node address is Hadoop jobs on the cluster.
Oozie worktlow jobs are represented as directed acyclic graphs (DAGs) of actions. (DAGGs
nodename: are basically graphs that cannot have directed loops.) Three types of Oozie jobs are permitted:
ontainer id is specified) • Workflow-a specified sequence of Hadoop jobs with outcome-based decision points and
Chapter 4), the logs can be control dependency. Progress from one action to another cannot happen until the first
action is complete.
• Coordinate-a scheduled workflow job that can run at various time intervals or when data
become availabh:.
n be found either from the • Bundle-a higher-level Oozie abstraction that will batch a set of coordinator jobs. Oozie is
11,e applicationld will start integrated with the rest of the Hadoop stack, supporting several types of Hadoop jobs out
ofth1.1 box (e.g., Java MapReduce, Streaming MapRcduce, Pig, Hive, also Sqoop) as well
as system-specific jobs (e.g., Java program and shell scripts). Oozie also provides a CLI
,gs for that application. Node and a web UI for monitoring j obs.
Figure 3.1 depicts a simple Oozie workflow. In this case, Oozie runs a basic MapReduce
AppOut operation. If the application was successful. the job end; if an error occurred, the job is killed.
t for each container, stdout. Oozie workflow definitions are written in hPDI., (an XML Process Definition Language).
can be found by using the Such workflow contain several types of nodes:
• Control flow nodcs define the beginning and the end ofa workflow. They include st.art,
end. and optional fail nodcs.
• Action nodes are where the actual processing taks are defined. When an action node
1ulus_45454 finishes, the remote systems notify Oozie and the ne>.t node in the workflow is executed.
--=---= Action nodes can also include HDFS commands.
• Fork/join node enable parallel execution of tasks in the workllow. The fork node enable
ulus_45454 two or more tasks to run at the same time. A join node represents a redezvous point that
must wait until all forked tasks complete.

93
VIII Se.+111 (CSE:/ISE:)

• Control flow nodes enable decisions to made about the previous task. Control decisions
run on standard Hadoop
are based on the results of the previous action {e.g., file size or file existence). Decision
inefficienl nml totally unn
nodes are essentially switch-case statements that use JSP . . YARN provides the user
EL (Java Server pages-Expression Language) that evaluate to either true or false. Figure 3.2
MapReduce. Support for
depicts a more complex worl.folw that uses all of these node types.
In addition, using the flex

..
own web interface to mo
iii) I lamster: Hadoop a
The Message Passing lnte
ERROR MPI is primarily a set of
operate over popular ser
have full control over thei
run within a Hadoop clus
discussion of the issues in
~Wo,ldloWDAO
version ofMPICH2 is ava
Figure 3.1 A Jimple 001,ie DAG worJ.jfow iv)Apache Spark
Spark was initially develo
,:;. performance, such as ite
interactive d(lta mining. Sp
Spark holds intermediate r
holds supports more than
possible analyses that can
Java, and Pylhon.
Since 2013, Spark has been
of porting and running Sp
single underlying file syste

4. o. llow to chanie Hadoop p


Ans. One of the challenges of m
configuration properties. In
changes to a property oflcn
FIi( 3.2 A more complex Oozie DAG workftow cluster. This process is tcdi
b. E:i.plain the follo" ing In the Application frame work? (06 Marks) way to manage this process.
(i) Apache Tez Each service provides a Co1
(ii) Apache Giraph properties. Any service prop
(Ill) Hamster Hat.loop ,ind MPI 011 the same clu~ter.
the configuration properties
(iv) Apaehespark.
Ans. (I) Apache Tez
One great example of a nt:w YARN framework is Apache Tcz. Many Hadoop jobs involves
the execution ofa complex directed acyclic graph (DAG) of task using separate MapRcduce
stage. Apache Tez generalizers this process and enables these tasks be spread across stages
so that they can be run as a single, all-encompassing job. Tez can be used as a MapRcduce
replacement for projects such as Apache Hive and Apache Pig. No changes are needed 10 the
I live or Pig application~.
Ii) Apache Giraph
Apache Giraph is an ilerative graph processing system built for high scalability. Facebook,
twiucr, and Linkedln use it to create social graphs users. Giraph was originally writh:n to

~ifAr' E«M 5uMu


lf!-V~A~ C13CS · Mo-de.iQ~Petpu · 3
s task. Control decisions run on standard Hadoop V I using the MapReduce framework, but that approach proved
file existence). Decision inefficient and totally unnatural for various reasons. The native Giraph implementation under
YARN provides the user with an iterative processing model that is not direclty available with
r true or false. Figure 32 MapReducc. Support for YARN has been present in Giraph since its own vusion I O release.
In addition, using the flexibility of YARN, the Giraph developers plan on implementing their
own web interface to monitor job progress
iii) Hamster: Hadoop and MPI on the Same Cluster
The Message Passing Interface (MPI) is widely used in high-performance computing (HPC).
MPI is primarily a set of optimized message-passing library calls for C, C++, and Fortran that
operate over popular server interconnects such as Ethernet and lnliniBand. Because users
have full control over their YARN containers, there is no reason why MPI applications cannot
run within a Hadoop cluster. The llamster effort is a work-in-progress that provides a good
discussion of the issues involved in mapping MPI to a YARN cluster. Currently, an alpha
WOflll\oW.lffl version of MPICH2 is available for YARN that can be used to run MPI applications.
lv)Apache Spark
Spark was initially developed for applications in which keeping data in memory improves
performance, such as iterative algorithms, which are common in machine learning, and
interactive data mining. Spark differs from classic MapReduce in two important ways. First,
Spark holds intermediate results in memory, rather than writing them to disk. Second, Spark
holds suppons more than just MapReduce functions;.that is, it greatly expands the set of
possible analyses that can be e;,cecuted over HDFS data stores. It also provides APls in Scala,
Java, and Python.
Since 2013, Spark has been running on production YARN clusters at Yahoo!. The advantage
of porting and running Spark on top of YARN is the common r~source management and a
single underlying fi le system
OR
How to change Hadoop properties? (12 Marks)
One of the challenges of managing a Hadoop cluster is managing changes to cluster wide
configuration properties. In addition to modifying a large number of properties making
changes to a property often required daemons (and dependent daemons0 across the entire
cluster. This process is tedious and time consuming. Fortunately, Ambari provides an easy
way to manage this process.
Each service provides n Configs tab that opens a form displaying all the possible service
properties. Any service property can be change (or added) using this interface. As an example,
the configuration properties for the YARN scheduler as shown in Figure 4.1.

any Hadoop jobs involves


using separate MapReduce
ks be spread across stages
be used as a MapRcduce
o chanites are needed to the

high scalability. Facebook,


was originally written to
95
VIII Se-wv (CSf/I SE)

.·--
J•- ......., . .__.__
..
.. •(! ••-~ ... .. .. , ••
Clw:.I!. \\\1!. \!S.l!.t ~~ ~~'j
Figure 4.4, is presented.
new property is ohanged

. --
The new property will n

< ) ·;
m - -- --
·--- • ·--- • ·--- •
....
-· Figure 4.5, the Restart b
To be safe, the Rcsta11 Al
service will be restarted ·
Aller the user clicks R•
displayed. Click Confirm
Save Co

---
Fl ure

--
Figure 4.1 Ambari YARN properties view
Summary eon11g,

The number can be options offered depends on the service; the full range of YARN properties
can be viewed by scrolling down the form. To easily view the application logs, this property 0....., YARNOola1M1(•) I .
must be set to true. This property is normally on by default. As an example for our purpose
Figure 4.5 Ambarl
here we will use the Ambari interface to disable this feature. As shown in Figure 4.2, when
a pr~pe is chan ed the ~ccn Save button becomes activated. Confirma
>:•
. _..._ You are at>ot.t to

This ,. 11 :r 991!1
... Ma1n1~• .lt'C8 !

.,.,,,.aa~• 0 • 0

Figure 4.
Figure 4.2 YA RN prop,uties wit!, log aggre1:atio11 t11me1/ off Similar to the DataNodc re•
Changes do not become permanent until the user clicks the Save button. A save / notes progress bar is for the entire
window will then~ displayed. It is highly reco1rnrn:nded that historical notes concerning the arrow to the right of the bar
change be added to this window. Once the restart is complet
nf.~v==e:".c~o='n~filg
i:::u-:-,-a:-;
tl-o-
n- - - - - - - - - - ~
YARN ResourceManager Ap
down menu in the middle oil
4.8 will be displayed .
Ambari tracks all changes m
4.1 and in more detail in Fig
created. Reverting back to a
potential for version confusi
Figure 4.3 Amburi ,•011jig11rutio11 l·u11d11otes w/11<10111. Figure 4.3 and Figure 4. t I).
96
... _____ .., .. Once the user adds nny notes a<ld clicks the Save bunon. another window, shown in
Figure 4.4, is presented. This window confirms that the properties have been saved. O~ce the
new property is changed, an orange Restart button will appe~r at the top left of the wmdO\~.
The new property will not take effect unlll the required services are restarted. As shown tn

- . -. Figure 4.5, the Restart button provides two options: Restart All and Restart NodeManagers.
To be safe the Restart All should be used. Note that Restart All docs not mean all the Hadoop
service will
be restarted ; rather, only those that use the new property will be restarted.
After the user clicks Restart All, a confirmation window, shown in figure 4.6 will be
displayed. Click Confirm Restart All to be in the cluster-wide restart.
ve Configuration Changes

m
Fl ure 4.4 Ambur/ co11 11r"tio11 clw11 e 1101/ 1mllo11
Sunvna,y Cor,'G• Q..d'.ll.Q•

/el view .
he full range ofYARN properties
e application logs, this property
. As an example for our purpose Figure 4.5 Am/Jar/ R~larl 1111clio11 a es i11 lervlce ro erlles
e. As shown in Figure 4.2, when onfirmation X

rated
Voll aro abOl.t 10 leslllrt YAHi

Tn,s " , gge, ., M:i ~. tr~ ur,w 1. •e.tartO<I To ~ppioss alMS.tum OIi
MaT'~ffdr,c.- Mxiv rlJ, •'M,,, ~•I)! IC •IJlnr~ r&slJ~ ail

Figure 4.6 Ambarl co1,jirmalio11 box/or service restart


Similar to the DataNode restart example, a progress window will be displayed. Again, the
eg"tio11 t11me1/ off
progress bar is for the entire YARN restart. Details from the logs can be found by clicking the
the Save bulton. A save / notes
arrow to the right of the bar (see Figure 4.7).
at historical notes concerning the
Once the restart is complete, run a simple example and attempt to view the logs using the
YARN RcsourceManagcr Application UI. (You can access the UJ from the Quick Links pull-
down menu in the middle of the YARN series window.) A message similar to that in Figure
4.8 will be displayed .
Ambari tracks all chMges made to system properties. Is can be seen in Figure
4.1 and in more detail in Figure 4.9, each time a configtJration is changed, a new version is
created. Reverting back to a previous version results in a new version. You can reduce the
potential for version confusion by providing meaningful comments for each change (e.g.,
toles wlmlow. Figure 4.3 and figure 4.11). In the pn:ceding examph:, we created version 12 (V 12). The

97
VIII Se.m, (CSf/ISf) 8£'1'Vatl;l,A~
current version is indicated by a green Current label in the horizontal version boxes or in the
dark horizontal bar. Scrollin thou h the version boxes
1 Background Operationa Running •

.:l rm
cc----.
v-• . .
c..,. ..
■- ...
a::,

,,,_&,_..,.....,c....
Con"9'to<OOZIE

.,1'1e51111_.<lll--~ _.,,..,:,o,i,O:.Z
---· 0

□ Do... _ .. CIIIOQ..,-~•~""-- a Figure 4.9 Ambur,

.........._..._ ..
Fi 11re ,. 7 Ambur/ ro ress wl11dow ur c/u.fter wide YARN start
. . . . . . . . . ...
~

,,...
·-·-
- L09~ not iva,lab~ for Job_104480791998_0001 "99~a11on m1y not t,., co~oete CM<k
~ck later or try the nOdemanooer al nl •5•5•

Fl,:ure 4.8 YARN Resourcc/lla11agcr l11terface wlt/1 lug aggregatio11 turm!d off

---
or pulling down the menu on thi: ten-hand side of the dark horizontal bar will display the
previous configuration versions.
To revert to a previous version, simply select the version from the version boxes or the pull- ""'PM■
down menu. In Figure 4.10, the user has selected the previous version by clicking the Make
Current button in the information box. This configuration will return to the previous state ,.,..,_.

--..--
where log aggregation is enabled.

Figure-I.I
As shown in Figu
is saved. Again. it
box. When the sa
configuration. The
needed before the c

98
c.:acs - ModeltQ~Paper - 3
orizontal version boxes or in the &ullffil/)' Conllgo
o..o...--
...
- - M (10,

"""'
.:l rm,. __- cm ........ _ - 11 . . . . ._ - m •• _,,.ag,J

• .
---
100'II,

.
f'k•W:Juf0.1 M.UM()t 1'

'°""' • .....
'°""' •
'°""' •
,..,...
~-
_.,,. 1024

0 • 0

• yam.ad- •

·- .. yarn_... • ,.!. .

yarnJoO~ 0 • 0

El Figure 4.9 Ambarl co11jig11ration cl,ange 111a11age111ent for YARN service (Version JI/2
current
wide YARN start
rm
..,_
---•---•-..•• •"u• ·-- --·
rm g-·- rm ~---· m. _...
- m _...
-· _. ■

-- -
• ~ ~ ""90'
,: aggregation tumed off
horizontal bar will display the

the version boxes or the pull-


s version by clicking the Make ----
NIP.,.
,..,..,_. n
102•

ill return to the previous state


---
--lc>v-- J;/1

Figure 4.10 Reverting to previous YARN configuration (VII) wit/, An,bar/


As shown in Figure 4.11, a confirmation / notes window open before the new configuration
is saved. Again, it is suggested that you provide note about the change- in the Notes text
box. When the save step is complete, the Make Current button will restore the previous
configuration. The orange Restart button will appear and indicate that a service restart is
needed before the changes take effect.

99

,
VIII Semt (CSE/ISE)

Make Current Confirmation X Ambari, go to the I IDFS sc


screen, select the Add Pro
two properties (the item w
code):
<propcny>
<name>hadoop.proxyuser.
<value>• <tvalue>
Figure 4.11 A111b11r/ co1,jirn1111/011 wl11duw for a 11ew c·u1,jigur11tiu11 < /property>
1l1ere are several imponant points to remember about the Amabri versioning tool: <property>
• Every time you change the configuration, a new version is created. Reverting to a previous <name>hadoop.proxyuse
version created a new version. <value>• <tvalue> </pro
• You can view or compare a vi:rsion to other versions without having to change or restan The name of the user who
service. (See the buttons in the VI I box in Figure 4.10.) In the previous example,
• Each service has its own version record. starts the gateway. If, for
• Every time you change th.i propenies, you must restan the service by using the would be hadoop.proxyus
Restart button. When in doubt, festan all services. value, entered in the prece
b. Deline the c11p11bililies and conliguration steps of an NF5V3 Gateway to HDFS any host. Access is restri
(04 Marks) Entering a host name for ti
Ans. Configuring a n NFSv3 Gateway to I IDFS Nexl, move to the Advan
HDFS suppons an NFS version 3 (NFSv3) gateway. This feature enables files to be easily <name>dfs.nfs3.dump.dir-
moved between HDFS and client systems. The NFS gateway supports NFSv3 and allows <value>/tmp/.hdfs-nfs</v
HDFS to be mounted ns pan of the client's local file system. Currently the NFSv3 gateway </property>
supports the following capabilities: The NFSv3 dump direct
• Users can browse the HDFS file system through their local lile system using an NFSv3 Sequential writes can arri
client-compatible operating system.
temporarily save out-of-o
• Users can download fi les from the HDFS file system to their locol file system. has enough space. For cxa,
• Users can upload files from their local file system directly to the HDFS file svstem. recommended that this di
• Users can strea'.11 ~ata directly to HDFS through the mount point. File append is supported every file.
but random write 1s not supported. '
Once nil the changes have
The g~teway m_ust be run on the same host as n DataNode, NamcNode, or any HDFS client
made to the Notes box in 1
More information about the NFSv3 gateway can be found at https:/' hadoop apache org/
docs/current/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway · · the orange Restart buuon.
.hlml. Step 2: Start the Gatewa

:.Fs~~
In th~ following example, a simple four-node cluster is used to demonstrate tne
~~::~~t~e gateway. Other potential options, including those, related to sec:~~!~;
HDFS . s e ant is e>.ampl~. A D~taNodc is used a!, the gateway node in this example 'and
Log into a DatnNode and 0
Data Node nO is used as th
# service rpebind stop # se
Next, start the HDFS gatei
is mounted on the main (login) cluster node. ,
Step I : &I Conliguration Files and nfs3 as follows :
S~veral Hadoop configuration lites need to be chan ed I h' . #/usr/hdp/2.2.4.2-2/hadoo
will be used to niter the IIDF'S configuratio Iii Dg . n t ts example, the Amban GUI hadoop.sbin/hadoop-daemo
until all lhe following changes are made ~f es. o not sa~e the changes or restart HDF'S The portmap daemon will
the~.: files by hand and then rcstar1 the ap~rop~;t~ :': ?~' u~,ng Ambari, you must change /vvar /log/hadoop/root/h.id
cnv1ronmen is assumeck crvices across the cluster. The following To confirm the gnteway is
• OS: Linux look like the following:
• Platform: RHEL 6.6 #rpcinfo -p no
• Hononworks IIDP 2.2 with lladoop version- 2 6
Several properties need to be add d . . ~
• e to the /etc/hadoop/conf ig/core-site.xml file.
Using LJ.§
100
Sw.~i,_... E.c- ~""-
A b · 0 to the HOFS service window and sel«t the Configs tab. Toward the bottom of~he
X m an, g r k · the Custom core- site xml section Add the following
screen, sele~ tll(c1hA~d1e~~:yro~n1h:key field Ill Ambari is· tile name fi~ld included in this
two properties e 1
code):
<property>
<name>hadoop.proxyuser.root.groups</name>
E·AiA: :I <value>•</value>
< /property>
,r 11ew ,·01,jigurut/011
<property>
nabri versioning tool: <name>hadoop.proxyuser.root.hosts<lname>
5 created.Reverting 10 a previous <value>•<Jvalue> <!property>
The name of the user who will start the Hadoop NFSv3 gateway is placed in the name field.
thout having to change or restart
In the previous example, root is used for this purpose. This setting can be any user who
starts tile gateway. If, for instance, user nf sadmin starts the gateway, then the two names
would be hadoop.proxyuser.nfsadmin.groups and hadoop.proxyuser. nfsadmin.hosts. The •
he service by using the
value, entered in the preceding lines, opens the gateway to all groups and allows it to run on
any host. Access is restricted b entering groups (comma separated) in the group's property.
·v3Gateway to HDFS Entering a host name for the host's property can restrict the host running tile gateway.
(04 M11rks) Next, move 10 the Advanced hdfs-site.xml section and set the following property: property>
<name>dfs.nfs3.dump.dir</name>
Feature enables files to be easily <value>/tmp1.hdfs-nfs<lvalue>
vay supports NFSv3 and allows <lprop1my>
, . Currently the NFSv3 gateway The NFSv3 dump directory is needed because the NFS client ofierl recorders writes.
Sequential writes can arrive at the NFS gateway in random order. This directory is lZ to
>eal file system using an 'NFSv3 temporarily save out-of-order writes before writing to I IDFS. Make sure the dump directory
has enough space. For example, if tile application uploads 10 files, each of size 100MB, it is
~eir local file system. recommended that this directory have IGB of space to coYer a worst-case write reorder for
y to the IIDFS file svstem. every file.
t point. File append is supported, Once all the changes have been made, click the green Save button and note the changes you
made to the Notes box in the Save confinnation dialog. Then restart all of HDFS by clicking
llameNode, or any HOFS client. the orange Restart button.
at https:/' hadoop. apache.org,I Step 2: Start the Gateway
11ay Log into a DataNode and make sure all NFS services are stopped. In this example,
Data Node nO is used as the gateway.
cd to demonstrate tne steps for # service rpebind stop# service nfs stop
g those, related to security, are Next, start the HDFS gateway by using tile hadoop-daemon script to start portmap
way node in this example, and and nfs3 as follows :
#/usr1hdp/22.42-2/hadoop!sbin/hadoop-daemon.sh start portmap #/usr/hdp/224 2-2/
hadoop.sbin/hadoop-daemon.sh start nfs3
is example, the Ambari GUI The portmap daemon will write its log to
e the cllanges or restart HOFS /vvar /log/hadoop/roollhadoop-root-nfs3-n0.log
g Amb:iri, you must change To confinn the gateway is working, issue the following command. The output should
the cluster. The following
look like the folio" ing :
#rpcinfo -p no
program vers proto port service
10000.S 2 top 4242 mountd
f ig/core-site.xml tile. Using

101
VIII Setfl/ (CSF:/ISE) 13iif'V~A~

100000 2 udp 111 portmapper I. Sisense


100000 2 top Ill portmapper 2. Actuate Busim:ss lntc
100005 I tcp 4242 mounted 3. ieCub.:
100003 3 tcp 2049 nfs 4.0omo
100005 I udp 4242 mounted .5. Board Management 1

mounted 6. Clear Ano!) lits


100005 3 udp 4242
7. Oucen
100005 3 tcp 4242 mounted
8. Gooddata
100005 2 udp 4242 mounted
9. IDM Cognos lntcllig
Finally, make sure the mount is availabh: by 1ssuing the following command :
10. lnsightsquared
#showmount -c n0
Export list forn0 : 11. faspcrSofl
t• 12. Looker
Jfthe rpcinfo or showmount command does not work correctly, check the previously 13. Microsofl lll platfc
mentioned of files for problems. 14. MicroStraccgy
Step 3: Mount JIDFS 1.5.MITS
The final step is to mount HDFS on a client node. In this example, the main login node is used.
To mount the HDFS files, exit from the gateway node and create the following directory : 16. Open! •
#mkdir /mnt / hdfs 17. Oracle Bl I
The mount command is as follows. Note that the name of the gateway node will be different 18. Oracle Enh:rprise Q
on other clusters, and an IP address can be used instead of the node name. #mount -t nfs -0 19. Oracle llypcrion SJ
vers = 3, proto = tcp, no lock n0 : / /mnt / hdfs / . Th~ Bl tool us~d in j
Once the file system is mounted, the files will be visible to the client users. The following As higher education t,
command will list the mounted file system: decision-making. Th
# is/mnt/hdfs quality ofstudent ex
app-logs apps benchmarks hdp mapred mr-history system Imp user var The gateway in the I. Student enrolme
current Hadoop n:lease uses AUTH_UNIX-style authentication an requires that the login requires schools to d
user name on the client march the user name that NFS passes to HDFS. For example, if the develop models of w
NFS client is user admin, the NFS gateway will access HDFS as user admin and existing those students. The s
HDFS permissions will prevail. can be taken in time.
ll1e system administrator must ensure that the user on the NFS client machine has the same 2. Course offerings:
user name and user ID as that on the NFS gateway machine. This is usually noc a problem if new courses are like
you use the same user management system, such as LDAP/NIS, to create and deploy users reduce costs, and im
to cluster nodes. 3. Alumni pledges:
Module-3 to pledge financials
to pledge donations
5. a. Describe list or business intelligence tools ustd In the organization. Explain any 2 or other forms of oulre
them used In your organization. (10 Marks)
Ans. According to the list of best business intelligence tools prepared by experts from Finances b. What arc lhe soure
Online the leading solutions in this category comprise of systems designed to capture, evolve In the ai:e of
categorize, and analyze corporate data and extract best practices for improved decision Ans. Data Sources: DWs
making. The more advanced the system is, the more data sources it will combine, including data, would need to
intcmal metrics coming from different company departments, and external data extracted I. Operations data i
from third-party systems, social media channels, emails, or even macroeconomic data. that form the backbo
Ultim;itely, business intelligence software helps companies 'gain insight on their overall upon the subject ma
growth, sales trends, and customer behavior. customers, orders, c

102
I. Siscnsc 20. Palo OL/\P Server
2. /\ctu~t.: Rusin.:ss lntellig.:ncc and Rc:porting Tools (BIRr) 21. Pcnlaho
3. icCube 22. Profit base
4. Domo 23. QlikVic11
5. 80(,rd Management lntcllig.:nc,: Toolkil 24 Rapid tn<ight
6. Clear Anni) tics 25. SAP busine5s intelligence
7. Duccn 26. SAP BusinessObJccts
8. Gooddala 27. SAP Net Weaver OW
9. IOM Cognos lnh:lligence 28. SAS DI
command: I0. lnsightsquared 29. Silvon
11. JasperSoll 30. Solver
12. L<>oker 31. SpagoOI
eek the previously 13. Mieroson Bl platform 32. SQL Server Analysis Services
14. MicroStralcgy 33. SI) le lnlclligcncc
15. MITS 34. Syntell solutions
he main login node is used.
16. Opcnl 35. rargit
the following directory :
17. Oracle 81 36. Vismatica
way node will be different 18. Orack Enterprise 81 Server 37. WebFOCUS
i: name. #mount •t nfs -0 19. Oracle llyp,:rion Sy5tcm 38. Ycllowfin 01
The 131 tool used 111 our orgamzatton : Educallon
client users. The following As higher education becomes more expensive and competitive, it is a great user of data-based
decision-making. There is a strong need for efficiency, increasing revenue, and improving the
quality of studt:nl eKperience at all levels of education.
er var The gateway in the I. Student enrolment (recruitment and retention): Marketing 10 new potential students
an requires that the login requires schools to develop profiles of the sludents that are most likely to attend. Schools can
DFS. For example, if the develop models of what kinds of students are anracled lo the school, and then reach out to
s user admin and existing those students. Tht: students al risk of not returning can be flagged, and corrective measures
can be taken in time.
ient machine has the same 2. Course offerings: Schools can use the class enrolment data to develop models of which
is usually not a problem if new courses arc likely lo be more popular with students. This can help increase class size,
to create and deploy users reduce costs, and improve student satisfaction.
3. Alumni pledges: Schools can develop predictive models of which alumni are most likely
to pledge financial suppon to the school. Schools can create a profile for alumni more likely
to pledge donations to the school. This could lead to a reduction in the cost of mailings and
zation. Explain any 2 or other forms of outreach to alumni.
(IO Marks)
by experts from Finances b. What are the sources and types ordata for a data warehouse? How will data warehousing
ems designed to capture, evolve in the age of social media? (06 Marks)
s for improved decision Ans. Data Sources: DWs arc created from structured data sources. Unstructured data, such as text
it will combine, including data, would need to be structured before insened into DW.
d external data extracted I. Operations data include data from all business applications, including from ERPs systems
en macroeconomic data. Ihat form the backbone of an organization's IT systems. The data to he extracted will depend
upon the subject math:r of DW. For example, for a sales/marketing DW, only the data about
insight on 1heir overall
customers, orders, customer service, and so on would be extracted.

103
VIII Sem., (CSf(ISE) CS-Modei,Q~P

2. Other applications, such ns point-of-sale (POS) terminals and e• commerce applications, for example, work experie
provide customer-facing data. Supplier data could come from supply chain management 5. Data elements may need
systems. PlaMing and budget data should also be added as needed for making comparisons currency values may need
against targets. the same base year for com
3. External syndicated data, such as weather or economic activity data, could also be added 6. Outlier data elements
to DW, as needed, to provide good contextual information to decision makers. of results. For example, o
Three main types or Data Warehouses are: educational setting.
I . Enterprise Data Wareho use: 7. Any biases in the selecti
Enterprise Dalo Warehouse is a centralized warehouse. It provides decision support service of the phenomena under
across the enterprise. It offers a unified approach for organizing and representing data. It nlso than is typical of the popul
provide the ability to classify data according to the subject and give access according to those 8. Data should be brought
divisions. available daily, but the sal
2. Operational Data Store: relate these variables, the
Operational Data Store, which is also called ODS, are nothing but data store required when case, monthly.
neither Data warehouse nor OLTP systems support organiz.ations reporting needs. In ODS, 9. Data may need to be se
Data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities much variability, because
like storing records of the Employees. may dull the effects of otli
3. Data Mart: information density of the
A data mart is a subset of the data warehouse. II specially designed for a particular line of
b. Describe some key steps
business, such as sales, finance, sales or finance. In an independent data mart, data can collect Ans. Data has been described 35
directly from sources. The volume of data used
A DW project reflects a significant investment into IT. All ofthe best practices in implementing
and continues to grow. Fo
1111y IT project should be followed.
1. The DW project should align with the corporate strategy. Top management should be
downloaded from Science
profiles on Scopus and 3
consulted for setting objectives. Financial viability Return on Investment (ROI) should be
harder for a user to grab a
established. The project must be managed by both IT and business professionals. The OW That's where data visual"
design should be carefully tested before begiMing development work. It is of\en much more
and easy-to-understand vis
expensive to redesign after development work has begun.
There arc many advanced
2. It b important to manage user expectations. DW should be built incrementally. Users
fur specialized purposes s
should be trained in using the system, and absorb
disaster relief monitoring.
the many features of the system. is to help readers sec a pal
J. Quality and adaptability should be built In from the start. Only cleansed and high· read tedious descriptions s
quality data should be loaded. The system should be able to adapt to new access tools. As a profit growth of 25% in
business needs change, new data marts can be created for new needs
visualization summarizes i
OR on the points that are relev
6. a. Why Is data preparation so important and time consuming? (0-1 Marks)
An analysis clearly explain
creating a good visualizati
Ans. Data cleansing and prcpnmtion is a laborintensive or semiautomated activity that can take up
to 60 to 70 percent of the time needed for a llata mining project. Visualization E"'ample
To demonstrate how each
I. Duplicate data needs 10 be removed. The same data may be received from multiple sources.
a company who wants to
When merging the data sets, data must be de-duped.
important raw sales data fo
2. Missing values need to be filled in, or those rows should be removed from analysis. Missing
values can be filled in with average or modal or default values. Produ
3. Data elements may need to be transformed from one unit to another. For example, total AA
costs of health care and the total number of patients may need to be reduced to cost/patient to 88
allow comparability of that value.
4. Continuous values may need to be binned into a few buckets to help with some analyses.
cc i
DD
ti.MA
104
-
CBCS - ModevQ~P"'Pet"" - 3

e- commerce applications, For example, work experience could be biMed as low, medium, and high.. .
upply chain management 5. Data clements may need to be adjusted to make them comparable over ttme. !'or example,
d for making comparisons currency values may need to be adjusted for inflation; they would need to be converted to
the same base year for comparability. They may need to be converted to a common currency.
data, could also be added 6. Outlier data elements need to be removed after careful review, to avoid the skewing
sion makers. of results. For example, one big donor could skew the analysis of alumni donors in an
educational setting.
7. Any biases in the selection of data should be corrected to ensure the data is representative
s decision support service of the phenomena under nnnlysis. If the data includes many more members of one gender
d representing data. It also thon is typical of the population of interest, then adjustments need to be applied to the data.
,e access according to those 8. Data should be brought to the same granularity to ensure comparability. Sales data may be
available daily, but the sales person compensation data may only be available monthly. To
relate these variables, the data must be brought to the lowest common denominator, in this
t data store required when case, monthly.
reporting needs. In ODS, 9. Data may need to be selected to increase information density. Some data may not show
:ferrcd for routine activities much variability, because it was not properly recorded or for any other reasons. This data
may dull the elTccts of other differences in the data and should be removed to improve the
information density of the data.
~ned for a particular line of b. Describe some key steps in data visualization. (08 Marks)
nt data mart, data can collect Ans. Data has been described as the new row material for business and the ..oil of the 21" century.
The vol1tme of data used in business, research and technological development is massive
est practices in implementing and continues to grow. For instance at Elsevier, there are about 700 million articles per year
downlondcd from ScienceDircct, 80.000 institution profiles on Scopus, 13 million researcher
Top management should be profiles on Scopus and J million researcher profiles on Mendeley. II becomes harder and
nvestment (ROI) should be harder for a user to grab a key message from this universe of data.
ness professionals. The OW That's where datn visualization comes in: summarizing and presenting large data in simple
work. It is of\en much more and easy-to-understand visualizations 10 give readers insightful information.

r built incrementally. Use11


111ere are many advanced visualizations (e.g., networks, JD-models and map overlays) used
for specialized purposes such as 3D medical imaging, urban transportation simulation, and
disaster relief monitoring. But regardle~s of the complexity of a visualization, its purpose

r· Only cleansed and high•


clapt to new access tools. As
is to help readers see a pattern or trend in the data being analyzed, mthcr than having them
read tedious descriptions such as: "A's profit was more than B by 2.9% in 2000, and despite
a profit growth of 25% in 200 I, A's profit became less than B by 3.5% in 200 I.'' A good
eeds visualization summarizes information and organizes in a way that enables the reader to focus
on the points that are relevant to the key message being conveyed.
An analysis clearly explained with tables, graphs, charts and diagr11ms, keeping in mind that
(04 Marks)
creating a good visualization is an iterative proci:ss.
ated activity that can take up Visualization Example
To demonstrate how each of the visualization tools could be used, imagine an executive for
ived from multiple sources. a company who wants to analyze the saks performance of his division. Table 6.1 show the
important raw sales data for the current year, alphabetically sorted by Product names.
oved from analysis. Missing
rrodurt Re,·cnue Orders SalesPers
another. For example, total AA 9731 131 23
be reduced to cost/patient tu BB 355 43 8
cc 992 32 6
to help with some analyses.
DD 125 31 4

105
VIII Set111 (CS'E/ISE)

EE 933 30 7
FF 676 35 6
CG 1411 128 13
HH 5116 132 38
JJ 215 7 2
KK 3833 122 50
LL 1348 15 7
MM 1201 28 13
Table 6.1: Raw Performance Data
To reveal some meaningful pattern, a good first step would be to sort the table by Product
The number of orde
revenue, with highest revenue first. We could total up the values of Revenue, Orders, and
Sales persons for all products. We can also add some important ratios to the right of the table the revenue is widcl
number of orders.
(fable 6.2).
Product Revenue Orders SalesPers Rev/Order Rev/Sales P Orders/Sales P
AA 9731 131 23 74.3 423.1 5.7
HH 5116 132 38 3S.S 134.6 3.5
KK 3333 122 50 31.4 76.7 2.4
GG 14 11 128 13 11.0 108.5 9.8
LL 1348 15 7 89.9 192.6 2.1
MM 1201 28 13 42.9 92.4 2.2
cc 992 32 6 31.0 165.3 5.3
EE 933 30 7 31.1 133.3 4.3
Therefore, the ord
FF additional data is m
676 35 6 19.3 112.7 S.8 into 4 sizes: Tiny, s
BB 3SS 43 8 8.3 44.4
JJ 21S 7
5.4 Product 1
2 30.7 107.5 3.S AA
DD 12S 31 4 4.0 31.3 7.8 1-fH
Total 25936 734 177 KK
35.3 146.5
Tttble 6.2: Sorted data witlt .. 4.1
GG
There are too many n be . • adtlttt011al raltos
. . um rs on this table to visualize a t els . LL
m different scales so plouing them on th h ny ren m them. The numbers are
n be . e same c art would not be E MM
um rs are m thousands while the SalesP be easy. .g. the Revenue
or double digit. ers num rs and Orders/SalesPers are in the single cc
~ne_ could start by visualizing the revenue as . EE
significantly frorn the first product to th ( ~ pie-chart. The revenue proportion drops FF
It is interesting to no1e that lhe to 3 odenext. Figure 6.1 ).
p pr ucts produce almost 750,
70 ofth e revenue.
BB
JJ
DD
Total

Figure 6.3 is a sh1cke


106 This chart (Figure 6.

&i,,.t"',- e.._ ~IIIIV


CBCS - ModeiQ~Pcq,er · 3

RevfflUof 1hare by
Products

38
2
so
7 ,u . .... ... , ..'O •
13

'o sort the table by Product Figure 6.1: R,mm11e Sitare by Proc/11ct
The number of orders for each product can be plotted as a bar gmph . This show~ thm while
of Revenue, Orders, and the n:venue is widely different for the top four products, they have approximately the same
ios to the right of the table number of orders.
O rdors by Product
,ales P Orders/Sales P
.3.1 5.7

.7
08.5
2.6
3.5
2.4
9.8
2.1
2.2
5.3
II I. III
Figure 6.2: Orcll!fS by Pro,l11cts
I I-I
Therefore, the orders data could be investigated further to sec order patlcms. Suppose
4.3 additional data is made available for Ordi:rs by their size. Suppose the orders are chunked
5.8 into 4 sizes: Tiny, Small, Medium, and Large. Additional data is shown in Table 6.3.

5.4
Product Total Orders Tiny Small Medium Large
AA 131 5 44 70 12
3.5
HII 132 38 60 30 4
7.8 KK 122 20 so 44 8
4.1 GG 128 52 70 6 0
tius L,L 15 2 3 s 5
in them. The numbers an: MM 28 8 12 6 2
t be easy. E.g. the Revenue
cc ~,
.)~ 5 17 10 0
rs/SalesPers are in the single
EE 30 6 14 10 0
revenue proportion drops FF 35 10 22 3 0
BO 43 18 25 0 0
% of the revenue. JJ 7 4 2 I 0
DD 31 21 10 0 0
Total 734 189
.. 329 185 31
711ble 6.3: Aclc/111011al ''"'" Oil Oft/er M:es
figure 6.3 is a stacked bar graph that shows the p.:rcentage of Orders by size for each product.
This chart (Figure 6.3) brings a different set of insights. fl shows that the product HH has
107
VIII Sem, (CSE/ISE) 'Blg,-V~A~ C'BCS·M~Q~
a larger proponion of tiny orders. The products at the far right have a large number of tiny 3. Cltoose approprlu)
orders and very few large orders. it could be presented
4. Tlte data set ,·011l
Product Orders by S ize
0
is not necessarily bett
'j S. Tl,e vlsun/lu,tlon
or targets wilh which
t:-; "'j 6. Tl1e numerical du
#-o

'j I I I sI I I I I I I I I
~ ~ Ill
person were plotted
choices.
J; f i '11 'o 1. l/ig/1-lellf!/ visua/1

... .. ....., . ....


t-•rud,.1c11,
results, a drill•down
e' ~ • a ..... ..,.,., • Jt.., • 4\P
..,1 r·
•ft't'\P,._1
-~.' , 8. Tl,ere may be 11ee
- ~ 1.!•
example, one may rec
Fl,:11re 6.3: Produ,·t Orders by Order Slz.e
Visualization Example phase -2
The executive wants to understand the productivity of salespersons. This analysis could be 7. a. What is pruning?
done both in terms of the number of orders, or revenue, per salesperson. There could be two other'!
separate graphs, one for the number of orders per salesperson, and the other for the revenue Ans. Prunin& : The tree
per salesperson. However, an interesting way is to plot both measures on the same graph to The pruning is often
give a more complete picture. This can be done even when the two data have different scales. usability. The sympt
The data is here resorted by number of orders per salesperson. some of which may re
Ord,.,.., ,nu Rt>vrnuc pt•' ,.Jh•">t-'t!'· •,on
There are two approa
• Pre-pruning means
downside is that ii ii
because we do not
• Post.pruning: Rem
commonly used. C4.!
for pruning. A valida
TI1e most popular de
Tab
Derision Tree C
Figure 6.4: Salapenon produdivily by product llct
Figure 6.4 shows two line graphs superimposed upon each other. One line shows the revenue Full name:
per salesperson, while the other shows the number of orders per salesperson. It shows that the
highest productivity of 5.3 orders per sales person, down to 2.1 orders per salesperson. The
second line, the blue line shows the revenue per sales person for each for the products. The
revenue per salesperson is highest at 630, while it is lowest at just 30. And thus additional
layers of data visualization can go on for this data set.
c. What are some key requirements for good visualization. (04 Marks)
Ans. To help the client in understanding the situation, the following are some key requirements
for good visualization:
I . Fe~ch npprop~lute 0111/ curr~c~dutn for u11ulysls. This requires some understanding of the
domam of the client and what 1s important for the client. E.g. in a business setting, one may Typ,: of da1a
need to understand the many measure of profitability and productivity.
2. Sort the ~uta /11 tlte mu11 appropriate ma1111er. II could be sorted by numerical variables,
or alphabe11cally by name.
· Modet-Q~Papu- - 3
~t have a large number of tiny J. Clloostt appropriate n11ttl1od to,,,__
1t could be presented as any of the~ fM11a.
lilwMI. The data could be presented as a table or

4. 11,tt data silt could btt pntnllfl to._..._,. tbe more slgnllkant elements. More data
is noc ncccs!larily better, unlcs.s it makel die - 5ignificant impacl on the situation.
5. T/1tt vlsual/1.Rtlon could show addl"-'61 M slon for referem:e such as the expectations

Ill
or targets with which to compare the results.
6. The numerlcal data may 11ud to 1w bhuwtl .,_ 11 few t·ategorles. E.g. the orders per
person were plotted as actual values, while the order sizes were binned into 4 categorical
choices.
7. lligll-level 11isuull1.utio11 could be backedby more tlttloiled analysis. For the most significant
results, a drill-down may be required.
8. There may be need to present 11ddlliona( textutll illfom,ot/011 to tell 1hc whole s1ory. for
example, one may requir~ noteS' lo explain some extraordinary results.

r Slt.t
Modulc -4
What Is pruning? Whal are pre-pruning and post-pnaaln&? Why chOOH one over the
persons. This analysis could be other? (08 Marks)
;alesperson. There could be two Pruning : TI1e tree could be trimmed to make it more balanced and more easily usable.
and the other for the revenue The pruning is often done after the tree is constructed. to balance out the tree and improve
~easures on the same graph to usability. The symptoms of an overfilled tree are a tree too deep, with too many branches,
e two data have different scales. some of which may reflect anomalies due to noise or outliers. Thus, the tree should be pruned.
l. There are two approaches lo avoid over-fitting.
• Pre-prunins mcons 10 halt the tree construction early, when certain criteria are met. The
downside is that it is difficult to decide what criteria to use for halting the construction,
because we do not know what may happen subsequently, ifwe keep growing the tree.
• Post-pruning: Remove branches or sub-trees from a "fully grown" tree. This method is
commonly used. C4.S algorithm uses a statistical method to estimate the errors at each node
--1 •- for pruning. A validation set may be used for pruning as well.
The most popular decision tree algorithms are CS, CART and CHAID (Table 7. 1)
Tuble 7.1: Co111purl11g popular Deds/011 Tree ul,:orllltno
Decision Tree C4,S CART CIIAID
llerati\e Classification and r,:gr,:ssion Chi-square automatic
~ product Full nami,
her. One line shows the revenue Dichotomi~r (103) tl\."\.'S interaction d.:1......1or
r salesperson. It shows that the Adjusted significanc-:
Basic algorithm Hunt's algorithm Hunt's algorithm
2.1 orders per salesperson. The h:stin2
for each for the products. The £>.:\eloper Ross Quinlan Bn:mman Gordon KDSS
at just 30. And thus additional When do:vclor>cd 1986 1984 1980
Classification and regn:ssion Classification and
Tyjl(s of11'\l\:S Classification
trees rcl!Jc:ssion
(0-& Marks)
Serial Trec growth and tree Tree gro"th ond tn:o: pnming Trt.:1: gro" lh and Ire:.:
ing are some key requirements im11kmcnt;1tion orunin2 I orunin11

Discreh: and
· s some understanding of the Non-normal data also
Tyl'( of data continuous; Discr.:tc: and continuous :acc.:plc..-U
in a business setting, one may incomnl<.1,: da1a
uctivity.
sorted by numerical variables,

109
VIII Setn, (CSE(ISE) t3i.g,V~A~ cacs . Model,Q
Binary splits only; cle, er
Multiway splits as
T) p,:s of splits Multiway splits surrogatc splits to r.:duc.: trcc dc:fault
depth

Spliuing criteria Information gain Gini coclTicicnt, and oth<'rs Chi-squnrc lCSl
Cle, er bouom up
T.-.:.:s can l>l:comc , cry
Pruning criteria technique avoids Remove weakest Links first
large
overfilling
Popular in market $3,50,000
Publicly availablc in most rcs.:arch, for
Implementation Publicl) a, ailabh: S3,00,000
packages SC'1.mcntation
s2.so.000
b. Using the data that follows, create a regression model to predict a house price Crom the
size or the house. llcre arc samrle house ttata: (08 Marks) i $2,00,000

!louse Prict Siu (sqft) I


:c
Sl,50,000
S229,500 1,850 Sl,00,000
S273,300 2.190 SS0,000
$247,000 2, 100
SIIJ5,100 1,930 so
S26I.OOO 2.300
$179,700 1.710
Figure 7.1 Scatter plo
$168,500 1,550 The two dimensions
S234,400 1,920 diagram. A scauer pl
$168,500 I,!WO Visually, one can see
SUI0,400 1,720 relationship is not pe
Sl56,200 1,660 the following output
S288,3.50 2.40.5
$156.750 1,525
S202,IOO 2.030
$256.800 2.240
Ans. The regression model is described as a linear equation that follows. y is the dependent
variable, that is, the variable being predicted. x is the indcpcndcnt variable, or the predictor
variable. There could be many pr.:dictor variables (such as .TI, :c2, ...) in a regression
It shows the coefficie
equation. However, there can be only one dependent variable (y) in the regression equation.
by the equation, is 0
.1 =Po+ P, .,· + s positively correlated.
Here are sample house data: Regression coefficien
llous, Prict Size (sqR) House Price (S) = 13
$229,500 1,850 TI1is equation explain
$273,300 2.190 Suppose other predic
$247,000 2.100 house, it might help i
$195,100 1,930
S261.000 2.300
$179,700 1,710
Sl68,500 1,550
S234.400 1.920

110
CBCS - Mo-d,cl,Q~Pct.per - 3

$168.500 1,840
Multiway split, a~ $180.-lOO 1.720
default Sl~6.:?00 1,660
S288.HO 2,405
Chi-square test $156,750 1,525
$202,100 2,030
Trees can bccom~ very $256,800 2,240
large
$3,50,000
Popular in market
research, for $3,00,000
s.: m.:ntation $2,S0,000
~
ct a house price from the S: $2,00,000 +-::,-+-::~H:.it7~~~-i-t
(08 M arks) :,c
i $1,50,000 • House Price

$1,00,000 -r.::::!:::bt!:=::::::!:i: - - Unc.ir (House Price)

$50,000
$0
0 1000 2000 3000
Size (Sq ft)
Figure 7. I Scatter plot and regression equal ion between House price and house size
The two dimensions of (one predictor, one outcome variable) data can be plotted on a scatter
diagram. A scatter plot with a best-fitting line looks like the graph that follows (Figure 7.1).
Visually, one can see a positive correlation between house price and size (sqfl). However, the
relationship is not perfect. Running a regression model between the two variables produces
the following output (lruncated).
Rt>&rtssion Statistics
Multiple r 0.891
r2 0.794
Cocllicicnts
ollows. y is the dependent Intercept -54,191
nt variable, or the predictor Size (sql\) 139.48
1• .r2, ...) in a regression It shows the coefficient of correlation 1s 0.891. ,2, the measure of total variance explained
in the regression equation. by the equation, is 0.794, or 79 percent. That means the two variables are moderately and
positively correlated.
Regression coefficients help create the following equation for predicting house prices.
House Price (S) = 139.48 x Size (sqn) - 54,191
This equation explains only 79 percent of the variance in house prices.
Suppose other predictor variables are made available, such as the number of rooms in the
house, it might help improve the regression model. The house data now looks like this:
llouse Prict> Si:i~ (sqrt) # Rooms
$229,500 1.850 4
$273,300 2,190 s
$247,000 2,100 4

111
VIII Sem, (CSE/ ISf)
C3CS - ~~Que.,stum,,p_
$195,100 1,930 3 The predicled Yalues shoul
$261.000 2.300 4 able lo predict the actual v
$179,700 1,710 2 10 fine-tune and improve tH
$168,500 1,550 2
$234,400 , 1,920 4
$161!,SOO 1,840 2 What makes a neural neh
$180,400 1,720 2 learning tasks?
Supervised Learning
$156,200 1,660 2
• Training data includes
$288,350 2.405 s • For some examples the
$156,750 1,525 3
model during the learning
$202,100 2,030 2
• The construction of a pro
$256,800 2.240 4 • These method~ are usual
While it is possible to make a three-d1mc:ns1onal scatter plot, one can alternatively examine • Have lo be able lo gene
the correlation matrix among the variables. without knowing a priori
House Price Size (sq R) # Rooms Supervised learning is bl
llouse Price I classification already assi1
Size (SQ ft) 0.891 I Perceptron (MLP) models
Rooms 0.944 0.748 I I. One or more layers of
It shows that the house pnce has a strong corrclatton with number of rooms (0.944) as well. network that enable the m:
Thus, it is likely that adding this variable: lo the regression model will add to thi: strength of 2. The nonlinearity reflect
the model. Running a regression model between these three variables produces the following 3. The interconnection m
output. These characteristics alonE
Learning through training
Rearesslon Statistics
algorithm. The error co
Multiple r 0.984 output samples and finds
0.968 the desired output and a
rJ
the product of the error s1
Coefficients principle, error back prol)i
lnh:rcepl 12.923 Forward Pass: llere, in~
forward, neuron by neuro
Siic(sq0) 65.60 as output signal: y(n) =
Rooms 23.613 v{n) -l: w(n)y{n). The o
It shows the coefficient o corrclatton of this regression mod el is 0.984. r2, the total variance desired response d(n) 1111
explained by the equation, is 0.968, or 97 percent. That means the variables are positively network during this pass 1
and very strongly correlated. Backward Pass: The e
Adding a new relevant variable has helped improve the strength of the regression model. propagated backward thr
Using the regression coefficients help~ 1.1 ealc the following equation for predicting house each layer and allows the
prices. with the delta rule as:
House rrlce ($) = 65.6 x Si:te (sqR) + 23,613 x Rooms + 12,924 t.w(n) = 11 • 6(n) • y(n).
Thi~ equation sho"s a 97-percent goodness-of-fit with the data, which is very good for This recursive computat'
business and economic dala. There is always some random variation in naturally occurring for each input pallem till
business data, and it is not desirable to overfit the model to the data. Supervised learning para
This predictive equation should be used for future transactions. Given a situation that follows non-linear problems sud
it will be possible 10 calculate the price of the house with 2,000 sqft and 3 rooms. ' Unsupervised Learninil
• The model is not provil
House Price Size s rt # Rooms
•. Can be used to cluste
? 2000 3
House Price (S) - 65.6 x 2,000 (sqR) 1 23,613 x J -t 12,924 - $214,963
112 Swi.tAf' e.- ~ W '
The predicted values should be compared to the ac:aal values to see how close the model is
able 10 predict the actual value. As new data pocnlS become available, there are opportunities
to fine-tune and improve the model.
OR
What makes a neura l network nrsatlle enough for supervised 11s well as non-supervised
lenrning h1slui? (08 Marks)
Supervised Learning
• Training data includes both the input and the desired results.
• For some examples the correct results (targets) are known and are given in input to the
model during the learning process.
• The construction of a proper training, validation and test set (Bok) is crucial.
• These methods are usually fast and accurate.
e can alternatively exami • Have to be nble to generalize: give the correct results when new data are gi\C0 in input
without knowing a priori the target.
# Rooms Supervised learning is based oil ·training a data sample from data sou•-cc with correct
classification already assigned. Such techniques are utilized in feeJfoa ,Hird or Multilayer
Perceptron (MLP) models. These MLP has three distinctive character isucs:
I. One or more layers of hidden neurons that are not part of the input or output layers of the
network that enable the network to learn and solve any complex problems
ber of rooms (0.944) as well;
2. The nonlinearity reflected in the neuronal activity is dilfen:ntiahlc and,
clel will add to the strength 3. The interconnection model of the network exhibits a high degree of connectivity
dobles produces the followi
These characteristics along with learning through training solve difficult and diverse problems.
Learning through training in a supervised ANN model also called as error back-propagation
algorithm. The error correction-learning algorithm trains the network based on the input-
output samples and finds error signal, which is the difference of the output calculated and
the desired output and adjusts the synaptic weights of the neurons that is proportional to
the product of the error signal and the input instance of the synaptic weight. Based on this
principle, error bock propagation learning occurs in two passes:
Forward Pass: Herc, input vector is presented to the network. This input signal propagates
forward, neuron by neuron through the network and emerges at the output end of the network
as output signal: y(n) = 'P(v(n)) where v(n) is the induced lo..:al field of a neuron defined by
v(n) • l: w(n)y(n). The output that is calculated at the output layer o(n) is compared with the
desired response d(n) and finds the error e(n) for that neuron. The synaptic weights of the
s 0.984. r2, the total variance.
network during this pass are remains same.
s the variables are positivelJ Backward Pass: The error signal that is originated at the output neuron of that layer is
propagated backward through network. This calculates the local gradient for each neuron in
of the regression model.
each layer and allows the synaptic weights of the network to undergo changes in accordance
quation for predicting house
with the delta rule as:
6w(n) "' 11 • 6(n) • y(n).
24 This recursive computation is continued, with forward pass followed by the backward pass
ata, which is very good for
for each input pattern till the network is converged .
iation in naturally occurrina Supervised learning parndi~m oran ANN is efficient and finds solutions to several linear and
tlata. non-linear problems such as classification, plant control, forecasting. prediction, robotics etc
iven a situation that follows, Unsupervised Learning
sqft and 3 rooms. • The model is not provided with the correct results during the training.
•. Can be used to cluster the input data in classes on the basis of their statistical properties

113
VIII Se.w (CS'E(ISE) 8(1J'V~A~

only.
• Cluster significance and labeling.
• The labeling can be carried out even if the lnbcls are only a~ailnble for a small number of
objects representative of the desired classes. A scatter plot of IO data
Self-Organizing neural networks learn using unsupervised learning algorithm to identify (Figure 8.1 ). As a bottom
hid<len patterns in unlabelled input data. This unsupervised refers to the ability to learn and intuited.
vrgani.Gc informu1ion without providing :m error sienal to evaluate the potential solution. The points arc distributed
The lack of direction for the learning algorithm in unsupervised learning can sometime circle would represent the
be advantageous, since it lets the algorithm to look back for patterns that ha\e not been distance between the poin
previously considered. The main charactcnstics of Self-Organizing Maps (SOM) are: The three points at the bo
I. It transforms an incoming signal patt..:rn of arbitrary dimern,ion into one or 2 dunensional the other cluster. The two
map and perform this trnnsformation adaptively new centroids.
2. The network represents feed forward stmcturc \\ ith a single computational layer consisting The bigger cluster seems t
of ncun:,,s arnngcd in rows and cc-lumns. sep.irah: r luster. The three
3. At enc 11 st~!_le ot rl'prcscntation, cacl1 input signal is kept in its proper context and, TI1is solution h,1:; rhree cl
4. Ncuro ·s d,:alin~ wtlh closr l} related pieces of information are close together and they However, its centroid is n
comnwn ,ate through ,ynapt1c connections tight-fitting, with a nice ce
The cvm:·utati,•r.al hi)er i~ also called a, competitive layer since the ne1irons in the layer much usefulness.
compe:e with each other to become active Hence, this learning algorithm is called competitive
algori,'111' Unsupervised ahorithm in SOt-.1 works in three phases:
Competition phJse: for ec1ch input pattern x, presented to the network, inner product with
synaptic w.:ight w 1s cJlculoted and the neuron~ in the competitive layer finds a discriminant
function that induce competition among the neurons and the synaptic weight vector that is

close 10 the input vector in the Euclidean distance is announced ns winner in the competition.
That neuron is called best matching neuron, i.e. x = arg min ~ x • w ~. •
Cooperative phase: the\\ inning neuron determines the center of a topological neighborhood
h ofcooperating neurons. This is performed by the lateral interaction d among the cooperative
neurons. This topological neighborhood reduces its size over a time period.
Adaptive phase : enables the winning neuron and its neighborhood neurons to increase their
individual values of the discriminant function in relation to the input pattern through suitable
synaptic weight adjustments, t:.w = 11h(xXx - w). n ~
Upon repeated prcsentntion of the training patterns, the synaptic weight vectors tend to Fig11u 8.1 /11
follow the distribution of the input patterns due to the neighborhood updating and thus ANN
learns without supervisor [2].
Self-Organizing Model naturally represents the neuro-biological behavior, and hence is used
in many real world applications such as clustering. speech recognition, texture segmentation,
vector coding etc
b. ldentiry lhe clusters from the dala as shown in Dataset below table.Determine the
number or cluslen and the center points or those cluslen. (08 Marks)
X y
2 4
2 6
s 6
4 7
8 3
s 6
5 2

114
for a small number of ~
Ans. A scatter plot of 10 data points in 2 dimensions shows lhem distributed fairly randomly
algorithm 10 identify tFigure 8.1 ). As a bottom-up technique, the number of clusters and their centroids can be
he ability to lo:mn and intuited.
the potential solution. The points arc distributed randomly enough that it could be considered as one cluster. The
earning can sometime circle would represent the central point (centroid) of these points. However, there is a big
ns that have not been distance between the points (2,6) and (8,3). So, this data could be brol,.en into two clusters.
aps (SOM) are: The three points at the bottom right could form one cluster and the other seven could form
o one or 2 dimensional the other cluster. The two clusters would look like this (Figure 8.2). The circles will be the
new centroids.
lalional layer consisting The bigger cluster seems too far apart. So. it seems like the four points on the top will form a
separat.: cluster. The three clusters could look like this (Figure 8.3).
per context and, This solution has three clusters. The cluster on the right is for from the other two clusters.
close together and they However, its centroid is not loo close to all the data points. The cluster at the top looks very
tight-filling, with a nice centroid. I he tliird cluster, at the left. is spread out and may not be of
the neu10ns in the layer much usefulness.
thm is called competitive
• 4, T
• ~- 7
vork, inner product with
ayer finds a discriminant • 2. 6
• !,, 6
• 6. G

ptic weight vector that is


vinner in the competition.
11.
• 7, 4
• 4, .

topological neighborhood
n d among the cooperative
• 6, 3
• 8, 3

e period. • s. :l

i1 neurons to increase their


ut pattern through suitable
n - -- - - -4 .,
ic weight vectors tend to Figure IJ. / l11iti11/ 1/1lffl po/11/s tmd rite ce11troitl (l'l10w11 flS I/tick"tlot)
updating and thus ANN

115
VIII Se..111 (CSE(ISE) CBCS • Mod.et,Q~F

ti
.,.,
their old (shaded!_values

n
Fl,:ure8.4 R
0 9

Figuu ll.1Dfridl11,: l1110 two clusters (cemroldl 1how11 a.s tllidc dots)

Ii
0 Ii 8 9

Figure I.J Divldi11g /1110 three clusters (ce11troids show11 a.s thick dot1)
This was an exercise in producing three best-filling cluster definitions from the given data.
The right number of clusters will depend on the data and the application for which the data
would be used.
K-Means Al&orithm for Clustering
K-means is the most popular clustering algorithm. It iteratively computes the cluste~ and
their centroids. It is a top-down approach to clustering. Starting with a given number of K
clusters, say 3 clusters; thus, three random centroids will be created as starting points of the
centers of three clusters (Figure 8.4). The circles are initial cluster centroids.
s,~p I: For a data point, distance values will be from each of the three centroids. The data
point will be assigned to the cluster with the shortest distance to the centroid All dat."I points
will thus be assigned to one data point or the other. The arrows from each data element show
the centroid that the point is assigned to (Figure 8.5).
Step 2: The centroid for each cluster will now be recalculated such that it is closest to all the
data points allocated to that clw.ter. 111c dashed arrows show the centroids being moved from 0

116
cacs· - Model, Q ~ Pc;q,iw - 3
their old (shaded) values to the re~ised new values (Figure 8.7).

• 4, 7
• s. 7

• • 2, 6
• S,6
• 6, 6

• ,, . ......
• 5, 2
• 6,,
- ♦ 8, 3


0
" " ..
Fi,:11re 8.4 Ru11domly assi1:11i11g t//ree ce11troi<lsfor t//ree tlflta clusters
r--
8 9

OW// as tltkk tlots) 1

2. 4

l--~---~--------.J
0
2 a 4 S 6 7 R
Figure 8.5 Assig11i11g data poi11ts to closest ce11troi,/
8 9

ltOWI/ (1$ 1//ick ,tots)


efinitions from the given data.
application for which the data

ely computes the clusters and


ing with a given number of K
reated as starting points of the
ster centroids.
f the three centroids. The data
to the centroid. All data points
from each data element show

uch that it is closest to all the


e centroids being moved from 0 1 1 4 \ 6 7 R
Figure 8.6 Recontputlng ee11trolds for ead, c/11ster
VIII Sem, (CSE(ISE) CBCS - ~od.el,Q
Step 3: Once again, data points are assigned to the three centroids closest to it (Figure 8.7). Here is the pseud
The new centroids will be computed from the data points in the cluster until finally the number of clusters
centroids stabilize in their locations·. These are the three clusters computed by th1s algorithm I. Choose K num~
(Figure 8.8). 2. Repeat till cluste
The three clusters shown are a 3-datapoints cluster with centroid (6.~,4-~), a 2-datapoint a. Allocate each poi
cluster with centroid (4.5,3), and a 5-datapoint cluster with centroid (3.5,3). b. Compute centroi
These cluster definitions are different from the ones derived visually. This is a function of the
random starting centroid values. The centroid points used earlier in the visual exercise were Selecting the Numl
different from that chosen with the K-means clustering algorithm. The K-means clustering The correct choice 0
exercise should, therefore, be run again with this data, but with new random centroid starting the distribution poin
values. With many runs, the cluster definitions are likely to stabilize. If the cluster definitions are needed to pick t
do not stabilize, that may be a sign that the number of clusters chosen is too high or too low. the clusters against t
The algorithm should also be run with different values of K. some point the marg
---- like an elbow.
At that elbow point,
, //♦ S,7 be the desired value
~~♦ S,6
.\ 20

~
.l~.8.3
0 l 2 3 4 5 6 7 a 9

Figure 8. 7 Assl,:11lng data po/Ills to Recomputed ce11trolds


Figure 8. 9 El
To engage with the d
a small number of clu
domain. The number
view.
This helps understand

9. a. What are the most


Ans. Naive Bayes classifie
some of the mails as
classify an incoming
Ans. Some of real world ex
• To mark an email as
• Classify a news arti
0 1 l 4 ~ fi 7 R ~
• Check a piece of tex
Figure 8.8 Reco111putl11g centroids for uch cluster till clusters stabill:e • Also used for facer

118
'BC.S,,VamtA~ CBCS - Mod®Q~PRJ)U • 3

closest to it (Figure 8.7). Herc is the pseudocode for impleN '""" & 1 algorithm. Algorithm K-Means (K
e cluster until finally the number of clusters, D list of data ~ )
omputcd by th1s algorithm I. Choose K number of random data pol-.a-.. ccaln>ids (cluster centers).
2. Repeat till cluster centers stabilize:
id (6.5,4.5), a 2-datapoint a. Allocate each point in D to the nearal all. oida.
id (3.5,3). b. Compute centroid for the cluster usial II,... ■ lbc cluster.
lly. This is a function of the
in the visua I cxen:ise were Selecting the Number or Clusters
. The K-means clusterinl The correct choice of the value ofK is often ambiguous. It depends on the shape and scale of
w random centroid startina the distribution points in a data set and the desired clustering resolution of the user. Heuristics
zc. tfthc cluster dcfinitiom are needed to pick the right number. One can graph the percentage of variance explained by
osen is too high or too low. the clusters against the number of clusters. The first clusters will add more infonnation, but at
some point the marginal gain in variance will fall, giving a sharp angle to the graph, looking
like an elbow.
At that elbow point, adding more clusters will not add much incremental value. That would
be the desired value of K (Figure 8.9).
20
Elbow for KMeans clust ertno
----,.... --,- -r-- -·

0.4
7 I 9

ed c,ntrolds ~,---a... ~ - .--...,..._,---..~----,!.,


N umtHtr or Cl\lsters

Figure 8. 9 Elbow n,ethod for determlniHg nunrber ofd11sters In a data set


To engage with the data and to understand the clusters better, it is often better to start with
a small number of clusters, such as 2 or 3, depending upon the data set and the application
domain. The number can be increased subsequently, as needed from an application point of
view.
This helps understand the data and the clusters progressively better.
Module-5
9. a. Whal are the most popular applications or Naive-Bayes techniques? (04 Marks)
Ans. Naive Bayes classifiers is a machine leaming algorithm. If you wonder, how Google marks
some of the mails as spam in your inbox, a machine learning algorithm will be used to
classify an incoming email as spam or not spam.
Ans. Some of real world examples are as given below
• To mark an email as spam, or aot spam ?
• Classify a news article about teclinolOI)', politics, or sports?
• Check a piece of text expressing positive emotions, or aeaattveemotions?
7 II 'I • • Also used for face recognition softwares
i/11 cluJten stablllu
119
VIII Se,m, (CSE/ISE) 'B½fV~A~ C13CS - Mode,l,Q~
Naive Bayes is used in Real Life. The following arc the Use Cases or Naive Bayes: transformation occurs i

: !;1.
r. :.!i&I
-1
C.tegon,lno N ~ beforehand is not nee
3. SVMs provide a g
ofa Gaussian kernel)
,... ,__... _,.·
··--··-
-""·- ......-
generalization grade,~
4. SVMs deliver a u
I I1 ---~- j advantage compared
local minima and for
Senument ~tys.i.s
5. With the choice ofi
stress on the similari

---
of two companies is,
company, the values 0
of the training sample
__,,___ M•d1~I 0 1Agno•\ &
classified according to
Here arc some exam
monotonicity. One c
estimated with a lin
expected one occordin

l~l~:
The reason for that

W••"••~••«"•" ~ i~~T~ PD and to the score.


dominate or cover the
One of these financia
Categorizing news, email spam detection, face recognition, sentiment analysis, medical Also leverage may sh
diagnosis, digit recognition and weather prediction arc just few of the popular use cases of capital, it may not ex
Naive Bayes algorithm.
may be the size ofa c
Machine Leaming explores the study and construction of algorithms that can learn from
but if a company has g
and make predictions on data. Among Classification Algorithms, Naive Bayes along with
big size may become ,
Regression is one of lhe most popular and powerful algorithms.
financial ratios are oft
Naive Bayes classifiers is a machine learning algorithm. If you wonder, how Google marks
some of the mails as spam in your inbox. a machine learning algorithm will be used to a linear classification
in linear techniques re
classify an incoming email as spam or not spam.
monotone and linear!
b. What Is the Point in Using SVMs as a Classification Technique? (12 Marks) such as SVMs is the J
Ans. All classification techniques have advantages and disodvantoges, which are more or less SVMs cannot represc
important according to the data which are being analysed, and thus have a relative relevance. financial ratios, since
SVMs can be a useful tool for insolvency analysis, in the case of non-regularity in the data, single financial ratios f
for example when the data are not regularly distributed or have an unknown distribution. ratios arc not constant.
It can help evaluate information, i.e. financial ratios which should be transfom1ed prior to is va riable. Using a
entering the score of classical classification techniques. difference between th
The advantages of the SVM technique can be summarised as follows : the training data samp
I. By introducing the kernel, SVMs gain flexibility in the choice of the forn1 of the threshold lnterprctation of rcsul
separating solvent from insolvent companies, which needs not be linear and even needs not as on a local linear ap
have the same functional form for all data, since its function is non-parametric and operates a bi-dimensional grapl
local_ly. As a consequence they can work with financial ratios, which show a non-monotone projects the multidim
rclullon lo the score and to the probability of defouh, or which are non-linearly dependent separating solvent an~
and this without needing any specific work on each non-monotone variable. ' the other financial rat!
2. Since the ~"mcl implic itly coma ins a no1 r fi ·
f\inctlonal form ufthe transfomrntio . ~- me:r trans ~rmation, no assumptions about the this way, diflerent con
,1, w 1uc ma es data lmeorly separable, is necessary. The
However, an analysis
120
'8~V~A~ C13CS • ModdtQ~Ptiper · 3
transformation occurs implicitly on a robu5t theoretical basis and human expertise judgement
Lscs of Naive Bayes: beforehand is not needed.
3. SVMs provide a good out-of-sample generalization, if the parameters C and r (in the case
of a Gaussian kernel) are appropriately chosen. This means that, by choosing an appropriate
generalization grade, SVMs can be robust, even when the training sample has some bias.
4. SVMs deliver a unique solution, since the optimality problem is convex. This is an
advantage compared to Neural Networks, which have multiple solutions associated with
local minima and for this reason may not be robust over different samples.
5. With the choice of 11n appropriate: kernel, such as the Gaussian kernel, one can put more
stress on the similarity between companies, because the more similar the financial structure
of two companies is, the higher is the value or the kernel. Thus when classifying a new
company, the values of its financial ratios are compared with the ones of the support vectors
of the training sample which are more similar to this new company. This company is then
classified according to with which group it has the greatest similarity.
Here are some examples where the SVM can help coping with non-linearity and non-
monotonicity. One case is, when the coefficients of some financial rulios in equation (I),
estimated with a linear parametric model, show a sign that does not correspond to the
expected one according to theoretical economic reasoning.
The reason for that may be that these financial ratios have a non-monotone relation to the
PD and to the score. The unexpected sign of the coefficients depends on the fact, that data
dominate or cover the part of the range, where the relation to the PD has the opposite sign.
. One of these financial ratios is typically the growth rate of a company, as pointed out by .
Also leverage may show non-monotonicity, since if a company primary works with its own
senti~ent analysis, medic
w of the popular use cases o capital, it may not exploit all its e~ternal financing opportunities properly. Another example
may be the size ofa company: small companies are expected to be more financially instable;
gorithms that can leam fr~ but ifa company has grown too fast or ifit has become too static because of its dimension, the
\ms, Naive Bayes along with big size may become a disadvantage. Because of these characteristics, the above mentioned
~
-
u wonder, how Google marks
financial ratios are often sorted out, when selecting the risk assessment model according to
a linear classification technique. Alternatively an appropriate evaluation of this information
g algorithm will be used 10 in linear techniques requires a transformation of the input variables, in order 10 make 1hem
monotone and linearly separable.4 A common disadvantage of non-parametric techniques
such as SVMs is the lack of transparency of results.
que? ( 12 Marks)
SVMs cannot represent the score of all companies as a simple parametric function of the
lges, which are more or less
financial ratios, since its dimension may be very high. It is nei1her n linear combination of
[hus have a relative relevance.
single financial ratios nor has it another simple functional form. The weights of the financial
of non-regularity in the data,
ratios are not constant. Thus the marginal contribution of each financial ratio to the score
ve an unknown distribution.
is variable. Using a Gaussian kernel each company has its own weights according to the
ould be transformed prior to
difference between the value of their own financial ratios and those of the support vectors of
the training data sample.
as follows:
lnterpretation of results is however possible and can rely on graphical visualization, as well
or the form or the threshold
as on a local linear approximation of the score. The SVM threshold can be represented within
be Iinear and even needs not
a bi-dimensional graph for each pair offinancial ratios. This visualization technique cuts and
non-parametric and operates
:which show a non-monotone projects the multidimensional feature space .as well as the multivariate threshold function
separating solvent and insolvent companies on a bi-dimensional one, by fixing the values of
1 are non-linearly dependent.
the other financial ratios equal to the values of the company, which has to be classified. By
one: variable.
this way, different companies will have different threshold projections.
on, no assumptions about the
However, an analysis of these graphs gives an important inp~t about the direction towards
yseparable, is necessary. The

121
VIII Se,m, (CSf/ISf)
which the financial ratios of non-eligible companies should change, in order to reach
eligibilil)'.
The PD can represent a third dimension of the graph, by means of isoquants and colour 10. •· What are the two majo
coding. The approach chosen for the estimation of the PD can be based on empirical estimates Ans. The Web works through
or on a theoretical model. can create a hyperlink to
Since the relation between score and PD is monotone, a local linearization of the PD can be or self-referral nature 0
calculated for single companies by estimating the tangent curve to the isoquant of the score. structure of Web pages ~
For single companies this can offer interesting information about the factors inftuencing their pages. There are two
financial solidity. I. H11bs: These are pag
In the figure below the PD is estimated by means of a Gaussian kemelS on data belonging to gathering point, where ~
the trade sector and then smoothed and monotonized by means of a Pool Adjacent Violator com, or government sit
algorithm.6 The pink curve represents the projection of the SVM threshold on a binary space com and yelp.com could
with the two variables K2J (net income chnnge) and K24 (net interest ratio), whereas all 2. A11tl1orltla: Ultimat
other variables are fixed at the level of company j. The blue curve represents the isoquant for complete and authorita
the PD of companyJ, whose coordinates are marked by a triangle. information, news, advi
Figure, Graphical Visualization or the SVM Threshold and or ■ Loe■I Llnearization or of inbound links from
the Score· page for expert medical
Function: Example of a Projection on a Bi-dimensional Graph with PD Colour Coding news.
Probability of Default
Write short notes on w
Web Minin& Al&oritbm
Hyperlink-foduced Topi
as being hubs or authori
The most famous and
Google co-founder
its search function. This
web page by counting
number of links, and/or
works in a similar way
with relations to more
higher status. PageRank
a Google: Search query.
many ways and the lates
... of the algorithm and m
·228 ,o, '38 788 110 stan~rd el~ments that "i
1
Nol Income Ct,;,na-, 1<21 website. This process is
The grey line corresponds to the linear approximation of the score or PD function projection c. Esplain the Pnctlcal ,
for company j. One interesting result of this graphical analysis is that successful companies between Social Networ
with a low PD often lie in a closed space. This implies that then: exists an optimal combination
area for the financial ratios being considered, outside of which the PD gets higher. If we Ans. PRATICAL CONSID
consider the net income change, we notice that its influence on the PD ii; non-monotone. Both Network Siu :Most SN
too low or too high growth rates imply a higher PD. This may indicate the existence of the network can be very chal
optimal growth rate and suggest that above a certain rate a company may get into trouble; of the number of nodes.
especially if the cost structure of the company is not optimal i.e. the net interest ratio is too possible pairs of links.
high. Out if a company lies in the optimal growth zone, it can also afford a higher net interest
ratio.

122 s.-+M ~ ~
B"if'Vam,A~
thange, in order to reach OR
s of isoquants and colour What are the two major ways that a website can become popular? (04 Marks)
. sed on empirical estimates The Web works through a system ofhyperlinks using the hypertext protocol (http). Any page
can create a hyperlink to any other page, it can be linked 10 by another page. The intertwined
arization of the PD can be or self-referral nature of web lends itself to some unique network analytical algorithms. The
structure of Web pages could also be analyzed 10 examine the panem of hyperlinks among
0 the isoquant of the score.
the factors influencing their pages. There are two basic strategic models for successful websites: Hubs and Authorities.
I. H u bl ': These are pages with a large number of interesting links. They serve as a hub, or a
gathering point, where people visit to access a variety of information. Media sites like Yahoo.
crnelS on data belonging to
com, or government sites would serve that purpose. More focused sites like Traveladvisor.
of 3 Pool Adjacent Violator com and yelp.com could aspire to becoming hubs for new emerging areas.
threshold on a binary space 2. A11tl1oriti~s: Ultimately, people would gravitate towards pages that provide the most
interest ratio), whereas all complete and authoritative information on a particular subject. This could be factual
e represents the isoquant for information, news, advice, user reviews etc. These websites would have the most number
e. f of inbound links from other websites. Thus Mayoclinic.com would serve as an authoritative
or 8 Local Unearlzation o page for expert medical opinion. NYtimes.com would serve as an authoritative page for daily
news.
ith PD Colour Coding
b. Write short notes on web mining algorithm. (04 Marks)
... Ans. Web Minln& Al1orithm1
"' Hyperlink-fnduced Topic Search (HITS) is a link analysis algorithm that rates web pages
as being hubs or authorities. Many other HITS-based algorithms have also been published.
The most famous and powerful of these algorithms is the PageRank algorithm. Invented by
Google co-founder Larry Page, this algorithm is used by Google to organize the results of
its search function. This algorithm helps determine the relative importance of any particular
web page by counting the number and quality of links to a page. The websites with more
number of links, and/or more links from higher-quality websites, will be ranked higher. It
works in a similar way as determining the status of a person in a society of people. Those
with relations to more people and/or relations to people of higher status will be accorded a
higher status. Page Rank is the algorithm that helps determine the order of pages listed upon
a Google Search query. The original PageRank algorithm formuation has been updated in
many ways and the latest algorithm is kept a secret so other websites cannot take advantage
of the algorithm and manipulate their website according to it. However, there are many
standard elements that remain unchanged. These elements lead to the principles for a good
110
website. This process is also called Search Engine Optimization (SEO).
ore or PD function projection c. Explain the Practical consideration or Social network analysis. Give the dlfl'erence
s is that successful companies between Social Network Analysis vis Tn ditlonal Data Analytics
ell.ists an optimal combination (08 Marks)
ich the PD gets higher. If we Ans. PRATICALCONSIDERATION:
the PD is non-monotone. Both Network Size :Most SNA research is done using small networks. Collecting data about large
indicate the existence of the network can be very challenging. This is because the number of link is the or1er of the square
ompany may get into trouble; of the number of nodes. Thus, in a network of 1000 nodes there are potentially I million
i.e. the net interest ratio is too possible pairs of links.
lso afford a higher net interest

123
VIII Sem, ( CSf/ISf)

Gathering Data: Electronics communication records (email, chat, etc.) can be han1essed to Eight Semester B.E.
gather social network data more easily. Data on the nature and quality of relationship need to
be collected using survey documents. Capturing and cleansing and organizing the data can
take a lot of time and effort , just like in a typical analytics project.
Computation And Visualization: Modeling large networks can be computationally Note: Amw~r any FIVEfu
challenging and visualizing them also would require special skills . Big data analytical tools
may be needed to compute large networks.

Dynamic Networks: Relationships between nodes in a social network can be fluid. They can How docs the Radoop
change in strength and functional na16urc. For Example, there could be multiple relationships an example.
between two people... they could simultaneously be coworker, coauthors, and spouses, The In Hadoop, MapReduce
network should be modeled frequently to see the dynamics of the network. individual tasks that can
Table 10. I Social Network Analysis vis Traditional Data Analytics tasks can be joined toget
Dimension Social Network Analysis Traditional Data Mining MapReduce consists of
Nature of learning Unsupervised learning Supervised and Unsupervised Map Function - It take
learning individual elements are
Analysis of goals Hub nodes ,important nodes, Key decision Example - (Map functio~
and sub-networks centroids Input Set of data
Dataset structures A graph of nodes and (directed) Rectangular data of variables
links and instances
Convert into~
Analysis techniques Visualii.ation with statistics; Machine learning, statistics Output set of data
iterative graphical computation
(Key, Value)
Quality measurement Usefulness is key criterion Predictive accuracy
classification techniques Reduce Function _ Take
tuples into a smaller set
Ex.ample - (Reduce funct
Input
(output or Map Seto
function)

Conv
Output small
tuple

124
Eight Semester B.E. Degree Eu■imdioll. CBCS - June / July 2019
t. etc.) can be harnessed 10
ality of relationship need to BigDataAllalytlcs
and organizing the data can e: 3 hrs. Max. Marks: 80
ct. ty Note : Am,wer a11y FIVE full question:,, sel«tbti ONE/11// q11estion/rom eacll module.
can be computational
is . Big data analytical tools
Module- I
How does the Hadoop MapReduce Data flow work or a word count program? Give
ork can be fluid. They can an example. · (08 Marks)
Jld be multiple relationshiP5 In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into
:oaulhors, and spouses, The individual tasks that can be executed in parallel across a cluster of servers. The results of
• network.
tasks can be joined together to compute final results.
Data Analytics
MapReduce consists of2 steps:
itional Data Mining Map Function - It takes a set of data and converts it into another set of data, where
[Vised and Unsupervised individual elements are broken down into tuples (Key-Value pair).
mg Example- (Map function in Word Count)
decision Bus, Car, bus, car, train, car, bus, car, train, bus,
Input Set of data
~ids TRAIN,BUS, bus, caR, CAR, car, BUS, TRAIN
angular data of variables (Bus, I), (Car, I), (bus, I ). (car. I). (1rain, I),
stances Convert inlo another
(car. I), ( bus, I). (car, I), (train, I), (bus. I).
ine learning, statistics Output set of data
(TRAIN,I),(BUS, I), (buS,I), (caR, I), (CAR, I),
(Key,Value)
(car,I), {BUS,1), {TRAIN, l)
ictive accuracy Reduce Function - Takes the output from Map as an input and combines _those data
~fication techniques tuples into a smaller set of tuples.
Example - (Reduce function in Word Count)
(Bus, I), (Car, I), (bus, I), (car, I), (train, I),
Input Set ofTuples (car, I), (bus,I), (car,l), (train,!), (bus,I),
(output of Map
function) (TRAIN,l),(BUS,1), (buS, I), (caR,I), {CAR,1),
(car,I), (BUS,1), (TRAIN,I)
Converts into (BUS,7),
Output smaller set of (CAR,7),
tuples (TRAIN,4)

1
VIII Setr11 ( CSE I IS£)
j.setJarByClass(WordC
j.setMapperClass(Map
lnl-"'l Bu, ~
~1
~ -
1-!.:::.1.l...J:
Out1>~l j .setRcducerClass(Red
C•r 1.
j.setOutputKeyClass(Tc
;,
l lu<Cu Jron t f •••n l
@:] I
,[oTD
eu1 l j.setOutputValueClass(I

-u
au, Cat , , 11n CAR l
Ttt11""1 flll'lt C•r ,~ -{ fro , Plant ~
~ JAAIN l FilelnputFormat.addlnp
lo"1lo<PLO.. ,
1 TAAIN I ,[rAAiN}J- ~~f 2 FileOutputFormat.setO
t lu< &u, Plo"tJ\ 1_:AAIN

s
I
Systcm.exit(j.waitForC
I l •tA~( :
PIAN[ 1J ,u...t 2)
}
public static class Ma
lntWritablc>{
Reducing Combining
public void map{Long
Splitting Mappina Intermediate
Splllllng lnterruptedl::xception
{
F11, Workflow of ~\lplleducinc Srring line = value.toStr
Workftow of MapReducc consists of S steps: String(] words=Jinc.split
1. Splitting - The splitting parameter can be anything, e.g. splitting by space, comma, for(String word: words )
semicolon, or even by a new line ('\n'). {
2. Mapping - as explained above. Text outputKey .. ne
3. Intermediate splitting - the entire process in parallel on different clusters. In order to lntWritablt: outputValu
group them in "Reduce Phase"1he similar KEY data should be on the same cluster. }con.wrile(outputKey, 0 ~
4. Reduce - it is nothing but mostly group by phase.
5. Combining- The last phase where all the data (individual result set from each cluster) }
is combined together to form a result. }
package PackagcDemo; public static class Red
import java.io.lOException; lntWritable>
import org.apache.hadoop.conf.Configurntion; {
import org.apache.hadoop.fs.Path; public void reduce(Text
import org.apache.hadoop.io.lntWritablc; IOException, Interrupted
import org.apache.hadoop.io.LongWritable; {
import org.apache.hadoop.io.Text; int sum= O·
import org.npache.hadoop.mnpreducc.Job; for(lntWritablc value .
import org.apache.hndoop.maprcduce.Mapper; { .
import org.apachc.hadoop.mapreduce.Reducer; ;um += value.get();
import org.apache.hadoop.mapreduce.lib.input.FilelnputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; } con.write(word, new Int
import org.apachc.hadoop.util.GenericOptionsParser;
public class WordCount { }
public static void main(String 0 args) throws Exception }
{
Co'.'f\guration c = ncw Configuration(); b. Briefly explain IIDFS Na
Stnn~0 filcs=new GenericOptionsParser(c,args).getRemniningArgs()· and Backups.
Path input- new Path(files[O)); • Ans. HDFS Federation impro
Pat~ output=new Path(filcs[ I]); of namespace and storage
Job J=new Job(c,"wordcount''):
multiple namespnccs in '

2
· JIM\el(Jvly 2019
j.setJarRyClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setRcducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
ous) j.setOutputValueCla:;s(lntWritable.class);
CAR 2 Filelnputformat.addlnputPath(j, input);
TP.AJN 2
~ TR.\IN 2 { PlAAf 2 FileOu1putformat.sctOutputPath(j, output);
System.exit(j. waitForCompletion(true)?0: I);
} .
PlAI;[ 2 public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
lntWritable>{ .
public void map(LongWritable key, ~xt value, Context con) throws IOExcept1on,
Combining lnterruptedExccption
{
String line= value.toString();
String[] words=linc.splil(",");
for(String word: words )

{ Text outputKey = new Text(word.toUpperCaseQ.trim());


lntWritable outputValue = new lntWritable(I);
n different clusters. ln order
con.write(outputKey, outputValue);
d be on the same cluster.
}
}
ual result set from each clust
}
public static class RcduceForWordCount extends Reducer<rcxt, lntWritable, Text,
lntWritable>
{
public void reduce(Text word, lterable<lntWritable> values, Context con) throws
IOException, lnterruptedExccption
{
int sum= 0;
for(lntWritable value: values)
{
sum+= value.get();
ormat; }
tFormat; con.write(word, new lntWritable(sum));
}
}
}

b. Brielly explain HDFS Name Node federation, NFS Gateway, snapshots, Checkpoint
and Backups. (08 Marks)
'ningArgsO;
HDFS Federation improves the existing HDFS architecture through a clear separation
of namespace and storage, enabling generic block storage layer. It enables support for
multiple namespaces in the cluster to improve scalability and isolation. Federation

3
VIII Sem, ( CSE I ISE) cs -JIM'\£,/Jul;y 20
also opens up the architeclure, expanding the applicability of J IDFS cluster to new HDFS snapshots ares·
implementations and use cases. dfs-snapshot comman
r·· ··· · ··· ····· ··~ r---------------• -----------------.
: :~ •: :NN-n : . system. They offer the

l~
' NN-1
I • Snapshots can be t
I '
I
I '' • Snapshots can be
I
I
I
recC1Very.
I ••
I • Snapshots creation
I
I • Blocks on the Dat

t••·
I
I
list and the file si
there are duplicat
,________________ Jj • Snapshots do not
A Checkpoint Node
changes are just writ
NameNode runs for
because more chang
the metadata. The C
NameNode and me
HDFS has two main layers: uploads the result to
I . Namespacc There was also a si
• Consists of directories, fiks and blocks. the "upload to Nam
• It supports all the namespace related file system operations such as create, delete. the Secondary Name
modify and list files and directories. Secondary NameNod
2. Block Storage Service, which has two parts: Backup Node The •
• Block Management (performed in the Namenode) Node, but is synchr
a. Provides Datanodc cluster membership by handling registrations, and periodic periodically because
heart beats. the current state in-1
b. Processes block reports and maintains location of blocks. checkpoint.
c. Supports block related operations such as create, delete, modify and get block
location.
d. Manages replica placement, block replication for under replicated blocks, and What do you unde
deletes blocks that arc over replicated.
• Storage - is provided by Datanodes by storing blocks on the local fi le system and Hadoop File Syste
allowing read/write access. commodity hardwa
The prior IIDFS architecture allows only a single namespace for the entire cluster. In that and designed using I
configuration, a single Namenode manages the namespace. HDFS Federation addresses HDFS holds very la
this limitation by adding support for multiple Namenodes/namespaces ro HDFS. data, the fi les are st
The NF~ ~11teway for HDFS _allows clients to mount I fDFS and interact with it through fashion to rescue th
NFS, ~s 1f 11 were pa!' of their local file system. The Gateway supports NFSv3. After makes applications
mounting HDFS, a client user can perform the following tasks: Features of IfDFS
Brows~ the I IDFS file system through their local file system on NFSvJ client-compatible a. It is suitable for
oi>c;ra11ng systems. ·
b. 1ladoop provid
Upload and download files between the HDFS fil . C. The built-in se
Stream data directly to I IDFS throu h th e sy~tem ~nd their local file system.
random write is not supported. g e mount point. File append is supported, but of cluster.
d. Streaming acces
e. IIDFS provides
4
• JI.MU?/( JIALy 2019
HDFS snapshots are similar to backups. but arc created by administrator using the hdfs
dfs-snapshot command. HDFS snapsbo&s arc read-only point-in-time copies of the file
system. They offer the following features.
• Snapshots can be taken of a sub-tree of the file system or the entire file system.
• Snapshots can be used for data backups. protection against user errors. and disaster
recovery.
• Snapshots creation is instantaneous.
• Blocks on the DataNodes are not copied, because the snapshot files record the block
list and the file size. There is no data copying, although it appears to the user that
there arc duplicate files.
• Snapshots do not adversely affect regular HDFS operations.
A Checkpoint Node was introduced to solve the drawb'acks of the NameNode. The
chunges are just written to edits and not merged to fsimage during the runtime. If the
NameNode runs for a while edits gets huge and the next startup will take even longer
because more changes have to be applied to the state to detennine the last state of
the metadata. The Checkpoint Node fetches periodically fsimage and edits from the
NameNodc and merges them. The resulting state is called checkpoint. After this is
uploads the result to the NameNode.
There was also a similar type of node called "Secondary Node'' but it doesn't have
the "upload to NameNode" feature. So the NameNode need to fetch the state from
the Secondary NameNode. It also was confussing because the name suggests that the
Secondary NameNode takes the request if the NameNode fails which isn't the case.
Backup Node The Backup Node provides the same functionality as the Checkpoint
Node, but is synchronized with the NameNode. It doesn't need to fetch the changes
periodically because it receives a strem offile system edits. from the NameNode. It holds
the current state in-memory and just need to save this to an image file to create a new
checkpoint.
OR
What do you understand by HDFS? Explain Its components wilb a neat diagram.
(10 Marks)
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulltolerant
and designed using low-cost hardware.
r the entire cluster. Int HDFS holds very large amount of data and provides easier access. To store such huge
FS Federation add data, the files are stored across multiple machines. These files are stored in redundant
spaces to HDFS. fashion to rescue the system from possible data losses in case of failure. HDFS also
d interact with it throu makes applications available to parallel processing.
supports NFSv3. Aft Features of HDFS
a. It is suitable for 1he distributed storage and processing.
b. Hadoop provides a command interface to interact wilh HDFS.
c. The built-in servers of namenode and datanode help users to easily check the status
their local file S) stem. of cluster.
ppend is supported, bti d. Streaming access to file system data.
e. HDFS provides file pennissions and authentication.

5
VIII Sem, (CSE I I SE)
cacs •JIM\.e/lJ~201
HOFS Architecture
number of replicas of a]
11N Cleta (Name, ...,i1u,,...): and can be changed latt
ltlome/loo/data, 3, ... any time.
The NameNode make~
receives a Heartbeat an~
of a I leartbeat implies t
a list of all blocks on a

•• ■
D ata Nodes
,._ _ _, 1~1
__o_a_t_a_N_o_d_e_s___,J
Na
/use1
/usel

Rack 1
'
Rack2
HDFS Architecture (Components)
Given above is the architecture ofa Hadoop file System. HDFS follows the master-slave OJ
architecture and it has the following elements. 0
Namenode The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software. It is a software that can be run on
commodity hardware. The system having the namenodc acts as the master server and it
does the following tasks -
I!] ~'
• Manages the file system namespace.
• Regulates client's access to files.
• It also executes file system operations such as renaming, closing, and opening files 3. a. Explain Apache Sq
and directories. Marks)
Datanode The datanode is a commodity hardware having the GNU/ Linux operating Ans. Figure I describes the
system and datanode sonware. For every node (Commodity hardware/System) in a two steps. In the first
cluster, there will be a datanode. These nodes manage the data storage of their system. the necessary metada
reduce step) Hadoop j
• Datanodes perform read-write operations on the.file systems, as per client request.
transfer using the me
• They also perform operations such as block creation, deletion, and replication
import must have ace
according to the instructions of the namenode.
The imported data ar
Block Generally the user data is stored in the files of HDFS. The file in a file system for the directory, or t
will be divided into one or more segments and/or stored in individual data nodes. These be populated. By de
file segments are called as blocks. In other words, the minimum amount of data that separating different re
HDFS can read or write is called a Block. The default block size is 64MB, but it can be over by explicitly sp
increased as per the need to change in HDFS configuration. placed in I IDFS, the
b. .B.-ing o ut the con ceplJi ofHDFS block replication, with an example. (06 Marks) Data export from the
Ans. Data Replication HDFS is designed to reliably store very large flh::. across machines in as shown in Figure 2
a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the for metadata. I he ex
last block are the same size. The blocks of a file arc replicated for fault tolerance. The database. Sqoop divl
block size and replication factor are configurable per fi le. An application can specify the push the splits to the
the database.

6
C'8CS · JIM'\R/( J"'½' 2019

number of replicas of a file. The replicatioe . . _ can be specified at file creation time
and can be changed later. Files in HDFS -WIiie-once and have strictly 0111: writer at
, 191lllca1,...):
any time.
Illa,~-·-
The NameNode makes all decisions •& iis aq>lication of blocks. It periodically
receives a Heartbeat and a Blockreport from....,llle DataNodes in the cluster. Receipt
of a Heartbeat implies chat the DataNode at 2 1 « prop1:rly. A Rlockn:port contains
a I ist of all blocks on a DataNode.
Blocl. Reillc1lo11
-
Namenode (Filename, nurnRel)lcas, block-ids, ... )
/users/sameerp/data/pat-0, r:2, {1.3}, ...
/users/sameerp/dala/part-1, r:3, {2,4,5}, ...

odes Datanodes

follows the master-slave

ontains the GNU/Linux


are that can be run °!'
, the master server and it
Modulc-2
Explain Apache Sqoop Import and Export method with neat diagrams. (JO
losing, and opening files Marks)
Figure I describes the Sqoop data import (to HDFS) process. The data import is done in
1e GNU/LinlLX operating two steps. In the first step, shown in the figure, Sqoop examines the database to gather
hardware/System) in a the necessary metadata for the data to be imported. The second step is a map-only (no
storage of their system. reduce step) Hadoop job that Sqoop submits to the cluster. This job doos the actual data
,s, as per client req'.1est: transfer using the metadata captured in the previous step. Note that each node doing the
dcletiorT, and repltcatton import must have access to the database.
The imported data arc saved in an HDFS directory. Sqoop will use the database name
. The file in a file system for the directory, or the user can specify any alternative directory where the files should
lividual data nodes. These be populated. By default, these files contain comma-delimited fields, with new lines
num amount of data that separating different records. You can easily override the format in which data are copied
ze is 64MB, but it can be over by explicitly specifying the field separator and record terminator character:,. Once
placed in IIDFS, the data are ready for processing. ·
Data export from the cluster works in a similar fashion. The export is done in two steps,
example. (06 Marks)
as shown in Figure 2. As in the import process, the first step is to examine the uutabase
ie files across machines in
blocks in a file except the for mctadata. The export step again uses a map-only I ladoop job to write the data to the
d for fault tolerance. The database. Sqoop divides the input data set into splits, then uses individual map tasks to
pplication can specify the push the splits to the database. Again, this process assumes the map tasks have access to
the database.

,..,,..._.. ~~ ~ " '.....


7
VIII Sett11 ( CSE I ISE)
,,,, ___
,,,,-----........ start
L. ·,
~ :-_!_--=-~ -:
( ,,, ....,.._,.,. I I
I I
I
I
I
I
I MapR
I
I
I I 4. a. How do you run
I
I _r,._,
'---------✓
\,. _______
_ _.
I
I
.)
Ans.
architecture? Disc
figure I Figure 2
b. Explain with a neat diagram, the Apache Oozle work flow for Hadoop arcbltMectuksn:).
(06 ar
Ans. Oozie is a workftow director system designed to run and manage multiple related
Apache Hadoop jobs. for instance, complete ~ata i~put and analysis ma~ require several
discrete Hadoop jobs to be run as a workftow m which the output of one Job serves as the
input for a successive job. Oozie is designed to construct and manage these workflows.
Oozie is not a substitute for the YARN scheduler. That is, YARN manages resources for
individual Hadoop jobs, and Oozie provides a way to connect and control Hadoop jobs
on the cluster.
Oozie workflow jobs are represented as directed acyclic graphs (DAGs) of actions.
(DAGs are basically graphs that caMot have directed loops.) Three types of Oozie jobs
are pennittcd:
Image Workflow-a specified sequence of Iladoop jobs with outcome-based decision
points and control dependency. Progress from one action to another cannot happen until
the first action is complete.
Image Coordinator-a scheduled workflow job that can run at various time intervals or The ResourceMana
when data become available. therefore, can ens
Image Bundle-a higher-level Oozie abstraction that wjll batch a set ofcoordinator jobs. Depending on the a
Oozie is integrated with the rest of the Hlidoop stack, supporting several types ofHadoop the ResourceManag
Jobs out of the box {e.g., Java MapReduce, Streaming MapReduce, Pig, Hive, and run on particular n
Sqoop) as well as system-specific jobs (e.g., Java programs and shell scripts). Oozie also cores) bound to a p
provides a CLI and a web UI for monitoring jobs. ResourceManager in
Below Figure depicts a simple Oozie workftow. In this case, Oozie runs a basic MapReduce the NodeManager. C
operation. If the application was successful, the job ends; if an error occurred, the job is are heartbeat based ti
killed. resource availability.
and killingjobs). Th
view" of the cluster.
User applications are
through an admissio
and various operatio
B
-J~/]~2019
,,----................

.. -
Slllrt map-,educe._,_OIC;;;..;..,a,j

--t I
, __
','\

---·
_ _ ,.:t_ _ _ _ _ _ _

I I
I
wordcounl

ERROR
,--......,,_.,...._.,
__,
>CfflaJaM1rCo•

o:
I

I MapReduce Worldlow DAG wo,tdlouml


I
I OR

I -0 J\
l __ - c -_ _
How do you run Map Reduce a■d M1 FE I ...... Interface (MPI) on YARN
architecture? Discuss. (10 Marks)

Figuro 2
w for Hadoop archltectu
(06 Mara
d manage multiple relat
analysis may require several
tput ofone job serves as the
d manage these work.flows.
ARN manages r~ources for
and control Hadoop jobs

(DAGs} of actio
types of Oozic jo . lld,ntw
:.... ' .• o,f ConWnerII I
· outcome-based decisioo
another cannot happen until
The ResourceManager has a central and global view of aJI cluster resources and,
therefore, can ensure fairness, capacity, and locality are shared across all users.
Depending on the application demand, scheduling priorities, and resource availability,
tch a set ofcoordinatorjobs. the ResourceManager dynamically allocates resource containers to applications 10
ing several types ofHadoop run on particular nodes. A container is a logical bundle of resources (e.g., memory,
apReduce, Pig. Hive, and cores} bound to a particular clust~r node. To enforce and track such assignments, the
and shell scripts}. Oozie also ResourceManager interacts with a special system daemon rwming on each node called
the NodeManager. Communications between the ResourceManager and NodeManagers
zie runs a basic MapReduce are heartbeat based for scalability. NodeManagers are responsible for local monitoring of
an error occurred, the job is resource availability, fault reporting, and container life-cycle management (e.g., starting
and killing jobs). The Resoun:eManager depends on the NodeManagers for its "global
view" of the cluster.
User applications are submitted to the ResourceManager via a public protocol and go
through an admission control phase during which security credentials are vaJidltted
and various operational and administrative checks are perfonned. Those applications

9
VIII Semt (CSE/ I Sf)
that are accepted pass to the scheduler and are allowed to run. Once the scheduler has
enough resources to satisfy the request, the application is moved from an accepted state Version 1 Apal
MapReducc ~
to a running state. Aside from internal bookkeeping, this process involves allocating
a container for the single ApplicationMaster and spawning it on a node in the cluster.
Often called container 0, the ApplicationMaster does not have any additional resources
at this point, but rather must request additional resources from the ResourceManager.
The ApplicationMaster is the ·•master" user job that manages all application life-cycle
aspects, including dynamically increasing and decreasing resource consumption (i.e.,
containers), managing the flow of execution (e.g., in case of MapReduce jobs, running
reducers against the output of maps), handling faults and computation skew, and
performing other local optimizations.
Above Figure illustrates the relationship between lhe application and YA RN components. Hadoo
The YARN components appear as the large outer boxes (ResourceManagcr and
NodeManagers), and the two applications appear as smaller boxes (containers), one dark
and one light. Each application uses a different ApplicationMaster; the darker client is
running a Message Passing Interface (MPI) application and the lighter client is running Write any four Busioe.
a traditional MapReduce application. 1. Customer Relations~
b. What do you understand by YARN Dbtributcd-Shcll? (06 Marks) customer becomes
Ans. Distributed-Shell is an example application included with the Hadoop core components sentiments of the cu
that demonstrates how to write applications on top of YARN. It provides a simple method also, expand the poo
for running shell commands and scripts in containers in parallel on a Hadoop YARN of marketing.
cluster. a. Maximize the ret
The distributed shell client allows an application master to be launched that in turn would points from data-
run the provided shell command on a set of containers. tuned to better re
This client is meant to act as an example on how to write yarn-based applications. b. Improve custome
To submit an application, a client first needs to connect to the ResourceManager win new custome
aka ApplicationsManager or ASM via the ApplicationClientProtocol. The on their likelihoo
ApplicationClientProtocol provides a way for the client to get access to cluster such as discounts
information and to request for. a new ApplicationId. manner.
Fortheactualjob submission, theclient first has to create anAppl icationSubm issionContext. c. Maximize custon
The ApplicationSubmissionContext defines the application details such as Applicationld customer should
and application name, the priority assigned to the application and the queue to which this a customer new p
application needs to be assigned. In addition to this, the ApplicationSubmissionContext increase revenue
also defines the ContainerLaunchContext which describes the Conlainer with which the opportunity to wo
ApplicationMaster is launched. and value, the busi
The ContainerLaunchContcxt in this scenario defines the resources to be allocated for the d. Identify and delig
Ap~licationMaster's ~ontainer, the local resources Uars, configuration files) to be made best customers can
available and the environment to be set for the ApplicationMaster and the commands to with greater attenti
be executed to run the ApplicationMaster. effectively.
Using the ApplicationSubmissionContext, the client submits the application to the e. Manage brand ima
Re~ourceManager and then monitors the application by requesting the RcsourceManager chatter about itself.
for an Application Report at regular rime intervals. In case of the application taking too nature of comments
long, the client kills the application by submitting a KillApplicationRequest to the 2. Health Care and Wel
ResourceManager. economies. Evidence-
run. Once the scheduler
oved from an accepted s ....,, ,_
~1

process involves allocati


t it on a node in the c\us
ve any additional resou
om the ResourceManager. YARN Re5CUClllllanager
iges all application life-c~c
resource consumption (,.
'or MapReducc job::., runni
and computation sk\!w,
Hadoop Distributed Fie System - HDFS
ion and YARN componen~
. es (ResourceManager
. boxes (containt:rs), on~ d~ Module-3
,nMastcr; the darker chcn~ II Write any four Business Intelligence Applicatia r... various sectors. (08 Marks)
d the lighter client is runmtC 1, Customer Relationship Management A business exists to serve a customer. A happy
customer ~omes a repeat customer. A bw.incss should understand the needs and
(06 Marks) sentiments of the customer, sell more of its offerings IO the existing customers. and
.he Hadoop core componenll also, expand the pool ofcustomers it serves. Bl applications can impact many aspects
!-It provides a simple method of marketing.
,arallel on a Hadoop YARN n. Mai.imizc the return on marketing campaigns: Understanding the customer's pain
points from data-based analysis can ensure that the marketing messages are fine-
ie launched that in tum would tuned to better resonate y, ith customers.
b. Improve customer retention (churn analysis): It is more difficult and expensive to
nm-based applications. "'in new customers than it is to retain existing customers. Scoring each customer
:ct to the ResourceManager on their likelihood to quit can help the business design effective interventions,
licationClientProtocol. The such as discounts or free services, to retain profitable customers in a cost-effective
nt 10 get access to cluster manner.
c. Maximize customer value (cross-selling. upselling): Every contacl with the
iplicationSubmission~o~texL customer should be seen as an opportunity to gauge their current needs. Offering
details such as Apphcattonlcl a customer new products and solutions based on thosl: imputed needs can help
n and the queue to which this increase revenue per customer. Even a customer complaint can be seen as an
plicationSubmissionC_onte:d opportunity to wow the customer. Using the knoY.ledge of the customer's history
the Container with which the and value, the business can choose to sell a premium service to the customer.
d. Identify and delight highly valued customers: By segmenting the customers, the
urces to be allocated for thc best customers can be identified. They can be proactively contacted, and delighted,
nfiguration files) to be made with greater attention and better service. Loyalty programs can be managed more
aster and the commands to effectively.
c. Manage brand image: A business can create a listening post to listen to social media
its the application to the chaner about itself. It can then do sentiment analysis of the text to understand the
ting the ResourceManagcr nature of comments and respond appropriately to the pros~cts and customers.
fthe application taking too 2. Health Cure aod Wellness Health care is one of the biggest sectors in advanced
IApplicationRequest to the economies. faidence- based medicine is the newest trend in data-based health

11
VIII Se,m., (CS£ I I S£) 'B~V~A~ C'BCS • JIM'\R./( J ~ 20

care management. Bl applications can help apply the most effective diagnoses and c. What is Confusion Ml
prescriptions for various ailments. They can also help manage public health issues, Ans.
and reduce waste and fraud.
3. Education As higher education becomes more expensive and competitive, it is a great
user of data-based decision-making. There is a strong need for efficiency, increasing
revenue, and improving the quality of student experience at all levels of education.
4. Retail organii.ations grow by meeting customer needs with quality products, in a
convenient, timely, and cost-effective manner. Understanding emerging customer
shopping patterns can help retailers organize their products, inventory, store layout,
and web presence in order to delight their customers, which in tum would help
increase revenue and profits. Retaih:rs generate a lot of transaction and logistics data
that can be used to solve problems.
b. Explain the star schema design or Data Warehousing with an example.
A confusion matrix
(06 Marks)
Ans. Star schema is the preferred data architecture for most DWs. There is a central fact table classification m~del
that provides most of the information of interest. There are lookup tables that provide known. The confusi
detailed values for codes used in the central table. For example, the central table may use terminology can be ci
digits to represent a sales person. The lookup table will help provide the name for that
sales person code. Herc is an example of a star schema for a data man for monitoring 6. a. Explain CRISP-D
sales performance Ans.
Item id Item name Region id Name

ltomid Sales Region Period Total ~les


Person 10 id Sales Quota
ID
.. J L

-- Prriod id Month/yr

Other schemas include the snowflake architecture. The difference between a star and
snowflake is that in the latter, the lookup table~ can have their own funher lookup tables.
There are many technology choices for developing OW. This includes selecting the right
The data mining I
database management system and the right set of data management tools. There are a
Mining (CRISP-D
few big and reliable providers of DW systems. The provider of the operational DBMS
I. The first and m
may be chosen for DW also. Alternatively, a best-of-breed DW vendor could be used. asking the right
There are also a variety of tools out there for data migration, data upload, data retrieval,
11ml data analysis. lead to large pa
selecting a dat

12
CBCS -JIM\,(!/(J~ 2019
t effective diagnoses and c. What Is Confusion Matrix. (02 Marks)
age public health issues, Am.

d competitive, it is a great
for efficiency, increasing
t all levels of education.
ith quality products, in a
~ding emerging customer
cts, inventory, store lay~
which in tum would hel
-ansaction and logistics

h an example. A confusion m~trix is a table that is often used II> dacribe the performance of a
(06 Ma classification model (or "classifier") on a set of test dlla for which the true values are
. There is a central fact tab known. The confusion matrix itself is relatively simple to understand, but the related
lookup tables that provi terminology can be confusing.
le, the central table may OR
p provide the name f?r t_
a data mart for momtonna 6. a. Explain CRISP-OM cycle with a neat diagram. (08 Marks)
Ans.

'fference between a star and


ir own further lookup tables.
·s includes selecting the right ~e. data mining industry has proposed a Cross-Industry Standard Process for Data
agement tools. There are a Mining (CRISP-OM). II has six essential steps
r of the operational DBMS
I. Th~ first an~ most i'."portant s~ep in data mining is business understanding. that is,
DW vendor could be used.
asking the right business questions. A question is a good one if answering it would
• data upload. data retrieval,
lead t? large payoffs fpr the organi1.ation, financially and othem ise. In other words
selecting a data mining project is like any other project, in which it should sho~

13
VIII Sett11 (CSE ( I SE) 'B~V~A~ L13CS · JIM'\.el( J u½' 2011
strong payoffs if the project is successful. There should be strong executive support
for the data mining project, which means that the project aligns well with the business Data mining is a method
strategy. large amounts of data t
2. A second important step is to be creative and open in proposing imaginative
patterns.
hypotheses for the solution. Thinking outside the box is important, both in terms ofa Data mining is usu;i
proposed model as well in the data sets available and required. business users with th,
3. The data should be clean and of high quality. It is important to assemble a team that engineers.
has a mix of technical and business skills, who understand the domain and the data. Data mining is the co
Data cleaning can take 60 to 70 percent of the time in a data mining project. It may process of extracting d
be desirable to add new data elements from external sourc~s of data that could help data sets.
improve predictive accuracy. One of the most impc
4. Patience is required in continuously cn$aging with the data until the data yields some of data mining tech
good insights. A host of modeling tools and algorithms should be used. A tool could detection and identific2
be tried with different options, such as running different decision tree algorithms. in the system.
s. One should not accept what the data says at first. It is better to triangulate the analysis
by applying multiple data mining techniques and conducting many what-if scenarios,
to build confidence in the solution. Evaluate the model's predictive accuracy with 7. a. What Is splitting va1
more test data. variable.
6. The dissemination and rollout of the solution is the key to project success. Otherwise Ans. Decision trees arc traine
the project will be a waste of time and will be a setback for establishing and supporting repeatedly split accordin
a data-based decision-process culture in the organization. The model should be homogeneous) in terms
embedded in the organization's business processes.
b. What do you understand by the term Data Visualizatio n? How It is Impo rtant in

...
Big Data Analytics? (OS Marks)
Ans. As data and insights grow in number, a new requirement is the ability of the executives
and decision makers to absorb this information in real time. There is a limit to human
comprehension and visualilation capacity. That is a good reason to prioritize and manage
with fewer but key variables that relate directly to the key result areas of a role.
Here are few considerations when presenting data:
I. Present the conclusions and not just report the data.
2. Choose wisely from a palette of graphs to suit the data.
3. Organize the results to make the central point stand out.
4. Ensure that the visuals accurately reflect the numbers. Inappropriate visuals can
create misinterpretations and misunderstandings.
5. Make the presentation unique, imaginative, and memorable.
c. Differentiate between Data Mining and Data Warehousing. (03 Marks) Three criteria for choo
Ans.
a) Which variable to u
Data Mining Data Warehouse variable for the fi _
Data mining is the process of A data warehouse is database system which is measures like least
analyzing unknown patterns of data. designed for analytical instead of transactional b) What values to use
work. age or BP, what vah
c) How many branch~
with just two branc
b. List some of the advar
14
,c strong executive support
Data mining is n method ofcompar;ns Duo, "'~huu5ing Is a method or centralizing
ligns well with the business
large amounts of data to finding right data from different sources into one common
patterns. reposit.or)
in proposing imaginati\e
!
mportant, both in terms of a
1uircd.
Data mining is usually done by
business users with the assistance of
Data w.-ehousing is a process which ne~ds to
occur ~fore any data mining can take place.
engineers.
ant to assemble a team that On the other hand, Data warehousing is the
nd the domain and the data. Data mining is the considered as a
process of extracting data from large process of pooling all relevant d~ta together.
data mining project. 1t may
data sets.
urc~s of data that could help
One of the most important benefits One of the pros of Data Warehouse is its
ta until the data yields some of data mining techniques is the ability to update consistently. That's why it is
hould be used. A tool could detection and identification of errors ideal for the business owner who wants the
decision tree algorithms. . in the system. bc!>t and la1eM features.
ter to triangulate the anal~s•s Module- 4
ting many what-ifscenan~s,
l's predictive accuracy with 7. a. What is splitting variable? Describe three criteria for choosing a splitting
variable. (04 Marks)
o project success. Otherw_ise Ans. Decision trees arc trained by passing data down from a root node to leaves. The data is
r establishing and supporting repeatedly split according to predictor variables so that child nodes are more "pure" (i.e.,
ltion. The model should be homogeneous) in terms of the outcome variable. This process is illustrated below:
Root
on? How it is important in
(OS Marks)
1 theability of the executives
e. There is a limit to human _./' , root split

.. :':--$···
.- ~
ason to prioritize and manage
result areas of a role.

st
1 child split /
. _/
'
nd
2 child spltt
~

/ \
s. Inappropriate visuals can

(03 Marks) Three criteria for choosing a splitting variable.


a) Which variable to use for the first split? How should one determine the most important
Warehouse variable for the first branch, and subsequently, for each subtree? There are many
measures like least errors, information gain, and Gini coefficient.
is database system which is
b) What values to use for the split? If the variables have continuous values, such as for
cal instead of transactional
age or BP, what value-ranges should be used to make bins?
c) How many branches should be allowed for each node? There could be binary trees,
with just two branches at each node. Or there could be more branches allowed.
b. List some of the advantages and disadvantages of Regression Model. (04 Marks)

15
VIII Semt ( CSE ( ISE)
Ans. Advantage:
I. Regression models are easy to understand as they are built upon basic statistical
principles, such a~ correlation and least square error.
2. Regression models provide simple algebraic equations that are easy to understand
and use.
3. The strength (or tho: goodncs,; of fit) of the regression model is measured in terms
of the correlation coefficients, and other related :.tatistical parameters that are well
understood.
Disad,•antage:
I. Regression models cannot cover for poor data quality issues. If the data is not
prepared well to remove missing values, or is not well-behaved in terms of a normal
distribution, the validity of the model suffers.
2. Regression models suffer from collinear problems (meaning strong linear correlations
among some independent variables). If the independent variables have strong
correlations among themselves, then they will eat into each other's predictive power
and the regression coefficients will lose their ruggedness.
3. Regression models will not automatically choose between highly collinear variables,
although some packages attempt to do that. Regression models can be unwieldy and
unreliable if a large number of variables are included in the model. All variables
entered into the model will be reflected in the regression equation, irrespective of their
contribution to the predictive power of the model. There is no concept of automatic
pruning the model. 8. a. Explain the deslg,
c. Create a decision tree for the following data set. Ans. Design Principles
Credit Loan Approved • I. A neuron is th
Age Job House
element) recei11
Young FALSE No Fair No
weighted com
Young FALSE No Good No output value, a
Young TRUE No Good Yes the inputs, w's
Young TRUE Yes Fair Yes
Young FALSE No Fair No
Middle FALSE No Fair No
Midd°le FALSE No Good No
Middle TRUE Yes Good Yes
Middle FALSE Yes Excellent Yes
Middle FALSE Yes Excellent Yes
Old FALSE Yes Excellent Yes
Old FALSE Yes Good Yes
Old TRUE No Good Yes
Old TRUE No Excellent Yes
Old FALSE No Fair No 2. A neural net\\
Then solve the following problem using the model. output neuro
structure wo

16
Age J ob Houe Loan Approved
Young FALSE No ???

strong linear correlations


t variables have strong
other's predictive power

ighly collinear variables,


ds can be unwieldy and Age Loan Approved
the model. All variables Young No
ation, irrespective of their
no concept of automatic OR
Explain the design principles of an Artificial Nnnl etwotk. (08 Marks)
Design Principles of an ANN
I. A neuron is the basic processing unit of the nelWOrL The neuron (or processing
an Approved
clement) receives inputs from its preceding neurons (or PEs), does some nonlinear
No weighted computation on the basis of those inputs. transforms the result into its
No output value, and then passes on the output to the ncxrncuron in the network. X's are
Yes the inputs, w's are the weights for each input, and y is the output.
Yes
No X1""
No
No X2
~
_w2

--
Yes
- Y
Yes
Yes X3 - W3

Yes
Yes /w,
Yes X4
Yes
No 2. A neural network is a multilayered model. Then: is II least one input neuron, one
output neuron, and at least one processing neuron. An ANN with just this basic
structure would be a simple, single-stage computational unit. A simple task may be

17
VIII Sem, (CSE ( ISE) CBCS · J(;(..t'\£11J«½t 201

processed by just that one neuron and the result may be communicated soon. ANNs,
however, may have multiple layers of processing elements in sequence. There could Scru, I> f or
cnun\ of c:.ch
be many neurons involved in a sequence depending upon the complexity of the c:u1d1d~1e
predictive action. The layers of PEs could work in sequence, or they could work in
parallel.
•1 ~
' Wm , w,11 Neuron 11 ~ HJ.- ~
.,.)_:t,-,~1-J,-,~,,
('l
Ki w Wu, ~Neuron

~~~
W, 11 Neuron., Ct:nctntr C2 llcn>>et
wm' _ !\ / Y!.m cnnd1dntc1 hum r..1 { l1, 1_)

r\ w111, 1ulfJ1 Ill.DI


-' Neuron12 /\ " 122 122 - ,Y
\II, 14)

Kl Wm w.,~ j I
In /12 : ~:ll Neuron
13
w,u \11 . 15}
{12 , 13)
(12, 14)

w"1 I rn In
{12.15}
\13. 14}
(IJ. •~J
(14. I~)
Input Neuron layer Hidden Layer Output Neuron Laye r
3. The processing logic of each neuron may assign different weights to the various
incoming input streams. The processing logic may also use nonlinear transformation,
such as a sigmoid function, from the processed values to the output value. This Vcnt!r.lfe (' ~ tnu.cr
ca.uct1<1,1,- h om ( II . 12. n )
processing logic and the intermediate weight and processing functions are just what l.1
works for the system as a whole, in its objective of solving a problem collectively.
Thus, the neural networks are considered to be an opaque and a black-box system.
4. The neural network can be trained by making similar decisions over and over
again with many training cases. It will continue to learn by adjusting its internal 9. a. What is Na'ive Bayes 1,
computation and communication based on feedback about its previous decisions. Ans. It is a classification te
Thus, the neural networks become better at making a decision as they handle more independence among pr
and more decisions.
the presence of a partic
b. How does the Apriori Algorithm work? Apply the same for the following example. feature.
TIO List of Item-Ids For example, a fruit ma
T I00 11 , 12. 15 inches in diameter. Even
the other features, all oft
T200 12, 14
this fruit is an apple and ,
TJ00 12, 13 Naive Bayes model is e
T400 11,12,14 Along with simplicity, N
T500 I 1,13 classification methods.
T600 12,13 Bayes theorem provides ~
and P(xlc). Look at the e
noo 11 ,13
T800 11,12,13,15
T900 11,12,13
Assume the support count- 2.
A ns. (08 Marks)

·18
cs -J~( J~ 2019

municated soon. ANNs. c, 1.,


Sc,n I) for IICmliCI Sun C• qwr aa6dat,

-
t.:i1\;:ll lh."BlJ.CI Sun .u11111
in si:quence. There could !fl I r,
on the complexity of the
coun: of eoch
cand1<la.~ {1'2) 7 ~_,
l..~C.-• {I I)
{12)
(\
7
(13) 6 !IH t,
ce, or they could work io {14) 2 {hi} !
1151 2 !IS} 2

Genern1c C,
r,
llcmM:I Senn D for
•1 /.,
llcmm Su11..- c,.,.,r:irc cand1dotc llemset Sun count
-
Neuron31 cn11d11Ja1c., (,.,,;, l1 I II. 12 J coun1 of each (II , 12) 4 'iaf"P1rt cuunl wi1h { II, I~) ~

Gli]---- >-Y p 1. 13 J c:111wd01e


{II. f.l)
{JI.IS)
{12, 13}
{ II. 13 I
(11, !,I/
(II . I~)
{ 12. Ill
1
~
I
zm..mum!iiuppon
cuunl
111. n1
{II. 15)
(12. 11/
{12, M}
4

4
2
2

fl2, IJ) {12. IJ} 2 {12, IS) 2


{12. IS} {I~. 15/ 2
flJ. 14) {13.14) 0
{13, ,~, {l.l.151 I
(14. 15} {IJ. 1,} (I

utput Neuron Layer


nt weights to the various
, nonlinear transformation, Comp3tc c:mdJ
1 :uc
vcncralc C1 llem,c1 se,n D for lle10<c1 Suo. counl 1uppor1 coom ,..,lh ilCUl.\Cl SUD. Ct>Ultl

to the output value. This canJ«lll~\ ",,m {II. 12. ll) coun1 or c>ch {II. 12, 13) 2 minimum <Uppon fl 1. 12. 13) ?
l,1 cond«J:u~ count
· ng functions are just what {11.12.15} -
{11.11.15} 2
.
(11.12. JS} 2
ing a problem collectively.
and a black-box system. Module- 5
decisions over and over
n by adjusting its internal What is Na'ive Bayes Technique? Explain its model. (05 Marks)
out its previous decisions. It is a classification technique based on Bayes' Theorem with an assumption of
ision as they handle more independence among predictors. In simple terms, .a Naive Bayes classifier assumes that
the presence of a particular feature in a class is unrelated to the presence of any other
feature.
or the following example,
For example, a fruit may be considered to be an apple if it is red, round, and about 3
inches in diameter. Even if these features depend on each other or upon thi: existence of
the other features, all of these properties independently contribute to the probability that
this fruit is an apple and that is why it is known as 'Naive'. ·
Naive Bayes model is easy to build and particularly useful for very large data sets.
• Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
Bayes theorem provides a way of calculating posterior probability P(cJx) from P(c), P(x)
and P(xlc). Look at the equation below:

(08 Marks)

29
VIII Sem- (CSE I I SE) CBCS - J ~ / J ~ 2
l.i-elmood 1 N
-:,w1 w+CL s,
- za; l
P( c I-") = P(x I c) P(c) subject to the constrai
I P(x) yc(w1 ¢{~J +b)~1
POS-tt:flOr Prob.ab~ o,d1ctorPr~P,obabRrty where C is the capac
represents parameter
training cases. Note t
variables. The kernel
Above, feature space. It sho
P(clx) is the posterior ·probability of class (c, target) given predictor (x. attributes). Thus, C should be ch
P(c) is the prior probability of class. · CLASSIFICATION
P(xlc) is the likelihood which is the probability of predictor given class. In contrast to Class
P(x) is the prior probability of predictor. minimizes the error ti
b. What is Support Vector Machine? Explain its model. (08 Marks) 1 1 .,
Ans. In a nutshell, a support vector machine (or SVM) is an algorithm that works as follows. It
- w 1 w-vp+-~
uses a nonlinear mapping to transform the original training data into a higher dimension.
2 .v;=i
subject to the constra•
Within this new dimension, it searches for the linear optimal separating hyperplane
(that is, a "decision boundary" st:parating the tuples of one class from another). With an Y;(w1 ¢\x,)+b)~
appropriate nonlinear mapping to a sufficiently high dimension, data from two classes In a regression SVM,
can always be separated by a hyperplane. The SVM finds this hyperplane using support variable yon a set ofi
vectors ("essential" training tuph:s) and margins (defined by the support vectors). that the relationship
y deterministic functio
Regression SVM
y = f(_x) + noise
The task is then to fi
the SVM has not bee
model on a sample s~
above), the sequential
this error function, t
REGRESSION SV
For this type of S VM
1 .\'
X -w1 w+cL ,:, +
To construct an optimal hyperplane, SVM employs an iterativt: training algorithm, which 2 r• l
is used to minimize an error function. According to the form of the error function, SVM which we minimizes
models can be classified into four distinct groups: . w 1 ¢(x 1)+ b - y 1 :5'
I. Classification SVM Type I (also known as C-SVM classification)
2. Classification SVM Type 2 (also known as nu-SVM classification) Y, -w1 ¢(x, )-b, :S
J. Regression SVM Type I (also known as epsilon-SVM regression) ~1. ~,· 2::0,i = 1,.... .
4. Regression SVM Type 2 (also known as nu-SVM regression)
Classification SVM REGRESSION SV
CLASSIFICATION SVM TYPE 1 For this SVM modd,
f or this type of SVM, training involves the minimization of the error function:

20
I V
.,
-
~.,,,
-w'w+C~ ,,,
t• l
subject to the constraints:
y,fw'9(x.)+b)~1-s,and{,~O •l ,.V
where C is the capacity constant, w is 6e wector of coefficients, b is a constant, and
represents parameters for handling no,asq.aable data (inputs). Thi;: index i labels thi;: N
training cases. Note that represents the class labels and xi represents the independent
variables. The kernel is used to transform dala from the input (independent) to the
feature space. It should be noted that the larger the C. the more the error is penalized.
ictor (x. attributes). Thus, C should be chosen with care to avoid over fitting.
CLASSIFICATION SVM TYPE 2
ivcn class. In contrast to Classification SVM Type I. the Classification SVM Type 2 model
minimizes the error function:
} T 1 X
(08Ma -w w-vp+- L "'·
m that works as follows, 2 N 1-1 .,,,
ta into a higher dimcnsi sub)ect to the constraints:
mat separating hyperpl .vilw' 9Cx:)+b)~ p- ~t' !t ~ 0, i =l, ...,X and p ~o
lass from another). With
In a regression SVM, you have to estimate the functional dependence of the dependent
ion, data from two cl variable yon a set of independent variables x. It assume:,, like other regression problems,
is hyperplane using sup that the relationship between the independent and dependent variables is given by a
the support vectors). deterministic function f plus the addition of some additive noise:
Rcaresslon SVM
y = f(x) + noise
The task is then to find a functional form for f that can correctly predict new cases that
the SVM has not been presented with before. This can be achieved by training the SVM
model on a sample set, i.e., training set, a process that involves, like classification (see
above), the sequential optimization of an error function. Depending on the definition of
this error function, two types ofSVM models can be recognized:
REGRESSION SVM TYPE 1
For this type of S VM the error function is:
J X X
X
-w'w+C""
') t
£-. . +CY ..... ~Jt • 1
- ,-1 t•l
ve training algorithm, which
which we minimize subject to:
of the error function, SVM
w 1 ¢(x:)+b-y1 ~-E+,:;
Y, -w'¢(x,)-b :Se+::,1

s,.~/ ?:0,i =1, ... ,N


REGRESSION SVM TYPE 2
For this SVM modef, the error function is given by:

Zt
VIII Sem, (CSE/ IS'f)
CBCS · JIM'\£/( l!M:1

.!..wr
2 .'llf ,.. 1
f
w-c( ve+..!... k +c;:,· )1
) I
10. a. Explain briefly t
which we minimize subject to: Ans.
(wr¢(xi)+b)-yi $e+{,
y, -(w1 ¢(xi)+ b,} ~ E+ ~•;
~ 1. ~ · ; ~ O,i = l, ..., N ,e ~ 0
There are number of kerne.ls that can be used in Support Vector Machines models. These
include linear, polynomial, radial basis function (RBF) and sigmoid: Web Conte
Kernel Functions Using HTM
X 1°Xi Linear
Web Content
f , \- (rxi •X j +C)d Polynomial A website is de
K X i, X i .- ( )
locator). A large
. exp -rl X 1 - X ; I
2
RBF
are managed usi
tan1i(rx1.x 1 +c) Sigmoid audio, video, fo
content. The wet
where K(X;.Xi)- 4l{X1)• ~iJ
of these request
that is, the kernel function, represents a dot product of input data points mapped into the and application
higher dimensional feature space by transformation , pages on a webs
Gamma is an adjustable parameter of certain kernel functions.
pages could be t
The· RBF is by far the most popular choice of kernel types used in Support Vector altogether. Simih
Machines. This is mainly because of their localized and fi nite responses across the entire more fresh and i
range of the real x-axis.
Web Structure
c. Mention the 3-steps process of Text Mining. (03 Marks) The Web works t
Ans. page can create
Text mining is a s~miautomated process. Text data needs to be gathered, structured, and
then mined, in a three-step process. Web lends itselft
I. Tho text and documents are first gathered into a corpus and organized. also be analyzed
2. The corpus is then analyzed for structure. The result is a matrix mapping important Web Usage Min·
terms to source documents. As a user clicks a
3. The structured data is then analyzed for word structures, sequences, and frequency. entities in many I
Ter'?•document matrix (TDM): This is the heart of the structuring process. Free the web server pr
Howmg text can be transformed into numeric data, which can then be mined using entities between t
regular data minin techniques. too, would record
data generated th
£stablish the Structure using
Term Document Mine TOM for data stored in ser
Corpus of Text:
Matrix (TOM): Patterns user characteristi •
Gather syndicated data. F
documents, Select a bag of ·Apply data data, arc also gath
clean, prepare for words, compute mining tools like b. Compute the ra
analysis frequencies of classification and
cluster analysis FigQlO(b), Whic
occurrence
OR
a. [ 1plain briefly the three different type .r"" mining. (06 Marks)

( web-.] =c::::::
ector Machines models. These
dsigmoid: Web Content Mining
Using HTML pages
Web Strudlft Mlling:
Using URL links
Web Usqe Mining:
Using visits, clicks, logs
I
Web Content Mining
A website is designed in the form of pages with a distinct URL (univcr5al resource
locator). I\ large website may contain thousands of pages. Those pages and their content
arc managed using content management systems. Every page can have text, graphics,
audio, video, forms, applications, and more kinds of content, including user-generated
content. The websites make a record of all requests received for its page/URLs. The log
of these requests could be analyzed to gauge the popularity of those pages. The textual
and application content could be analyzed for its usage by visits to the website. The
ut data points mapped into the pages on a website themselves could be analyzed for quality of content. Th«: unwanted
pages could be transformed with different content and style, or they may be deleted
unc tions. altogether. Similarly, more resources could be assigned to keep the more popular pages
types used in Support Vector
more fresh and inviting.
nite responses across the entire Web Structure Mining
The Web works through a system of hyperlinks using the hypertext protocol (http). Any
(03 Marks) page can create a link to any other page. The intertwined or self-referral nature of the
to be gathered, structured, and Web lends itself to some unique analytical algorithms. The structure of web pages could
also be analyzed to examine the structure ofhyperlinks among pages.
and organized. Web Usage Mining
·s a matrix mapping important As a user clicks anywhere on a web page or application, the action is recorded by many
entities in many locations. The browser at the client machine will record the click, and
s, sequences, and frequency. the web server providing the content would also log onto the pages-served activity. The
the structuring process. Free entities between the client and the server, such as the router, proxy server, or ad server,
ich can then be mined using too, would record that click. The goal of web usage is 10 el\tract useful information from
data generated through web page visits and transactions. The activity data comes from
data stored in server access logs, referrer logs, agent logs, and client-side cookies. The
MlneTDMfor user characteristics and usage profiles arc also gathered directly, or indirectly, through
Patterns syndicated data. Further, metadata, such,as page attributes, content attributes, and usage
data, are also gathered.
-Apply data
mining tools like b. Compute the rank v11lues for the Nodes for the following network shown in
classificat ion and FigQJ0(b), Which is the Highest ranked node. Solve the snme with e ight ilcn11lons.
cluster analvsl~ (10 Marks)

23
VIII Set111 (CSE/ ISE)

Ans.
Ra Rb Re Rd
Ra 0 0.50 0 1.00
Rb 0.50 0 0 0
Re 0.50 0.50 0 0
Rd 0 0 1.00 0
Variable Initial Value
Ra 0.250
Rb 0.250 .
Re 0.250
Rd 0.250

u C C C g C C C C
:;:;
.!! ~1 -~ -
0 0 0
-~ ..... .! .... "!..,. ·s 0 0
·.:i
E so
0
-~"
_g
~ ..
NET
V>
]~
~ 1! l! ~ ~ -= ~ ~

Ra 0.250 0.375 0.313 0.344 0.328 0.336 0.332 0.334 0.333


Rb 0.250 0.125 0.188 0.156 0.172 0.164 0.168 0.167 0.167
Re 0.250 0.250 0.250 0.250 0.250 0.250 0.250 0.250 0.250
Rd 0.250 0.250 0.250 0.250 0.250 0.250 0.250 0.250 0.250
The final rank shows that rank of node A 1s the highest at 0.33.
{VIII
As Pe r New VTU Syllabus w .e.f 201 5-16
Choice Based Credit System(CBCS)

Rd
1.00
0
SUNSTAR
0
po 0

C: C:
0
g
0
-~ "' -~ .... -~ 00

6
~
0.332
~
0.334
~
0.333
NETWORK MANAGEMENT
64 0.168 0.167 0.167
50 0.250 0.250 0.250
50 0.250 0.250 0.250
.33.
(VIII SEM. B.E. CSE / ISE)
. SYLLABUS
NETWORK MANAGEMENT Eight
li\S rl'll <.1101('1 OASLD <.RU)IT SYS II M (c:tKS) S( 111 M~)
terrl'C-IIVI FROM1111 A(ADl~11l'YI \1(21Jlt\-W17)
Subjrrl C'odr •~< ~8,; I\ \l:1rk} 2Q
,umbrr of I ntun· lloun/'I\ ,-ck UJ L ,:on \ 1...rl..., ~o Time: 3 hr~
Iota! :-.umhn 11r l.rcturc llour, ~o f,:om 111/Ul'll 113 'iotc: fmwu ""Y r11 ·

MODULE 1
Introduction: AnalO!l' ul kkphun~ Nc11,ork M,10,1!!,· 11cm. IJ,ir., aml ·1ck.iw1111nmi.:,11iu~ Scl\\ork 1~1,:nt,111,-J
compulin!! r"' ironm,nts I Cl' 11'11:i;;c,I ~Cl\\ ,rk, I he lmcrn,·1,rn.l lnlr,m:K <.:c•mm11111,,111un, I r..1..,,,1s.,ndS1,,ml,1nls- I a. What :ire the 1:0:11.
('uminunu;ation t\r..,;I it~~ hm::-.. 1•it..)h 1,,;,,1 I .i: . "mt S-:1, 11.:~'i (. ·''"' ~ 11:-,h•ric:-. ul \!t,..,,.,rk1ng tiOd ~vtina~~mL'nl I I
lmpurlancc uf wpulo"~ . r:;1tcn,1,! D1>c • "' Hc,l11cc I uaJ <111 ~ •.le. ',omc (.\m'.1111111 Nct~vurl._ l'rohkms· l_hallcn~, Ans, The goals of nctwo
of Inform.Ilion kd1 .11h1g.1 Manag.er,, Nc11111rl. :-..1a11,1)!c111cn1 Ciuab. Org.an11,111u11. :,nil hmc11011,- G1>al ol Ncl\\Ork
IT services ,1 ilh a q
\lanag.cmc111. Nci" rk Pr,,· i,mmn:!.. t,;,1,H1rk Or,•r.1111111, anll the MX ·. 1'cl\\1•rk ln_,1allat10n and M.unlcnan,,:
~el\\Url. and S: ,, >-l.111.1g,·11,1,1 :S:c11,u • M~ •.:mcnl S"1"111 pl.ufor111. ( urrcnt StalU~ ;uni I ulur.: ol :S.,l\\ui\ The function!> of nc
Mana~cm.:111.
'VIODULE 2
Basic ~·u1111da1i1>1 , ,!,ml,. \h•1'cl~. an,! I ·11:,ll,I!(, ,·1111111. M,magcm.:111 Stan,lards.
11.111,kl. Oriµini1.o•• '.' I. lnli.,, ,: t· ,n \h•lcl \fan ,;,,ment Information lr.:cs. \!anag.,•,I Ob1cc1 l'cr-1-.:,ti,,
Communkalion M• ,1,, \Sl'-:. •-".:mt •I ,. S"n'l\ils. :in<l Con,cmion,. Objcci, .ind Data l~pc,. ObJ,'ct ,\.,mes. Aa
fo,:imph: ol'ASN I from hO li'P I: l.n,ud1n:-, '.'itr- e111rc. M:,cro, ! 1111,11,,nal \10\lcl
MODULE 3 Nc1,,ork
SNMl'vl Ncl\,01k :-..1anag.:rn~11t: M,111agcll N,·1,,ork I he lli,tN) ol SNl\11' l\1anagcmenl. lntcrni:l Org,ini7,lllllll5 Prm isioning
and \landards. ln1e111.:1 D,,cum-·111,. I he S!\i\ II' \10\lcl. I he O~;mi1.i1ion Mo1.kl. S~ ,tc:m O, en ic1,. The lnli11m:11iua
Model - lntrodu,1ion. I h,· S1111 111r • of M.1n,1;_,cn11·n1 lnform~ll,•n..Mana11,cd Obje,1,. Mana11,cmen1 lnfurmalion l\,1-.: Planning
'I he SNMP Communicalinn Mn,td I'll, S'-~il· h:hi1,e111:-c. /\dministrati,c: M01lc:I, SNMP Spc,ificalion,. SN\11' Ocsign
Operations. SNMI' 1\1111 (iroup. Func111•nal Mo,kl SNl\11' M.m.1!(cmcn1 - RMON ; l{cmnh: Monitoring,. Rl\l(JN Sl\11
and MIB. Rl\10NII- RMO'NI 'lc:-.111.il (\,mcnti,,ns. RM(•:-.11 (irm1p, and F11nc11ons. Relationship lkh,c:IXI Control
and O~ta !'ables. UMON I ('omnwn and l:1hcrn,·1 l,ruup,. R\10N Iokcn Rmp F,t<!nsion C.rc111,,,. ltMO;s:2 rtw
!{MON~ Man.,gcmem Information 11.,sc. !l\lO:-.:? Confonnanec Sp,:~1lka11011,.
MODULE4
Broadband A'-css Nct"or\..,. flro.1db,m.f ,\cccs, l'cchnnlv!l). I IF(.' I' Tedmolot!): l'hc 13roadband I .AN. The Cable
Modern. Th.: C:ahlc Mo1lcm Tcrn1in.11ion ~: ,1e111. The Ill'(.' 1'1:inl. I he 1(1' Spectrum forC,hle fllodem: 1)31n Ch er Cable.
Rcfcn:m..-c Arehi1cc1ure: I IFC Managcmcn• - Cat-le Modem and lM IS \olanagernenl. I IF(' l.inl. M,111;1gcmcnt. Ill
Spcctnim Management OSI. li:ehnulog~ ,\,~ mmeirie D1g11.1I Suh,cnl>,·r I.in.: k,hnolug~ -Role ul the /\D',f \cc.:.-
Ne1,\ork in an 01crall Network, /\D!'it. t\rchitccturc. ,\DSI { 'h,umcling ~chcmc,, 1\DSL lncod111g Scheme,. i\l>SL
Mnnngcmenl -/\DSI Network Manag.cmenl Elcmen1'. /\DSI. ( unligura1ion 1\111nagcmcn1. 1\0SI F,,ult Manni:cmcnt.
/\OSI. l'crformanc.: Managcmenl. SNML'-llaseli ADS!. I.inc MIil. MIU lmcsra1111n "ilh lmcrl'acc, Group, in Mlll-2.
ADS!. (;011figur;1tion Profiles
MODULES
Network Management \pplicmion,· Conli-Jura1mn Man.,gcrncnt- Nct\\orl.: Provisioning. lmcmol) \fanag.cmcnt.
Network 'lopolog). Fault Managcnicn1-l'auh Dc1cct1vn. Faull I o,,llion an,I holation 24 techniques. Pcrform<1ncc
Mnnng.cmcnl - Perfonnance Metric,. Data Moniluring. l'rnhlc111 bolmion. l'crl<mnanec Sta1i,1ic,. 1:.,ent Corrdalion
leehmques - Ruk-lla,cd Rea,unmg. :'1111,kl-ll:,,cd Re;iwning. C'.1'ellased Re.1,unmg. Codd,uok corrcl;nion \1<'<ld.
Stale 1 ran.,ition Graph \.10\lcl l'inite State \fachinc l\foJcl. Securi1~ M,tn;igcm,nt - l'olic,~, a11d Procedur.:s. Sccuril)
Rn:a chcs ,md the Hc,nurlcs N~cdc,111, l're<ent lhcm. I irc11,1II,. C'l")ptoi:r.1flh,. /\uthentieation ;,nd Au1hori1.nion
ClicnVScl'\cr /\uthen11c;llion S)slcm,. Mcss;iec, lr.rn,li:r Scc11ri&). Pro1cc11u~ of Network., from Virns A11:1ck~
i\~cuunttnl? Mana1,1cmcn1. R~pon I\ lnn11gcme111, l'ol,c)- Based M,urngcmcnl. Ser\'i,c Le, cl Manugcmcnt.
Eighth Semester B.E. Degree Examinntion,
C BCS - Model Q uestion P:1pcr - 1
211
NETWORK M ANAGEMENT
80
Time: 3 hrs. M11x. !\,, rks: 80
Nole : Am over 1111y Fl V/:,,f11ll q11eJtiom, 1e/efti11g ONEf ull question from l!(lc/t module.
OJ ..-,:

Module - 1
I a. What 11rc the goals, org1111iz11tio11 and function of Network m:1n11gcm,·11t?
(I~ Marks)
Ans. The goals of network management is to en~urc 1ha1 users of a network a1, provided
IT services with a qualities of services.
The runc1ions of network management arc
Ncl\,ork
Management
k Nc1work MJn.,J~
agc,1 Object Pcrs1•~d
.t Types. Obj.:cl N,1111.:s.

Network Network Network


Provisioning operations maintenance
cnl. Intern.:! Org,,niz
Planning fault managl·mcnl system Faull management
Overview. The lnfo11
nag..:mcnl lnforn1a1iun rc~toration Trouble ticket
Design
NMP Specifirations. S Configuration management administrator
1.: Moni1onng.. II.MON Network installation
•lationship \:lc1w~l)l1 C Security management
ion Group,. ltMON2 -
Accounting management
Recovery management
Bro.idbaml LAN. The Invcnlory 111.inageml!nt
bl.: Mml.:m: Data()\ ~r
MIT I.ink Manag~m.:111. Data gathering/ analysis
,, - Role of1hc AOSI ,\
·i. Encoding Sd1.:111c,:-'
' nl. ADSL Fnult Mann!_t.0
h ln\erfac.:s Ciro11ps in M

ning. tnv<0nlol) Manage


24 T10chniqucs. P.:rror
•.: Stn1islic~. i-:,cnt Corre
. Cmlehook corrd:1t1011 M
lid.:, and l'rm:cdur.:s. Sc'\:
1t~n1ica1ion anti /\uthoriz
.:tworks from Virus /\ti
"' d Management.

3
VIII Se,m, ( CS£/ISE) Network, M ~ C13CS · Mode.l.,Q~

Power hits could re


Network The network has a
hit could change co
A performance pro
ecision network managcm
New
'onfiguration dat or application proc
technology With the .:vcr incr·
Performm:cc and lT Restoration
security violation ·
....---trnlfa data
OperatlOll I and M group 2. a. With diagra m, ex
l~ngincering
group network Group NOC network Ans. The telephone ne
& pl,111nin•~ Network install.1tion an connt:ction is gud.
desi •n The key is the m
teh:phone net" urk

Regional
Netwo,k :-nivi<Pn;,: .. ,rn, isls of nt'lwo,k planning and design is the responsibilities centre class
of engiuc..:nng ~rnup. t: 0 ha!. defined fiw OSI network mam1gement application switch
which are fault. confi~t:~,llion. performance. securities and account management.
Trouble admini~tration 1•, th1· administrati,c i-1 of fault mnnagcm..:nt and is used
to track problem in the nel\,ork. When e\et there is a service failure. it is NOC's Sccrional
responsibility to restore service as soon as possible. centre class
Data need tu be gathered by NOC and kept updalcd in a timely foshiun in order to 2 switch
perform some of the nbu, c h,nctiom,. Security management can cover a vel) broad
range of securit). Primary
b. Explain common netwo rk 1>roblcm (08 Murks) ccn1rc cl,1ss
Ans. The most common and serious problems of network's arc connectivity failures, 3 switch
which are in category of fault management. F•lt is the mean to failure in accessing
network and systems by user. The network failure is caused more often by a node
Toll centre
failure than by failure of passive links. class 4
The node failures are limited lo specific interface failures. When this happens, switch
all down streams systems from that interface are in accessible. Such failures arc
associated with network interface card which needs replacement.
The node fai lures manifest :ts connectivity railures to the user. These are networking End off
tools available to localize the fault. Another cause of network connectivity failure is class 5
procedural. Network connectivity is based on IP address, which is a logical address switch
assigned by the network administrator. Mistakes are mnde in assigning duplicate IP
address. ~
Voice
A host or s~stem !nlerfocc problem in a shared medium can bring entire segment
down. The 1ntcrm 111cnt problems could also occur as a result of traffic overload
which causes packets to be lost, Performance monitoring tools can be useful i~
tracking problems.

4
C'BCS - MoctevQ~POtpe-r - l

Power hits could reset network component configuration. causing network failure.
The network has a permanent configuration and a dynamic configuration and a power
~it could change configuration.
A performance problem could manifest as a network delay and is an annoyance to
network management. They need to separate the network delay from the application
or application processes delay and then rectifies the situation.
onfiguration dat With the ever increasing. si.:e of the n..:twork's and the connectivity 10 the internet.
TT Restoration security violation is a frequent problem.
OR
2. a. With diagram, CXJllain airnlogy of telephone network management.
Ans. The telephone network is reliable and dependable and the quality and speed of
connection is gud.
The key is the management and operations of the network. The architecture of
telephone network is hierarchical.
To other
Regional centre
Regional Regional Sectional centre
ign is the responsi~i lit_ies
centre class 1------~ centre class Primary centre
Toll centre
switch I Switch End ofllccs
management applicat1on
nd account managcmenL To other
management an<l is used Sectional Sectional Primary centre
centre class Toll centre
rvice failure. it is NOC's centre class End offices
2 switch
timely fashion in order to
ent can cover a very broad To other
~m~ ~m~ Clan 4 toll points
centre class ci:ntre class End offices
(08 Marks)
arc connectivity failures,
_ 1---,1,;....--►I 3 switch
.._;..._.,..... _,

lCl\ll to failure in accessing


used more often by a node Toll centre Toll centre
class 4 class 4
switch switch
ilurcs. When this happens,
cessible. Such failures are
cement. End off End off
user. These are networking class 5 class 5
ork connectivity failure is switch switch
• which is a logical address
de in assigning duplicate IP 6
Voice Voice ---•Loop
m can bring entire segment - Direct trunck
a result of traffic overload,
oring tools can be useful in -----•Toll connecting trunk
- T o l l trunk

s
CBCS · Mo-d.citQ
VIII Se,m, ( CSf/ISf)
The message in e
There arc five le, els of network S\\ itches and three types of trunks that connect (PDU 's) "hich co
these switchi:s. A trunk is .1 logical link b/ " two ~witchc~ 1ha1 ""'> lnl\ erse one or PCI comain~ hen
more ph)~ical links. ('ht: c:nd office the lo,,cst in the: hic:rarchy. is the local switching.
a service provide
office. I he other four le,els of switches (class 4 through class I) arc to switches
that carr) toll call!>. /\ din.:ct tnmk connects two end omces. a toll connecting truak
connects an end ollice to any toll office: and a toll trunk connects any two offices. 3. n.
Operation ~· r 'NI :.ystcms ensure the qualtl} of service in thc tt:lephone network. Ans.
The,· con"t::ntlv monitor ,·nrious parameters of the network. The quality of call is representation o
me;surcd in te~ns of sign.ii to noise (s/\\) ratio. is measured regularly b) n trunk management inti
maintenance !lystem. For n given rt:g.ion. tht:re is a network operatiuns centre (NOC) bm,c to descri
whert: the global status of the netnork is moniton:d. Thc NOC is the nave centre management inf
of telephone network operations. To manage a network remotely, i.e. to monitor information stor
and control the network components from a central location nel\\ork management agent and m:ina1
function need to be considered in building the components of the network. MIB associated
b. Explain OSI lnycrs .mtl services (08 \\forks) manager is dcsi ,
Lnycr No L:l)'Cr ·amc Salient features pro,·ldctl b) layer on all the netwo
onl} its local inti
Transfers to and ga1hcrs from the physical medium raw
The manager has
bit data
Physical base (MIO). M
I landl~ physical and electrical interfaces lo the configured value
transmission medium
and contains inf
Consists of two sub layers. Logical link control (LLC)

2. Data link
and media access control (MAC)
LL~: Formats the data to go on the medium. Perform
~oaj
error control and flow control
MAC: Controls data transfer to and from LAN resolves
conflicts with other data on LAN.
3. Network Form the switching/ routing layer of network
Multiplexes and de- multiplexes message
application
Acts as a transparent layer to application's and thus
4. Transport isolates tht:111 from transpon system layc:r~
Makes and breaks connection for connection oriented
commun1cat1on Net\\ orl.. clemcn
Controls flows of data in both directions. programs, algori
Establishes and clear sessions for applications and thus contact person a
5. Session
minimizes loss ol data during last data exchange.
b. Explain the tcr
Provides a set of standard protocols so that the display Ans. ASN I is based o
6. Presentation would be transparent to systcm of the applicallon of back us nonna
Data encryption and deeryption <name>:: <de 0

Provides application specific protocols for each Where the notat


7. Application
application and each transport protocol system. ..defined as...

6
kt M~
The message in each la)er is contained in message units called protocol data units
of trunks that connect (POU's) ~hich consists oft,\O pan~- protocol control info (PCI) ancl user data (lJD).
al ma) 1n1, erse one or PC\ contains header 111fo about the la~er. lJI) contain:-. the data thM the layer acting a~
y. is the local sv,it~hmg a sen ice pro, idcr receive:. from or transmitter 10 the upper layer / scr, ice user ln)er.
lass I) are to switches Module - 2
a toll connecting trn;i\i. 3. n. Exphiin infornrntion model (08 Marks)
ccts any two offices. Ans. Information model is concerned with structure and storage of information. The
the tl!lcphone networ~. representation of objects and information relevant lo their management form the
. The qualil~ of cnll ,s management information model. The information model specifics the information
eel regularl) by a trunk base to describe manai;ed objects and their rclatiom,hips. l'hc :,tructurc ol
operations centre l OC) management information (SMI) dcfinl!s the syntax and Sl!manticc, of management
NOC is the nave cc1~tre information stored in the management information b,1se (MIR). MIB is u<,cd b> hoth
motely. i.e. to monitor agent and management process to store and exchange managl!mcnt information.
ion net\\ork management MIB associated with an ngcnt is called ag.1::nt MIB and the MIB associated\\ ilh a
of the net\\ ork. manager is designated as the manager MIB. A manager M1B consists of information
{08 Marks) on all the net\\ork components that it managcc;, where as agent MIB needs to know
ovillctl by luyer only its local information. its MIB view.
The manager has both the management database ( MDB) and management information
, the ph) sical medium raw
ba:,c (MIB). MD0 is II r~·nl databn5\: and \:Ontains 111c::asurcd or acl111inistrati\l:ly
configured value oft:lemc11t~ of network. On the other hand, MIB.is a\ irlual database
and contains infornrntion necessary for proces~ to e.,change information.
Logical linl-. control (Ll.C)
AC)
~0€)-----1 Manager 1----~0~
, ,o on the medium. Perform
rol ·
er to and from LAN resolves
l LAN.
ng. layer of nch, ork
lllhiplexes message
MOO= Management database
er 10 application's and thus Mll3= Mnnagement information ba~e
Manuged objects
rt system hi) ers
tion for connection oriented
IIJ!llliit! =- Agent process
Network clements: Hubs, bridges, roulers, transmission facilities sollware pro..:ess
,oth directions. programs, algorithms, protocol functions. dntabascs administrative informntio11,
ions for applications and thus contact person account number.
ing last data exchange. b. Explnin the terminology, symbols i1nd con,•cntion of ASN I (08 Marks)
protocob so t!lat the display Ans. ASN I is based on Backer S) stem and uses th.: formal S) stem languag.t: and gr:unmar
stem of the arplication ofbackus normal form (BNF) which looks like
ption <name>:: = <definition>
cific protocols for each Where the notation <entity> denotes nn <entity> and the symbol ::= represents
port protocol system. "defi lll!d as·•.

7
VIII Se.m; (CSE(ISE) Networ-~ M ~

We can define an entity <digit> in the follo1\ing way


<digit>::= 011121314\516\71819
Where the symbol ·• 1•• represents ·•or•· we can also define an operation entity <OP>
in the fall way.
The definition on the right side are called primitives. An entity number can be
constructed from the primitive, <digit>
<Number> . · digit> \ <digit> <number>
For ex. number 9 is the digit 9.
Simple arithmetic expression <SAF> from the primitives and construct <number>
<SAE>::- <number> I <SAE> I <SAE> <OP> <SAE>
TI1e format of each line is defined as a production or assignment.
Consider an en with foll two assignments
<Boolean type> :: ;;:: boolean There arc amplifiers
<Boolean value>::;;:: True agent!>. rile outside
Entities that are all in capital letters. such as TRUE and FALSE arc called keywords. mon,tonng tools buil
Data types arc buih up from primitiH~ data types· INTEGER, REAL, NULL and Alternatives: Cl/01
graphic string. List · SfT and SEQ
OR Replication · SET OI
ASN - I S) mbols
4. a. Explain three tier orgnni1.ation model (08 Marks)
Ans. In the two tier models, ti~ networl.. manager recievcs raw data from agents and
Symbol !\lea
processes them. Network managercont,nuously monitoring the events and calculating Dclin
the information an intermediate agent called RMON (Remote monitoring) is inserted 01,al!
between the managed object and the network manager. The RMON function.
Signe
implemented is a distributed fashion on the nctworl-.. The pure SNMP management
S)-Stem consists of SNMP agents and S MP manager:.. f olio
S MP f: Star,
() Starr :i
..... R:ing
ASN - I Keywords ,
Kcy,,ord
BEGIN
CIIOICE
Managed .._1----...1
objects DEFINITIONS
END
The application is in legacy systems management, telt:communication management EXPORTS
network. IDENTIFIER
IMPORTS D
INTEGER /ti.
NULL /ti.

8
an operation entity <OP>

SN MP
managed
nd construct <number>
objects
•nmcnt.
There are amplifiers on tlw outside cab Ii: plant. \\ hich do not have built Ill SNMP
agents. The outside c:ibk pl:int uses some of e11.is1ing cable lechnulog) and has
LSE arc called keywords. monitoring tools built into it.
EGER,REAL,NULL~d Alternatives : CHOICE
List: SET:md SEQUENCE
Replication : SET or- and SEQUENCE OF
ASN - I symbols
(08 Marks) Symbol Meaning
raw data from agents and
Dc!incd as. or assignment
ng the events and calculating
mote monitoring) is inserted Or, alternatives, options of a list signed number
ger. The RMON function. Signed number
e pure SNMP management Following the symbol arc comments
f I
I! end ofa list
S1.111 nnd
() Start and end ora sub type
Range
ASN - I Keywords .
Key\\ord Arief description
BEGIN Start ofASN - I module
CHOICE List of alternative
DEr-INITIONS Definition of a data type or managed object
END End ofASN - I module
ommunication management EXPORTS Data types can be exported to other module:.
IDENTIFIER A sequence of non negnti, e numbi.:rs
IMPORTS Data types defined in external modules
INTEGER Any negative or non m:gative integer
NULL A pl,1cc holder

9
VIII S0t111 (CSE(ISE)

SNMP SNMP SNMP SNMP


manager agent agent Manager S. a. Explain RMONI
Ans. Filter group is a c
performing n boo
Fig bcl<m sho,\s g
a •cnt
Network
clement
RMON acts as network agent. An Sfl.~Pmanagemcnt system can behave as an agent
as well ns mrnngcr, RM< lN. ,-..hile collec ng data from network objects, perform
some of function of" llt'I\\Ork manager.
b. With dbgra,,1, l'Xplain l'ncap.,nlatecl S MP manager (08 Marks)
Ans. The p..:1.. ,,roc..:s~.:., ,,1t1 .. 11, 1· 1~1111:nt S MP and thus support SNMP application
entitic~ ·'•l ,;nlld p:ut,,c.:, 1 l·:11itic~.
Data

Appliration ommunicatio,
Aprlication PDU Version SNMPPDU Data
header units N/w
Gathi:rin ,
Transpon PDU IUDP Meader! SNMPPDU

Network POU
I IP IMeader Transport POU

Data link l'DU IDCC Ileader! Nct,,ork POU


Communication among protocol enhties is accomplished using messages
encapsulated in UDP data. An SNMP message consists of a version identifier an
SNMP. Community name, protocol data unit (PDU). Fig shown the encapsulated The data gathering 1
SNMP message. The version and commwiities name are added to the data POU and network comprising
along with application header the enti~ message is ancd to the transport layers as sets if functions. Th
SNMP PDU. The host and com
Thi: network or datn link lnycr (DLC) header is added before the frame is transmitted hosls. rankinu of tr.
on to the physical medium. ofstatistical llatn 11 ~
An SNMP protocol 1.mlit)' 1s received on pon 16 1 on the host except for trap, which in cthcrnct ~tatistics
ii. received on port 162. The mnn length of the protocol is 484 bytes. The five PDU ·s rhc comrol table co
are : Get rclJuci.t f>DlJ, Get- next reqlleSI- POU, Get-response- POU, Set-requesl It is also used by 10
-PDU. mid trah • J>DU. The fi ltcrcd output
A managed object is n scalar variable and is simply called a variable. Associated • network m1111ager.
with the variable is its value. The pa.rmg ofthc variable and value is called variable The output of the fi
bimli11g llf 111rb1mJ. analysis by network
10
~1,+,ir C..CNA ~1\1\W'
Module-3
~ •
a. E'.\.plain R:\1O · I Grows an d ,unctions . . (OM Marks)
Ans. Filter group is a cascade of two filters. The packet filters incoming r ,ickets by
performing a boolean and/or XOR with a mark specified.
Fig below shows groups and functions

can behave as an agent


twork objects, perfonn
~
Token Rang
Ucrnd
Token Ring
I listory
Hist<1ry
ContrC'I
·-
-
(08 Marks)
port SNMP application

Data ]
- fnh:rnal
S1a1is1ics
Enternal I li!,tory
Ifo,tol)
Control -
~unicatio1
- Data
1-lOSl & CONVERSATION STATISTICS
Host I-lost top N Matrix
r+
Network
N/w ...,
nits Gathering Staistics statistics St:itistic~ manager
-

~lPOU
~
P:illet
ti ltcring
Channel
filh:ring .
Packet
1:apturc
~

POU ] .'\lam1 E, cnt


plished usin~ rn:ssages Gcnerution IGcncration
of a version identifier an The data gathering modules are LAN probes gather data from the re,,., 'el~ monitored
g shown the encapsulated network comprising ctl11:rnct and token ring l,AN. The data serve~ :11, ;nputs to fi ve
dded to the data POU and sets if functions. Three of those compraise of monitoring traffic statistics.
to the transport layers as The host and con, crsallon statistics group deals "irh traffic data nssociated ,, ith
hosts, ranking of traffic for top N hosts and conversation bch,ecn hosts rlu,: group
of sinlist ical dala associated with ethcrnet I .AN is addressed by groups and f 111ctions
in ethernel statistics box.
host except for trap. which The control table controls the various data need to bl:! wllet:tt!d hy dilTcrent lll:tworks.
484 bytes. The five rou·s It is also used by lllken ring modules in lukcn ring statistics box.
sponse- PDU, Set-request
The filtered outputs may generate either alarms or e,ents. Thcse·are rcforte<l to
neh, orl-. manager.
led II vnriable. Associated The output of the liller group could be stored in pallet l'aprurc module for fu11hcr
nd value is called variable 11nalysis by network manager.

11
VIII Se-wv (CSE/ISE) Netwo-vlv i\-1 ~

b. Explain RMON2 management infornrnlion base (08 Marks) b. Explain RMON I


Ans. RMON 2 MIB is arranged in ten groups. Ans. Two new data types
The protocol directory group identihcs 1he protocols that the probe can monitor entry status. A netw4
probe capability can be altered by recontagured by protocol Dist table. The protocols be permitted to ere,
rnnge from the dnta link control ln)er ~r- :nc application layer and are identified by identification is mac
columnas object on the uniqui: protocol ID. l'r<,tO\.ol Dist cont1ol table is configured entry status is used
according to dala to be colli:cted and protocol di~tastl·~ table store~ tht' data collech::d. man ipu lat ion of co11
The address map group is similar ll, tht mklrt·,, transh1tion~ tnbk b in the M/\C The owner string is
address to m:twork un.i:nch interface. It l•,1-, t\\o tabks for control and data The information context
network hycr host group measures the traffi1,; set I frc,m nnd to cnch network address IP address managem
represt·111ing carh lwst dis1.uv.:rccl by probe. The neh\ ork layer mat rill. group provides number.
information oa the cunvers;ition between pa~:s of hosts in both directions. Entry status data t) p
The appl cation layer matrix group provides info on the com ersation between pair i) Valid ii) Create r
of ho•,1~- in huth dirt•ct1on~ The application layer functions are divided into the Session En
appli, ,cit ,lit lay.:r 110~1 gN.tl-' :..nlt application layer mmrix group. . Valid
Alnrn' .111d his;01. inf,,111''llio11 have been combined into the user history collect1on
gro1. J ,{MC>/'. ~ It ha~ .,,uhiple e,inrml and data tables. Data objects arc collected
in bu..,l. , 0 rc•ups Lael, 1>:t.ekct gwup cv1lain to a l'vllB object, and the elements in tht: Create request
group arc th,: i1ht,mcc" ,,;· M113 object. Users can specify the data ol be collc.;:ted by Under creation
cnteri11;.: dula in l,is101y ._:untrvl table. Ench ru" in the former specifics the number Invalid
of btKkcts to b~ ..llorah'<.I for t:ach object and the latter contains rows of instance of
the MIR object. Under the valid stale ,
OR rncn:,ure tlw numb;.:r 4
6. 11. Explain SNMP opcrntion (08 Marks) a row. 13ased on the
Ans. SNMP operaliom, comµ1 isc get and set messages from manager lo agent and gt:t trap and the resource clain
mcSS!ll!CS from accnt t<> m11nal!er. information docs not .
riicneric S~cdic Var Var Var Var process of creation m
PDU Entr·r Agent Time manager and the agen
trap trap bind I bind I ..... bind n bind i1
type prise address stamp valid stale.
lypt> type name value name value
Get RUQ - PDU It starts with the get req operation using a get RUQ PDU from
a manager process to an agent process and get response from the agent with a get 7. a. Explain cnble moden
response PDU . Ans. The cable modern mo 4
Get ncx1 RUQ PDlJ operation is very !>imilar to a get request except the requested equipment to the freq
record is the one next to lhc OUJECT IDENTIFIER specified in request. The mnnager commonly used arc qu
process then issues a gel nc>..t request PDU with the OBJECT IDENTIFIER. modulation (QAM).
The system group all the objects are single valued scalar objects the increasing order capabilities amplitude i
of entity used in SNMP operation in an lcni co-graphic order. shiA keying (PSK).
Get next request has sc,crnl a(l\anlages first we don't need 10 know OBJECT. A digital signa l from a
IDENTIFIER of nc1.t entity. Knowing 1he current OBJECT IDENTIFIER. we can this case a cable muden
rctriev1: ncxr one. Nex1 in the t.:ase of an aggregate object. the number of row is (RF) carrier. The modi
~hangi1~g d) nam ical ly. Another ad, anlage of the get next request is that we can use frequen cies shown as c
11 lo build a MIO lrcc b) repeating the request from any node to any node.

12
I . RI\ION I tcxtu11l rnA,er,.ahon . (08 M:trks) d
b. E,p nm . . . . I RMON I tell.tun I convection arc 0\\ner ~,ring an
Two nc" data types dehned is t 1.:h £\ >rk ,na11agemen1 S' stem '"hich could
at the probt: can monitor entrv status. A ncl\\Or·k has '· mon: t an one ne Vt '
· J. • t ble' rhc owner
/I D:st table. Thu protouols - create u<,e irnd delete the control paramelers Ill ,1 a .
[a}cr nnd nn: ide1111hed by be
. pcrm,tted
'fi to . ,,f. t h l' controI 1·•• Il le d~fincd
· lHl<l. prart ~ by owner string data lypc.. lheh
rden11 u,111011 1s • • " · fl.
entn status is used !o rcsohc con acts t11.11 m1g11 . • . arise manHl!t:ment
~
!>ystem 111 t e
contiol table is configured i

le ~tor.:,; the data c,>llcctcd.


mai;ipulation of co'.1trol
1 ~:11;,le~. .111 NVT ASCII characlers set as displa) strmg. the
tioib table b :n the "1 ·\C The owner str111g s spec, ie ' . . , • bo t ti , O\\ ncr ~uch as
for control and data l'he information contc~t of U\\ller \Iring contains rnlormnllon a u l~ .
d to each network ad<lre~~ Ip address mnnagcment station name. nen,ork mnnagcr!> name locatron or telephone
ycr matri.\ group pro, 1d1::~ number .
both dir~ctions. Fntr) slatu!> data type can exist in one of lour !>lat\rs
con, crr,ation bctwf.'cn pair 1) Va 1·10· "')
II Crea te req uest iii) Under creallon i\) Invalid
tions are di, ided into the Session Enumcrntion Description
group. . Valid I
~ the user h istor: collec11on ROW c.,ists anti is active it is fully configured
I. Data ol~1t·cts arc co!l..uted and operational.
'cc.:t, and the eleme111~ in the Creme I cq11us1 2 Create n nc,, ro,, b) creating 1his object
the data ot be collc~rcd by Under cn:ation 3 Ro\\ is not fu lfy active
rmcr specifies the number Invalid •I Oelete lhl! row by disassociate the mapping of
ontains rows of im,tancc of this cntl')
..
Under the, alrd stale condrtron the 111stantral1on or ro11 of the fable rs opera11onaf and
mcasurt: the number of input octets in IF !_!roup. The invalid slate is used to delete
(08 i\larks) a row. Based 011 the implcmenlatiun used. the row Illa) be deleted immediately
anager to agent nnd get trap and tht' resource claimed or i1 may deleted 111 batch mode later. The dcsircd row of
information does not already e.,ist, the management systems can create a row. The
tr Var Var process of creation may involve more than one exchange of PDU's between the
I ..... bind n bind n manager and the ag.:nt atlcr the creation process h11~ been completed it is set to 1hr:
UC name value valid slate.
using a get RUQ POU from Module- 4
sc from the agent with a get
7, a . Explaiu cal,le modern (08 Marks)
request except the requested Ans. The cable modcrn modulc11cs and demodulates lht: digital signal from the customer
•ificd in request. Tht: manager equipment to the frequency ~ignal carried on cable. Modulation techniques most
JECT IDENTIFIER. commonly used arc quadrant phase shift keying (QPSK) and quadrature amplitude
r objects the increasing order modulation (QAM). Three difforcnt modulation techniques support different
order. capabilities amplitude shit) keying (ASK) frequencies shill keying (FSK) and phase
shifl ke} ing (PSK)
n't need to know OBJECT.
JECT IDENTIFIER. we can A digital sign;il li-0111 n compu1er is convened to a nmrlol.! si!_!nnl b the modern in
bject, the number of ro,, is this case a cablf.' modern. The converted analog signals modulators a ;adio frequency
:1.1 request is that we cnn use (RF) carrier. Th1: modulated signal occupics a bnnd of frequencies around carries
frequencies shown as channel bnnd width.
node lo any node.

13
VIII Se-wtt (CSE(ISf) Networ~/vl~ C13CS - /vfode,l; Q

Digital
Modern
Modul:ited annlog
Cnrrier
Modern I Digital
. ~'
The single lim: di
duplex COlllllHllllC
The any metric di
I
0 n_n_n subscriber line (v1
~l-reque~1c~me signal has a large
Time
Channel upstream signal i
band" ilith difference betwee
Bit rate is the number of bits per second traversing the medium. The band rate is the shorter Iines than
numb1.:r ol :.ignal units per second. The bit rntc equals the hand rate times the number
of bits per symbol. . 8. a. Explain RF spcct
The channel bandwidth and the data rate defined on the rate at which the signal unit Ans. /\n asymmetric c
and on the type of modulation. In QPSK modulation four levels (00, 01, I 0, and allocation of band
11) arc represented four phase states (0°, 90°, 2101) and 180°) phase shirt keying is The t I ical s ectn
limited by the difficulties of detecting small phase shifts , but it can be combined
Upstream Gu
with amplitude modulation to increase the number of levels of a signal. The number
(Forward) 42-
of po~sible levels is the number of PSK le\els times the AM four levels. The
5-42 MHz
downstream signal is at a higher band and carries much more information than the
upstream signal. The downstream signal ha!> higher information capacity. The digital
over cable service interface specification (DOCS ICS) standard de, eloped by MCNS
in the indusiry standafd.
b. Explain DSL technology (08 Marl.s)
Ans. The main motivating factor for employing digital subscribes line (DSL) for access
technology in multimedia services is the pre existence of loop facilities to most
residences. The information capacities of 3000l lz analog noice channel with a 30-
dB signal to noise ratio is 30000 bits per second an unloaded twisted pair of copper
wires from a central office to a residence can carry a digital a digital Tl/OSI signal
at 1.544 mbps. a 0S2 signal at 6.312 mbps and STS. I signal at S 1.840 mbps thus The upstream or re
Shannon fundamental limitations of data rate prevalent in analog modern can be to 42MHz. The do,
overcome by direct transmission. guard band from 42
These distances can be increased for analog telephony if loaded cables that downstream band c
compensate for loss and dispersion arc u~ecl. They cannot support digital subscribes with TV requiremc,
lines because the loaded coils alternate high frequencies. The length limitations of offered at a data rat
copper cable is practically eliminated by 1h1.: 11s1.: offibcr optic configuration. Special services have band,
digital subscriber I inc multiplexing needs 10 be done at the termination of fiber optic Cable modern are cl
line. downstream channc
The basic DSL architecture consist of an unloaded pair or pairs of wire connected cable modern could
between a trans receiver unit at a central office and a trans receiver unit ;it a customer channels to improve
premises. This trans receiver multiplexes and demultiplexes voice and d.ata converts capabilities and corr
the signal to the format suitable for lra115111ission on the DSL link.
b. Explain ADSL chan
A high data rate digital subscriber line (HSDL) operates at a T or £ data rate in
Ans. There are two aspect
a duplex mode with l\\o pairs of wires. The duplex mode with ~airs ~f wires. The
transpor1 bare chann
duplex mode is defined as two way communication \\ ith the same speed in both
as bearer channels a
directions.
14
The single line digital subscriber line (SDSL) is the saint: as NSDL e::1.cept the 2 way
Di ital
duplex communication occur!> over a smgle twisted pair.
The any metric digital subscriber line (Al)SI ) and the very high data rate digital
subscriber line (VDSL) both operate any metrically. As in HFC. the down stream
signal has a larger bandwidth and is at the high end of the spectrum, where as the
upstream siglllll is at lower end of the spectrum and has a lower bandwidth. The
difference between ADSL and VDSL is that VDSL operates at higher data rates over
sho11er lines than ADSL.
um. The band rate is the
nd rate times the number OR
8. a. Explain RF spectrum for cable modern (08 Marks)
al which the signal unit Ans. An asymmetric configuration is required to achieve two way communication and
r levels (00. 01, 10, and allocation of bandwidth is the two directions is different based on the type of service.
on) phase shift keyi1~i; is Tl1c tvo1ca
. I SJJcctrum a IIocatton
. extends on Iv lo 750 Ml IL.
but it can be combmed Upstream Guard band Down slre111n (forward) 54- 750 Ml lz
~ of a signal. The number (forward) 42-54 MHz Analog Digi1al data Digital Telephone
he AM four levels. The 5-42 Mill. video 54- scrviccs 550- video 560- 700-750
nore information than the 550 MHz 560 MHz 700 Ml-12 MHz
ation capacity. The digital

~
dard developed b) MCNS

(08 Marks)
Upstream (Reverse)
ibcs line (DSL) for access
of loop facilities 10 most I Digital
5-42 Milz
Digital Tele
I
g noice channel with a 30-
video Data shopping
aded twisted pair of copJ)(f
•ital a digital Tl/OSI signal control services 25-40 MHz

signal at S1.840 mbps thus The upstream or reverse signal is allocated the low end of spectrum from 5MHz
t in analog modern can be to 42MHz. The downstream or forward signal is allocated from 54 to 750MHz. A
guard band from 42 to SO Ml It separates the forward and reverse spectral bands. The
ony if loaded cables ~hat downstream band contains the analog video from 54 to 550 MHz and is compatible
with TY requirements. The digital services providing data service to the home are
01 support digital subscribes
•s. The length limitations of offered at a data rale of upto 30 mbps. Digital data, digital video arid tek'r,honcs
optic configuration. Spcci~I services have bandwidth allocated in both forw,1rd and reverse directions.
the termination of fiber optic Cable modern are designed to tune themselves automatically to the upstream and
downstream channel frequencies upon initial installation. Under noisy conditions.
or pairs of wire connected cable modern could automntically switch to different downstream and upstream
s receiver unit at a customer channels to improve quality of service. Such a feature is called frequency agile -
xcs voice and d_ata converts capabilities and corresponding cable modern the frequency agile cable modern.
DSL link. b. Explain i\DSL c hanne lling, encoding schem es (08 Marks)
s at a T I or EI data rate in Ans. There are two aspects of channels in ADSL access network. The first is tradilional
e with pairs of wires. The transport bare channel as they are defined in ISDN for ADSL transport frames. level
ith the same speed in both as bearer chan_nels arti dtifined in the downstream signal upstream signal operating

15
VIII Sewtt (CSE/IS£)
in a simph:x mode. The a<, bcau:r channels :-m: in 111ultiph:s ( 1.2,3 or 4) of the rate
1.536 mb. rhrec additional l.S duple:-. channel can call signal in both do,, n '>tream
and upstream directions. . . . . Plain text
Second. is how the signal is bulTercd "hilc traversing AOSL lmk. Real ume signals.
such as cudio and real - time 1.:ideo. use a fa~tcring scheme und hence are reft:rred a~
fast chRnncl. .
ADSL mg,m·, ! is dependent on line ccno<ling scheme The two types of cncod111g
schemes uwJ in ADSL line encoding 1.:arricr~ amplitude 111odulation (LAP) and
discrdc multitonc (DMi). Both arc based on the Quadrature amplitude modulation
lQAM). In both cases, basic approach ll> to sep:irate no rs band (0.4 JHz). Crom
P1
Cipher text is
the secret ke)
j
fack technique can be used to separate c,, crlapp1ng band hclwecn upstrenm and Message ,ligc
do,\nStn:am signals. in digil:tl Irani
Although ANSI has recommendcd the us~ of D~H for ADSL, a significant number frame or pae
. f current I) deployed sysh.:111 use of CAP ') ~tcm. In D\t1T. the entire bandwidth of nlso known as
approximately 1-1 MI +2 is used in encoding. channels each of an approximate!~ 4 receiv<:d chec
Kl-lz bane!. Cl') ptograph ic

Mod ule - 5 version is M


9. a. Expla in secret key Cryptoj.\r:1ph), public kc) cryptography and m:urnge tligcst. mcssngc of nr
(10 Marks) of the input.
Ans. Secret key cryptography - ench letter i~ replaced by another letters in the alphabet
received from
(i.e. kc) of n). The sender and the rccciH:r ha"c to agree ahead of time on the turned on by s
secret kc) for successlhl communication. In the same key used for encryption and b. Write short n
decf) pt ion and is called secret key Cl')pto~raphy. The encryption and deer: pt ion Ans. Security polic
modules can be impleme,111.:d in either hardwan:: or software. An intruder can ea:.il) are given acce
decode the preceding cipher tc,t. The~ kc) 1~ the key to the security of manage1.1 \\O An enterpriser,
standard algorithms implement se1.:rct Kl') Cl) ptograph). They an.: data encl) pt ion net\\ ork acccs
standard (DES) and inte1 national dnt.t encl) pt ion algorithm (IDEA). Both cleal with accounts nrc e
b4 bit message blocks and create ~amc size cipher text. DES uses 56 bit key and appl1cntio11 in
IDEA u~es 128 bit key. Both DES and IUE/\ have same principle of encryption. written so that
The bits in the plain text block arc rearranged several times, using a pn.:dctermincd A basic guide
algorithm and the secret 1-ey. A message at i~ longer than the block length is dh ided l. Identify wh
into 64 bit message block~. One of the more populnr ones is the cipher block chaining. 2. Determine,·
(Cl3C) mechod. 3. Determine h
Public 1,cy cryptogrnphy - In prn ate ke~ cryptograph). each pair of users must 4. Implement,
ha,e n secret key. It O'vcrcomes the ditliculties of having too many cryptographic S. Re\ ie1\ the
keys. Public key c1yptography is OS) mmclric with a public key and a private key The assets tha
is used for decryption. Ex: A memllt:rs mail cannot be accessed by ocher members data, documen
because only the administrator with a private key can open the collection door and The classic ch,
access the mail of all members. unintended an
1 he differ Hell-man public kc> nlgorithm ic; che oldest public key algorithm commonly
used algorithm is RSA. It docs both encl) pt ion and decryption a5 well as digital 10. n. W ith tliagrnm
signatures. Bo1h the message ltngth and kc) length are variable. The commonly used Ans. PGP means p
lenglh is S 12 bits.
zimmennan. ti
~,~~ '\)\=~•.•., 'II:
s..l\~f A.. f:.cAM 5"'
b (1.2,3 or 4) or the rate
nal 111 both do,, n stream Pl,111 text
Plain 1c,t . Description .=====;:>
C111:r)pt1on
L link. Real time signal~.
and hence are referred ,1s

1c two types of encoding Public key Pri"ate key


e modulation (LAP) and Cipher text is always the length of the key. ·1 he message is then transmitk· i:, one of
tre ampliwdc modulation the secret key algorithms.
rs band (0.4 JHt). Crom Message ,liges t:- C)clic redund,1nc) du:cl,,, (CRC) i:; a method of detec 1. g errors
d between upstream and in digital transmission. It in\Oh-cs calculating. a check sum based on lla'.a m the
frame or packet at the sending end and transmitting it along with the ,l ·•a CRC
SL, a significant numbt:r also known as checksum is computed flt the receiving end and matched ag 11nst the
• the entire banclwidth of received checksum to en~ure that packet ha~ not been corrupted. It is de1., cd m,ing a
ch of an approi.imatel) 4 cryptogrnphic hash algorithm called mange digest (MD). One of the most common
version is MOS. ~IDS utilit} is used in free BSD. 2" the utilit) takc,. a~ input a
me:;sage of arbitrary length producing output consisting. of a 128 bit mcssag.1: digest
iphy nud ,mrnagc digest. of the input. The generated message digest for the string ,\as baseu ,n the data
(10 Marks) received from standnrd input. The free BSD version also has a tcsl mode that can be
ther letters in the alphabet turned on by specifying ·•n" as a parameter.
rec ahead of time on the
b. Write short notes oo policies und procedures (06 Marks)
v used for encryption and
Ans. Securil} polic) is defined as a formal statement of the rnlt•s by which puoplu who
·ncryption and deer) pti~n arc given access to an organisations technulog.1 and information assets must abide.
are. An intruder can cusil)
/\n enterprise polic) should address both access and security breather. An enterprises
e sccunly of manager. 'I \\O
nelwork access polic) could alkl\\ all emplo) crs full access to the network and thus
. They arc data encl) pl ion
nccounts arc cstabli~hcd for cmplo)cr to have access :Hl) to appropriate hosts and
un (IDEA). Both deal with
application in these hosts. The access polic) g.owrning these accounts s110uld be
. DES 1ises 56 bit key and
written so that all employees arc fully aware of it.
lC principle of cncrypt_ion.
A basic guide for selling up policies and proccdurl!s indudc the following,
,es, using a pn:detcrmmed
I. Identify wlu_1t you arc lr)'ing to protect.
the block length is divided
2. Determine,, hat )'OU arc tr; ing to protect it from
i'> the cipher block chainin~
3. Determine how lil-.ely the threats arc.
4. Implement measures that" ill protect )0llr assets in a cost cffecti\'c manner.
\, each pair of users mu!il
5. Review the process continuously and make improvements if \\caknes, arc found .
.g too many cryptographic
The assets that need lo be protected ~hould be listed, including hard\\ arc. software
blic key and a private key
ecc~!:ed by other 1m:mbers data, documentation, supplies and the people who h:l\'e respon~ibilil) li...r them.
n the collec1ion door and The clas~ic threats arc from unauthorized access to resources and / or information,
unintended and\ or unauthorized disclosure of info and dcnia! or sen 1cc
ic key algorithm commonly OR
I") plionas well a~ digital 10. 11. With dingrnm expluin PGP procl!Ss (08 Marks)
iable. The commonly used Ans. PGP means pretty good privacy is a secure mail package developed b) phil
zimmennan. that is :l\ ailable in publiu domain. fig shm~ n the , .inous modules in
PGP process at originming end.
Sw\•-f-111r f.c111111 Suu,11v 17
VIII Se,m., (CSE(ISE)

Signature
Public Encrypted
l key and
compn:s!.ed
Plai Encryption
text messa •e

SMTP fon
1::,-mail User pta,~ convcrsio
ration
s ·stem IC.\t
Plain S1gnnturc
text gc11cntin11

i Private
1-.cy
Fig below shown enc!

Then.:, c1-sl p, ,1t·css vecur at 'c receivin~ end PGP is a package, it does not r.:i11vent
the " '. It 1s 111; inl:r 11 ~.:d to tn111~n:it e-mail The signature generation module uses
MDJ 'u .:1.1,... rntc a h.,,I, ~·,,de ,,f 111a, ..1ge and encrypts it "ith senders pri\'att: key.
using R:,A algorithm. II iE/\ or t{SA 1i. employed to generate encrypted manage ·1he
encrypll:d mc,.,ngc ts compn:.,~ed "ith ZIP. 1ne si~naturc is concatenated with the
encrypted 11wssa!,e and com cncd to ASCII:> format by using Radix 64 conversion SMTP formal
module to make it co111patihli: with the e-mail system. t:00\((SIOO

PGP 1s similar tu ENCRYP fED PEM but has additional compression capability. The IC\I

main difference between PGP nnd PEM is how the public key is administered.
b. Explain privllC) enlumccd m.til (PEM) (08 M 11rks) DEK :: Data encrypti
Ans. PEM \\as developed h) IETF. It ii; in11:ndcd to provide privacy enhanced mail, using I K = Inter exchange
end to met cryptography bct\\een originator and recipient process. The privacy MIC:: Message inteo
enhanced services provided arc SMIP - Simple ruail
I. Confidentiality The specification pro
2. Authentication exchange key (IK). O
3. Mcssng.: integrity nssurancc is used to cnct)'pt th •
4. Non repudiation of origin. The lK is a long ran
The cryptographic key called {iata encryption key (DEK) can be either a secret key the DEK transmissiQ
or a public key. generated for digital ,
Fig sho"n MIC- clear message MIC- clear process
encl) pied D[K and
In encl)pted PEM
Public key is the bes
message. encrypted
to pan through e-mn
to e-mail system.

18
~

Signature Signature
ncrypte
Encrypted \_ .

and
; ' TEXT
compr~sscd
messa •e +
E-mail
s stern
SMTP format
User plair, conversion
text
MIC

DEK
t
-~
MIC/DEK ~

t
1~11:iil
IK
Fig below shown encrypted PEM process
MIC:
r.11cryp1c<l
on:
it docs not r.:invcnt tnci, pied
mid
n~ration module uses
senders private key. ~ - - - - \•I IC/OrK-----,
I encoded
ll1<>132C

crypted manage. The
oncatenated with the
Radix 64 conversion -
Us~r 11l11in
ICXI
SMTP format
.
~onvcrs1un
MIC
er11cralor

t
IJl:f-
ENCRYP
fEI) l'l, M

t
ession capability. The DEK IK
is administered. DEK = Data encryption key
(08 Mi1rks) IK = Inter exchange l,.ey
t:nhanced mail, using MIC Message integrity code
process. The privacy SMIP = Simple mail transfor protocol.
The specification provides two types of keys. Data encryption key (DEK) and inter
exchange key (IK). DEK is a rnnclorn 11111nbcr generated on a per message basis and
is used to encrypt the message text to generate an MIC. ifone is needed.
The IK is a long range key agreed between sender and receivt:r i.; "~ed to encrypt
the DEK trnnsmission within the message. The message integrity <.Nie (MIC) is
11 be either a secret key generated for dig.ital signature and is included as past of e-mail.
MIC- clear process and is the simplest. The MIC generated is concat, natcd with
encrypted OF.Kand SMIP text and inserted as the text position in the e-mail.
In encrypted PEM process. SMTP text is padded ir necessary and encrypted.
Public key is the best choice because it guarantees the originator ID. The encrypted
message, encrypted MIC and the encrypted DEK arc all encoclcd in printable code
to pan through e-mail system as ordinary text. They are concatenated and then fed
to e-mail system.

19
CBCS - lvtodei Q
Eighth Semester B.E. Degree Examinution, layers in the two sy
with those layers.
CBCS - Model Question Paper - 2
node/ system.
NETWORK MANAGEMENT The message in cac
Time: 3 hrs. Max. Marks: 80 (POU) which consi
!I. Note: Am wer 1111,v FIVE/11/11f1testio11s, !>~/,:ding ONE/11/l 1f11estio11fro111 eac:/r 111otl11le. ,
(UD). PCI contains
the layer. acting as a
Module - 1 service user 1a;cr.
1 a. Explain the basic communication as architecture. (08 Marks)
Ans. Communication between users and applications occurs at various levels. They can b. Give the comp:iriso
communicate :it the application level. highest level of communication architecture.
Each· system can be <lividecl into t,H) hroad sets of commun ication layers. Top set of Ans. The internet model d
layers w nsist of the application layers and the bottom set of transport layers. layers form the suite

[ __
· _user A I
Peer - protocol interface I User
2
I End user applrcatio

Presentation service
r- - ··
t
\ ';,1,l,..:ntio11 !aver Application layer Data flow control
Transport layer ~ Transpot1 layer
Transrn ission contro

Path co111rol
Physical medium
Data link
System A System 2
Physical
Intermediate system
User A User2 Comparison ofSNA, 0
1: Peer - protocol interface 1: I· In the seven layered s
to one correspondence
Application layer i S stem N
Transport layer
L Application layer (transport). S (session)
model. The p.ith contro
Transport layer conversion Transport layer but it overlaps a little wi
session layer function cq
transmission control lay
Ph sical medium Ph sical medium SNA transmission sub sy
Communication between end system via intermediate system level services combine
Fig (b) shown the enc.I systems communicating via an intermediate system N which control functions.
enables are of dillerent physical media for the two end systems. System N converts
the transport layer information into the appropriate protocols. System A could be 2. a. What are the challenge.
copper wire Li\N and system L could be an fibre optic cable. Ans. The top chnllenges in m,
Various professional organization propose, deliberate anc.l establish stanc.lards. One I. Analysing problems,
of the organisation_is international standard organization (ISO) The corresponding
20 &.111•-f-"'" f".c"'"" Sui""u
layers in the l\\O !>)Ste1m, communicauon on a per to per protocol 1111erfacc associated
with those layer:-.. The end systems communicate b)' going through an inter mediate
node/ ~ysh:m.
The mcssagi: 1n each layer is conca111cd an message units called protocol data units
Max. Marks: 80 (POU) which consists of two parts protocol control information (PCI) and user data
111 tad, 11101111ft. (UD). PCI contains header information about the la)cr. UD contains the data till
the layer. acting :1s a service provider rccci, c1 from or transmits to the upper la)cr /
service user layer.
(08 Marks) b. Give the comparison of SNA, OSI 11nd Internet protocol layer models.
us le,·els. The) can (08 Marks)
ication architecture. Ans. The internet model does not spectly the two lower layers. The transport and nctwrok
1011 layers. Top set of layers form the suite ofTCP/ IP protocol.
nsport layers
End user application Application Application specific
r2 Presentation protocols
Presentation services
Session Transport
Data tlo,, control Transport ~onnection:Connectior
SNICP ... less Ul.J '' less TCP
Transmission control
Nc1,,ork
-------·-
SNDCP .•• Network IP
Path control ·---·---C. '
SNDAP
Data link Data link
Not specific
ystcm 2 Physical Physicnl
Comparison of SN A, OSI and internet protocol lasses models.
User 2
In the seven layered SNA model. physical data link. application layers hnve one-
to one corrcspondi:nce "ith OSI layers. OSI reference model layer 3 (network) 4
(transport). 5 (session) and 6 (presentation) are structured differently in the SNA
lication layer model. The path control layer in the SNA model is similar to OSI m:twork la)cr
but it overlaps a little with OSI transport functions. Much of the SNA transrort and
session layer function equivalent to those of OSI model are done in data flow and
transmission control layers. The combination of these two services is nlso called
SNA transmission sub system. Presentation services, which are known as SNA high
level services combine presentation services :md function with some of session
ediatc system
control functions.
ediatc system N which
s. System N converts OR
Is. System A could be 2. a. What arc the challenges of information technology managers.
Ans. The top challenges in managing. the network are
blish standards. One I. Analysing problems, which requires intuition and skill.
ISO) The corresponding

21
VIII Se+w ( LS[/ISr) Network,, M ~

2. t\nt1c1pnti11g cu~tomers demand


3 Acquiring. resources
4 Managing client/ server environment
S Networking "ith emerging technolog) • s n par! of contnimng education.
6. M:11nt.1i11i11g rL·liabilit).
7. M.1intai11inl,.\ a secure fire wall bet,,een internal network and internet while gaming
the vnlue c,r ,i: • info and services"' ailable from thi: internet.
8. Dctenn1n, 1u responsibility for out:iges to the WAN.
D
Worksta11on
9. Config.uring mnnngement system itself.
I 0. Lxpanding the network
I I. Gathering and analysing stati<;tics for pn:scntation to upper mg.mt.
12. rraditional maintenance, mandatory maintenance.
13. I·ault isolation.
14. 1 rouble shooting.
15. Remo, c constraints and bottle necks.
b. Explain internet fal,ric moc.lcl. (08 Marks)
Ans. Transmission control protocol/ internet protocol (TCP/ IP) is a suite of protocols D

that cnab)e networks to be inter conncctcJ TCP IP form the basic foundation of the
internet
Internet can be viewed as n layered an.:hitecturc as \hown in fig. The nrchitccturc
Workstation
shown the global internet as C!om;cntm; la)ers or work~tations. LANS, WANS
interconnected b) fabrics of medium ncccss controls (MAC's) switches and gatewa) s.
The workstation belong 10 tho.; user pl11111:. the LAN's to the LAN plane and WAN's to


the WAN plnnc. fhe interfaces are defined as the fabrics. MAC fabric interface the
user plane nnd LAN plane. fhc LAN nnd WAN planes interface through switching Gate\
fnbric. rt1c \VAN's in the WAN plane interface via the gnte\\Oy fabric.
The user's \\ orkstation intc1faccs tu a LA , in a medium access control (MAL).
LAN's interfaces to WAN by a switches fabric of bridges. routers and switches.
Communication between two user palnec. takes the following path. The physical path
trnvcls the I\IAL fabric, LAN plane, s,,itchmg fabric. WAN plnnc nnd the gateway
D Swit

fabric to the core and then Murn to the user plane going through all the planes and
interface fabrics.
D MA~

3, a. With diagram, cxpl


Ans. Organization model
relationships. Netwo
bridges routers and s
The managed clemer
The manager comm
manages the manage
intermediate layer ac

22
M~

□ l ,..:, plane □
educa11on.

ernet "hile gaming D LAN plane DWorkstation


Worl-.Mat1011

(08 Mnrks)
a suite of protocol!. □ □
, sic foundation oflhe

fig. The architecture


ions. L/\.NS. WANS
DWorkstation
D
W-0rk5tation

witches 11nd gatcWll)'S.


plane and WAN's to


c fabric interface the Gntcwny fabric
ace through switching

D
y fabric.
ccess control (MAL). Switching fabric
routers and switches.
path. The physical path
plane and the gatcwa}
ugh all the planes and
D MAC fabric

Module- 2
J, a. W ith di11gram, Cl.plain organiz.ation model. (08 Marks)
Ans. Organizalion model describes the components of network management and their
relationships. Network objects consists of network clements such as hosts, hubs,
bridges routers and so on.
The managed clements have a managed process running in them, called an agent.
The manager communicates with the agent in the managed clement. Manager
manages the managed element Fig presents a three tier configuration in which the
intermediate layer acts as both agent and manager.

23
VIII Sem, (CSE(ISf) Network, M ~ C-BCS · ~ode,l,-Q

information be1wJ
Manager protocol) messag,
message (comman
Fig represt:nts corr
Agent/ manager

MDl3 = Management database


.___ __.!= /\gent process □□Managed objects
The applic:ition in
model. It is a part
Net\\ork organization model with MOM
on the network clo
Notifications / 1rt1
Fig belo\\ presents
MOM object and mnna ,j

___ /\gent ___ _ Managf


___ Agent ___ _ applicati
/\gent NMS
__ Agent NMS
Manager
Manager

Man,1gcd objects
Tram,port I

MOM = Manager of manager OSI model w,ei,


MOB = Management database lS icil /\gent process common rnanagcn
NMS = Network management system management prol
Net\\ork domains can be managed locall) and a global vie,\ of the network can be operations using r
monitored hy manager of managers (MOM). The agent and the manager devices are OSI uses both
called agent NMS and manager NMS. communication p
NMS can be configured as peer to peer relationships. In three tier configuration.
intermcdinte layer acts as that both agent and manager. As manager. it collects data 4. a. With diagram, c
from network clements. processes it and stores results in database. As ngent. it· Ans. ASN I synta:-. lh.J
transmits into to the top le\.el manager. encoding rules (Bi
h. Expla in communication model. (08 Marks) TLV encoding :,tr
Ans, Management data 1<; communicated between agent and managers process us well
ns bctwc.:n message process. 1 hrcc aspects .s,c nddrcsscd in the communication of
24
O"r"K./ M ~
information between two entities- transport medium of message exchange (transport
protocol) 1m:ssage format of communication (application protocol) and the actual
message (commands and response)
Fig represents communication model.

Manager

/\ppli~:ttiun
Notification / Traps

The application in the manager module initillfe requests to the agent in the internet
model. It is II part of operations in the OSI model. Tht: agent c>..ecutes the request
on the net\1 ork element i.e. managed object and returns responses to the manager.
Notifications / traps arc the unsolicited messages such as alarms generated by agent.
Fig below presents communicntion protocol used to transfer info between mannged
object and manai.dn~ orocess, as well as between management processes.
Manager Operations/ requests/ responses
applications Trans / notifications Agent nppli<;ation

Manager
Manager Agent
SNMP (Internet)
communication communication
Module CMIP(OSI)
Module

Tran~port layers UDP/ IP(lntemel) .• I


•OSI lower layer broken (OSI)► 1ransport ayers

OSI model uses common management information protocol (CMIP) alons with
common management information si:rvicc {CMIS). Internet uses simple network
managcmenl protocol (SNMP) for communication. The services are part of
C'\\or the networl-. ~an be operations using requests, responses and alann notifications.
the manager devices are OSI uses both TCP and UDP. CMIP and SNMP specifies the management
communication protocols for OSI and internet management respectively.
three tier configuration. OR
manager. it collects data
in database. As 11gent, it· 4. n. With diagrnm, explain TLV encoding s tructures. (08 Mu-rks)
Ans. ASN I synta1' that contains management inform Ilion is encoded using the basic
encoding rules (BER). ASCII text data is converted into bit oriented data.
(08 Marks) TLV encoding structure is shown below.
managers process as well
m the communication of

25
VIII Se+111 ( CSf/ISE) Network,l-,1~ CBCS - l-,iode,l, Q

Type Length Valuti


(3) TMN
2
3
4
Class (7th- Tag no (1st -
PIC (6th bit) ..,
8th bit) 5th bits)
(4) IEEF
The type has three sub components: Class, PLC and tag number PLC specifies 3
whether the structure is a simple. type or a construct, which is anything other than a 4
simple type. It is encoded as a one type (an och:t) field. T,,o most significalll bits (7th (5) Web based I
and 8th bit) that specify class are coded ACC to values defined in table managt:ment 2
Class 8" bit 71• bit
Universal 0 0 5. a. Explain the structf
Applications 0 I Ans. A managed object
Context specific I 0 instance as shown i
Private •I I
..
The value of PLC is O for pnm111ve and I for consuuct and 1s designated as the 6th bit.
TI1t: lowest 5 bits ( 1-5) designate tag value in binary. The length specifics the length of
the value field in number of octets. Length is defined as series of octets. The MSB (8th
bit) i,; 0 for short length with low 7 bits indicating the length of the rnlue. If the value
field is longer than 127 The 8th bit of the first octet is marled a!> and rest of the sc, en
bits of the first octet indicate how many octets follow to specify the length.
Ex: Value of length 128 would look like
10000001 10000000
The integer H1lue is encoded using twos complement from which it is computed.
b. Explain network management standards. (08 Marks) amc obje
Ans. Network management standards arc idenlfier
Standard Salient points SMI is concerned
I . International standard (ISO/ OSI) Fig shown the sit
2. Managemt:nt of data communication network LAN and WAN managed object nt:
3. Oeals with all seven layer., an object name ••in
(1)0S1/CMIP 4. Most complete only n means of id
S. Object oriented Objecl t) pe is n d,
6. Well srmcturc and la, cred objecl type is dt:fine
7. Consumer large rcso~rct:s in implemcntalion. is the encoding sch
I. lnclu~lry ~lnndard ( IETF) Every objecl lypt:
2. Originall) inte~dt:d for managemt:nt of internet components. OBJECT I DENTI !
(2) SNMP/
currently adopted tor WAN and telecommunicalion systems. case letters. OBJE
lnremct information tree.
3. Easy to implement
4. Most" idcly implemented Internet OBJECT
An) combin:ition
used.
26
"'M~
I. International standard (ITU-T)
2. Management oftele communication network
(3) TMN
3. Based on OSI network management framework
4. Addresses both network and administrative aspects of management.
I. I EEF standard adopted internationally
. 2. Addresses management of L/\N's and WAN's
(4) IEEF
3. Adopts OSI standards significantly
mber PLC specifies
4. Deals with first two layers of OSI reforence model
nything other than a
(5) Web based I. Web based enterprise management
t significant bits (7th
management 2. Java management extension (JMX)
in table
Module- 3
5. a. Explain the structure of management information.
Ans. A managed object can be considered to be composed of an object type and an object
instance as shown in fig in next page. .--------,

Object
signat_ed as the 6th bit.
specifies the length of
octets. The MSB (8th
the value. If the value Object
t e Object
s and rest of the seven
Instance 3
the length.
Object
Instance 2
ich it is computed.
Name objec Syntax Encoding
(08 Marks) Object
idcntfier ASN-1 BER Instance I
SMI is concerned only with object type and not object type and not object instance.
Fig shown the situation where there are multiple instances of an object type. A
ork LAN and WAN managed object need not be just a network element. In internet as an organization has
an object name "internet'' with OBJECT IDENTIFIER 1.3.6.1. A managed object is
only a means of identifying an object, whether it is physical or abstract.
Object type is a data type, has a name. syntax and an encoding scheme syntax of an
object type is defined using abstract syntax notation ASN- I basic encoding rules (BER)
lion. is the encoding scheme for transfer ofdata types between agent and manger processes.
Every object type is uniquely identified by a DESCRIPTION and an associated
of internet OBJECT IDENTIFIER. The DESCRIPTOR name is mnemonic and is all in lower
nication systems. case letters. OBJECT IDENTIFIER is a unique name nnd number in the management
information tree.
Internet OBJECT IDENTll711:.R :::. [ iso org. (3) dod (6) I!
Any combination of unique name and node number on the management tree can be
used.

27
VIII Semt (CSf/ISE) Network, l - 1 ~

b. Expluin RMON token r ing utcnsion groups. (08 Mllrks) types of net"orks. Co1
Ans fhere are two tok1:n ring statistic~ group~. one at MAC layer and the other on packets \)O\\in\!, 9enod ,md o\\\c
coll~")cd promiscuously. Both contams statis.tici. on ring utilization and ring error Hic.-,\(\t'\' \;fOtlj) IS hdpfo
stat1st1cs.
3) .\larm j.\roup - p
Token ring Function Tables
group the probe and Cl•mp
V. lu:never lh1: monit
l. Statistics Current utilization and error Token ring ML stats table avoid excessive en:nt
statistics of MAC layer
are specified. Groups
2. Promiscuous Current utilization and error Token ring ML stats 2 table alarm parameters. I h
statislics statistics of promiscuous data token ring P stats table to select the variable a
3. MAC layer Historical utilization and error Token ring P stats 2 table token -') Host group - cont·
history statistics of MAC layer ring ML history table a list of host~ by loo•
4. Promiscuous Historical utilization and error Token ring P history table source and dcstinatior
history statistics of promiscuous data and tune table host c
5. Ring station Station statistics Ring station control table ring done host-able contair
stat ion table: S) nchroni,cd \\ ith res
6. Ring station Stntion ordl.!r Ring station control 2 table ring 5) Host top N group.
order station order table categories. ·1he ho,t t
7. Ring station Active configuration of ring Ring station con fig table 6) l\1:ltri'1. group - sto
configuration stations is ere/lied for each co
8. Source Utilization statistics ofsourcl.! Source routing slats table source are - rna1ri, control t
routing routines info routing stats 2 table keeps track or source
The promiscuous group com::cts s1at1st1cs on the number of packets of various sizes destination 10 source t
and the type of packets multicast or broadcast data. There are tow corresponding 7) Filter group - 1s 11.
history statistics groups. Current and promiscuous. One data tabl~ is associated with Stream of data based
each of four statistics groups. The history control table is common to both Ethernet a fi her table and a ch,
and token ring. arbitraf) fi her expres
Three groups are associated with the stations on the ring. The ring station group each filter associated
provide statistics on each station being monitored on the ring, along with its status. 8) Pucket capture gr
The data art: stored' in the ring station tablt:. The ring ~la lion order group provider the based on fi lrcr criteria
order of the station of the monitored rings and has only a data table. The ring station 9) Event group - C
configuration group manager the station on the ring. and falling alam1s c:i
The last of the ring groups is the source routing group. It is used to gather statistics trnnsmit events, syste
on routing on routing info in a pure source routing environment. b. Write short notes on
OR i\nll,
Structure
6. a. Expla in RMON, common and Ethernet groups (10 Marks)
Ans. The RMON, Ethernet groups are
I) Statistics group- contains statistics measured by the probe for each monitored
Ethernet interface on a device. The ether stats table in the group has an entry for each I. l'rim itivc I) pcs t
interface. The data include statistics on packet types. seizer and errors. The group is
used to measure live statistics on nodes and segments.
2) History eroup consists of two sub 1:,aroups history control group and history (data)
group. History control group controls the periodic statistic sampl mg of data from various
28 ~l\.f-M f ,t,.M Sul\l\'-"
M~
(08 Marks) types of neh\orks. Control table stores configuration entries compressing interface
the other on packets polling period ·and other parameters. The information is stored in media sp~··111c table.
ation and ring error J listory group i1, helpful in travelling o,erall trend in the volume oftrallic.
3) Alarm group - periodically takes statistical samples on spt:cificd ~, ,ables in
the probt: and compares them with the preconfigured threshold store, probe.
Whenever the monitored variable crosses the threshold. an event is gl: 1.:1.,ted. To
Lstats table avoid excessive event generation on the threshold border, rising and fall in~ '1 rcsholds
arc specified. Groups contains an alarm table with a list of entries tha, ,;, nne the
stats 2 table alarm parameters. The columnar objects alarm variable and alarm inter. a, nre used
ats table to select the variable and the sampling interval.
tats 2 table token 4) Host group - contains information about the hosts on the network. It ompiles
ry table a list of hosb by looking at the good baskets traversing the network a11< , ,tracting
history table source and destination. Group compresses three tables host contrast table, ::ost table
and time table host control table controls the interfaces on which data bathering in
ontrol table ring done host-able contains statistics about thc::"11ost. The entries in the two data tc,bles ar.e
synchronized with respect to the host in· host cohtrol table.
ontrol 2 table ring S) Host top N gmup- generates reports ranking the top N hosts is selected ~tatistics
table categories. The host top N control table is used to initiate gc111:ratio,n of such a report.
6) Matrix group - ~tores statistics on conversation bet~,een pairs of hosts. An entity
con fig table
is created for each conversations that probe dctects. Thc three tables in the group
arc - matrix control tablt: contains the information 10 be gathered uiatrix SD table
g stats table source
2 table keeps track of source to destination conversations. Matrix DS table keeps data by
destination to source trallic.
ckets of various sizes
7) Filter group - is used to base the capture of filter packets on logical expressions.
re tow corresponding
bl~ is associated with Stream of data based on a logical expression is called a channel. The group contains
mon to both Ethernet u filter table nnd n chnnncl 1abl1:. Filler tabl,; allow:; pockets to be filtered with c111
arbitrary frlttr expression. for cach channel. the input packet is validated against
he ring station group cach·filtcr associated with that channel and is accepted if it passes any of tht tests.
along with its status. 8) Packet cnpttlrc group is a post filter group it captures pnckcts from each channel,
er group provider the based on fi Iler criteria of packet and channel filters in the fl lter group.
table. The ring station 9) Event group - Controls the generation and notification of events. Both rising
and falling alarms can be specified in the event table associated with ,he group. To
cd to gather statistics transmit events, system maintains a log.
nt. b. Write s hort notes on 'SNMP based ASN -l data type structure (06 Marks)
Ans.

(10 Marks) Structure D:1ta type Comment~


INTEGER Subtype INTEGER (n,...n,) special casc
for each monitored : enumerated int I} pe ·
p has an entry for each I . Primitive types OCTET STRING 8 bit binnl") and tcxtirnl data sub types
d errors. The group is cnn be specihlld b> range or fixed
OBJECT IDENTIFIER NULL Object po. hion in MIB placi: holder
up and history (data)
ng ofdala frum various
...,..,.., E 11AM fu1111U' 29
VIII Sem., (CSf/ISE) CiKS · ~od.cl,Q
rl1t:' bot1om part of
Network address 11' nd,lr.:~~ Not u~ed modern tcrminatio1
counter Doner decimal IP addres~" rap .1round. nnd 11 her<; servers.
n,1n negative integer n1ontonically
incn:asing. max 2 32 -1 acce~:. controlle1.
CMTS consi~cs ol
Gauge C'appcd, non negative integer increa~e
or decrease
sidc and a nct\\Or
2. Defined types
modulator is conn
Time ticks Non negative integer in hundreds of
se.:ond units and feeds them to
lc,el and reeds it
Opasuc /\pplicauon wide arbitrary ASN - I
syntax, doublt: wrapped OCTET the CMTS dcmod
SrRING Sen er at the hen
managed by sccu
3. C1t.1slruclion ~FQU£NCE List maker
cable modern lo c1
tvp~~ Sl'QUENCI:. OF 1ablc maker
intcrfoce. \\ hich i
l\1odulc - 4 The second catcg
third catcgor:, incl
7. a. With d1.1~rum c:\pl.1in dutn o,cr c:11blc rcfcn:ncc architecture ( 10 Marks)
fhe RF intcrfoccs
Ans. The 1.-1, p ·st of the figure sho\\ n the system reference architecture of HFC data (lvc:r CMTS nnd the I 11·
servi~,·., and intcrfa~·es
b. Explain IIFC tee
,----- .....
: WAN.- llead Subscriber Ans. HFC technologies
' ...... ____ ,, End PC In I IFC. 1hc si1-:n·
distribulcd via ~o'

termination system
s.
(CMTS Fiber
Receiver
2
Switch/ WAN
router
Splitler ISP
6
and filter
Servers
Video
.-------,
Upemtions support 3 Security and
S) stems/ clement access controlle
mcss:11.tcs
lnterfnccs :-
1. CMCI Cable modem to CPl:. interface At the head encl,
2. CMTS - NSI CMTs Net\\ ork side interface WAN and intern
3. DOCS - OSSI Data over cable services operations support system arc multiplexed
interface Communication
4. CMTRI Data over cable security system head end to the Ii
5. DOCSS Data over cable security system from the hcntl c1
G. Rrl Cable modem to RF interface. signal. The sign
up~tream i.i1malJ
30 S.-1\.tM C.CAM Si.AM
M~
The bottom pan offig sho" nan expanded, ic\\ of the head end It comparc1l tl-ie cable
mod..:r 11 1cn111na1ion system (CM.IS) S\\ itch , router combiner transmiill·, splilh:r
al Ip address "r.1p around. and ti hers servers, operations support systems / t:lemcnt manager and ~- ·1ity and
integer montonically acccs~ controller.
ax 2 32 -1 CMTS consi,ts of a modulator mod and a dcmodulator clcmod on ti· •re link
negotive integer ,ocrc~se side and a m:twork terminator term to the S\\ ilch / rouler connecting l\l \,.\N. Tht:
modulator is connectt:d to the combiner, which multiplexes the data and, ,~.!,, ,;ignals
e int,:ger in hundreds of and feed~ tht:m lo transmillt:r. The rccciH:r converts optical signal do,, , the RF
level and feeds il lo the splitter where the various channels arc split. ThL dcmod in
wide arbilrary ASN -1 the CMTS demodulates the analog signal back to digital data.
le wrapped OCTET Sen er al the head end handle applications and databases. The securit) ' o1ction is
manngccl by security and access controller~. Two interfoces in the datn ' 1 . · face nre
cable modern 10 customer premises equipment (CPE) interlace and the l . rs - NSI
interfoce, \\hich is the nct\\ork side of the interface ofCMTS ,~ith s, i .1 , router.
l"hc second cntegory of interfaces include operations supporl S) stems interfaces and
third categoi} include security system int.:rfacc.
re (10 Marks)
The IU 1nterfac.:cs arc between the cable modern and lhe HI C network ,111. between
ure of HFC data over CMTS and the I IFC network in the upstream and downstream directions respectively.
b. Explain IIFC tcchnolo1,:) (08 '.\1:lrks)
Ans. 1IFC technologies i~ based on existing cable t-:levision I cablc tv or(ATV) technology.
In I IFC, the signal is brought to a fiber node via a pair of optical fibus and then
distributed via co cablc to customers as illustmted in fig below
Cab c
Satellite dish
·mnsmitte
Fiber
Receiver
\\'AN

ISP------

Work station
At the head end, signals from various sources analog and digital Ser\ ices using
WAN :ind inkrnct service pro, idcr ( ISP) sen ice~ 11si11~ prh ate backbrn1L' network
are multiplexed and converted up from an electrical signal to optical signal.
tions support system
Communication is one ''") on the optical fiber. each p:iir <'I opllc/ll lil>bN f·om the
head end to the fiber node calling onc way traffic in one dir..:c1tons l'hc signal going
from the head end to the customer premier is cnlled a do" nstream sig,1,11 or a path
signal. The signal going from the customer_prem icr to tire hi:all .:•1CI is called an
up~lrt'am \ignal or arc, 1:1 w p.iiu ~ig.nal.
SMl\t.-tA.. C..AM Sul"""" 31
VIII ~em., (CSf/ISE) h'etwo-r-k, M~~ C13CS · Mod.cl,Q~
Broad band !iignal transmitted over co a>.ial ..able difl"t:rs from the band signal u - R, - lntc1face be
transmitted over a pair of wires. V,. Lo1:,ical interfo
At the customer premises, a network interface unit (NIU) also refereed to as networl-. v,. = lt1tcrfac..• bctw
interface device (NID). The analog signal is split at the NIU. The tv signal is directed TL= Terminal equip,
to the TV and the data to the cable mC'dcrn Cahle modern convert the analog signal POTS = Plain old te
to an Ethernet output feeding Either a PC or LAN. ·1he tclt:phone signal is abo PSTN = Public swit
transmiued \\ ith, idco and data at s0111e cable site:;.
Splitter:, at central
HIC technolog) is based on
telephones s ignal ti
I. Broadband LAN 2. Asymmetric b,11,dw1dth allocation to achieve two \\ay
central office. klcp
communication. 3. Radio frequency )prcad )pectrum technique for causing signals
are all off the spfitt
oH:r IIFC 4. Radio frequency spectrum allocation to cause multimedia telephones
POTS interfoct: arc
(voice). tdcvi<;icn (v1cleo) and computer communication data services.
satellite feed direct I
OR
b. With diagram exph
8. a. With •li11f!rnm, c~ph1in ADSLsy,;tcm reference model (08 Murks) Ans. fhe access networ
Ans. SON CT WAN arcs
'·., V
! '
U-C: 4-c UR ' UR-1 .F-sm
T
Digit.ii_:
I •
' s
.

''
broadcast\ ''
ATU-C I

Broa,llast '
''
nctworl-. j ATU-C
----,,
ATU-R i : TE
: r.,_
Narrow band :
network : I .
t I '
I : TE
POT JC -·· POTS-R ....
1:1 ATU-C TE
Nctwo,k ATU-C Premises :
management Phone (s) distribution
PSTN network
Acc..:si.
node
Interfaces:
B= Auxiliary data input
POTS-C = Interface between PSTN and POTS splitter at network end
POTS- R Interface between phones and POTS satellite at premier end
T = Interface between premier distribution network and service mobiles
T/ sm = Interface between ATU-R and premier distribution network
u - c = Interface between LOOP and ATu - c Two types of occei
u - C1 = Interface between POTS splitter and ATu -c customers. They :ire
11-R = Interface between LOOP and ATU - R subscriber line (OS
can support the trn1
transmitted to cabl
The alternnti.,.c me

32
from the band signal u. R = Interface between POTS spliller and ATU • R
V = logical interface b/ w ATU - C and access node
refereed to as network VC = lnterfac·.! between accesi. node and network
he tv signal is din:cted TE·= Terminal equipment •
vert the analog sigm,I POTS = Plain old tclcpho11e service
lcphone signal is abo PSTN = Public switched telephone network
Splillers at central office and customer premises separate the low frequencies
telephones signal from video and digital data. PSTW is the switch connected at
to achieve two way central office. telephones operate off the splitter at customer end. The u- interface~
ue for causing signals are all olf the splitters. Interfaces may disappear when ASDL lite is implemented
multimedia telephones POTS interface are also from splitters. B interface is for auxiliary data input in a
services. satellite fet:d din:ctty into a service module such a set top box.
b. With diagrnm explain broad bnnd ncct:ss network (08 Marks)
(08 Mnrks) Ans. The access nctworl.s serving the customer premises from the backbone SDH/
SON ET WAN arc shown in tig below.
Cabk
F-sm
T
Tclophouc
s loop


IIFC
clwork

'TE
: Ti-.
'Tr
• TE
Premises :
distribution SD/1/00NF.T
WAN
network
Satclli1c communication
and h:lcphon~ loop

twork end
premier end
ice mobiles Wireless - - ~
link
1elcphonc loop
network
Two types of access technology are available to residential and small business
custo~ers. '.hey arc based on hybrid fibre coaxial (HFC cable) modern and the digital
subscnbcr l111e (DSL). Network access based on either of theses two technologies
can support th9 transmission of voice video and data. In case of HFC information is
transmitted to cable modern at the customer from an MSO facility.
The alternative method of getting broadband-services to the home is via a satellite

33
VIII Se-w11 ( CSE/ISE) Networ'lv~~
communica1ion m:1work. /\;, with "irdcss, thi~ neh,ork is t:nvisiont:d as bc111g on~
"a> 10 the customers local ions Md telcphon~ return. /\gain, the potential or 1,,0 -
way satcllilc communication cxi~t,;. SMT
Broadband acce;,s technologies consist of four mutu..llv independent of transporting Gote
multimedia data to a customer premises. ·1he} are based on cable modern, digital
subscriber line and two wireless media methods. rrn
Gate\\-
Module- 5 Screene
9. a. E,plain linit.- s t:lll' rn:11.:hinc model (08 Marks) nnd f
Ans. Model based fou lt detection scheme uses communicating finite :-latt: machine Thc m.iin
foature or 1hi:, procc~:. is. it is a passive testing ~)stcm based on th1: assumption than Packet ti ltering routers c
an observer agent is pres\;nt in each node and reports nonnali1:iblc to a centrnl point. further screening. The nil
A failure in a node or a link is indicated b) state machine associated with component Application level gnlcw
entering an illegal state. An applicntion on the server co- relate--. the events. packet fi hering.
FSM enables communication between a client and server -.ia a communication Firewalls l find 2 will dnt
channel. Client and server both have t\\0 states each. The client '-"hich is in send A secured LAN is a gm
request stale. sends a request message to the sener and transition to the rece1,·c gateway. Ex: FTP service
rcsponsc state. For TELNET service, ap
The sen er in the receive requc<,t state. rccei\ e:. the request state. receiver rcqm:st
and transition to the send n:spom,e stati:. Alier processing the request. it send<, the
response and transition back 10 the rccci, c request state. The client then receive the
response from the scrver and transition to send request state.

l ,,mmunn;J11on
Kc4uc,t chaMcl
Firewalls protect a secu
parameters (In FTP, NNTI

IO. a.Explain policy based m


If either client or server enters an illegal state during the transitions. system has Ans. Policy needs to be built i
encountered a fault./\ message is sent to a central location as a fault condition. The observe neh, 0 1k manage
passive detection scheme is similar 10 lntp mechanism. to take. This action will d
b. Mention nnd explain different types orfin:wall. (08 Marks) the failure. They should
Ans, Firewall involve w,t: of packet filtering or application level gateways as the two when failure occurred an
primary techniques of controlling unde!oircd traffic. Policy management arch'
Object in the domain s
Packet filter is based on protocol specific criteria. It is done in data link, networl-. and
performance and authe
transport la}crs. Packet filters are implemented in screening routers or packet filtering
attributes. Attributes of
routers. Filtering is done on the fall parameters. Source IP address. destination IP
device. Allributc of pac
address source l'CP UDP port and destination rCP IP port. filtcring is implemented percentage loss. They ar
in each port of' the router and can be programmed ind1.:pcndently.

34
k,l-,1~

isioned as being on.:


the potential of two- ~
ndent of transporting
cable modern. digital
Gatewa

FTP
Packet
filtering
router
g
(08 Marks)
Secure network
late machine. The main
Packet filtering routers can l'ither drop packets or redirect them to specific hosts for
11 the assumption than
• further screening. The rules to be implemented should be simple.
,able to a central point.
ciated with component
Application level gateway - is used to overcome some of problem identified for
packet fi ltcring.
s the events. . .
via a communication Firewalls I and 2 will data only if it going to or coming from the application gateway.
client which is in sc_nd A secured LAN is a gateway LAN. Filtering is handled by the proxy services in
allsition to the receive gateway. Ex: FTP service fi le is stored first is application gateway and then forwarded.
For TELNET service, application gateway authenticates foreign host.
t state. receiver reqm:sl Secure
the request. it send~ the Firewall I
c dient then receive the

Proxy
services

Application
gnteway

Firewalls protect a secure site by checking addresses (In IP address) transport


parameters (In FTP, NNTP) and applications.
OR
10. a.Explain policy based management (08 Marks)
,e transitions. system has
Ans. Policy needs to be built into the system. In network operation centre personnel may
11 a~ a fault condition. The
observe network management system, at which time they ne~d to know what action
to take. This action will-depend on what component failed, severity or criticizing of
(08 Marks) the failure. They should know who should be informed and how. This depends on
. cvel gateways ns the two when fai lure occurred and what service level agreements are in eftect.
Policy management architecture is as shown below.
e in data link, network and Object in the domain space arc events such as fault management packet loss in
g routers or packet filtering performance and authentication failure is security management. Objects have
Ip address, destination IP attributes. Attributes of alarms include severity, type of device and location of
rt • filterinob is implemented device. Attribute of packet loss can include the layer at which packets are lost and
dently. percentage loss. They arc same as rule based reasoning.

35
VIII Sew11 ( CSE(ISE)
E ighth Semester 8.1

Time: 3 hrs.
~ Note : Am wer any FfJIJ::J

I VIC}
I. a. Wha l are tbc challc11
~pace____.___ _,
why?
Domain Ans. For challenge Refer Q.
s ace ..__ _ Polic) Action NMS is required for l
_ _ _ driver space (a) For proactive rnana
(b) To verify customer
Rule space (c) To diagnose problc,
(d) To provide statistici
Policy driwr i~ 1he control mechani!.m. Thus _t~e o~jccts in 1he do~ain ~pace and tl~e (e) To help remove bot
set of rules in rule space are combined for dec1l>IO,l 1.c. made by policy driver. A rule '.s (f) To see the trend in ~
that all network failures an: to be reported to engineering group. A statement that this b. What is Nehvork
rut~ applies to the operations personnel at the netwon.. management desk is a policy. management function
Ans, Refer Q.No. I .a. of M(
b. Explain fo ult detection, location and isolation techniques. {08 Marks!·
Ans. fault detection is accomplished by using either a polling scheme or by generation o c. Explain Network mar
traps. An application program generates the ping command pcriod1call) and "~its
for response. Connecti\ it) is declared broken \\hen a pre set number of consccl'.tlve Ans. A network managem
responses are not received. The freque ncy of pinging and pre set number for fo'.lt~rc systems A and B exch
detection may be optimiLed for balance betnecn traflic over head and the rap1d1ty of management'infonn
\\ ith which failure i~ detected. Traps genetic trap van be set in the agents, giving management controls (
them capability to report events to the network management system with legitimate
communit) name. Advantage of traps o,er polling is that failure detection is
accomplished faster with less traffic overhead.
fault location using a simple upproach would be to detect all the net\\.Ork component:.
that have failed. The origin of the problem could then be traced by traversing the
topology tree to where the problem starts. If an interlace card on a router failed, all
managed components connected to that interface would indicate failure.
/\fter having located where the fault is. nc:1.t step is to isolate the fault. f-irst is to
determine whether the problem il> failure of the component or failure of ph)s1cal
link. Let us assume that the link is not the problem and the interface card is then
proceed to isolate the problem to the layer that is causmg. t:..,cessive packet toss may
cause disconnection and can measure the packet loss by pinging, if p,nging can be
used. We can query the various MIB parameters on the node itself or on other related
nodes to localiLe further the cause of the problem.
The i?eal solution t~ locating and isolating the fault is to have an artificial intelligence
sulullon. Uy observing all the symptoms, we might be able to identify the source of
the r,roblcrn,;

36
Eighth Semester B.F,. Degree Eumination, CBCS - June/ July 2019
Network Management
Max. Marks: 80
Note : A11swer any FIVE full q11ntiom, selec·ting ONEfull quell/011 from ea,·h module.

Module- I
Whal ~ire the challenges of managing nct"'·ork? Explain the ust: of NMS and
why? (05 Marks)
For challenge Refer Q.No. 2.a. of MQP - 2
NMS is required for following reasons
Action
(a) For proactive management of network
space
(b) To verify customer configuration
(c) To diagnose problems
(d) To provide statistics on performance
(c) To help remove bottlenecks
the domain space and the (f) To see the trend in growth.
e by policy driver. A rule ~s
roup. A statement that thl'i b. What is Network management? With neat diagram, explain network
agcment desk is a policy. management functional groupings (06 Marks)
Refer Q.No. I.a. of MQP - I
ues. (08 Marks)
chemc or b) generation of c. Explain Network management Dumbell architecture, with neat diagram.
and periodical\) and waits (OS Marks)
set number of consecutive A network management dumbell architccturc is shown in fig where vendor
pre set number for fa'.11'.re systems A and B exchange common management messages. The message consist
over head and the rnp1d1t) of mnnagement'information data (ex type , id and status of managed objects) and
set in the agents, giving management controls (ex setting and changing configuration of an object)

--
c.~.....
ent svstem \\ ith legitimate t.4M - I
s that failure detection is

all the network components


be traced by traversing the
card on a router failed, all
indicate failure.
0
isolate the fault. First is to
ent or failure of ph) sical
the interface card is then
E.,cessivc packet loss may
pinging, if p,nging can be
e itself or on other related

"e an artificial intelligence


le to identif) the source of
1
VIII Sem, ( CSE I ISE) Network, M ~
CBCS • JIM'U!//JIM(y 201
The protocols and services associated with dumbell architecture are presented in
fig(b). Application services arc the management related applications such a'> fault c. Explain TLV encodin{
and configuration management . The manager protocols are CMIP for OSI model Ans. Refer 4a Model Qui:sti<
and SNMP for internet model. Transport protocol arc first four layers of OSI model
and TCP/JP over any of the first two laycrofscp.crs la)erOSI model. 3. a. What is Management
OR
Ans. Refer Sa Model Questi1
2. a. List and briefly explain the silent features of different network management
models. (08 Marks) b. Explain in brief with
Ans. Refer Q.No. I .b. of MQP - 2 organization model.
Ans. Refer 4a Model Questi
b. With block diagram, explain ASN-1 data t)·pe structure and tags. Give example
(05 Marks) c. Explain SNMP Netwo
Ans. Following arc the ASN-1 Data type conventions
Ans. Network management
nata types ConHntion Example application and agent ai
Object name Initial lo,,cr c:1se letter sys Descr. etherstats pkts UDP. IP, DLC nnd PH~
Application data type Initial upper c.lie letter Counter, lpAddress RFC 1157 describes s
Module Initial upper case Personnel Record network element may 0
The get request messa~
Macro, MIB Module A II upper case letters RMON- MID
of an object . the value
Keywords AII upper case letters INTEGER, BEGIN. The get next request
The body ofa module always starts ,,,th BEGIN and always ends with END instances of object.
.ASN. I data type definition example is sho"n below SN
Personnel Record: : = SET
{ Name,
title Graphic string,
division CHOICE {
marketing [O] SEQUENCE
{Sector.
Country}
research [I] CHOICE
{ produce - based [O] NlJLL,
basic [I) NULL }
production [2] SEQUENCE
{ product - Iine
country} } }
There constructs are used to build structured data ty pes. The data type whose name
is title is defined in - line as the data type Graphic string. It can be defined as follows
title Title:: Graphic string. The constructs SET and SEQUENCE are list builders.
<module name> DEFINITIONS : : - BEGIN
<name> : : = <definition>
END <name> : : =- <definition>

2
CBCS · J (;(NU!/ I J 1M.:1 2019
!cture are presented in c. Explain TLV encoding structure with value of class bits. (03 Marks)
ilications such as fault Ans. Refer 4a Model Questions paper 2.
, CMIP for OSI model Module-2
r layers of OSI model
I model. 3. a. What is Management Information base'! Explain MIB module structure.
(05 Marks)
Ans. Reter Sa Model Questions paper 2
ehvork management
b. Explain in brief with neat diagram, two tier, three tier and proxy server SNMP
(08 Marks)
organization model. (06 Marks)
Ans. Refer 4a Model Questions paper 1
nd tags. Give example
(05 Mark,) c. Explain SNMP Network management architecture, with neat diagram,
(05 Marks)
Ans. Network management architecture portrays the data path between mam1ger
Example application and agent application process via the four transpo1t function protocols :
r ethcrstats pkts UDP, IP, DLC and PHY.
I '
r, lpAddress RFC 1157 describes SNMP system architecture SNMP is management info for a
network element may be inspected or altered by logically remote users.
~el Record
The get request message is generated by the management process requesting value
-MIB of an object . the value of an object is a scalar variable.
The get next request or get next can have multiple values because of multiple
, s ends with EN D instances of object.
SNMP Manager SNMPAgent

,.._
Manageme11t
dnl3 K SNMP Manager
Application
'
( SNMPAgent
Application
i1:, ii:,
1
1ii
" .,X i1:,
::,
<,
~
r:
0
C.
-.. .,..8'
"'
:, ;; t1 0c.."'"r:
~
~ u
C

i
[ ~
i g
C. e C:
:,

e ~
0/)
t " 0/)
C.
!l u g 0/)

SNMP SNMP
UDP UDP
·he data type whose name JP IP
can be defined as follows DLC DLC
UENCE are list builders.
PHY PHY

Physical Medium

3
,.

VIII Sem, (CSE/ I S£) Network, M ~ CBCS •JIM'l-e/(Ju


The set reHuest is generated by management process to ini1iali1.c or reset the valm:
of an object variable. .
The get response message is generated by an agent proce~s. It 1s gen_e~ated only on
receipt of get reQ. get - next reQ or set- request. A trap 1s a~ ~nsolic11ed message
generated by agent proces~ without a message or event amvmg from managers
process.
OR
4. a. Whut is SNMP Community? F,xplaia SNMP community profile and Access
policy. (06 Mark.,)
Ans. A pairing of SNMP MIB view with the SNMP access mode is called community
profile.
SN MP Community profi le is shown below
SJI.MPA&cnt
R.:aJ
Only
l Read
Writc
SNMP ncccss modC
'C

_.l I

Nol
_t_
Read

Write

Read /
accessible only only \Hile
Object I Object 2 Obj~-ct 3 Object 4

< MiRAccess

< SNMP MIB view (


b. Whal is lntcrfi
Ans. The interface gr
if there is more
There are four access privileges - none, read only. write only and read-write. The
above is example of no - access model. associated with
Most of the objects available are read-only and these arc get and trap operations. int,. l.1,'t' cards
A pairing of an SNMP communit) with an SNMP communit) profile in defined as inr.-. 1.1.:c in the;
or a ~) stem. The
SNMP access policy. Fig shows an example of three network management systems
lfent,y OBJEC
in three network operations centres (NOC'S) . Each having access to different
community domains. Agents I and 2 belong to community I. Manager I in NOCI SYNTAX !FEN
ACCESS NOT.
and manger in NOC2 are responsible for managing their respective domains.
STATUS manda
Examples of intq
Entity
if Number

ifNumber i

if Entry

4
,O' E«AM 5c.A1111
cscs -JIM'\£// J~ 2019

ess. It is generated only on


ap is an unsolicited message
nt arriving from managers

unity profile and Access


(06 Marks)
mode is called community
Commumty2

p access mode

Manngcr 2
Object 4 (Communit 2

MIBAcccss b. What is Interface group? Explain wilb an example. (05 Marks)


Ans. The interface group contains managed objects associated with interfaces of a system.
p Mll3 VIC\\ iflhere is more than one interface in the system. the group describes the parameter
nssociutcd with each interface. Ex :- If an Ethernet bridge has several network
int,, I.a,:.: cards , the group covers the info associated with each interface. Each
int.; 1.1cc in the interfaces table can be visualized as attached to either a sub network
re get and trap operations.
or n s) stem. The index for the table i~specified by if Index like in the below module
munity profile in defined as
If entry OBJECT-TYPE
etwork management systems
h having access to different SYNTAX IFENTRY
ACCESS NOT - accessible
unity I. Manager I in NOC I
STATUS mandatory
ir respective domains.
Examples of interfaces Group are
Entity om Description
if Number interfaces Total no of network interfaces in system
List of entities that describes info on each interface
ifNumber interfaces 2
of the system.
An interface entity that contain objects at the sub
if Entry ifTablc I
nelworl... layer for a particular interfaces.

5
VIII Se.111 ( CSE I IS£) Network, M ~

c. Explain ICMP group with neat diagram. (05 Marks) The netw01k ma
Ans. Internet control message protocol group contains the statistics on ICMP control probe monitor th
messages. and backbone ne
ICMPGroup network segment
Entity 0 1D Description (brief) Ex :- RMON cou
Total no of ICMP messages receive by the an abnormal con
icmp ln Mrgr icmpl an alann.
entity including icmp In Errors
No of messages received by entity with ICMP The individual
icmp In Errors icmp2 diagnosed and
specific errors.
No of icmp destination unreachable message technology in a
icmp ln Destunreaches icmp3 productivity for
received.
icmp ln Time Excds icmp4 No ofICMP Time Exceeded messages received. b. Explain RMON
No of ICMP parameter problems messages Ans. Refer Sa Model c
icmp In parm Probs icmp 5
received. c. Explain four b
icmp l n Src Quenches icmp6 No of ICMP source Quencr messages received Ans. Four modes of
icmp In Redirects icmp7 No of ICMP Redirect messages received fibre coaxial (Ii
icmp In Echos icmp8 No of lCMP Echo (reQuest) messages received. Communication
deployed means
icmp In Ecro Reps icrnp 9 No oflCMP Echo reply messages received
..
The syntax of all ent1t1es 1s read - only counter. Ex stat1st1cs on the no of reQucsts
sent might be obtained from the counter reading of icmp out Echos.
Module-3
5. a. What ls Rcmoh: Monitoring? With a fiture, explain use orRMON probe.
(06 Marks)
Ans. Remotely monitoring the network with a probe is referred to as Remote Network
monitoring (RMON . Fi below shows the use of RMON probe

In one - way te
traverses cable.
network telephone facilit
acceptable. For I
internet, there r
implementation
trans mission Ii
NMS three categoric
Probe
Wireless satelli
cacs -JIA.,V\,(!/1JtAl:Y 2019
(05 Marks) The network management system(NMS) is on local ethemet LAN. A token ring
ics on ICMP control probe monitor the token ring LAN. It communicates with NMS via routers, WAN
and backbone network. One advantage is that each RMON device monitor the local
network segment. it relays info in both solicited and unsolicited fashion to the NMS.
)On (brief) Ex :- RMON could be locally polling the network elements in a segment. If it detects
gcs receive by the an abnormal condition such as heavy packet losses or excessive collisions, it sends
an alarm.
~ Errors
The individual segments can be monitored almost continuously. A fault can be
id by entity with ICMP diagnosed and reposted to NMs. The overall benefits of implementing RMON
technology in a network are higher network availability for users and generate
unreachable message productivity for administrators.
b. Explain RMON, Groups and its functions in detail. (OS Marks)
eeded messages reee ·ve Ans. Refer 5a Model Questions paper I.
problems messages
c. Explain four broad band access technology with example. (05 Marks)
Ans. Four modes of access rely on four different technologies There are Hybrid
,encr messages receive fibre coaxial (HFC) Cable, Digital subscriber line (DSL) wireless and satellite
messages received Communication . HFC uses television facilities and cable modem and is most widely
Quest) messages received. deployed means of access of fou,.r.;....._ _ _ __,
Broadband Access
y messages received
0 11 the no of reQue~ts
Echos.

ofRMON probe.
(06 Marks)
to as Remote Network
obe

In one - way telephone return configuration, downstream signal to the customer


traverses cable. The return upstream signal from customer premises is carried over
telephone facilities via regular modem. One - way or two - way communication is
acceptable. For Ex, a residential customer may reQuest a file to be downloaded from
internet, there rcQuirc only small band width . DSL technology has three different
implementation and is referred as XDSL, Wireless access technology uses wireless
trans mission from downstream link to customer site. ISM, MMDS and LMDS are
Ethernet three categories of wireless services.
Probe Wireless satellite communication has been extended to broad band access service.

7
VIII Sem., (CSE I I SE)

OR c. Explain S layers Al
6. a. Explain SNMP operations with details. (06Ma r ks)
Ans. Refer 6a Model Questions paper I. Ans. Management ofl-U,
of either a compute
b. Explain RMON1 m1magement info base with details. (05 Murks) layer architecture fl
Ans. Refer Sb Model Question paper I. WAN via an ATM
c. Explain Data over Cable system reference Architectures. (05 Marks) the applications am
Ans. Refer 7a Model Question puper 2. in the cable mode
Module-4
7. a. Briefly explain HFC technology , wi~b neat block diagram. (06 Marks)
Ans. Refer 7b Model Question paper 2.
b. What is CMTS? Bricfty explain with example. (05 Marks)
Ans. Cable modern and Cable modern termination system (CMTS) can be managed with
SNMP management. Fig shov.n the Mm•s associated with cable modern and MIB
associated with CMTS that are relevant to managing cable modern and CMTS.

Three functional f
support and plan
include detection
gains signal levelf
docsit MIB docsTrCntMlB
127 128 8. a. What is ADSL T
The first category is the generic set of lETF MIB'S, system {mib - 2 I} interfaces with example.
{mib - 2 2} [RFC 1213). The second category comprises MlB'S for interfaces of Ans. Fig shown the c
cable modern and CMTS. The docs If mib is a sub node under transmission and premise network
includes objects for the Baseline privacy Interface. The docs Tr Cm MIB specifics ADSL = Asynch
the telephone return for cable modem and CMiS. ATM = Asynchr
The Docs interfaces objects group, docs if MIB objects has three sub groups. Base STM = Synchro
interface objects common to CM and CMIS. The RF spectrum has n layered interface TE = Terminal ~
similar to that of ATM sub layers. The CMTS interface objects group, docs if cmts OS = Operation i
objects pertains to the CMTS. PDN = Premises
SM = Service M

8
7<,M~ CBCS ·]IM'\£,/]ufy2019
c. Explain S layers Architecture of network management, with neat diagram.
(06 Marks) (05 Marks)
Ans. Ma1~agement of HFC system with cable modems is more complex than management
of either ~ computer network or a telecommunication network. Figure presents the
(05 Marks) layer architecture for network management. the head end is connected to the SON ET
WAN via an ATM link and to the HFC via an HFC link. The head end contains both
(OS Marks)
!he applications and SNMP manager in the application layer. An SNMl> agent resides
in the cable modem that monitors both RF and Ethernet interface.
Head end Cable modem Subscriber
Applications Modem Applications
SNMPAgent Applications
(06 Marks) SNMPMGR
Applications SNMP, FTP
SNMP
(OS Marks) SNMPMGR HTTP etc
) ca1\ be managed with
TCP/UDP TCP/UDP TCP/UDP
cable modern and MIB
odern and CMTS.
IP IP IP
'

Er ATM
link
HFC
link ~
HFC
link
Ethernet
link ~
Ethernet
link

Three functional areas are identified• network maintenance, subscriber (customer)


support and planning. At the physical layer, HFC access network management
include detection of errors and correction of ingress noise interference, amplifier
gains signal levels at cable modems and power supply voltages.
OR
8. a. What is ADSL Technologies ? Briefly explain simplified ADSL access network
em {mib • 2 l} interfaces with example. (06 Marks)
s MlB'S for interfaces of Ans. Fig shown the components of the overall network comprising private , public and
premise networks and the role that ADSL access network plays in it
e under transmission and
ocs Tr Cm MlB specifies ADSL ;; Asynchronous Digital Subscriber Line.
ATM;; Asynchronous Transfer Mode.
s three sub groups. Dase STM = Synchronous Transfer Mode.
m has a layered interface TE = Terminal EQuipment.
objects group, docs if cmts OS = Operation system.
PDN = Premises Distribution Network
SM ~ Service Module.
VIII Sem,, (CSE I I S'f), Network; M ~

/\DSL /\cccss Nctworlo. I CBCS - JIM\el( J&A.o/

SM
Brood b:1nd
NIW
Narrow band Access
NIW Node Message
AOSL ADSL proeessin
Pa.:~,1
NW model
STM

ATM
Packet
STM
..
ATM Packet

Transport Mode: The authentication


The service providers network consists of service systems, operators system (OS) authentic cation.
that perform operations, administration and maintenance (OAM) of net\\ork and as HMAC. MDS
access nodes and ATU- C'S. authentication ensu
The services systems are on a private network that provides on - line services . ?f the message. r
Private network interfaces with the public network, which consists of broad band, identifies associate
narrow band or packet. The access node is the point for broad band signal , narrow
band data and packet data. It is located in a central office or at a remote location such
9. a. Briefly explain AD
as an optical network unit (ONU). A ns. Fault isolation para
The access node may include ATU - C'S and DSCAM. The premises network starts
from the network interface at the o/p of ATU - R. Five transport modts are shown Parameter
in fig . The PON is a part of customer premises network pair cable of a telephone ADSL Line
network. Which distributes signal over power lines, co - axial cable, fibre optic cable status
or a combination of these transmission media. Alarm thresh
b. Briefly explain the role of ADSL access network in an overall network with holds
details. (06 Marks) Unable to
Ans. Refer Sa Model Question paper 2. initaliz.e ATU • R
c. What is SNMPV3 security? Expansing conceptualized representation of SNMP
secure communication. (04 Murks) Rate Change
Ans. The IETF SNMPV2 working group has specification for its security subsystem user
- based security model. It is called user - based. Faults should be di
indication of fau lts
ADSI line status sh~
is a loss of any of t
initialization errors
minutes on loss of si

10
CBCS •JIAKI.(!//Jufy 2019
Securit model
Onto intcsri1~ Aulhen11cation
module
Data origin authentication

Message Privacy
processing i--- Data confidcntialit) _ _....,.
module
model

Timeliness
Message: Timeliness and module
limited rc:play protection

The authentication module provides two services - data integrity and data origin
authentic cation. TI1e authentication process involves the use of protocols such
perators system (OS) as HMAC - MOS or HMAC - SHA - 96 in SNMPV1 . The data origin service
AM) of network and authentication ensures that claimed identify of the use on behalf is truly the originator
of the message. The authentication module appends to each message a unique
es on _ line services . identifies associated with an authoritative SNMP engine.
onsists of broad band.
d band signal • narrow
Module- 5
8 remote location such 9. a. Briefly explain ADSL fault management with example. (08 Marks)
Ans. Fault isolation parameters are ~hown in table below
remises network starts Parameter Component Line Description
port modes are shown
ADSL Line Indicates operational status and
ir cable of a telephone ADSL line Physical
status various types of failures of link.
1cable, fibre optic cable
Alann thresh Generates alarm on failures or
ATU-CIR Physical
holds Crossing of thresh holds.
overall network with Initialization failure of ATU - R from
(06Marks)
Unable to A.TU- CIR Physical
initalizc ATU - R ATU-C
Event generation upon rate changes
presentation of SNMP Rate Change ATU-CIR Physical when shift margins arc Crossed in
(04 Marks) the both upstream and downstream.
curity subsystem user Faults should be displayed by network management system. After the automatic
indication of faults, ATU • C and ATU - R self tests as specific in Tl.413
ADSI line st:itus shown the current status ot line - whether it is operational or there
is a loss of any of the frame , signal , power or link parameters. rt also indicates
initialization errors Alanns are generated \\hen the reset counter reading exceeds IS
minutes on loss of signal, frame. power, link and error seconds.

11
VIII Seffl/ ( CSE I ISE) CBCS • JfM/l.e.,/J~ 2
------------------------------' Network - - M ~
b. What is Event Correlation ? Explain case ba,sed reasoning in detail with
necessary diagram. (08 Marks)
Ans. The management system has to correlate all the events and isolate root cause of the
problem. The methods used to do so are called Event Correlation.
Case based reasoning (CBR) overcomes many of the deficiencies of RBR. The
general CBR architecture i~ shown in fi

Input Retrieve Adapt Process


A user logs on to a c
It consists offour modules; input, retrieve, adapt and process along with case library.
server. After verify
The CBR approach uses knowledge gained previously and extends it to current
ticket to client. Th
situation. The former episodes are stored in case library.
decrypts the mess
If the current situation as received by input module matches one in the case library,
and obtains a servi
it is applied. If it does not match, the closest situation is chosen by adapt module validating ticket a
and adapted to current situation to solve the problem. The process module talces the
())Authentications
appropriate action. Once the problem has been resolved, the newly adapted case is granted. No login j
added to the library. The user sends id t
When a trouble ticket is created on a network problem, it is compared to similar agent to client and a
cases in the case library containing previous trouble tickets with resolution. The (4)Authentication L
current trouble is resolved by adopting the previous case in one of three ways (I) functions. The sen
Parametrized adaptation (2) Abstraction/re specialization adaptation and (3) Critic decrypts the messai
based adaptation. used to encrypt and
OR b. What is Privacy E
10. a. Write a note on different Authentication system in client server Environment. diagram.
(08 Marks) Ans. Refer IO b Model
Ans. There are four types of client - server Environments. They are
(I) Host/User Authentication System
Host authentication requires that certain hosts be validated by the server providing
service. The host names are administered by server administrator. The server
recognises the host by the host address. If server S is authorized to serve a client
host C, anyone who has an account in C could access S. Authentication can also be
provided by ID and password.
(2) Ticket granting system - is explained by kerberos Fig consists of an authentication
server and ticket granting server

12
CBCS • JIMU!/( ]IMIJ 2019
..-------'
c,~~

ng in detail with K.:rbcros


(08 Marks) Client
~e root cause of the
u
~ work Authentication
server
). station
,cies of RBR. The

Application Ticket
~~ Granting
server
server

A user logs on to a client workstation and sends an logins request to the authentication
ng with case library. server. After verifying the user on its access control list, the AS grants an encrypted
:xtends it to current ticket to client. The circuit workstation reQucst a password from the user and then
decrypts the message from AS. The circuit interacts with ticket granting server
in the case library, and obtains a service granting ticket and session key. The application server, after
ten by adapt module validating ticket and session key provides service to user.
ss module takes the (3)Authentication server system - is similar to ticket granting system but no ticket is
wly adapted case is granted. No login identification and password pair is sent from client workstation.
The user sends id to central authentication server,aner validation of user, as a proxy
compared to similar agent to client and authentication the user to the application server.
ith resolution. The (4) Authentication Using Cl) ptographic functions - involves the use of cryptographic
e of three ways (I) functions. The sender can encrypt an authentication request to the receiver, who
ation and (3) Critic decrypts the message to validate the user's identification. Algorithms and keys are
used to encrypt and decrypt messages.
b. What is Privacy Enhanced Mail (PEM) ? Briefly PEM process, with neat block
diagram. (08 Marks)
erver Environment.
(08 Marks) Ans. Refer 10 b Model Question paper I.

the server providing


nistrator. The server
lzed to serve a client
ntication can also be

ofan authentication

13
1/
Eighth Semester 8.E. Degree Examination, CBCS - Dec 2019 / Jan 2020 protocol (CMIP). them
Network Management of tl~e internet suite of I
Time: 3 hrs. Max. Marl-s: 80 and is connectionless ne
Note: A11swer any FIVE/111/ q11estiom, selecting ONE full questio11/ro111 each 111od11/e. An~· Discuss the challenges
. Refer Q.No. 2.a. MQP. 2
Module- I
1. a. Explain the salient services provided in OSI layer. (08 Marks) 3. a. Explain the various dat
Ans. Refer Q.No. 2.b. of MQP - I Ans. Refer Q.No. 2.b. of MQP
b. Briefly explain Network Management functional groupings, with diagram. Refer Q ·No· 2·b· ofJ une
(08 Marks) b. Discuss 2 Tier mod 1
Ans. ReferQ.No. l.b. of June/ July 2019 Ans Tw • c and
. o tier model • network
OR hubs, bridges and routcrs.

~
2, a. Discuss application-specific protocol in OSI Models (08 Marks)
Ans. All application specific protocol services in OSI arc sandwiched between the user
and presentation layers. _ _ _ __
OSI user MDB-M
- _- anagement Datab
-Agent Process

Terminal
VT Application
As shown in fig the .
mana ' re is a d
ges the managed ele

~
management
. . data, Processes
FTAM File Transfer aF m11111nal set of infiormation
.
or three tier, Refer Q.No. 4.

Mail/ Message
MUTIS -----1 Transfer . a. Discuss the vnrious models
ns. Refer Q.No. 2.a. of June/ Jul
Management b. Explain the following·
CMIP i) Macros ·
Application
ii) Functional Model
ns. (a) Macros. ASN.I langua
Presentation layer types and values by defining
ThcASN. l macros also fncili
The boxes on the right hand side describe the services offered. A user interfaces with characteristics associated wi
a hu:,l at a remote terminal using virtual terminal (VT). File transfer are accomplished Structure of ASN. I macro is
using file transfer access using file transfer access and management (FTAM) in OSI <macroname> MACRO : :
model. Similar to SMTP protocol is message oriented text interchange standard BEGIN
(MOI IS). Network management is accomplished using common management info TYPE NOTATION : : = <S)

14
CBCS - 'f)ec,2019 I Jt:M'\/2020
protocol (CMIP). the most used network protocol is lntemct protocol (lP). It is a part
2019 / Jno 2020
of the internet suite of transmission control protocol and internet protocol (TCP/IP)
and is connectionless network service protocol.
Max. Marks: 80 b. Discuss the challenges facL-d by Network Managers. (08 Mark.,)
1 from eacll module. Ans. Refer Q.No. 2.a. MQP - 2
Module- 2
J. a. Explain the various data types and symbol used in ASN.1 (08 Mark.,)
(08 Marki)
Ans. Refer Q.No. 2.b. of MQP - l
Refer Q.No. 2.b. of June / July 2019
pings, with diagram,
(08 Marks) b. Discuss 2 Tier model and 3 Tier model. (08 Marks)
Ans. Two tier model - network management consists of network elements such as hosts,
hubs, bridges and routers.

(08 Marki)
iched between the user
MDB = Management Database
- = Agent Process

erminal
pplication As shown in fig, there is a database in the manager but not in the agent. The manager
manages the managed element. TI1e manager acquires the agent and receives
management data, processes it and stores it in the database. The agent can also send
a minimal set of information to the manager unsolicited.
le Transfer
For three tier, Refer Q.No. 4.a. of MQP - I
OR
ii / Message Discuss the various models in Network Management. (08 Marks)
Transfer Refer Q.No. 2.a. of June / July 20 19
b. Explain the following:
anagement i) Macros
pplication ii) Functional Model (08 Mark.,)
ns. (a) Macros - ASN. l language permits extension of capability to define new data
types and values by defining ASN. l macros.
The ASN. I macros also facilitate grouping of instances ofan objects and determining
. A user interfaces with characteristics associated with it.
nsfcr are accomplished Structure of ASN. I macro is shown below
gcment (FTAM) in OSI <macronamc> MACRO : : =
xt interchange standard BEGIN
mon management info TYPE NOTATION : : = <SyntaxofNewType>

15
VIII Se,m, (CSE I ISE) Net:wor-'/v M ~ CBCS - Veo201~ I J~
V/\LUE NOTATION : : = <SyntaxofNewValue> (I) It should minimi~
<auxiliary assignments> by management ager
END (2) lt should be flexi
The keyword for MACRO is in capital letters Type notation defined the syntax of (3) SNMP architect
new types and VALUE notation defined the syntax of new values. The auxiliary particular hosts and
assignments define and describe any new types identified. Only non aggregate
Example of object identity macro is shown below are communicated a
OBJECT - IDENTITY MACRO The get request and
BEGIN data from network el
TYPE NOTATION : : = of network element.
"STATUS" Status messages from mana
"DESCRIPTION" Text time stamp. SNMP n
Refer Part protocol in order to b
Value NOTATION : : = b. Explain various gro
Value (VALUE OBJECT IDENTIFIER)
Status : : =·'Current''\ "depreciated"\ Obsolete"
Refer part :: = "REFERENCE'' Text I empty Ans. Refer Q.No. S.a. of M
Text : : ="'"'String"'""'
END 6. a. Explain various S~
The two syntactical expression STATUS and DESCRIPTION are mandatory and
Ans. Refer Q.No. 6.a. of M
the type Refer part is optional. The value in VALUE NOTATION defines object
identifier. b. Explain the followin
b) Func tional Model - Component of OSI model addresses the us~r oriented i) Create and Go Ro
npplications and an, specified as shown in fig below
ii) Create and Wait E
OSI Functional Ans. i Create and Go Ro
model
Manager
Configuration process
Fault Performance
Management Manngemcnt Security A •
Management M ccountmg
. . __..;;..._..J anagement Management
Configurat1on management addresses the I . .
networksand theircomponents Fa It se tmg ~nd chang111g of configuration of
o~the problem causing the fail~rc ~n t;~ea~:f:::n~t~v~~e; delectation and !solation
displays in real time all major and minor I .b const_antly rn~nrtors and
trouble ticket administration of fault. a arm ascd on seventy of failures. The <lat

Module -3
5. a. Explain SNMP ~etwork management architecture. (08 Marks
Ans. The SNMP architecture consists of communication between netwo k )
stations and managed network elements or objects SNMP co ~ ~anagement
is us d 1 • . fi · mrnun1cat1on protocol
. e _o communicate m o between network management stations and ,
station m the elements. management
There are three goals of architecture

16
CBCS · Veo2019 ( J<M\/2020

(I) It should minimize the number and complexity of management functions realized
b} management agent.
(2) It should be fl exible to allow expansion
tion defined the syntax of (3) SNMP architecture should be independent of architecture an<\ mechanism of
new values. The auxiliary particular hosts and gateways.
Only non aggregate objects arc communicated using SNMP. the aggregate objects
arc communicated as instances of object.
The get request and get next req message and generated by manager to retrieve
data from network elements. The set request is used to initialize and edit parameters
of network element. The get response request response from agent to get and set
messages from manager. there are three types of traps: Generic trap, specific trap and
lime stamp. SNMP messages are ex-changed using connectionless UDP transport
protocol in order to be consistent with simplicity of model as well as to reduce traffic.
b. Explain various groups and functions, RMON 1 perform at the data link layer.
(08 Marks)
Ans. Refer Q.No. 5.a. of MQP - I
OR
6. a. Explain various SNMP operations. (08 Marks)
lPTlON are mandatory and Ans. Refer Q.No. 6.a. of MQP • I
NOTATION defines object b. Explain the following briefly:
i) Create and Go Row creation
ii) Create and Wait Row creation. (08 Marks)
Ans. i Create and Go Row creation
Manager Agent Managed
process process Entity

Security Accounting
mcnt Management
Set Request {
changing of configu~ation_ of Status 3 = 4
Ives delectation and isolauon index 3 = 3
MS constantly monitors and data, 3 = Def (Data) Create
on severity of failures. The instance

Instance
Created
(08 Marks) Response {
cen network management Status 3 = I
MP communication protocol index 3 = 3
nt stations and management data, 3 = Def(Data)

17
CBCS · Veo2019 ( j<M'V1020

(I) lt should minimize the number and complexity of management functions realized
by management agent.
(2) It should be flexible to allow expansion
tion defined the syntax of (3) SNMP architecture should be independent of architecture anct mechanism of
ew values. The auxillary particular hosts and gateways.
Only non aggregate objects are communicated using SNMP. the aggregate objects
are communicated as instances of object.
The get request and get next req message and generated by manager to retrieve
data from network elements. The set request is used to initialize and edit parameters
of network element. The get response request response from agent to get and set
messages from manager. there are three types of traps: Generic trap, specific trap and
time stamp. SNMP messages are ex-changed using connectionless UDP transport
protocol in order to be consistent with simplicity of model as well as to reduce traffic.
b. Explain various groups ond functions, RMON 1 perform ot the data link layer.
(08 Marks)
Ans. Refer Q.No. 5.a. of MQP - I
OR
6. a. Explain various SNMP operations.
PTlON are mandatory and (08 Marks)
Ans. Refer Q.No. 6.a. of MQP - I
NOTATION defines object
b. Explain the following bric0y:
i) Create and Go Row creation
addresses the user oriented
ii) Create and Wait Row creation. (08 Marks)
Ans. i Create and Go Row creation
Manager Agent Managed
process process Entity
Security Accounting
anagement Management
Set Request {
changing of configu~ation. of Status 3 = 4
Ives delectation and isolauon index 3 = 3
MS constantly monitors and Create
data, 3 = Def (Data)
cl on severity of failures. The instance

Instance
(08 Marks) Response { Created
etween network management Status 3 = I
MP communication protocol index 3 =3
ment stations and management data, 3 = Def (Data)

17
VIII Sem, (CSE I ISE) Networ-k, M ~ CBCS - Veo2019 (
The creation of a row and deletion of a row arc significant new features in SMIV2.
There are 2 states for Row status. CreateAndGo and CreateAndwait. which are
operations. In createAndGo the manager sends a message to the agent to create a row 7. a. With neat diag
and make its a status active immediately. Fig below shown create and go operation.
The manager process initiates a set request PDU to create a conceptual row with the Ans. Refer Q.No. 7.a 0
values given for three columnar instances of the row. The value for the index column b. Explain with nea
is specified by VarBind index = 3. Ans. Refer Q.No. 5.c.
This is suffixed to the other two columnar objects in the next row to be created.
ii) Create and Wait Row creation.
Fig shows the o rational sequence to create a row using the create and wait tnethod. 8. a. Discuss the folio
Manager Agent i) ADSL Nehvor

..----t-------
Process
Sec Request l
Process Ans. i) ADSL Networ
ii) ADSL Fault M
Scaua-3 : 5
ladn•3 • J b. Discuss protocol I
Ans. Refer Q.No. 7.b. 0

GotRequeot (data 3)0

9. a. Discuss security
Ans. Security manageme
management. ltinvo
Set Request ( access to data store
through the network
-
dll>J DclDaia) :::::---;
Response(
Status- 3 • 2 mobile station. Secu
Index - 3 • DelDala) The IETF work gro
Sclltoqual( statement of lhe rut
11111113 • I ::::-----:,.
technology and info
Rc,poosc[ - An enterprise policy
- - Sta1u1-3 • I
into systems and ace
The manager and the agent are shown. The manager process sends a set request A basic guide for ge
POU to the agent process. the value for status is 5, which is to create and wait. I. Identify what you
The third columnar object expects a default value, which is not in the set-request 2. Determine what y
message. The agent process responds with a status value of 3, which is not ready. 3. De1ermine how Ii
The manager sends a get request to get the data for the row. The agent responds with 4. Implement measu
a nosuchlnstance message, indicating that the data vlue is missing. The manager 5. Review the proces
subsequently sends tho value for data and receives a response of not Inservice(2) The assets that needs
from the agent. The fourth and final enchange of message in the fig is to activate data, documentation,
the ro~ with a ~tatus value of I. With each message received from the manager, the classic threats are fro
agent either validates or sets the instance value on the managed entity. I or unauthorized dis
Denail of service is a
state in which it no I
b. Discuss Event corrc
Ans. Refer Q.No. 9.b. of J

JD
cscs -Veo2019 IJCM'\12020
[~e~==-
eAndwait. which are
Module- 4
7. a. With neat diagram, explain data over cable system reference architecture.
agent to create a row (08 Marks)
ate and go operation. Ans. Refer.Q.No. 7.a ofMQP- 2
nceptual row with the (08 Marks)
b. Explain wilh neat sketch brondbund access technology.
for the index column
Ans. Refer Q.No. 5.c. of June/ July 2019
w to be created. OR
8. a. Discuss the following:
ate and wait method. I) ADSL Network Munagemcnt ii) ADSL Fault Management (08 Marks)
Agent Ans. i) ADSL Network Management : Refer Q.No. 8.b. of June / July 2019
Process ii) ADSL Fault Management : Refer Q.No. 9.a. of June / July 2019
b. Discuss protocol layer in IIFC system. (08 Marks)
Ans. Refer Q.No. 7.b. of MQP - 2
Module- S
9. a. Discuss security Management. (08 Marks)
Ans. Security management is both a technical and an administrative consideration of info
management. It involves securing access to the network and info flowing in the network,
access to data stored in the network and manipulating the data stored and flowing
through the network. Another area of concern in secure communication is the use of
mobile station. Security management goes beyond the realm of SNMP management.
The IETF work group that generates RFC defined a security policy as "a formal
statement of the rules by v,hich people who are given access to an organization's
technology and info assets must abide".
An enterprise policy should address both access and security breaches. Illegal entry
into systems and accessing of networks must be protected against.
A basic guide for getting up policies and procedures includes following.
css sends a set request I. Identify what you arc trying to protect
h is to create and wait. 2. Determine what you are trying to protect
is not in the set-request 3. Determine how likely threats are
f 3 which is not ready. 4. Implement measures that will protect your assets in a cost efiective manner
Th~ agent responds with 5. Rc,ic\\ the process continuously and make improvements of weakness are found.
s missing. The manager The assets that needs to be protected should be listed including hardware, software,
nse of not lnservice(2} data, documentation, supplied and the people who have responsibility for them. The
in the fig is to activate classic threats are from unauthorized access to resource and/ or info, unintended and
from the manager, the I or unauthorized disclosure of info and devial of sen ice.
ed entity. Dena ii of service is a serious attack on network because the network is bought into a
state in \\hich it no longer carry legitimate user's data.
b. Discuss t'~ent correlation techniques. (08 Mar~)
Ans. Refer Q.No. 9.b. of June/July 2019

19
VIII Sem,(CSE ( ISE) Netwo-r-k- M ~

OR
10. a. Discuss the Following:
i) Fault Management ii) Performance MIIDagement. (08 Marks)
Ans. i) Fault Management : Refer Q.No. 10.b. of MQP - 2
ii) Performance Management : The purpose of network is to carry info, and
performance management is (data) traffic management. It involves data monitoring,
problem isolation, perfonnancc tuning • analysis of statistical data for recognising
trends and resource planning.
The parameters are throughput, response time, network availability an network
reliability. The metrics of these parameters depend on what when and where
measu:-ements are made. The parameters that affect network throughput are
bandwidth or capacity of the transmission media, its utilization, channel error rate,
peak load and average traffic load. The response time depends on the application.
Three types of metrics arc used to measure applications responsiveness application
availability, response time between user and server and burst frame is the rate at
which the requested data arrives at user station.
b. Explain configuration Management. (08 Marks)
Ans. Configuration management in network management is used in discovering
network topology, mapping the network and setting up configuration parameters in
management agents and management systems.
Network provisioning also called circuit provisioning is an automated process. A
trunk and a special service circuit are designed by application programs written in
operation system.
Planning and inventory systems are integrated with design systems to build an overall
SYST
system. Ex :- Trunk Integrated Record keeping system. Network provisioning in a
computer communication user packet switched paths, broad band communication
uses WAN communication.
Network management is based on knowledge of network topology. As a network
grows, shrinks or otherwise changes. network topology needs to be updated
automatically. Auto discovery can be done by using broadcasting on each segment
and following up with further SNMP queries to gather more details on system.
One efficient method is to look at the ARP cache in the local router. This process is
continued until the desired info is obtained on all IP addresses defined by the scope
(VIII S
of auto discovery procedure. A map showing network topology is presented by auto
discovery procedure after the address of the network entities have been discovered.
TI1e auto discovery procedure becomes more complex in virtual LAN configuration.
An efficient database system is an essential part of inventory management system.
Two of the systems that TRICKS uses are equipment inventory (El) and facilities
inventory (Fl). An object oriented relational database is helpful in configuration and
inventory management.

20
rktM~

As Per New VTU Syllabus w.e.f 2~15-16


(08 Marks)
Choice Based Credit System(CBCS)
is to carry info, and
olves data monitoring,
I data for recognising

ailability an network
iat when and where
~vork throughput are
on, channel error rate,
SUNSTAR·
ds on the application.
nsiveness application
t frame is the rate at

(08 Marks)
used in discovering
uration parameters in

automated process. A
011programs written in
SYSTEM MODELLING AND
terns to build an overall
vork provisioning in a
cl band communication
SIMULATION
opology. As a network
needs to be updated
sting on each segment
.ore details on system.
I router. This process is (VIII SEM. B.E. CSE / ISE)
es defined by the scope
gy is presented by auto
have been discovered.
ual LAN configuration.
management system.
tory (El) and facilities
fut in configuration and
SYLLABUS
SY ST EM MODE L LING AN D SIM U LATION E
(AS l'lil< Cl!OlCE BASED CREDl'I SY~ 1l Mt('(l(.-;l ~Ull·MF)
(EHEC'I IVE FROM TIil· ACADF\I.IIL 'i I ,\R 2lllh • 20171
SulJj«t C:od• 1;;('S833 I \ \l:trk, zo s
~:»1111 \ l:,rl,, 80 Time: 3 hrs.
SumlJ<•r of L,•,tun· Ilours/Week 03
Note : ,<1,,,.,.,er 0 11_,
rotal Xumbcr of L«1urr llou•·• ~o •.·u•m llou,·s 03

MODULE 1
Introduction: When simulation is the appropriate tool amt" hen it is not appropriate, Advantages l. a. With a neat d
and disadvantages of Simulation; Areas of application, Systems and system environment; Ans. Steps in simul
Components of a system; Discrete and continuous systems, Model of a system; Types of Models,
Discrete-Event System Simulation examples: Simulation of queuing systems.
G ene ra l Principles, S imulntio n Softw11re: Concepts in Discrete-Event Simulation. The Event-
Scheduling / Time-Advance Algori.thm, Manual simulation Using Event Scheduling

MODULE 2
Statistical Models in Simulation : Review of terminology and concepts. Useful statistical
models,Discrete distributions. Continuous distributions,Poisson process. Empirical distributions.
Queuing M od els : Characteristics of queuing systems,Qucuing notation.Long-run measures of
performance of queuing systems,Long-run measures of performance of queuing systems cont. ..
,Steady-state behavior of M /GI 1 queue, Networks of queues.
/

MODULE3
Random-Number Gcner:ttion: Properties of random numbers; Generation of pseudo-random
numbers, Techniques for generating random numbers, Tests for Random Numbers,
Ra ndom-V11ria te G eneratio n : Inverse transform technique Acceptance-Rejection technique.

MODULE4
Input Modeling: Data Collection; Identifying the distribution with data, Parameter estimation,
Goodness of Pit Tests, Fitting a non-stationnry Poisson process, Selecting input 1_nodels without
data, M11ltiv11riatt: and Time-Series input models.
F.stimation o f Absolu te Pc rfo rm11ncc: Types of si1111.lations with respect to output ·analysis
,Stochastic nature of output data, Measures of pcrforma1.cc and their estimation, Contd..

MODULE 5
Measures of performance and their estimation,Output analysis fur terminating simulations
Continucd..,Outpul analysis for steady-state simulations.
Verification, C:tlibr:ttion And V:1lidatio11: Optimiz:tlion: Model building, verification and
validation, Verification of simulation models, Verification of simulation models,Calibralion and
validation or models, Optimiwtion via Simulation.
.TION Eighth Semester B.E. Degree Examination,
C BCS - Model Question Pnper - I
SYSTEM MODELING AND SIMULATION
20
Time: 3 hrs. Max. Marks: 80
80
),. Note: A mwer a11y FIV£fi1/I t/lll!M/1111s, .1·e/ecti11,: ON£/11ll 1111e.1t/011/ro111em:/111wd11/e.
OJ

Module - 1
I . a. With a neat diagrum explain the steps in the simul11tion study. (08 Marks)
1 appropriate, Ad,antages
nd system environment; Ans. Steps in simulation study
i.ystem; Types of Models, Problem
formulation
stems.
Ill Simulation. The F,vent-
Selling of objectives and
nt Scheduling 2
ovcrnll project plan

nccpts. Useful Model Data


ss. Empirical distributions. conceptualization 4
collection
ion.Long-run measures of
of queuing systems cont.• ·

,cration of pseudo-random
m Numbers,
cc-Rejection technique.
No No

data, Parameter estimation.


>cting input 1J1odds without

respect to output analysis


cstilllation, Contd ..
Yes
or terminating simulations

I building, ,critication nn
tion models,Calibration an

12 Implementation
VIII Seffll (CSE/IS'f:) Syst"em, ~~ a-t'\.d,S~t,0-n,
I. Problem formulation 12. l mplementatlo ]
Every study should begin "ith a statement of the problem. The success of the
• If the statement is provided by the policymakers or those that have the problem. ha,e been performe
the analyst must ensure that the problem being described is clear!)' understood. I phase:- Consists
• lfa problem statement is being developed by the analyst, it is important that the • It is period of di
policymakers understand and agree with the fonnulation. • The analyst ma
2. Setting of ol.Jjcctives and overall project plan • Recalibration nn
The objectives indicate the questions to be anS\\ered by simulation. At this po_int. 11 phase:- Consist
a determination should be made concerning whether simulation is the appropriate • A continuing int
methodology for the problem as formulated and objectives as stated. • Exclusion of m
111 phase:- Consist
3. Model conceptuali1.ation .
The art of modeling is enhanced by an ability abstract the essential features of a • Conceives a thro
problem. 10 select and modify basic assumptions that characterize the S)i.tem. and • Discrete event s1
then enrich and elaborate the model until a useful approximation result~. • lhe o/p variabl
4. D11t11 collection statistical analys
There is a constant interplay between the construction o f the model and the collection IV phase:- Consists
of the needed input data. As the complexity of the model changes. the required data • Successful impl~
e lements can also change. successful compl
5. Model translation
The modeller must decide whether to program the model in a simulation language
I. b. Explain advantao
Ans. AdvunCuges of sim,
..
such as GPSS/ 11 or 10 use special purpose simulation sollware. • New policies,
6. Verified organ izational p
Verification pertains to the computer program prepared for the simulation model, operations of th
with complex models. it is difficult if not impossible to translate II model successfully • Ne,\ hardware
in its entirety without a good deal of debugging. tested without c
7. Validntcd • H) pothesis abou
Validation usually is achieved through the c11libration of the model, an iterative • ln<;ight can he o
process of comparing the model against actual system behaviour and using the the system.
discrepancies between the two and the insights gained, to improve the model. • Insight can be o
8. Exr1crlmcnt:tl design
• Bottleneck ana
l11e altc111ativcs that arc to be simulated must be dctenni111..'<I. The decision conceming \\hich information, m
altematives to simulate will be function of runs that ha\.c bl..-cn complcH.'<i and anal) sect. • A simulation st
9. Production runs nnd nnnlysis ' how individual
Production runs and their subseqm:nt analysis. arc used to estimate measures of Dis1ulv11111agcs
performance for the system designs that an: being simulated. • Model l>uildin
JO. More runs
and through e ·
Gi\ en the analysis of runs 111111 havi: bi:1:11 completed the analyst determines ,~hcthcr eom~tcnt indi
additional runs are needed and what design those additional experimentation ,hould the) \\ iII be th
follO\\ . • Simulation res
II. Documentation nntl ~porting essentially ran
There are two t)'pes of documentation program and progress. observation is
• Program documentation:- If the program is going lo be used again by the • S11nulation m
same or diffen:nt analysts, if could be ncccssa~ tu understand how the program • Simulation is
operates.
4
12. Imple mentation
The success of the implementation phase depends on how well the previous steps
at have the problem, have been performed. The simulation modeling can be broken into 4 phase
clearly understood. I phase:- Consists of step I and step 2
t is important that the • It is period of discovery/ orientation.
• The analyst may have to restart the process if it is fine tuned.
• Recalibration and classifications may occur in this phast: or another phase.
ulation . At this point, II phase:- Consist of step 3, 4, 5, 6 and 7
ion is the appropriate • A continuing interplay is requirt:d among the steps
• Exclusion of model user results in implications during implementation.
stated.
III phase:- Consists of step 8, 9 and I0
essential features of a • Conceives a through plan for experimenting
• Discrete event stochastic is a satisfaction experiment
·tcri1.e the system, and
• The o/p variables estimates that contain random error and therefore proper
tion results.
statistical analysis is required.
IV ph.1sc:- Consists of step 11 and 12
odcl and the collection
• Successful implementation ckpends on the involvement of user and every steps
nges, the required data
successful completion.
. .
I . b, Expl:iin advantages :rnd disadvant:igcs ofsimul:1tio11. (08 Marks)
n a simulation language Ans. Adv:intugcs of simulution
are. • New policies, operating procedures. decision n,lcs, information nows,
organizational procedures and so on can be explored without disrupting ongoing
r the silllulation lllodel, operations of the real system.
late a model successfully • New hardware designs, physical layout, transportation systems and so on can be
lt:sted without committing resources for their acquisition.
• Hypothesis about how or way certain phenomena occurs can be tested for feasibility.
the model, an iterative • Insight can be obtained about the interaction of variables to the performanct: of
behaviour anti using the the system.
mprove the model. · • Insight can be obtained about the inkraction of variables.
• Oottlcneck analysis can be performed indicating where work - in - process,
d1.-cision conceming which information, materials and so on are being exclusively delayed.
completed and analysed. • I\ simulation study can help in understnnding how the system operntcs rnther than
how individuals think the system operates.
10 estimate measures of Dismlvm1tages .
ed. • Model building requires special training it is an art that is learned over time
and through experience. Further more, If 2 mod~ls are constructed by different
alvst determines whether competent individuals they might have similarities but it is highly unlikely that
I ;ll.perimentation ~hould they will be the same.
• Simulation results can be different to interpret most simulation outputs arc
essentially random variables, so it can be hard to distinguish whether an
ss. observation is a result of syste111 interrelationships on of randomness.
to be used again by the • Simulation moclt:lling and analysis can be time con~uming a,nd expensive.
Jcrstand how the program • Simulation is used in some cases when an analytical solution is possiblt:.

5
VIII Se,w (CSE/ISE)
C13CS - Mo-de.iq~
OR Service lime dctcrnl
2. a. A small store bas only one check out counter customen arrive al the checkout
counter at random times that arc l to 8 minutes apart. Each possible value of
Inter-arrival time has the same probability of0.125 the sen•ice time vary from
J to 6 minutes, with probability show below.
s·ervice time minutes I 2 J 4 5 6
Probabilities 0.10 0.20 0.30 0.25 0.10 0.05
G iven the random digit distribution for customen arriv11ls us 64, 112, 678, 289,
87 1 11nd ror service times 84, 18, 06, 91 s1,,-qucatially.
Develop simul11tion table fqr 6 customcn. Find
(i) Average w:iiting time (ii) Average senice time (iii) Averuge time customer Custonirr Inter ~
spends in the system. (10 Marks) nrriv:il
Ans. Distribution of time between arriva ls
Time b/w Probability Cumulative Random digit I
lime
.
I
j
arrivals probability ;1,;signment 2 I 1
0.125 3
I 0. 125 001-125
4
I I
0.250 126-250 6
2 0.125
3 0.125 0.375 251-375
5 3 •
6 7

4 0.125 0.500 376-500
5 0. 125 0.625 501-625 I. Average waiti,ig tim
6 0.125 0.750 626-750
7 0.125 0.875 7S 1-87S
8 0.125 1.000 876-000
Service hme d1Str1buhon 2- Avcragcscrvice tim
Service time Probability Cumulative probability Random digit
I 0.10 0.10 01-10
2 0.20 0.30 11-30
3 0.30 0.60 31-60 3 · A vcrage time custon
4 0.25 0.85 61-85
s 0.10 0.95 86-95
6 0.005 1.00 96-00
lal er arrival time determln11Cloa 2. b. Explnin event sehc,
Customer s1rn11:.hoc 111 cluck ... 1
Random digits Time b/w arrivals
Ans. After lhc sysh!m snap
I - - advanced to simulmioi
2 064 I FEL and event is exec
3 112 I • At time 11 nc\\ full
4 678 6 arc scheduled hy
s 289 position on the FE
3
6 • A Her the nc\\ syst
871 7 to the time of then
6
SM11•+M f:.c"'M Su.Mv
Service time determination
·ve at the checkout Customer Random digit Service: lime
,b possible v1tlm! of I 84 4
ice time vary from 2 18 '.!
3 87 s
5 4 81 4
'. o io \ o~s ] 5 06 I
~s 64, ll2, 678, 289, 6 91 5
Simulation table for smgle channel queuing problem
Customer Inter Arrival Sen•ice Time Wniting Time Time lcJtl
ge time customer arrival lime timr servicr time in ~er, ice ruMomrr lime of
(10 Murks) lime Begins <111eue ends ~11e11ds ~yslem service
I . 0 4 0 () 4 4 0
tandom digit 2 I I 2 4 3 <, s 0
:issi):!.11mc11t 3 i 2 s 6 4 II \I 0
001-125 4 6 8 4 II 3 15 7 0
126-250 5 3 II I 15 4 16 5 ()

251 -375 6 187 s 18 0


.. 23 s 2
376-500 , rota! time customers wait in queue
I.Averngewaitingtirm:-----------"---
501-625 total numberof customers
626-750 = ~= 2.33 minutes
751-875 6
. . Totalscrvice time
876-000 2. Avcrageserv1ec t i m e - . - - - - - - - - -
total number of customers
Random digit 21 3.s minutes
:: -= .
01-10 6
. . Total time customers spend in thesystm
11-30 3. J\ vcragc t11nccustomerspen1 m syslcm = - - - - - - - - " - - - ~ - - ' - -
31-60 total number of customers
61-85 =~=S.83 minutes
86-95 6
96-00 2. b, Explain event scheduling / time advance algorithm by generating system
snapshot at clock = t nnd clock -= t 1• (06 Marks)
Ans. After the system snapshot at simulation time clock= t has been updated the clock is
rrivals advnnccd to simulation time clock= t 1 the imminent event notice is removed from the
FEI. and event is cxccull:d.
• /\t time t 1 new future events may or might not be generated but, if any arc, they
arc scheduled by creating e,ents notices and putting them in to their proper
position on the FEL.
• Aller the new system snapshot for time 11 has been updated the clock is advanced
lo the time of the nc,, imminent event is executed.
7
VIII Sem., ( CSE/ISE)
CBCS · l-,todel,,Q
• This process repeats until the simulation is over.
• The st;quence of actions that a simulator must perform 10 ad1..ance the clock
and build a new sys1e111 snapshot is called the event scheduling / time advance
algorithm. ls proportional to th
• The removal and addi1ion of events from FEL is in figure. The rncan and vnrian
Old system snaps hot :11 lime E(x) = II+
Clock System state Future event list 2
and
t (S. I, 6) (3, t 1)-Type 3 event to occur at lime t 1
(I, t:) -Type I event to occur at time t 2
(I, t,) -Type I even! lo occur al time 13 V(x) = (b-
l
The pdfand cdfwhe
(2, 1.)-Typc 2 event to occur at time t.. pdf
lb,)
Event / scheduling/ time advance ulgoritlam
Step I:- Remove the event notice for the imminent event (e, cnt 3, timc 4) from FEL.
0.2
Step 2:- Advance clock to imminent c,cnt time.
Step 3:- Execute inuni111.:nt event update system state. Change entity attributes and
set membership as needed.
Step 4:- Generate future events and place their event notice on FEL rank b} event 0. 1
time.
St1.:p 5:- Update eumulatiH: statistics and counters.
New system snapshot at time I I I 2 J
Clock System state ··•· · Future event lisl The uniform dis1ribu1·
t, (5, I, S) (I, t,)- Type I event to occur at time t distribut1:d bch,ecn 7.

(4, t•)-Type 4 event to occur at time 1i ii) Exponential disfr


(I, t 3 )-Type I event occur al I A random variable x I
n:
(2. 1..)-Type 2 event to occur time t "n iti. pdf i{s t:~n by
Module-2 f(x)= < • X
J. a. Expluin the following continuous distribution• 0, else
i) Uniform distribution ii) Exponeatial distribution. The density litnction •
Ans. I) Uniform distribution (08 Marks) l{x)

A ranido~ variable x is uniformly distributed on the interval (a, b) if its pelf is given by
f(x) j;':';• as; x s; b
0, Otherwise
Thecdf is given by
0, <a
X
o-----
F(x) ~ ~ - aSx<b The Exponential dis
{ b-a arrival:; arc compleh:
I. In these instances. A. i
di~lribution has mea
8
~ s~ W t ' \ I
X. - X1
to advance the clock P(x, <x <x:) = F( x:)-F ( x1) = b-a
~duling / time advance ls proportional to the length of the interval for all x, and x~satisfying a=:x, <x;::b.
The mean and variance of the distribution are given by.
E(x) = n+b
2
and
cur at time 11
ccur at time t 2 V{x)= (b-a)'
ccur at time 13 12
The pdf and cdf when a= I and b= 6 are shown below
pdf cdf
\Ccur at time t__ !lx) l~x)

1.0
cnt 3, time 4) from f"EL. 0.2
0.8
hge entity attributes and 0.6
0.1 0.4
e 011 fEL rank by eve 0.2
X X
I 2 3 -1 5 6 7 8 I 2 3 4 5 c, 7 8
The uniform distribution plays a vital role in simulation. Random numbers, uniformly
distributed between zero and one, provide the means to generate random events.
nl to occur at time 12 ii) F.xponcntial distribution
nt to occur at time t* A random variable x is a said to be exponentially distributed with parameter A..>0 if
mt occur at t3 its pdf is given by
nt to occur at time t. f(x ) ={A,.-'·', x ;,: O
0, else where
The density function is shown below
(OS Ma l{x) f(x)

(a, b) if its pdf is give,

() X O :\

The Exponential distribution has been used 10 model inter arrival times when
arrivals arc completely random and to model service times that arc highly variable.
In these instances, A is a rate, arrivals per hour or servic..: per minut<:. Tht: exponential
distribution has mean and variance is given by

9
VIII Sem, (CSE/ISE) sy~em, ~ ~ ~ s~ t m ' \ I cacs - Modei Q
I I The population
E ( x) = °i and V( x) = },: assumed to be
Thus, the mean and standard deviation are equal. The cdf can be exhibited by For example. c
integrating equation to obtain of time a Iliac
rerno, es the tir
o. x <0
customers. 11 h
F(x)- '
p,i:-..,clt . ., I -c-••. c!: 0 server who ser
1
11.
finite .ind consi
In syslt:111 with
3. b. A production process mnnufactures computer chips on the average at 2% non assumed to be ;
confo rmi'lg· Everyday, 11 rumlom sum pie or size 50 is taken rrom the process. Ir 2. System capn
the sample contnins more than two non conforming chips, the process will be In many qut:11in
stopped. Compute the probability that the process is stopped by the sampling the 11aiting line
scheme and also compute mean and variance. (08 Marks) For c.,ample. an
Ans. Consider the sampling process ns n=50 Bernoull i trnils, ench with p=0.02, then to enter the mcc
the total number of non conforming chips in the sample. x, would have a binomial customer who (i
distribution given by population.

fC
0 3. The Arrirnl I
P(x) = }o.02)' (0.98)50-'. x =0,1,2 .... 50 The anfrnl proc
1 0, Otherwise
of inter armal
time\ or at rand
It is much easier to compute the right - hand side of the following identify to compute charactcri✓ed by
the probability that more than two non conforming chips 11re found in a sample. a time or in bate!
P(x c!:2}=1- P (:-. s2) 4. Queue bcl111\'i
The probability P(x ~ 2) is calculated from Queue bt:haviou
service to begin.
P(x s 2} = L:,, (-o) (0.02)'(0.98t°-x will balk. renege.
Queue discipline
••O X
\\hich customer,
= (0.98r +50(0.02)(0.98}49 + l225(0.C12f (0.98}48 5. Sen ·icc times ,
=0.92 The !.en ice ti111e
Thus, the probability that the production process is stopped on any day, based on the be constant or o
sampling process, is approximately 0.08. The mean number of non confonning chips characteri/ed as
in a random sample of sizt: 50 is given by variahlcs. ·1he ci
E(x) = np = 50(0.0'.!) = I distributions hav
situations.
and the variance is given by
4. h. Lis t the differcn
V{x) =npq =50(0.02)(0.98) = 0.98 Ans. Rccogni£ing the
for parallel scrv
OR
4. •· Explnin the clrnrnctc ristics of queuing system. of this conventio
(08 Marks)
Ans. Charncteristics of queuing system following system
I. T he c:111ing popul:1tio11 A represents the 1
0 represents the ~

10
cacs - Model,Q,-t.e~t"i,owPet.pu- - 1
Tho.: popula1iu11 of potential customers. referred to as the calling population. may be
assun11::d 10 be finite or infinite.
For example. consider a pank or five machines that are curing tires. Atier an interval
df can be exhibited by or 1i11;e a machine automatically opens nnd must be aucnclt:d b) a \\Orker who
removes the tire nnd puts an uncured tire into the machint:. The machines arc the
customers. who arrive at the instant they automatically open. The worker is the
server who serves an open machine as soon as possible. The calling population is
finite <1nd consists of five machines.
In systcni with a large population of potential customers, the calling is usually
the average at 2% non assumed to be infinite.
en from the process. If 2. System cap:1city
ips, the process will be In many queuing systems. there is limit to the number of customers that may be in
the waiting line or system.
topped by the snmpling
(08 Marks) For example. an au10111mic car wash might have room for only IOcars to wait in line
each with p=0.02, then lo enter the 1m:chanism. II may be illegal for cars to wait in the street. An arriving
, would have a binomial customer who finds \ht: system full docs not enter but immediately to the calling
' population.
3. The J\rrirnl (lroccss
The arrival process for i111i11itc population modt:ls is usually charncterizecl in terms
or inter arrival times or successive customers. Arrivab ma) occur at scheduled
times or at random limes. when al random 1i11ics, the inter arrival times are usually
owing identify to compute
a
characterized by probability distribution. In addition. customers may arrive one of
a time or in batches. The batch may be of constant size or of random size.
re found in a sample.
4. Queue behaviour nnd Queue discipline
Queue behnviour refers to the actions of customers while in a queue waiting for
service to begin. In some situations, there is a possibility that incoming customers
will balk, renege.
Queue discipline refers 10 the logical ordering ofcustomers in a c1ucuc and determines
which customer will be chosen for service when a server becomes free.
5. Sen'icc times and the service nH•chanism
The service times of suc~essivc <1rrivals nrc denoted by S,. S:. S,... They may
be constant or of random duration. In the laucr case, [S,. S:, S3..... J is usually
ed on any day, based on the
characteri;,ccl as a sequence of inilependent and identically distributed random
r of non conforming chips
variables. The exponential. weibull. ga111111:1. log normal and truncated normal
distributions have all been used successfully as 111odcls of service limes in different
situations.
4. b. List the different <1ueuing notations using in queuing systems. • (08 Marks)
Ans. Recognizing the diversity of queuing systcms,Kendall proposed a notational system
for pnrallel server systems which has been' widely adopted. An abridged version
(08 Marks) of this convention is based on the format A/8/C/N/K. These letters represent the
fol lowing system chnractcristics.
A represents the inter arrival time distribution.
B represents the service time distribution.

11
VIII Semt (CSE/ISE) Sy~M~~S~ cacs - Modei Q
C repn:sents the number of parallel servers.
N represents the system capacity. o• = mnx I s i s N
K represents the size of the calling population.
Common symbols for A and B include M(exponential)
D(constant or deterministic), Ek (Erlang of order k) o- ""max I S i s
PH(phase - type). H (hyper exponential). G(arbitrary or gcneral) and Gl(general
indepcndent) Step 3:- Compute
Queuing notation for parallel server systems Step 4:- The criti
r n- steady state probability of having n customers in the system. sample: size N.
P.(t)- Probability of n customers in system at time t Step 5:- If D>Da,
,.,,_ Arrival rate Problem
).,, - Effective arrival rate
µ-• Service rate of one server
t- Server utilization
A - Inter arrival time between customers n-1 and n I
S: Service time of the n'11 arriving customer 2
W Total time spent in the system by the n'h arriving customer
11
-
3
W:- total timc spent in the waiting line by customer n
L(t)- The number of customers in Queue of time I 4
L"(t) - TI1e number of customers in queue at time t 5
L- Long run time average number of customers in the system. Compute D= max
Lv· Long run time average number of customers in Queue D= max (0.26, 0.2
W- Long run average time spent in system per customer D=0.26
W v· Long nm average time spent in Queue per customer. Check = D>Da, j
Module-3 0.26<0.S6S
The hypothesis is
S. a. Explain steps in the kolmorgo,· - simronv test. The sequence of random numbers
0.44, 0.81, 0.14, 0.05, 0.93 arc generated using kolmorgo,·- simrno,, test with 5. b. Wlrnt is nccepl:1
a ...0.05 td determine if the hypothesis that he numben arc 011 the interval (0.1) ~ can cx=0.2. The
can be rcjccteil take Du.=0565. (08 Murks)
Ans. Kolmogorov- simrnov k st An1. In acce1>t:111ce rcj
This test compares the F(x) of the uniform distribution with the t:mpricnl SN(K) of the In thi:. 111c1hod a ra
sample of N observation. only then the rand
Problem
F(x)= x. 0:::: x :::I
If the sample from the random number gcnernlor is R 1• R•..... R, tht:n the cmprical Step I:- set n=O.
S"(x) is defined by · Step 2:- R, =-0.43
Step 3:- Since P-
S,i (x )- Number of R,. R~~.R~ Whichare:s: x
Stcp I:- Set n=0.
Stcp2:- R,=0.414
Step I:- Rank the data from smallest to largest Step 3:- Since P~
R,:::: R~:::.......RN Step I:- Set n=O.
Step 2:- Compute Step 2:- R,•0.83
Step 3:- Since p -
Step 1:- R,=0.99
riis~
D' =mm.Isis N { ~- R}
D" = ma:-. I s i S N { R, - ( i ~ 1) }
eral) and Gl(gem:ral
Step 3:- Compute D- max (D·, D·)
Step 4:- fhc critical value, Da for the specified significance level a, and the given
m. sample size N.
Step 5:- If D>Da, reject 11,. otherwise accept H...
Problem
• I)+

I R I
i/N -i-N 1 -NI -R,
R o-_ (i-1)
I N
I 0.05 0.20 0 0.15 0.05
2 0.14 0.40 0.20 0.26 -
.r 3 0.44 0.60 0.40 0.16 0.04
4 0.81 0.80 0.60 - 0.21
5 0.93 I 0.80 0.07 0.13
Compute D= max (D", D·)
D"' max (0.26, 0.21)
D=0.26
Check = D>Da , if reject
0.26<0.565
The hypothesis is accepte<l because D<Da.
eofrnndom numbers S. b. What is 11ccept11ncc rejection technique? Generate three Pois.,on variates with
,._ simrno,· test with
~c11n a=-0.2. The random numbers are 0.4357, 0.4146, 0.8353, 0.9952 and 0.8004.
on the interval (0.1) (08 Marks)
(08 Mnrks)
Ans. In accc1>t:mcc rejection technique
In this mc1hod a random number is generated and ifit satisfies the particular condition
e emprical S,(x) of the
only then the random variable is generated.
Problem
Step I:- sc::t n"'"O. P= I .
... R~ 1hen the empncal
Step2:- R1=0.4357, P= I. Rl)! 1=1•0.4357=0.4357
Step 3:- Since P=0.4357<0.8187, then accept N=0.
S1cp I :- Set n= O. P-= I
Step 2:- R1=0.4146, P=I. R0 • 1=1•0.4146=0.4146
Step 3:- Since P=0.4 I40<0.8187, then accept N=0
Step I:- Set n=0. P-= I
Step 2:- R,=0.8353. P= l *R0, 1= I *0.8353=0.8353
Step 3:- Since P=0.8353 >c"" then Reject n=I and return to step 2
Slcp 2:- R2=0.9952. P- R,. R,=0.8313

SM,+At e.cAM ~1\1\U' 13


VIII Sem, ( CSE/ISE) sy~e,m, 1 - , i ~~ si.mulat'LOY\t
Step 3:- Since P=0.831 3 2!:e"' then reject return to ste p 2 with n=2
Step 2:- R1:0.ll004, P=R,, R3=0.6654 =0.6654 I
Step 3:- Since P=0.6654<0.81 87 then accept N=2 E(x,) =i
n Rn+I p Accept/ Reject Result I •
0 0.4357 0.4357 So- •s the mean i
Accept )\.tj) A.
0 0.4146 0.41-t6 Accept l\ --0 • The goal here i
0 0.8353 0.8353 Reject an exponential,
I 0.9952 0.8313 Reject • The inve rse Ira
"hen cdf, F(x)
I 0.800-1 0.6654 Accept N=2
• Step h} step p
OR Slep I:- Compute
6. a. What :arc pseutlo random numbus'! What i1rc lhe problem th:1t occur "hilc distribution. the cd
gcncrntin:.! pseudo random numben;? (06 Marks) Step 2:- Set F(x) •
Ans. Pseudo means false. so fobe random numbers arc being. generated 1-e·'-=R on the ra
While g.cncrating pseudo random numbers. cenain problems o r errors can occur. called R. R has a 4
These errors from ideal randomness are related lo the properties of random numbers. Step 3:- Solve the
Errors include the follo\\ ing
1-c-,.. = R
I. The generated random number!> might not be uniformly distributed. e-" =1- R
2. The gcnerntcd numbers might be discrete valued instead of continuous \.alued.
-h =ln( I -R)
3. The mean or, :iriancc of the generated numbers mighl be too high or too lo\\.
4. Then: mig.ht be dependence I
X=-,:-ln{I-R)-
(a) Autocorrdation bet\\Ct:11 numbers.
(b) Succcssi,cl) higher or lm,cr than adjacent numbers. Variate generator
(c) Sc, cral numbers abo,c the mean or follo"ed by several numbers below the mean. x=f·1( R) Gencrau
6. b. Explnin im erse Ir.ms form kchnique of protlucing random v:1ri:1tes for Step 4:- Gencratl
i) faponenliul distribulion random variates
ii) Uniform dis tribulio n. (10 Marks) x, = r'(R.)
Aas. i) Exponcnli:il dis lrilrnlion forexponentiall;;
The c:-..poncnrial distribution has pelf
f(x) ={:...c·'·', x~O Equation (l)so
0. x <0 I
X =-- ln(l - ll
I A
The exponential d i!>tribution has cdf
For i= I. 2. 3 .....
F(x) = Jf(t)df 1- R, byR
_,
={1- c.-h• X ~ 0 x = -~ln(R)
' A ,
0 x<O
This alternative
The A means number of occurrence per lime unit. on (0, I).
Ex:- If inlcr arrival times , 1• x:. xi hatl an exponential dislribution with the rate}., ii) Uniform di•
A means number of arri, als per lime unit or the arrival race, Consider a ra

14
~~
h n=2

Result So_!_ is the mean inter arrival time


N=O "-
N- 0
• The goal here is to develop a procedure for generating values x,. x,. x, that have
an exponential distribution.
• The inverse transform technique is used for any distribution but it is most useful
when cdf. F(x) is of a form so simple that its inverse F· 1• can be computed easily.
N=2 • Step by step procedure for inverse transform technique.
Step I:- Compute the cdf' of the desired random vllriable x. For the exponential
distribution. the cd f' is F(x) = 1-e·'·". :QO. ·
o ble m that occur while Step 2:- Set F(x) = Ron the range of x for the exponential d,istribution. it becomes
(06 Marks) · 1-e·;.,=R on the range X::'..0, x is a random variable, So 1-e·'·' is also a random variable
•neratcd. called R. R has a uniform distribution over the interval (0, I).
ems or errors can occur. Step 3:- Solve the equation F(x)= R for x in terms of R
rties of random numbers. 1-c-'·' = R
e-'-' = I - R
distributed.
t of continuous valm.:<l. -11.x=ln{I-R)
e too high or too low.
x::: _ _!_ln(l - R) ➔ (I) is random
"-
Variate generator for the exponential distribution in general equation I is written as
1numbers below the mean. .x=F''(R) Generating a sequence of values is done through Step 4
Step 4:- Generating uniform random numbers R,. R~. R3 and compute the desired
d om varia tes for random variates by
x, = F-1 {R,)
( 10 Marks)
for exponential case, r-1
( R) = _ _!_In (I - R)
i,,.
Equation (l)so

X =-_!_ln(l-R ) ➔ (2)
, "- '
For i= I, 2. 3...... one simplification that is usually in equation (2) is to replace
1-R, by R

x =-..!_ ln(R)
' A '
This alternative is justified by the fe<a:I thnt both R, and I- R, are uniformly distributed
on (0, I).
stribution with the rntc 11., ii) Uniform dis tri bution
Ile. Consider n random variable x that is uniformly distributed on the interval (a. b)

15
VIII Sem, ( CS'f(IS'f) CBCS · Mod.el,Q
Step I:- Find the cdf F{x) i.e., 2. Label the horiz
0, X <a
3. Find the frequc,
x-a 4. Label the vcrtic
F(x) = - - , as:xsb
{ b -a 5. Plot the freque
I, x>b 3. Selcc1i11g lhc
After cons1ructinij
Step 2:- Set upper part Off he rl
. X -11 Depending on thu
F( x ) = R 1.e.,--=R
b-a
Step 3:- Get
1
X = f- (R)

. x-11 R
I.C.• - - =
b-u
x-a = R(b - a)

Ix =a+ R(b - a)I


Which is the random variate generator for u(a, bl
Module-4
7. a. Ex1>luin the diffcnmt steps in the development ofn useful model of input dnln.
(10 Marks)
Ans. 1. Duin collection
• Suggestions might enhance and facilitate data collection.
• I\ useful expenditure of time is in planning this could begin b) practice or pr1:
observing session. Try to collect data while pre-observing Ex
• Tl) to analyse the data as lhcy arc being collected checkout \\hcther the data Another method
being collecti;d arc adequate to provide tin: distributions .needed as input to the distribution.
simulation. • Poison:- Num
• Try to combine homogeneous data sets, check data for homogeneity in successive • F:xponential:-
time periods an during the same time period on successive days. • Normnl:- Son
• Be aware of the possibility of data censoring in "hich a quantit) of interest is not • Discrete / co
observed in its entirely. is no data.
• To discover whether there is a rclntionship between two , ariables. • Gammn:- /\n
• Consider the possibility that a sequence of obsen :nions that nppcar to be variables.
independent actually has autocorrelation. • Emi>irical :-
• Auto correlation can exist in successive time periods or for successive customers. • Beta :- An
2. ldcnti(,·ine the distribution with data variables.
After collecting the data the appropriate distribution followed by the data should be 4. P11ntmclcr es
identified. l-listogrnm is used for this purpose Once thti distrib
The pammcters
Histogram proccdu1'\:
I. Divide th!! range of the data into intervals of equal width.

16
2. Label the horizontal axis to conform to the inten ab ~eh:ctcd.
3. find the frequencies for each class interval.
4. Label the vertical axis so that the total occum:nces can be plotted for each interval.
5. Plot the frequencies on the vertical axis.
3. Selecting the distribution
After constructing a histogram a smooth curve passing through mid points from the
upper part of Jhc rectangle is drawn. lhis cun e 1s called the frequency curve.
Depending on the shape of the curve. appropriate guess to the distribution is made.

Poisson Uniform

ul model or input dnta.


(10 Marks)

n.
begin by practice or pre
Exponential Normal
ing Another methud to identify the distribution is to use the physical basis of the
hcckout whether the data
ns .needed as input to the distribution.
• Poison:- Number of independent events that occur in a lhed nmount of time.
homogeneity in successive • Exponential:- Time between independent periods
• Normal:- Some of component processes
I
ssive days.
a quantity of interest is not • Discrete/ continuous uniform:- Complete uncertainty cnn be used where there
is no dnta.
vo variables. • Gamm:1:- An extremely flexible distribution used to model non negative random
ations that appear to be vnriables.
• Em1,iric11l :- Re samples from the actual data collected.
r for successive customers. • Octa :- An extremely flexible distribution used to model bounded random
variablt:s.
owed by the data should be 4. Parameter estimation
Once the:: distribution is identified the parnmetcr are tu be idenriflcd and estimated.
The paramc::tcrs can be eliminated by using sample means and variance denoted by
chh.

17
V III Set>11 ( CSE/I SE) CBCS · lvt~Que.!
~ and S: s :

-1 t:i (unfroupeddnta)
6
7 .
x= I x,f, (grouped data) 8
9
~

''
, n
10
( ungrouped data) 11
!
s2 = I
,l
(grouped data) x: - I(o, - E.)
E,
Distribution Parameter Suggested estimator X ~ > .\ 2U, J.. - 5 -
Poison a
x=x
- 27.68 >3.6-t
:. Reject H) poth
Expotential
" A.==
I
l\
8. a. Brien,- explain c
Gamma p,e I Ans.
0==
X • To understnn
bc1wcen meas
Normal µ,o
~I= X
- • One way to rr
prediction inti
a' =S2 • Both confide
being produce
7. b. Apply chi- square goodness of fil lest to poison assumption nith a=3.64, data
Suppose that mo
size=IO0. Observed freq uency is 0i: 12 10 19 17 10 8 7 5 5 J 3 I ta ke le,cl of
unknown.
significance a=0.05 and critical , ·alue I I.I. Expected value should be greater
To lel J be lhe ~
tha n five. (06 l\.1nrks)
simulation
Ans. Therefore. 8 is t
of the average CJ
• lf gool is toe
lfwc arc plannir
(o,-E,}:
X
'
( or o x.( P(x) .
I.:. 0 -
I
I.:. '
I::.,
8 is a relevant p
• A confidenc~
• I I{
0 12 0 0.0262 :?.62 9.85 7.'1.7 s· =- L ('
R-1 .. 1
I 10 10 0.0955 9.55 - -
be the !>ample
2 19 38 0.1739 17.36 1.65 0.156
3 17 51 0.2110 21.10 -4.1 0.796
4 10 40 0.1920 19.16 -9.16 4.379

18
5 8 40 0.1397 13.9 -5.9 2.50
6 7 42 0.084 8.46 -1.46 0.25 1
7 5 35 0.044 4.4 . .
8 5 40 0.0200 2.0 . .
9 3 27 0.0080 0.8 9.51 I 1.062
10 3 30 0.0029 0.3 . .
II I 11 0.00097 0.09 . .
100 364 100 27.68
,,

x 1 > -'- 1a,k - 5-I


27.68 > 3.64
:. Reject Hypothesis(! lo)
OR
8, 11. Briefly expl11in confidence interval estimation. (10 Marks)
Ans.
• To understand confidence intervals, it is important to understand the difference
between measure of' error and measure of risk.
• One way to make the differ<>nce clear is to contrast n confidence intervnl with n
prediction interval.
• Both confidence and pn.:diction intervals arc based on the premise that the data
being produced by the simulation is represented well by a probnbility model.
tlon with o.-3.64, data Suppose that model is the normal distribution with mcon O nnd variancc o2. bolh
'J 5 5 3 3 1 take leYcl of unknown.
~alue should be greater To let J be the avemg.e cycle time for parts produced on the j•h replication of the
(06 Marks) simulntion.
Therefore. 0 is the mathematical expectation of Yi and cr is the day to day variation
ofthc average cycle time.
• If goal is to estimate 0
1
lf"c nrc plmrning to be in business for long time. producing parts da) afler day, then
(o, - E.) 0 is a relevant parnmeter, because it is the long run mean daily cycle time.
o-E • A confidence intervnl is a measure of the error let
I •
E,
I K
s~= - L(Y, -v ... J
- '

9.85 7.117
R-1 ,.1
be the sample varinncc ncrosc; the R replications.
1.65 0.156
-4.1 0.796
-9. 16 4.379

19
VIII SeM11 ( CSE(ISE) Sy~~~ CM'\a1S~t.0nt CBCS · ~odeltQ
• · The usual confidence interval. \\hich assumes the Y, are normally distributed. is
-Y... ± tet / 2. R - I r.:-
s 9. a. Explain the replici\
"R Ans. Replication melho
Where to./2, R-1 is the quantile of the t distribution "ith R-1 degrees of freedom that • If initialization
cuts off u/2 of the area of each tail. level, then the 111
• A prediction interval is designed to be wide enough to contain the actual average
estimator variabi
cycle time on any particular day with high probability.
• The basic idea is
• A prediction interval is a measure of risk a confidence interval is a measure of
one the same wa
error.
• If significant bia
• The normal - theory prediction intcn al is
arc used to redu
Y... ± to. / 2, R-1 S✓l + I~ can be misleadin
• This happens be
The length of this interval will not go ofO as R increases. In the limit it becomes affected only hy
0 ± Za./2a • Thus, incre:ising
to rcOcct the fact that, no matter ho,, much \\I! simulate. our daily average cycle time intervals around
sti II varies. in,csting the init
• If the simulatio
8, b. Prescribe Quantiles in output unalysis for terminating simulations. (06 M:irks) observations in a
Ans. • The basic raw 0
• To present the interval estimator for quantitics,it is helpful to review the interval • Each y,, is derive
estimator for a mean in the special case when the mean represents a proportion ,,
Cusc I :- y is an in
or probability P. could be the delay 0
• When the number of independent replications Y ,......Y k is large enough that to./2, Case 2:- )' is a ba
n-1 =Za/2. the confidence interval for a probabilil) Pis often written as obscr\'alio,;~.

P±Zo. / 2
. ( )
P 1-P
,,
Case 3:- y is a bat
• When using the
sample for the p
R-1
- 1 "
Where I' is the sample proportion. y,.(n.d) = -
n- d d
• The quantile - estimation problem is the inverse for the probability - estimation
problem. find O such that Pr [Y S Ol=P. Thus IO estimate the P quantile we find as the sample mca,
the value O such that I 00% of the data in a histogram of Y is to the lefl ofO. different random n
• Extending this idea. an appro:-.imatc ( 1-u) I()0%. Confidence intc:rval for O can initial conditions (I
be obtained b) finding two values: Ot that cuts of I00 pt % of the histogram and y,.(n,d), .......)R·<-'
Ou that cuts IOOpn % of the histogram. where. are ind1:pcndent an
P(l - P) random sample fro
Pt -= P - Za. / 2
R-1 8n.d -E[y,.(n,d)
The overall point e
- I K
y ...(n.d) ~-
R ,.,
L

20
lly distributed. is Module- 5
9. a. Explain the replicntion method for steady state simulation. (10 Marks)
Ans. Replication method for steady state simulation
es of freedom that • If initialization bitts in the point estimator has been reduced to H negligible
level, then the method of independent replications can be used to estimate point
the actual average estimator variability and to construct a confidence interval.
• The basic idea is simple make R replications initializing and deleting from each
al is a measure of one the same way.
• If significant bias remarks in the point estimation and large number of replication
arc used to reduce point estimator variability, the resulting confidence interval
can be misleading.
• This happens because bias is not affected by the number of replications (R), it is
affected only by deleting more data or extending the length of each run.
limit it becomes • Thus, increasing the number of replications (R) could produce shorter confidence
intc.rvals around the wrong point. Tlum:fore. it is important to do a through job of
average cycle time investing the initial condition bias.
• If the simulation analyst decides to delete t observations of the total of n
ations. (06 Mnrks) observations in a replications then the point estimator of0 is Y. .. (11, d)
• The basic raw output data, {Y,i' r= I ....R, j= I ... n}
review the interval • Each yri is derived in one of the following ways
resents a proportion Case 1:- Y,i is an individual observation from within replication r, for example. Y,,
could be the delay of customer in a queue or the response time to job j in a job shop.
rge enough that to.J2, Case 2:- y,, is a batch mean from within replication r of nurnber of discrete time
written as observations.
Case 3:- y is a batch mean of a continuous time process over time interval j.
• When ~:sing the replication method, each replication is regarded as a single
sample for the purpose of estimating 0. for replication r, define
- I "
y,.(n.d) = - L
n - d ,-<1 H
Yri
obability • estimation
he P quantile we find as the sample mean of all observation in replication r. l-3ccausll all replications use
different random number streams and all initialized at time O by the same set of
is to the left of 0.
nee interval for 0 can initial conditions (IJ the replication averages.
o of the histogram and Y, ,YR.(
.(n.d ) ....... n,d)
are independent and identically distributed random variables that is. they constitute a
random sample from some underlying population having unknown mean
0n.d = E[y, .( n,d)]
The overall point estimator, is also given by
- l~-
y...(n.d) = - L., Y,.(n. d)
R r, I

21
VIII Set111 ( CSE/I S£) CBCS • ModeiQ
Thus, it follows that
E[y....(n,d)] =8n,d
also ,if d and n are chosen sulliciently larl!,e. then On, d ~- and y...(n,d) is an
approximately unbiased estimator of0. The bias in y..(n,d) is 0n,d -0.
9. b. Explain output analysis for stc:1dy state simulation. (06 Marks)
Ans. Consider a single run of a simulation model whose purpose is to estimate a steady
state or lone run characteristic of the system. Suppose that the single run produces
observ11tions y1, y:""' .... hich generating an: samples ofan autocorrelated time series.
The steady state measure of performance, 0 is defined by
I "
0 = lim 11 ➔ OO-LY,
n ,-1
With probability I, \\ here the , aluc of9 is independent of the initial conditions.
For t:xample ,if Y, was the time customer spent taking to an operator, the Owould be • The second st
the long-run average time a customer spends talking to an operator and because 8 is which include
defined as limit, it is independent lo the call confers conditions at time 0. Similarly system. ilp pa
the steady state performance for a continuous time output measure (y (I), t~O} , such • The third step i
as the number of customer in the call centers hold queue is defined as The model builde
♦ =limT1 ➔ . . !. ._ Jy(t)dt verifyi ng and vali
Verification:- Thi:
Tc o
is reOccted accura
With probability I .
The sample size n(or Tr) is a design choice. it is not inherently determined by the verification pmccs
• I lave the corn
nature of the problem. The simulation analyst\\ ill choose simulation run length with
developer.
several consideration in mind.
• Name a flow di
I. Any bias in the point estimator that is due to artificial or arbitrary conditions.
• Closely examin
2. The desired precision of the poi_nt estimator, as measured by the ~tandard error or
• Print the input p
confidence interval half width.
value are not ch
3. Budget constraints on computer resources.
• Mal-.e the com pt
OR • If the compute,
10. a. Exph1in lhe ncnl diagram model building verification mul validalion. (08 Mi1rks) animation imita
Ans. • Have a graphic,
• The first step in model building consists of observing the real system and • It simplifies the
interactions among it various components and collecting data on its behaviour Valid:1tion:- Verifi
also, persons familiar with system or sub system should be questioned to take validation is ovcrnl
advantage, their special knowledge. and ii~ behaviour.
For example operators, supervisors. h.-chnicians etc arc also questioned to gain more Calibration is the i
knowledge about the system. As model de\elopment proceeds new questions may system making ad
arise and model dcvclopers may have to return to this step. making additional

22
Real system

Calibration & validation Conceptual validation


,a and ) ...(n,d) is an
~~.d -8. Conceptual model
(06 Marks) I) Assumptions on system components
is to estimate a steady 2) Structural assumption
l,c single run produ~es 3) Data assumption
ocorrelated time series. 4) Input parameters

Verification

Operational model
c initial conditions.
pemtor, the Owould be • The second step in model building is the construction of a conceptual model.
perator and beca_us~ 0 is ,, hich includes collection of assumptions on a system components, structure of a
ions at time 0. S1m1larly system. i/p parameters, data assumption etc..
easure lY (t), t~O} • such • The third step is the translation of the conceptual model into operational model.
efined as The model builder has to return lo each of these steps many times while building.
verifying and validating the model.
Verification:- Thi: purpose of vcrifica1ion is to cnsun: that tht: conceptual model
is re0ected accurately in the computcri,ed reprc<,t:ntation, some suggestions in the
verification process arc,
rently dctermin~d by the • I lave the computerized reprcs1:n1a1ion checl-.ccl by someone other than its
1mulation run llrngth "ith developer.
• Name a flow diagram of the process.
arbitrary conditions. • Closely examine the model output tor reasonableness under a variety of inputs.
by the standard error or • Print the input paramett:rs at lhc i:nd ofsimulation ofachicvc that the i/p parameter
value arc not changcd.
• Make the computcrizt:d representation as self documentary as possible.
• If the computerized representation is animated verify that what is seen in the
d v11lidntion. (08 M11rks) animation imitates the actual system.
• Have a graphical representation of the model.
ing the real system and • It simplifies the task of model understanding trace the model.
ing data on its behaviour Val_id:1~io1~:- Verification and validaiion are conducted simultaneously by the model
Id be questioned to take validation 1s overall process of comparing the model and its behaviour to real system
and its behaviour.
questioned to gain more Calibratio11 is the itera1ive process of comparing the model and its behaviour to real
ccds new questions may sysl~m ma~i~g adju~lment to the model, comparing the revised model to reality
making aclcl111onal adJustment comparing again and again.

23
VIII Se.m, (CSE/ISf) Sy~ M ~ ~s~
10. b.Which are the common suggestions can be given for use in the , ·erification
process. (08 Marks) Ei
Ans. I. Have the operational model checked by someone other than its developer,
prefcmbl> an expert in the simulation software being used. SYS
2. Make a now diagram that includes each logically possible action a system can take Time: J hrs.
v. hen an event occurs and folio" the model logic for each action for each event type. Note : .-111.1 wer ""Y
3. Closely examine the model output for reasonablcne!>s under a variety of settings
of the input parameters.
4. Have the operational model print the input parameters at the end of the simulation, I. a. Lisi any five ci
to be sure that these parameter values have not been changed inadvertently. when ii is not.
5. Make the operational model as self documenting as possible. Ans. When the simul
6. If the operational model is animated, verify that what is seen in the animated I Simulation en
imitates the actual system. of a comple:-. sy
7. The interactive run controller (IRC) or debugger is an essential component of 2. Informational
successful simulation model building. the effect of the
8. Graphical interfaces are recommended for accomplishing verification and 3. ll1e knowled
validation. The graphical representation of the model is essentially a form of self suggesting imp
documentation. 4. Animations ,
vi:sualized. •
5. The modem Sl
simulation. ·
When the simu
I. When the pro
2. \\'hen the pro
3. When it is cat
4. When the sim
5. When the res,
I. b. A computer tc,
"ho t:1kcs call~
lime betnccn c
table 1.1. Able
which mcnns
their sen ice r
Table I. I Intc1

Table 1.2 Sc

24
e verification Eighth S emester B.E. Degree Examination,
(08 Marks)
its developer,
CBCS - Model Question Paper - 2
SYSTEM MODELING AND SIMULATION
system can take Time: 3 hrs. Max. Marks: 80
ach event type. 1- Note: Answer 1111y /CJV£/11ll 1111estiom, Je/ecti11g ONE / 11/1 q11esliim/ro111 e11d11110,/11/e.
ricty of settings
Module- I
fthe simulation, I. a. List any five circumst:rnces, when the simulation is the appropriate tool and
rtently. when it is not. (08 Marks)
Ans. When the simulation is the approp.-i:lte too l
in the animated I. Simulation enable the study of and expel'imentation with the internal interactions
of a complex systt·m or of a sub system within a complex system.
al component of 2. Informational, organintional and environmental changes can be simulated and
the effect of these alterations on the model's behaviour can be observed.
verification and 3. The knowledge gained. in designing a simulation model of great value toward
ly a form of self suggesting improvement in the system under investigation.
4. Animations shows a system in simulated operation so that the plan can be
visualized.
5. The modern system is so complex that the interactions can be treated only through
simulation.
When the simulation is not appropriate
I. When the problem can be solved using common- sense.
2. When the ,,roblem can be solved analytically.
3. When it is easier to perform direct experiments.
4. When the simulation costs exceed the savings.
5. When the resources or time arc not available.
I. b. A computer technical support centre is staffed by two people, Able and Baker
who takes calls and try to answer questions and solve computer problems. The
time between calls r.rnges from l to 4 minutes with the distribution os shown in
table 1.1. Able is more experilmccd und can providl.' service faster thnn Baker,
which means that, when both arc idle, Able tokes the call. The distribution of
their service times arc shown in table 1.2 and 1.3 respectively. (08 Marks)
Table 1.1 Inter A rriv.il time (IAT) Distribution
IAT I 2 3 4
Probability 0.25 0.40 0.20 0.15
Table 1.2 Scrvicl' time dis tribution of Able
IAT 2 3 4 5
Probability 0.30 0.28 0.25 0.17

25
VIII Semt ( CSE/ ISE)
Table 1.3 Service time distribution of Baker
3 4 s 6 2. a. Prepare a slm
Prob11bility 0.35 0.25 0.20 0.20 queu~ stoppin~
Random digits for inter 11rrival time- 26, 98, 26, 42 given belo
11 l'l'

Random digits for Service time- 95, 21, 51, 92, 89, 38 Simulate the t
Simulate thi11 system for 6 customers by finding
I) Average waiting time of customers
Ii) Average s~rvice time of Able
iii) Average servke time of B:1ker Ans.
ADIi, Inter Arrival deh:rminution
Customer no. Random digit Inter arrival time
I
2 26 2
3 98 4
4 90 4
s 26 2
6 42 2
Simulation of ABLE - Buker S
AUI.F. Uoke,

Cu)lumcr

'"'
I
2
lnltr
:uTl\'al
hme

2
Am"val
11n1r

2
Tune
~.,,~
riq,,n
0
S1,'1'\l(C
hmc

s
1,me
a-ff\!ICC
l'nds
~
~-
11111c
M.'fVl~t:

2
~"i\lt."C
lllll\':
T1a1c
M:r\lCC
Fnd

s
Cwa,....,
0.1.,

0
Time
Spent 1n
S~SICffl

s Sinmlution rablQ
l
1 ~ 6 l ')
' 0
u
)

J
b
Clock LQ
4 4 10 IU s IS 0 s 0
:, 2 12 0
12 b 11 0 6 -
6 2 14 11 l 18 I 4
I I
2 2
i) Average waiting time of customer = Totalt1mecustomer wait mqueue
total number of customers 4 I
6 0
=¾ =0.16 mins 8 I
ii)Averageservicc timeof Able 11 I
16 . IS 0
=-= 4 llllllS
4 16 0
iii) Average service time of Baker Total busy time<
Maximum queue
=~= 4.Smins
2

,, '"-'"... Sc.1,1t11
26
OR
2. a . Prepare II simulution htble miing the time a,h, a nee 11lgorlthm fo1· 11 slniclc chunnel
queue slopping c\"cnl will be Ill 30 minutes. Inter arrival lime 11nd service times
arc given below. Find the busy lime server and maximum queue lcn1tlh.
Slmululc lhe 111ble unlil fifth customer departs
IAT 6 3 7 S 2 4
Service time 4 2 5 4 I 5 4 4
(06 Murks)
Ans.
Serial No IAT Arri,al Sen ice
al time titll(' time
I I 0 4
2 I I 2
·'" 6 2 s
4 3 8 4
5 7 11 I
6 s IX s
7 2 23 4
8 4 25 I
Time l1n1c
,cc
CuSlotffl:I' Spall IP 9 I 29 4
0<1"' iyMefl\
bd Simulation table for checkout counter simulation
0
Cumulative stntic·s
0 l
Q l Clock LQ(t) Ls(t) FEL 13 MQ
0 0 0 I (A. I) ( D,4) (E.30) 0 0
18 0 6 I I I (A.2) ( D,4 ) ( E.30) I I
4
2 2 I (D,4) (/\,8) ( E.30) 2 2
mqueue
4 I I ( D.6) (A,8) ( E, 30) 4 2
ers (A,8) ( D, 11) (1::,30) 6 2
6 0 I
8 I I ( D.11) (A.I I) (E.30) II 2
l1 I I (D, 15) (A , 18) ( r,.30) 11 2
15 0 I ,(D.16) (A, 18) (E.30) 15 2
16 0 0 (A. 18) (E,30) 16 2
Total busy 1I111c ol server = 16
Maximum queue length ~2

27
VIII Set111 ( CSf(ISc) S y ~ M ~ ~ S~ L < m l CBCS ·ModeiQ~
2. b. Six dump trucks arc used to haul coal from lhc entra nce or a smali mine 10 i
the railroad. Each truck is loaded by one of two loaders. After loading, a truck 36 0
immediately moves to scale, to be weighted 11s soon 11s possible. The queue
discipline is FlFO. When It is weighed, a truck travels 10 lhe industry und I
returns lo the loader queue. The distribution of loading time, weighing time
52 0
and travel time are given tu the following table as follows.
Loading time 10 5 5 10 15 10 10
Weighing time 12 12 12 16 12 16 -
Travel time 60 100 40 40 80 - - 6-l 0
Calculate the total busy hme or both the loaders, the scale, averngc loader and
scale utilization. Assume 5 trucks arc at the lo:,dcr and o ne is at the scale at time
0 stopping Event TE=60 mins (10 Marks)
Ans, . a. Explain i) Inv,
4512 Mainlainabillf)
·1· .
Average Ioad erut11zat1on;--= 0 .35
64 A ns. i) In realistic i
least 3 random ,
A veragcscaIc11t1·1·1zallon
. =-=
64 I
64 I. The numbt:r c
Simulation table for dump truck o peration
2. The 11me bet
SunulahM !)t.df t:uurnl.ah\c ~l\1stn:s
3. The lead time
~l,n Load1."1 \\e1ah
In very simple
c1,,.q11 L(l) WQ(1) WCU) 1(1 o, B,
over time and I
L<JII) ~th,"U\! llli~U.:

D'1', EL. 5, DT, • In most real


0 J 2 0 I OT, EL, 10, OT, 0 0
OT, lW, 12,0T, in time and
EL, 10, OT, random.
llf,
EL. 10, DT, • Distributiona
5 2 2 I I
OT,
1n,
CW, 12. OT,
LI. 10,01,
10
' usually base
10 I 2 ] I IJ I
1n. I.I.. I?, l)l 1 20 ,o in a realistic
Ill
tW 10.l>I. • In practice, t
ll 1 LI 12.DT, distribution.
10 0 2 J I Ill F.t.. 20.D1, 20 10
111: EW 2U)l, • Unlike anal
EL. 20, DT, assumptions
01. [L. 25, OT.
12 0 2 2 I
or; r,w.2~.01·,
24 12 • The geomet1
"'·''· 12. or.
• The passion
ll 1 I.I 25. DT extensively
211 0 I l 1 Ill t W !4.Dr, 40 :o
ll I Al.l). 72. IH, • The tail oft
u .. 2s.01.
binomial. wl
l>I. LW. 16, Dr, used than if
24 0 I 2 I 4~ 24
Ill. ALQ, 72 Dr,
ALQ 124 D'I, ii) reliability a
DI, l,W, 16.D'f. distributions in
2$ 0 0 3 I Dl, /\LQ, 72. DT, 4S 2S
DI /\1,Q, 12J, DI,
occur. the time
distribution ari
an c.,ponential

28
~ S~t.Ot'\I

ee of II smnll mine to EW S2.0T,


fter loading, a truck J6 0 0 J I
01 Al.0, 72 or, 4S .16
1)1 , /\I ,(). 124 UT 1
possible. The que ue AL(). 11,. rn,
to the industry and EW 1>-1 DI,
, time, weighing time ALI), 72. 0 r,
S2 0 0 I I or. ALO. 124. 01,
ALO. 76, Dr,
45
''

rn
ALQ. 92. OT,
EW SO.OT,
- 64 0 0 0 I
AU), 72. 0T,
Al,1), 124.0T,
4S 64
At<) 76 or
- Al.I). Y2 DI,
le, averuge loader and Al.l). 144. 01,
is at the scale at time Module - 2
( 10 Mark.'!)
3. a. Explain i) Inventory - supply chain management systems ii) Reliability nnd
Maintainability. · (08 Marks)
Ans. I) In renlistic inventory and supply - chnin mnnagcmcnt systems, there an: al
least 3 random variables.
I. The number of units demanded per order or per time period.
2. The time between demands.
3. The lead time
In very simple mathematical models of inventory systcms, demands is a constant
D,
over time and lead time is zero or a constant.
• In most real world cases hence in simulation models. demands occurs randomly
s.DT, in time and the number. of units demanded 1:ach time a demand occurs is also
0 0
10,0T,
11,DT, random.
10,DT, • Distributional assumptions for demand and lead time in inventory theory texts are
10.or; 10
12,DI, usually based on mathematical tractability but those assumptions could be invalid
10. ur, in a realistic context.
20 10
12. or • In practice, the lead time distribution can often be fitted fairly well by a gamma
, ?O, 1>1,
12, llT,
distribution.
10
20. 1)1, 20 • Unlike analytic models. simulation models can accommodated whatever
. 2s. 01. assumptions appear mosl reasonable.
_ 20. OT,
. 2S,OT, 24 12
• The geometric poisson and negative binomial distributions might be appropriate.
~.24.0T , • The passion di~tribution is oncn used to model demand because it is simple. it is
LQ 72 OT,
extensively tabulated and R is well lrnO\.\ n.
ps OT
20 • The tail of the poisson distribution is generally shorter than that of the negative
W.2·1, IH,
, _lJ 12 nr, binomial. which means that fewer large demands will occur if a poisson model is
1.,2S.Dl used than if a negati\'e binomial distribution is used.
W.3b.OI, 44
1.Q, 72,01 1 ii) reliability and maintainability:- Time to failure has been modelled with numerous
LQ 124, OT,
distributions including the exponential. gamma and wcibull. If only random failures
\W.36,0T.
\LQ, 72,DT,
15 occur. the time to failure distribution may be modelled as exponential. The gamma
~I.Q, 124, OT, distribution arises from modeling stand by. redundancy. where each component has
an exponential time to failure.

29
fS~WtV
a small mine to LW Sl 01,
r loading, a truck ,6 0 0 2 I
IH
or.
AU).72 IH,
1\1 (). Ill ur,
,, J6
sihle. The queue Al.(). 76. Dl,
the industry and EW<,1 Dl,
c, weighing time AL<). 72. 0 I,
S2 0 0 I I OT, AL(). 124, OT, 4$ $2
ALI), 76. OT,
ALI). 92. Dr,
f.W.80, OT,
Al.<),72.01,
AUJ, 12-1, 01 1
6-1 0 0 0 I
Al.I) 76, DI
41 64
Al.l).92.DI,
~ ,d.,,nd 0 Al.I). 144, DI,

t the scale nt time Module - 2


(10 Mark.'1)
3. a. Explain i) Inventory - supply chain management systems ii) Reliability and
Maintainability. · (08 Marks)
Ans. i) In realislic inventory and supply - clrnin mnnngemcnl systems, there are at
least 3 random variables.
I. The number of units demanded per order or per time period.
2. The time between demands.
3. The lead time
<.:umulati\C ~3.\1jln.;.)
I In very simple mathematical models of inventory systems, demands is a constant
o, I\ over time and lead time is zero or a constant.
• In most re11I world cases hence in simulation models. demands occurs.randomly
0 0 in time and the number. of units demanded each tirne a demand occurs is also
f. random.
'
r, s • Distributional assumptions for demand and lead time in inventory theory texts are
~. 10
usually based on mathematical tractability hut those assumptions could be invalid
in a realistic context.
t 20 10

10
• In practice, the lead time distribution can often be fitted fairly well by a gamma
distribution.
,.
r'. 20 • Unlike analytic models. simulation modds can accommodated whatever
assumptions appear most reasonable.
fl, • The geometric poisson and negative binomial distributions might be appropriate.
l. 24 12
i. • The passion distribution is of'tt:n used to model demand because it is simple. it is
OT
r
extensively tabulated and R is well known.
1)1, 40 zo • The tail of the poisson distribution is gem:rall> shorter than that of the negative
DI, binomial. which means that fewer large demands will occur if a poisson model is
~T
PT, 24
used than ifa negative binomial distribution is used.
44
DT 1 ii) reliability a nd malntaln:1bility:-Time 10 failure has been modelled with numerous
~ DI,
distributions including the exponential. gamma and weibull. If only random fai lures
I'll
oi", 4S 2S occur. the time to failure distribution may be modelled as exponential. The gamma
~ .OT, distribution arise~ from modeling stand by. redundancy. where each component has
an exponential time to failure.

t 29
1\1,+,.., f.c"M Sc,.,.M
VIII Sem,, ( CSE/ISE) CBCS - 1-,iode,l.,Q
• The wcibull distribution has extensively used to represent tirnt: to failure and its
nature is such that it can be made lo approximate many observed phenomena. 0,
When there arc a number ofcomponents in a system and failure is due to the most (x -,
serious of a large number of defects or po!;siblc defects, the weibull distributor
seems to de partically well as a model. (b-n)(
F(x) =
3. b. Explain the following continuous distributions I- (c
i) Tria ngular tlistribution (c-b
ii) Lognormal distribution (8 Murks)
Ans. i) Triungulnr distribution :- A random variable x has a triangular distribution if its
pdf is given by Ii) Log normal d

f (x)-
2
(x-a) , a!>x~b its pdif is given by
(b -a)(c-a) ~ex
f(x) J27rax
2(c-x)
(c-b)(c-n)'
0, else where Where a= >O. The
E ( X) = c••· 0
'''
Where a< h~ e. The mode occurs at x=b. A triangular pdf is shown below
f(x) V ( X) = e=,.. •' (e 01

Three lognormal A

a b c x
The pnrnn11.:ters (a. b, c) can be related to other measures. such as the mean and the
mode, as follows
E(x) =a+ b + c
3
the mode can be determined ns
TI1e parameters ►
Mode=b =3 ;1c (x)- (a + c) parameters come
Because a .S: b .S:c a lognormal distr
2a-+ c5 E( lognormal are kn
- · X) ~--
a+ 2c
J 3 given by
The mode is used more oftt:n than the mean to characterize the triangular distribution.
Its height is 2/ (c-a) above the x-axis.
µ= In[ ,µj ,
►lj_+ Gj
The cdffofthe triangular distribution is given by
µ~ +a2
a2 = In ,
[ µl

30
me to failure and its
served phenomena. 0, x=,;a
re is due to the most (x -a}2
wcibull distributor aS:xS:b
(b - a)(c -a)'
f(x) =
1-
(C- x}2 b·s xsc
(c-b)(c-a)'
(8 Murks) I, X >e
tar distribution if its
ii) Log normal distribution:- A random variable x has a log normal distribution if
its pdf is g.iven by

✓2~ox exp[
2
f(x)J (ln ~;lµ) ] X>0

l 0. other wise
Where a= >0. The mean and variance of a log normal random variable are
E( X) = e10• 0
"'

own below
V ( x) = eJi•+a' ( e"~ - I)
Three lognormal pelf's all having mean I, but various 1/2. I and 2, are shown below.

1.5

ch as the mean and the


0.5

2 3

The parameters µ and a2 are not the mean and variance of the lognonnal. These
parameters come from the fact that when y has aµ(µ, o2) distribution then x= e> has
a lognormal distribution with parameters ~land o2. If the mean and variance of the
lognormal are known to be ~ and a 2 respectively, then the parameters µ and a2 are

triangular distribution.
::,::rlJ~,j_:t ,l +O"i.

..,
31
VIII Semt (CSE(ISE)
CBCS. J..iodel,
OR od·
4. a. Explain networks of queues in Queuing model. (06 Murk.~) p[n+IJ: = p[l
Ans. :fhe study of mathematical model of networks of queue is beyond the scope . do;
• A few fundamental principles arc very useful for rough cut modell ing. prior to for II from c 1
simul~tions study. do
• The following results assume a stable system with infinite calling population and p[n+ lJ != Pf I J
no limit no system 1:apacity. ' od;
I. Provided that no customers are created or destroyed in the qu1:uc, then the departure RETURN (cv
rate out of a queue is the same as the arrival rate into the queue, over the long run. end;
2. If customers arrive lo queue i at rate A and a fraction 05 pij 51 of them are routed
to queue j upon departure, Ihen the arrival rate from queue i to queue j is ,.jpij over
the long run. S. a. What ls llnca
3. The over all arrival rate into queue j, Aj, is the sum of the arrival rate from all genemte a seq
sources. If customers ~rri,e from outside the net\\ork at rate aj, then Ans. Linear congri
Ai =a,+ L A, P,1 sequence of in
olh Xi ti= (axi + c
4. If queue j has c < C1J parallel servers.each worl...ing :11 rntc ~lj. then long run The initial \'alu
I
utilization of each server is • lfc#O in eq
Al • If c=0, the fc
p=- xi
J C ~l
1 1
R,~- i =I'
Ill '
and pI <I is n:quired for the queue to be stable. X,., =(a>.iie)
S. if. for each queue j, arrivals from oulside the network from a poisson process
with raic a and if there are c identical service)> ddi,ering exponentially distributed X,=(iix,,+c)
I J
service times with mean ~lI tilen,in steady SIHlc, queue j behaves like an µ/µle J queue X,=(17x27+
with arrival rate x. =502 modl0
A1 =a,+ LA,
P,,
all I
x. =2
2
4. b. Write the mnplc procedure to calculate p0 for the M/M/C/K/K queue. R, =I00=0.02
( 10 Marks) X:=(ax,+c)n
Ans. mmcckk= proc {lambda, mu, c, k}
# return steady state probabilities for M/M/M/C/K/K queue. =(17•2+4
#notice that p[n+l] is p-n, n=0, ...... K = 77modl0
local crho, k fac, c foe, p, n; X1 = 77
P: vector {K+ I, O};
crho:- lambda/ mu; 77
R. =-=0.77
k fac:= k!; · 100
c fac:= c!;
p[ I]:= Sum ((k fac / (n"! (k-11)''!)) 1 crcho"n, n-0, c-1 )+
Sum ((k fac/ (c"(n-c) • (k-n)! • c foe))• crho"n, n=c ..k);
P[ I]:= 1/P[ IJ .
for n from I to c-1

· 32
od;
p[n+I): = p[l) ♦ (k fac/ (n!*(k-n)!))* crho"n;
(06 Marks)
do;
ond the scope . for n from c to k
.ut modeII i1tg. prior to do
p[n+ I]!= p[ 1J• (k fac) (c"(n-c)*(k-nWc facWcrho" n:
calling population and
od;
RETURN (eva / m (p));
eue, then the departure
end;
iue, over the long run.
\j $ 1 of them are routed Modulc-3
to queue j is i-.jpij over 5. a. Whnt is linenr, congruential method? Use the linear congruential method to
generate a sequence of rJntlom numbers x0=27, n= l7, c=43, m=I00. (06 Marks)
he arrival rate from all Ans. Linear congruenti11I method:- It was initially proposed by lchmcr, produces a
aj, then sequence of integers x1• x: between rand m-1 following recursive relationship
Xi ti = (axi + c) mod mi = 0, I, 2........ 1
The initial values x0 is called seed, a..... multiplier c ..... increment and m is models.
rate ~lj. then long run • If c#0 in eq I then the form is known as the mixed congruential method
• If c=0, the form is known. as multiplicative congruential method.
xi .
R,.-- 1=1,2......
m
X,. 1 z {a xi H )mod m
from a poisson process X 1 = (axu -I c)modm
exponentially distributed
X 1 = (17x27 +43) modl00
haves like an µ/µ/c, queue
X1 ~
502mod l00
x, -=2
2
RI =-=0.02
C/K/K queue. 100
( 10 Marks) X 2 =(ax1 +c)modm
= ( 17 • 2 + 43) mod I 00
ue.
= 77modl00
X2 = 77
77
R. ---=- 0.77
· 100

33
C13CS · Mo-dclt
VIII Se+ttt ( CSE/ISE)
X; =(17•77,+43)modl00 Stcp4:-Sim
= I352mod 100
X 3 = 52 0 s,m
52
R3 = =0.52
100 After computi
X 4 =(17*52+43)modt00 z0 S Zee/ 2
x. =27 Problt!m solu
27 Stt:p I:- To tcs
R~ =-=0.27 Step 2:- Calcu
100
it (M+I) m:::N
xj = (17 x 27 +'43)mod I00 3+(4+1)5$30
X5 = 2 28530
z M=4
R5 =-=0.02
100 Sim=..!_[(0.23)
5
5. b. Explain test for Autocorrelation'! Consider the sequence.
0.12 0.01 0.23 0.28 0.89 0.31 0.64 0.28 0.83 0.93
Sim =-0.1945
0.99 0.15 0.33 0.35 0.91 0.41 0.60 0.27 0.75 0.88
and
0.68 0.49 0.05 0.43 0.95 0.58 0.19 0.36 0.69 0.87
Test for whether the 3rd, 8th, 13th ans soon numbers in the sequence arc auto 13(4)+
corrected using x= 0.05, consider critical value 1.96. (10 Marks) O'.,, =
12(4 + I
Ans. Test for auto correlation :- It is concerned with the dependence between numbers
in a sequence. I !ere test to be described sho11ly requires the computation of the auto z = -0.1945 _
0
correlation between every 111 numbers starting with the i th number. 0.1280 -
The auto correlation between numbers Ri, Ri+m, Ri+2m .... Rit (M+l) m. The value -Zo./2 :!;20 :!;2
Mis largest integer such that it (M+l) m:SN where N is the total number of values -l.96!:i:-1.516
in the sequence.
Step 1:- Formulate the hypothesis :. llypothcsisof i
H0 : Sim=O Vs H1 : Sim -:/, O
Step 2:- Calculate M, using it (M+ I) m::,N
Sim 6, a. Explain lhl· foll
Stcp3 : - CalulatcZ0 = - - i) Weibull distri
CTsun
ii) Trinngul:tr di•
It is approximately normal if the values Ri, Ri+m, Ri+2m .... Ri+(M+I )m, are Ans. i) Weibull distri
uncorrelated. machines or elect
Which is distributl'd normally with a mean of O and variance of I, under the f3 xll-1
assumption of indepcndcn1.:c F{x) == ~ e
{
Where e<>O and 13
Step I:- The cdf i
Step 2:- Set F(x) ·

34
~S~LC1'\I

1
Step4 :-Sim = -
Ill
-[:f
+I k ,,
Ri + km. Ri + (k + l)m]- 0.2S

Jl3m + 7
(J ----
S,m - 12 (111 + 1)

Aller computing 20. do not reject the n all h) polhesis of independence if Z-oc/2 ~
zO ~ Zoc / 2
Prol,l~m solution
Step I :- To test 110 : Sim - 0 or 1-1 1: Sim#0
Step 2:- Calculate M
it (M+ I) msN
3+(4+ I )S:S30
28~30
M=4
Sim - t[( 0.23)(0.'.!8) + (0.'.!8)( 0.33) + ( 0.33 )(0.'.!7) + ( 0.27)

{0.05) , (0.05)(0.36)] - 0.25


0.83 0.93 Sim - 0. 1945
27 0.7S 0.88 and
6 0.69 0.87
J13(4)+7 .Jsij 7.68
the sequence :ire auto a,,. = ~......:....-'--:,- =--- - = 0.1280
(10 Marks) 12 ( 4 + I) 60 60
ence between numbers z0 = - 0.1945 - -1.516
computation of the auto 0.1280
umber. -Za./ 2 sz0 s Za. / 2
Rit (M+I) m. The value
c total number of values
-I .96 s;- l .516 ~ 1.96
:. I lypothcsis of indepcndem:cconnot be rejected.
OR
6. Explain the following
11.
i) Weibull distributions
ii) Triungular distributions. (08 Marks)
Ans. I) Wdl,ull distribution :- It was introduced as a model for time to failure for
m .... Ri+(M+ l)m, arc
machines or electronic components. When location parnmctcr v is set to 0, it pdf.
ariancc of I, under the fj ,fl-I ( / )"
F(x) = a.P X e- X Ct.
{
0. other wise
Where u>0 and f3>0 arc the scale and shape parameters of the distribution.
Step I:- The cdf is given b) f(:\)-= 1-e-(x/u)', .\ ~0
Step 2:- Set r(x) - 1-c-(x/u)'1=R ...(I)

35
VIII Sent, ( CSE/ISE) Sy~ModeUrtfttMUi,S~wrt, CBCS · M~Q
Stcp 3:- Solving for x in tem1s of R yields. 6. b. Downtimes fo
x"'<l(- ln( I-R)] 11P ...(2) be gamma di8
By comparing I and 2. ifx is a weibull variate. then x'' is an exponential variate with 1}=2.30 and 0
mean aP, if y is on exponential variate with mean µ, then y•·P is a weibull variate. 0.434, o. 716
With shape parameter P and scale parameter a=µ I/13. Ans.
ii) Triangular distribution :- Consider a random variable x that has pdf;
x, Os x s I Step! :-a =
f(x) = 2-x, lsxs2
(2
{ Step2 :-Gene
0. othen\ ise
This distribution is cnlled n triangular distribution with endpoints (0. 2) and mode at SctV
I. Its cdf StcpJ. -Comp
0, xSO Step4:-x::5.3
x2
Osxsl
2
F(x) =
(2-xt = 0.91
I- I SxS2
2 '
I, X>2 Step2: -Gt:ner

f(x)
Stcp3 : - Compu
Stcp4 :-Since x

0.767,1
Step 5: - Divide ,

0 X 7, 11. Explain lhe difli


Density function for particular triangular distribution.
i1/ I Ans. It is necessal)
2
=
ForO S x :<. I. R -=:> 0 :i. Rs-=:> x J°2R
2
= prcliminal) stud
. (2 -x)2 be resourceful in
and for I ~ x ~ 2, R = I -
2
=> lI s R s I => x = 2 - J2 (I - R) results to the cho.
There are numbe
Thus xis generated by
available.
I. Engineering,
x=l~---O s Rs½ by the manufact
2-J2 (I - R) ¾s R s I speed of a tool is
2. Expert optio
process. They m
highl) variable a
36
CBCS · Model-Que~t"'i-01'\,Ptt.per · 2
6. b. Downtimes for a high production candy making machine have been found to
be gamma distributed with mean 2.2 minutes and variance 2.10 minutes, where
nential variate wit~ j3=2.30 and 0=0.4545. Consider the following random numbers 0.832, 0.21,
is a weibull variate. 0.434, 0.716 (08 Marks)
Ans.
at has pdf; 1
Stepl :-a= 1
, = 0.53.b = 0.91
(2x2.30 - 1) ·
Step 2:-GcnerateR, = 0.832, R~ =0.021
Set V = 0.832 / ( I - 0.832) = 4.952
ts (0. 2) and mode at
Srep3: -Compute x = 2.3( 4.952) '' 11= 5.37
Step4: -x = 5.37 > 0.91 + (2.3(0.53) +I) In 4.952 -
In ( ( 0.832)~( 0.02 I))
=0.9 I+ (2.219) I - 599-423 = 8.68, so reject xand
return to step 2
Step 2: -Generate R, = 0.434, R: = 0. 716,set v = 0.434
((1 -8.434)) = ~.767
Step3 :-Computcx = 2.3(0.767)0.53 = 2.00
Step4: -Since x = 2.00 ::; 0.91 + ( 2.3(0.53) + I] In
2
0:767, In ( (0.434) 0.716) = 2.36nccept x
StepS: - Dividex byp0 = 1.04510 get x = 1.91
Module - 4
7. a. Explain the different w11ys of selecting input models when data Is not available.
(08 Marks)
Ans. It is necessary to develop a simulation modt:I for demonstration purpose or a
preliminary study before any process data are available. In this case. the model must
be resourceful in choosing input models and must carefully check the sensitivity of
results to the chosen models.
I - R) There are number of ways to obtain information about process even if data are not
available.
I. Engineering data:- Often a product or process has performance ratings provided
by the manufactures. Ex:- laser printer can produce 8 pages/ minute. the cutting
speed of a tool is Icm / second.
2. Expert option:- Talk to people who are experienced with the process or similar
process. They might also be able to say whether the process is nearly constant or
highly variable and they migh! be able to define the source of variability.

37

L-----'- -
VIII Se,.m, ( CSE/ISE)
J) Physical or conventional limitations:- Most real process have physical limits
on performance for example, computer data entry cannot be faster than a person
can type. IJecause of company policies, there could be upper limits on how long a cr, - -~x:- n-1
process may 1nke.
4. The nature of proces;:. The description of the distributions can be used to justify
a particular choice even when he data nre available. When data arc not available, the cr = ~
uniform, triangular and beta distributions arc often used as input models.
7. b. Let x1 represent the aver.1ge lend time to deliver and x2 be the annual demand, 104520
for industrial robots. The following data nrc nvnilable on demand and lead time
er =
for the lust ten industrial robots. The following data arc availnble on demand
and lead time for the last ten years. P= cov(x"x 2
Lead time 6.5 4.3 6.9 6.0 6.9 6.9 5.8 7.3 4.5 6.3 er,, cr,.
Demand 103 83 116 97 112 104 106 109 92 96
Lead time an
Find the dependency between the lead time and demand.
Lead lime Demand x 1 • x~ x2 x/
I
(x ) (x,)
8. a. Explain point c
6.5 103 669.5 42.25 10609 Ans. The point estima
4.3 83 356.9 18.49 I "
6.9 116 800.4 47.61
6889
13456
0:-I 0 I I
v
• I

6.0 97 582 36 9409 Where is a sam e


6.9 112 772.8 47.61 12544 may refor to this
6.9 104 7 17.6 47.61 10816 cstimntor e i~ sni
5.8
7.3
106
109
614.8 33.64 11236 In general E( O):
795.7 53.29 1181

6.3
4.5
96
92 414
604.8
20.25
39.69
8464
9216 I\
E(o)
Ix -61.4 2.,x.- 10111 6328.5 386.44 10452 and E(O)- 0 is cnl
that are unbiased
X. = ~::X,
n
6 4
= 1. =6.14
10
of 8.
Example to thee
-X2 =~::X2 l018
-- = - = 101.8

w: - ,Ew an
I
N ,..,
r,..

n 10
in \\ hich case yi i
cov(x.,xi)= n~l(.2:X,x 2 - nx. x2 ) The point estimat
run length is deli,
l • I
= IO-I (~)328.5-10 x 6.14x 101.8) t =T'" LTl y(t)

I and is cc1llcd a tin
= 9 (77.98) = 8.66

38
e physical limits
ter than a person '°'x2
L.., I - x2
I
'°'x2-x2
L.., 2 2
o,=
its on how long a n-1 n- 1
1 be used to justify cr = ✓386.44 10(6.14) =l.0Z4
• not available, the
11odels.
annual demand, cr = 104520 10(101.8) = _
9 930
~nd and lead time
lnblc on d emand
cov(x.,x 2 ) 8.66
4.5 6.3
p= cr,, cr,, = (1.024)(9.930) 0.86
92 96 Lead time and demand are dependent

-r-./ OR
8, a. Explain point estimation in measures of performance. ( 10 Marks)
P609 Ans. The point estimator ofO to based on the data ly,.....y.,} is defined by
,.. I n

889 0=-I
n 1• 1
Y, ....(1)
p456
~409 Where 0 is a sample mean based on a sample size n. Computer simulation languages
,nay refer to this as a discrete time, collect, fall y or observational statistic. The point
2544
estimator 0 is said to be unbiased for Oif its expected valm: is 0- that is, if
0816
In general E( e)- O ... (2)
1236
1181
... (3)
8464
I\
9216 and E(8)- 8 is called the bias in the point estimator 0. II is desirable to have estimators
~0452 that are unbiased or if.this is not possible, have a small bias relative to the magnitude
of0.
Example to the estimator of the form of equation I including wand w 0 of equations
a (N a IN,
W;::: - I w, andwo;::: - Iw~ ➔ w 0
N ' I N ,.,
in which case yi is the time spent in the system by customer i.
The point estimator of$ based on the data {y(t), O~ t~ Tr), where Tc is the simulation ·
run length is defined by
" I
4' ;:::,,~ foTE y(t)dt ... (4)

and is called a time average of y(t) over [O, T1J

39
S y ~ M ~ ~S~i.,01'\, C6CS · Modet-Q
VIII Se.mt ( CS£/ISf)
Simulation languages may refer to this as a --continuous lime''. ·'discrete - change'' 5. A communicaf
components. It is
or ·'time - persistent" statistic.
In general E(0) * 0 ... (S)

and 1jl is said to be biased for+. .


Genemlly, O and ♦ arc regarded mean measures of performance of the system being
simulated. Other measures usually can be put into this common frameworl... For
example, consider estimation of the proportion of days on which sales are last Consider the sys
through an out of stock solution. In the simulation, let event E is define
I, : f out of stock on day: are that all com
y ={
' 0, otherwise
With n equal to the total number of days, 8 defines by equation I is a point estimator
9. a. Explain lnitiali
of 8, the proportion of 0111 of stock days. For a second example, consider estimation
Ans. This method is c
of the proportion of time Queue length is grater thank., customers. If LQ(t) represents
I. Setting the inv
simulated queue length at time t, then define and their arri\al
I, ifL0 (t)k 0 2. Placing custon
y(t) ca .
{ 0. other wise 3. I laving some
Then ♦ 11s defined by equation (4). is a point estimator of~ , the portion of time There arc at least
that the llUeuc Ieng.th is grater than k., customers Thus. e!ttimation of propo11ions or • If the system
I) pical initial
probabilities is a special case of lhe estimator means.
• This method
8. b. Explain terminating si111ul:1tions with respect to output analysis. (06 Mnrl<J) • If the system
Ans. In the analyzing of simulation output data u distinction is made between terminating existing syste
or transient simulations and study state simulations. A tenninating simulation is one • It is recomme
that runs for some duration of time Tr• where E is a specified event that stops the systems to h
simulation. Such a simul.1ted system often at time O under well- specified initial assuming the
conditions and closes at the stopping time Ti: of time 0.
Examples: I. The shady groove Bank opens at 8:30 Am with no customers present • A related ide,
aml 8 of the 11 tellers ,,.. orkcr fact that the back has bt:en open for 480 minutes. The that has been
simulation analyst is interested in modeling the interaction between customers 1111d • The ~implifie
tellers over the entire day. including the ellect of starting up and of closing down at customers int
the end of the day. • A second me
2. Consider the shady gro\'e bank of example 11.1, but restricted to the period from conjunction ,
11 :30 Am to 1.30 Pm \\hen it is especially busy. The simulation run length is Tl=l20 first 1111 initiali
minutes. The initial conditions at time Ocould be specified in essentially two ways. from time T0
3. The rcul system could be observed at 11 :30 on II number of different days 11nd under specifi
a distribution of number data could be used to load the simulation model \\ ith • Data collccti
customers at time 0. and continue.
4. The model could be simulated from 8:30 Am to 11 :30 Am without collecting • The state at ti
output statistics and the ending conditions at 11 :30 Am used as initial conditions for by I, should
the 11 :30 /\m to I :30 Pm simulation. original initi

40
•, ''discrete• change'' 5. A communications system consists of se,eral components plus several back up
components. 11 is represented schematicall) in the figure

•e of the system being


mon framework. For
which sales are last Consider the system over II period of ti.me T1, until the system fails. The stopping
event Eis defined by E= {A fails or D fails or (band C both fail)J. Initial conditions
are that all components are new at time 0.
Module - 5
I is a point estimator 9. a. Explain Initializations Bias in steady state similar. (10 Marks)
, consider estimation Ans. This method is called intelligent initiali7.ation example include.
rs. If L0 ( t) represents I. Setting the inventory levels, number of back· orders and number of items on order
and their arrival dates in an inventory simulation.
2. Placing customers in queue and in service in a queuing simulation.
3. I laving some compont:nts failed or degraded in a reliability simulation.
There are at least two ways to specify the initial condition intelligently.
. the portion of time • If the system exists, collect data on if and use these data to specify more nearly
tion of proportions or typical initial conditions.
• This method sometimes requires a large data collection effort.
lysis. (06 Marks) • If the system being modeled doesn't exist for example, if it is a variant of an
e between terminating existing system this method is impossible to implement.
ting simulation is one • It is reeom111t:nded that simulation analysts use any available data on existing
d event that stops the systems to help initialize the simulation, as this will usually be better than
well- specified initial assuming the system to be completely stocked, empty and idle is to obtain new
of time 0.
no customers present • A related idea is to obtain initial conditions from a second model of the system
1 for480 minutes. The that has been simulated enough to make it nrnthematieally solvable.
tween customers and • The simplified model can be solved 10 find long-term expected number of
nd of closing do\\ n at customers in the queue and these condition~ can be used to initialize the simulation.
• A second method to reduce the impact of initial conditions, possibly used in
ted to the period from conjunction with the first an initialiaition each simulation run into two phases
n run length is Tt "120 first an initialization phast, from OT<, time to, followed by a data collection phase
essentially two ways. from time T0 to the stopping time T.,+T1 that is, the simulation begins at time O
· of different days and under specified initial conditions I,. and runs for a specified period of time T0 •
·mu lat ion model \\ ith • Data collection on the response variables of interest does not begin until time T0
and continues until time T0+TE.
m without collecting • The slate at time T0 is quite important because the system state at time T0 denoted
initial conditions for by I, should be more nearly representative of steady - state behaviour than are the
original initial conditions at time 0, 10 •

41
VIII Se+111 (CSE/ISE) C13CS · tvfodeiQ
9. b. Explain Batch means for interval estimation in steady -state simulations. l ) Face validi
(06 Marks) appears reason
Ans. The method of batch means attempts to solve the disadvantages of a single- about real invo
replication design arises when \\C:: try to compute the:: standard error of the sample Sensitivity anal
mean, by diving the output data then treating the means of these batches as is they Model user is ,
were independent. input variables
When the raw output data aficr detection fonn a continuous time process {y(t) , Tc,S increased. We
t :S T0+ TL} such as the length of a queue or the level of inventory, then we form k 2) Validating
batches of size m=T /k and compute the batch means as assumption and
I ,m
Y, = -m f
( 1-l)m
Y(t+T0 )dt
Structural assu
consider custo
be I or 2 or 3 etc
for j= I. 2, .....k. be FIFO etc. T
When the raw output data after deletion form a discrete time process {Y i = d+ I during appropri
1
d+2.... n]. such as the customer delays in a queue or the cast per period ofan invento~ regarding bank ~
system, then we form k batches of size m= (n-d)/k and compute the batch means as should be based
y =..!_ ,''" Y, +d example, In abo
' M L..,. u- i, M + I service time etc.
For j= 1,2, ....k That is, the batch means are formed as shown here The reliability 0
consist of 3 steps
YI ...... Yd' vd...1••..···yd+ru' yd't-ftl♦I ......vtt ...:!m •••••Yd+(il.-1)111 ..1 •
I. Identifying pr
JddN V1
v, 2. Estimating pan
3. Validating mod
J) Input o utput
Starting with either continuous time or discrete time data the variance of the sample by using the actu
mean is esJimated by that the simulatio

s~ =.!. i: (vj-vr =i: v/-kY: occurred in the r


if is important tha
k k,. 1 k-1 ,. 1 k(k-1) lime period.
Where j is the overall sample mean of the data after deletion. 10. b. Explnin Ucr11tivc
Ans.
OR
• Calibration is
IO. a. E~plain three step approach for validation process as formula ted by Naylor nod making adjustr
Fmgcr. . ( 10 Mnrks) additional adjrn
Ans. The following 3- step approach aids in the validation process. • The compariso
I . Bu.ild a model that has high face value. • Subjective test
2. Validate model assumptions. aspects of the
3. Compare the model input output transformation to corresponding the input / output • Objective tests·
transformation of real system. data produced
• Then one or 1111
model data set.

42
VIII Set11.r ( CSE/IS£) Sy~ M~ ~ s ~ W l ' \ I CBCS · ModeuQ
9. b. Explain Ba tch means for intenial estima tion In steady -state simula tions.
(06 Marks) appears reas
A ns. The method of batch means attempts to solve the disadvantages of a single- about real inv
replication design arises when we try to compute the standard error of the sample Sensitivity an
mean, by diving the output data then treating the means of these batches as is they Model user i
were independent. input variabl
When tilt: raw output data after detection fonn a continuous time process {y(t) , T;, increased. W•
a
t 5 T0+ Tf} such as the length of queue or the level of inventory. then we form k 2) Validalin
batches of size m=T/k and compute the batch means as assumption a
I Structural as
J
JIU
YJ =111- Y(t+T0 )dt consider cust
t1-l)m
be I or2 or3
for j=- I, 2, .....k. be FIFO etc.
When tht: raw output data after deletion fom1 a discrt:te time process {Y,, i = d+ I. during appro
d+2.... n). such as the customer delays in a queue or the cast per period of an inventory regarding bar
system. then we form k batches of size m= (n-d)/k and compute the batch means as should be ba
y _ I ""''" Y, + <l example, In
' - M L,, o ·11 M + I service time c
For j= 1,2,....k That is, the batch means are fom1ed as shown here The reliabilil)1
consist of 3 st
~ • ~ • Yd+ni,l••••••YJ+'m•••••Yd♦(~-l)n1•I'
I. Identifying
deleted Yt Y1
2. Estimating
......Yd ♦~tn 3. Validating
3) Input O UI~
Starting with either continuous time or discrete time data the variance of the sample by using the
mean is estimated by that the simu

S
1
=.!.. i: (vj-Yr ±vf = -k~
occurred in tll
if is importan
time period.
k k ,., k-1 ,., k(k-1)
Where j is the overall sample mean of the data after deletion. J0, b. Explain Ilcrit
Ans.
OR • Calibratio
IO. a . E~plain three slcp approa ch for , ·alidation process as formulated by Naylor a nd making adj
Finger. • , (10 Marks) additional
Ans. The following 3- step approach aids in the \'Rlidation process. • The comp
I. Build a model that has high face value. • Subjective
2. Validnte model assumptions. · aspects of
3. Compare the model input output transformation 10 corresponding the input/ output • Objectivct
transformation of real system. data produ
• Then one
model dat,

42
es~im'\I 1) Face validity:- The first goal of simulation model is to construct a model that
c simulations.
appears reasonable on its face to model users and others who are knowledgeable
(06 Marks)
about real involved in model construction from its conceptualisation to implement.
antages of a single• Sensitivity analysis can also be used to check models face validity.
error of the sample Model user is asked if the model behaves in the expected \\ ay. When one or more
se batches as is they input variables arc changed for example; In queuing systems if arrival rate is
increased. We expect server utili1.atio11 and queue length to increase.
ne process {y(t) , T;, 2) Va lidating motlel assumptions:- Model assumptions are of 2 types structural
tory, then we form k assumption and data assumption.
Structural assumptions involve questions of how the system operates. For example,
consider customers queueing and service facility in a bank. Number of queues may
be I or 2 or 3 etc customers may be moving from queue to other queue discipline may
be FIFO etc . These structural assumption should be verified by actual observation
during appropriate time periods together discu~~ion with managers and teller
process {Y., i d+ I. n;garding bank policies and actual implementation of this policy. Data assumption
eriod of an inventory should be based on connection of reliable data and correct analysis of data. For
te the batch means as example, In above banking system, data may be collected on inter • arrival time,
service time etc.
The reliability of data is verified by consultation with bank mangers. Analysing data
ere consist of 3 steps
I. Identifying probability distance.
2. Estimating parameters.
3. Validating models by goodnesi. of fit test.
3) Input output validation us ing historical da ta:- Here the input data is generated
by using the actual historical record. When using this technology the model knows
variance of the sample that the simulation will duplicate as closely as possible the important events that
occurred in the real system. To conduct n validation test using historical input data
if is important that all input data and all response data be collected during the same
time period.
10. b. Explain Itera tive process of calibrating a model. (06 Marks)
Ans.
• Calibration is the iterative process of comparing the model to the real system,
making adjustment to the model, comparing the revised model to reality, making
ulated by Naylor a nd additional adjustment comparing again and so, on.
(10 Marks)
• The comparison of the model to reality is carried out by a variety of tests.
• Subjective tests usually involves people, who arc knowledge about the model
aspects of the output.
• Objective tests always require data on the system behaviour, plus the corresponding
ing the input / output
data produced by the model. ·
• Then one or more statistical tests are performed to compare some aspect of the
model data s et.

43

L_-----'--------------
VIII Se.-mt ( CSE(ISE) S y ~ M ~ CM'\d, Sim«la-t"i.ort,
• This iterative process ofcomparing model with the system and then revising both Eight Semest
the conceptual and operational models to accommodate any perceived model
deficiencies is continued until the model judged to be sufficiently accurate.
Time: 3 hrs.
Com(llll"C model to realit~ Note : Amwrr any
Initial model
Revise
Real <.:omp,1rc revised model to n:alit)
First re\ ision of model I. a. What is simul
System study•
• Revise Ans, A simulation is
Compare sc,;oml re, is,on tu realit) time. Whether
~-----------.i Second revision of model an artificial hist
inferences conce
Refer Q.no. I (a
b. A computer tee
who take calls ,
time between c •
table I.I. Able
which mean th
their aervice ti
distribution
Table l.J: late

,__
TableJ.2~
~
__
Scrvic
Probab
Table 13 : Serv~

Random digits
~
26, 98, 90, 26, 4:
Random digits J
95,21, SJ,92, 89,
Simulate this l)i
Finding
(l) Average late
(ii) Average se
(iii) Average se
Aas. Refer Q.no. I (b)
and then revising both Eight Semester B.E. Degree Examination, CBCS - June / July 2019
any perceived model System Modelling and Slmulatlon
cicntly accurate. Max. Marks: 80
Note: Amwer any FIVE full questions, Sl'lecting ONE full que:i.tlonfronr l'ach module.
ial model
Module- I
Revise
What Is simulation? Explain with Oowcbart the 1teps involved in slmulalion
ision of model study. (08 Marks)
A simulation is the imitation of the operation of a real-world process or system over
Revise
time. Whether done by hand or on a computer, simulation involves the generation of
an artifkial history of a system, and the observation of that artificial history to draw
vision of model
inferences concerning the operating characteristics of the real system.
Refer Q.no. I (a) of MQP - I
b. A compuler technical support centre in staff'ed by two people. Able and Buker,
who take calls and try to answer questions and solve computer problems. The
time between calls ranges from 1 to 4 minutes with the distribution as shown in
table 1.1. Able is more experienced and can provide service faster than Baker,
which mean that, when both are idle, Able takes the call. The distribution of
their service times are show in table 1.2 and Table 1.3: Inter arrival time (IAT)
distribution
Table 1.1: Inter Arrival Time (IAT) distribution
lAT (mlns) 1 2 3 4
Probability 0.25 0.40 0.20 0.15
Table 1.2: Service Time (ST) Distribution of Able
Service time (mins) 2 3 4 5
Probablllt 0.30 0.28 0.25 0.17
Table 13: Service time distribution of Baker
Service time (mins) 3 4 5 6
Probability 0.35 0.25 0.20 0.20
Random digits for inter arrival times are :
26,98, 90,26,42,74,80,68,22,48,34,45,24,34
Random digits for service time arc :
95,21 , 51,92, 89, 38, 13, 61, 50,49, 39,53, 88,01, 81
Simulate this system for 10 customers by
Finding
(i) Average inter arrival time
(ii) Average service time of Able
(iii) Average service time of Baker. (08 Marks)
Ans. ReferQ.no.l(b) of MQP-2

I
9'~5~tht'\I
m and then revising both Eight Semester B.E. Degree Examination, CBCS -June / July 2019
te any perceived model System Modelling and Slmulatlon
fficiently accurate. Time: 3 hrs. Max. Marks: 80
Nole: A11swer any FIJ/£/u/1 questions, 1>elecling ONE/111/ question from ea,·h module.
itial model
Module- I
Revise
1. a. What is simulation? Explain witb 0owchart the steps involved in simulation
, is ion of model study. (08 Marlu)
Ans. A simulation is the imitation of the operation of a real-world process or system over
Revise time. Whether done by hand or on a computer, simulation involves the generation of
an artifidal history of a system, and the observation of that artificial history to draw
revision of model
inferences concerning the operating characteristics of the real system.
Refer Q.no. I (a) of MQP - I
b. A computer tccboical support centre lo staffed by two people. Able and Baker,
who take calls and try to answer questions and solve computer problems. The
time between calls ranges from 1 to 4 minutes with the distribution as sbown in
table I.I. Able is more experienced and can provide service faster than Baker,
which mean that, when both arc Idle, Able takes the call. The distribution of
their service times are show in table 1.2 and Table 1.3: Inter arrival time (IAT)
distribution
Table I.I: Inter Arrival Time (IAT) distribution
IAT (mins) 1 2 3 4
Probability 0.25 0.40 0.20 0.15
Table 1.2 : Service Time (ST) Distribution of Able
I Service time (mlns) 2 3 4 5
IProbability 0.30 0.28 0.lS 0.17
Table 13 : Service time distribution of Baker
Service time (mins) 3 4 5 6
Probability 0.35 0.25 0.20 0.20
Random digih for later arrival times arc :
26,98, 90,26,42,74, 80, 68,22, 48, 34,45,24,34
Random digits for service time are :
95,21, Sl,92, 89, 38, 13, 61, 50,49, 39,53, 88,01, 81
Simulate this system for 10 customcn by
Finding
I\__A_.,-- -
and then revising both E ight Semester B.E. Degree Examination, CBCS - June / July 2019
any perceived model System Modelling and Simulation
ciently accurate. Time: 3 hrs. Max. Marks: 80
Note: A11swer any FIVE full questions, seleding ONE full questio11/ronr end, module.
al model
Module- I
Revise
I. a. What is simulation? Explain with flowchart the steps involved in simulation
sion of model study. (08 Marks)
Ans. A simulation is the imiuition of the operation of a real-world process or system over
time. Whether done by hand or on a computer, simulation involves the generation of
an artificial history of a system, and the observation of that artificial history to draw
inferences concerning the operating characteristics of the real system.
Refer Q.no. I {a) ofMQP - I
b. A computer technical support centre in stafl'ed by two people. Able and Baker,
who ta ke calls and try to answer questions and solve computer problems. The
time between calls range» from 1 to 4 minutes with the distribution as shown in
table 1.1. Able is more experienced and can provide service faster than Baker,
which mean that, when both are idle, Able ta kes the call. The distribution of
their service times are show in table 1.2 and Table 1.3: Inter arrival time {IAT)
distribution
Table I.I: Inter Arrival Time (IAT) distribution
IAT (mlns) 1 2 3 4
Probability 0.25 0.40 0.20 0.15
Table p : Service Time (ST) Distribution of Able
Service time (mlns) 2 3 4 5
Probability 0.30 0.28 0.25 0.17
Table 13 : Service time distribution of Baker
Service time (mios) 3 4 S 6
Probability 0.35 0.25 0.20 0.20
Random digits for inter arrival times are :
26,98,90,26,42,74,80,68,22,48,34,45,24,34
Random digits for senice time are :
95.21, 51,92, 89, 38, 13, 61, S0,49, 39,53, 88,01, 81
Simulate this system for 10 customers by
Finding
(i) Average Inter arrival time
(ii) Average service time of Able
(Iii) Average service time of Baker. (08 Marks)
Ans. Refer Q.no. I(b) of MQP - 2

1
VIII Sem, (CS£ I I S£) CBCS · Jl,(..t'l,(!/1July 2

OR
2. a. List the various concept used in discrete event simulation and explain a ny four 3. a. Explain binomial 8
of these with examples. (08 Marks) mean and variance
Ans. Concepts used in discrete event simulation Ans. Poisson distributio
System : A collection of entities that interact together overtime to accomplish one It describes many ra
or more goals. given by
Model : An abstract representation of the system
System state : A collection of variables that contain all the infonnation necessary to
describe the system at any time.
P(x)={e:t x
0, Oc
Entity : Any object or component in the system
Attributes : The property of an entity where a> O
List : A collection of associated entities, ordered in some logical fashion Poisson distribution
Event : An instantaneous occurrence that changes the state of the system E(X) = a = v(x)
Even Notice : A record of an event to occur at the current or sometime in future Binomial distributi,
olong with any associated data necessary to execute event. The random variabl
Event list : A list of event notices for future ordered by time of occurrence. a binomial distributi(
Activity : A duration of time of specified length which is known when it begins
Delay : A duration of time of unspecified indefinite length which is known when it P(x) = {(:} • q•·•
begins.
Example able and baker problem 0,
I. Busy state : Able and baker is busy severing It is computed by the
2. Idle state } by s, occurring in th
J. Waiting state cars waiting to get served
Entities P=[Sss. ... '. ·J
Customers, servers [Able and baker)
Event whcreq =I - P, The
Arrival event of cars
Able and baker service completion time
Activities
(:)= x!(n ~x}!
11

• Inter arrival time Outcomes having th


• Service time of able and baker mean and variance
independent bemoul
b. Consider a single server queuing system with inter arrival and service time Then
d etails as s hown below :
X=X, +X
I~ .I ! 1.~ I: I! I~ I~ I! I~ I! I
Stop simulation when simulation clock reaches 23.
and the mean E(X)i,
E(X)= p+
Ans. Refer Q.no. 2(a) ofMQP _ 2 (08 Marks)
and the variance V{
V(X) ::: pq

2
~s~ ~CS · ]IMWI(Ju½' 2019

Module-2
3. a. Explain binomial and Poisson distribution and give probability mass function
n and explain nny four
(08 Marks) mean and variance (06 Marks)
Ans. Poisson distribution
irtime to accomplish one Tt describes many random processes and is simple. The probability mass function is
given by
e""'a."
P(x) = ~ x=O,I.. ...
information necessary to
{
0, Otherwise
where a> 0
Poisson distribution mean and variance ai:e both equal to a., i.e.,
gical fashion E(X) = a= v(x)
of the system Binomial distribution
t or sometime in future The random vnriable x that denotes the number of successes in n Bernoulli trials has
a binomial distribution given by p(x) where
e of occurrence.
nown when it begins
which is known when it
P(x) =j(:)p•q•-•, x=0,1,2......n
0, otherwise
It is computed by the probability of a particular outcome with all the success denoted
bys, occurring in the first x trials, followed by n-x failures.

P • [ SSS.......:..........SS FFF......:::........FF), p" q•·•

where q = I~ P, There are

(:)= x!(nn~x)!
Outcomes having the required number of S, and F, . An easy approach lo calculate
mean and variance of tho binomial distribution is to consider X as the sum of n
independent bemoulli random variables each with mean p and variance p(l -p)=pq,
rival and service time Then
X = X, 4-X 2 +X3 + ..... +X11
and the mean E(X) is given by
E(X) = p+ p+ .......... + p = np
and the variance V( x) is given by
V(X) = pq + pq + ..... + pq = npq

3
VIII Sem,, ( CSE / ISE) Sy~ M~CUtd.-S ~ CBCS · J IM\el / J
b. Explain the following continuous distributions :
(l)p=}../µ
i) Uniform distribution
Ii) Exponential distribution
iii) Triangular distribution
Iv) Normal distribution. (10 Marks)
Ans. (i),(ii) : Refer Q.no. 3(a) of MQP - I
(iii) Refer Q.no. 3(b) ofMQP- 2
( iv) Normal distribution
A random variable x with mean -oo < µ < oo and variance a2 > 0 has a normal
distribution if it has the pdf.
2
I
f(x)= o.finexp
2 ""7
[ - I (x-µ) ] , - oo<x<oo

Properties :
~ µ

a) lim._,, f{x) = 0 and lim._,, f{x)= O. the valueofflx)approacheszero as xapproaches


negative infinity and x approaches has positive infinity. 5. · a. Use the llnea11
b) f{µ - x) = f(µ+x) the pdf is symmetric aboutµ with X0 = 27,
c) The maximum value of the pdf occurs at x =µthe mean and mode are equal. period.
The pdf is given by Ans. Refer Q.no. 5(

J I
F(x) = p(X~x)= _.,.x o,ff;cxp
27
[ - I('-µ)'] dt . b. T he sequence
Use Kolmogo
the numbers
OR Take D0 •0.
4. a. Explain the cha racteristics of a queuing system. (08 Marks) Ans. Refer Q.no. 5(
Ans. Refer Q.no. 4(n) of MQP - I.
b. Explain the various steady state paramcten of M/G/1 queue. (08 Marks) 6. a. Suggest a ste
Ans. Suppose the service times have mean I/µ and variance 0 2 and that there is one server. transform te
• If p = ,.Jµ < I, then the µ/g/1 queue has a steady • state probability distribution Ans. Step 1 : Com
with steady - slate characteristics as given in table. distribution th
• When "'<µ, the quantity p = Alµ is the server utilization or long - run proportion F(x) = I - e•l•
of time the server is busy. Step 2 : Set F
• l-p = p can also be interpreted as the steady state probability that the system
0
1-e-h = Ro
contains one or more customers. variable calle
• L- LQ=- p is the time - average number of customers being sened. R has a unifo
Table : Steady- state parameter of the µ/G/1 Queue Step 3: Solv

4
s
CU'\.d, ~ t m ' \ t CBCS - J IM'l.e/ I J u½1 201 9

(l)p=A/µ

(2)L=p + ;i,_2(J;:2+ o2) p+ p2(1 + 02µ2)


(10 Marks) . 2(1 -p) 2(1-p)

(3)ro =! + ;.(½2 +02)


µ 2(1-p)
o 2 > 0 has a normal
;i,.((µz + o 2
)
(4)WQ_ _..:..;.....;....._..,,........._
2(1 - p)

A
2
(J;: 2 +a )

( S) LQ=___,>;2:..(. ,.'l--
--
2
P2(l+o2µ2)
p) ~ = ___;_2(-!--p-
) ....:..

(6)p = l- p
0

es zero as x approaches Module-3


5. · a. Use the linear congruential method to generate a sequence of random numbers
with X0 = 27, a= P, C = 43 and m = 100. Write 3 ways of achieving maximal
d mode are equal. period. (08 Marks)
Ans. Refer Q,no. S(a) of MQP - 2
. b. The sequence ofrandom members 0.44, 0.81, 0.14, 0.05, 0.93 bas been generated.
Use Kolmogorav Smirnov test with a = 0.05 to determine if the hypothesis that
the numbers are uniformly distributt!d on the Interval [O, 1) can be rejected.
Take Du = 0.565. (08 Marks)
(08 Marks) Ans. Refer Q .no. S(a) of MQP - I
OR
eue. (08 Marks) 6. a. Suggest a step by step procedure to generate random variates using inverse
that there is one server. transform technique for exponential distribution. (08 Marks)
probability distribution Ans. Step I : Compute the cdf of the desired random variable x. For the exponential
distribution the cdf is give by
or long - run proportion F(x) = I - e·l.x, x ~ 0
Step 2 : Set F(x) = Ron the range of x. Fur the exponential distribution, it becomes
ability that the system I - e-1.x = R on the range x ~ 0, x is a random variable, so I - e· 1.x is also random
variable called R.
1g served. R has a uniform distribution over the interrupt
Step 3 : Solve the equation F(x) = R for interns of R

5
cuu;l,S~ cacs •JIMU!/(J~2019
(l)p=AIµ

A,2 (J;:2 +02) p+ p2 (1 +U2µZ)


2
( )L=p+ 2(1 -p) = 2{1-p)
(10 Marks)

(3)ro=.!.+ ,-(J;:2 +02)


µ 2(1-p)
0
2 > O has a normal

4 w - ,{>;:2+02)
()o - 2(1-p)

SL - ,.2(J;:2+02)~P2(t+o2µ2)
( ) Q - 2(1-p) 2(1-p)
(6)p 0
= 1-p

Module-3
s zero as x approaches
S. · a. Use the linear congruential method to generate a sequence of random numbers
with X0 = 27, a = 17, C = 43 and m "' 100. Write 3 ways of acblevina maximal
ti mode are equal. period. (08 Marlu)
Ans. Refer Q.no. S(a) of MQP - 2
. b. Thesequence of random members 0.44, 0.81, 0.14, 0.05, 0.93 has been generated.
Use Kolmogorav Smirnov test with a = 0.05 to determine if the hypothesis that
the numbers are uniformly distributed on the interval f0, J) can be rejected.
Take D = 0.565. (08 Marks)
(08 Marks) Ans. Refer Q~no. S(a) of MQP - I
OR
e. (08 Marks) 6. a. Suggest a step by step procedure to generate random variates using inverse
at there is one server. transform technique for exponential distribution. (08 Marks)
obability distribution Ans. Step I : Compute the cdf of the desired random variable x. For the exponential
distribution the cdf is give by
long - run proportion F(x) = I - e·,._, , x ~ 0
Step 2 : Set F(x) =Ron the range ofx. For the exponential distribution, it becomes
bility that the system I - e-"" = R on the range x ~ 0, x is a random variable, so I - e·"' is also random
variable called R.
served. R has a uniform distribution over the interrupt
Step 3 : Solve the equation F(x) = R for interns of R

5
VIII Sem, ( CS£ I IS£) CBCS · JIM'l.,(V f JIA½'
1-e-i..=R b. Explain the types
-e-"-• =R-1
Ans. Types of steady st
-Ax= ln(l -R) • In analyzing si~
transient simulf
x =-±ln{l-R) ... (I) isa random variable generator for theeqn (I) is • A terminating s
a specified eve,
written as
• Such a simula
X= r 1 (R) and closes at t
Generating a sequence of values is done through step 4. Examples ofter
Step 4 : Generate unifonn random number R1 ~ and ~ and compute the desired i. The shady grov
random variates by 11 tellers workinij
x, = f·I (R,) open for 48° minui
between custome
ForexponentialcaseF-1 (R) = _2._ln (t - R) byeqn(l) and closing down
A
ii. A communicati
-I
So X, =,: ln{l-R.) .... (2) components. It is r
Fori= 1,2,3 .....onesimplification that is usually in eqn(2) is to replace 1- R, by R,
-I
X, =,:ln(R,)

b. What is acceptance rejection technique? Generate three Poisson variates with


mean a = 0.2. The random numbers are 0.4357, 0.4146, 0.8353, 0.9952, 0.8004,
0.7945, and 0.1530. (08 Marks) Consider the systc
Ans. Refer Q.no. S(b) of MQP - I event E is defined
Module-4 are that all compon
Examples of stead
7. a. Explain the steps involved in the development of a useful model of input data. i. Consider the m
(08 Marks) the complete produ
Ans. Refer Q.no. 7(a) of MQP - I production levels a
b. Apply chi - square goodness of fit test for Poisson distribution with a ,. 3.64, shifts, this may be
data size = 100 and observed frequency 0 1 • 12, 10, 19, 17, 10,8,7,5,5,3,3, 1 precise estimates o
(JJ0.05,3• 11.1). (08 Marks) could decide to sim
Ans. Refer Q.no. 7(b) ofMQP - I by the nature of the
design of the simul
OR
8. a. Explain the different ways ofselccting input models when data is not available.
(08 Marks) 9. a. Discuss output ana
Ans. Refer Q.no. 7(a) of MQP - 2 Ans. Refer Q.no. 9(b) of
b. Discuss output au
Ans. Refer Q.no. 8(b) of

6
~s~
CBCS · JWU!,,/ July 2019
b. Explain the types of simulation with respect to output analysis. Give examples.
(08 Marks)
Ans. Types of steady state simulation
• In analyzing simulation output date, a distinction is made between terminating or
transient simulation and steady stale simulation.
theeqn(l)is • A terminating simulation is one that runs for some duration of time Te, where e is
a specified event, that stops the stimulation.
• Such a simulated system op~ns at time O under well specified initial conditions
and closes at the stopping time TE'
Examples of terminating simulation
i. TI1e shady grove bank opens at 8.30 am with no customers present and 8 of the
compute the desired
11 tellers working and closes at 4.30 pm. Here the event E is the fact that the bank
open for 48° minutes. The simulation analyst is interesed in modeling the interaction
between customers and tellers over the entire day including the effect of starting up
and closing down at the end of the day.
ii. A communication system consists of several components plus several backdrop
components. It is represented as below.

lace!- R, byR,

isson variates with


l53, 0.9952, 0.8004, Consider the system over a period of time TE until the system fails. The stopping
(08 Marks) event Eis defined by E = { A fails, or D fails or (Band c both fail)} Initial conditions
are that all components at a new at time 0.
Examples of steady - state simulation
1odel of input data. i. Consider the manufacturing process, beginning with the second shift, when
(08 Marks) the complete production process is under way. It is desired to estimate long - run
production levels and production efficiencies. For the relatively long period of 13
shifts, this may be considered as a steady state simulation. To obtain sufficiently
tion with a = 3.64, precise estimates of production efficiency and other response variables, the analyst
1
17, 10,8,7,5,5,3,3, 1 could decide to simulate for any length of time, TE. That is , TE is not d1:termined
(08 Marks) by the nature of the problem, rather, it is set by the analyst as one parameter in the
design of the simulation experiment.
Module-5
ta is no t available. 9. a. Discuss output analysis for steady state simulation in detail. (08 Marks)
(08 Marks)
Ans. Refer Q.no. 9(b) of MQP • I
b. Discuss output analysis for terminating simulation in detail. (08 Marks)
Ans. Refer Q.no. 8(b) of MQP • 2

7
VIII Sem.- (CSE ( ISE)
OR Eight Semest
10. a.Explain with neat diagram, a model building verification and validation.
(08 Marks) Time: 3 hrs.
Ans. Refer Q.no. IO(a) of MQP - I Note : An., wrr an

b. Describe the 3 steps approach formulated by Naylor and Finger in the validation
process. (08 Marks)
1. a. With a neat d
Ans. ReferQ.no. IO(a)ofMQP-2 Ans. Refer Q.No I.
b. A small gro
checkout cou,
inter-arrival
times vary fi
Serviet
Proba!
Develop slm
(ii) Average s
random digits
are 84, 10, 74
Ans. Distribution c
Time between
I
2
3
4
s
6
7
8
9
10
Distribution o
Service time
I
2
3
4
5
6

8
s~
E ight Semester B.E. Degree Examination, CBCS - Dec 2019 / Jan 2020
System Modelling and Simulation
d validation.
(08 Marks) Time: 3 hrs. Max. Marks: 80
Note : Answer any FIVE f11/I q11estions, selecting ONE full quest/011 f rom eaclt mod11le. _,

er in the validation Module- I


(08 Marks) 1.a. With a neat diagram, explain the steps in simulation study. (08 Mnrks)
Ans. Refer Q.No I .a. of MQP - I
b. A small grocery shop has one checkout counter. Customers' arrive at this
checkout counter at random from I to 10 minutes apart. Each possible value of
inter-arrival time bas the same probability of occurrence equal to 0.10. Service
times vary from 1 to 6 minutes with probability shown below:
Service time· I 2 3 -I 5 6
Probability 0.05 0.10 0.20 0,30 0.25 O.IO
Develop simulation table for 10 customers. Find: (i) Average waiting time
(ii) Average service time (iii) Average-time customer spends in system. Take the
random digits for a rrivals as 91,72, 15,94, 30, 92, 75, 23, 30 a nd for service times
are 84, 10, 74, 53, 17, 79, 91, 67, 89, '38 sequentially. (08 Marks)
Ans, Distribution of time between arrivals ' •·
Time between arrivals Probability Cumulative probability Random digit assignment
I 0.10 0.10 01 - 10
2 0.10 0.20 11 -20
3 0.10 0.30 21 - 30
4 0.10 0.40 31 - 40
s 0.,10 0.50 41 - so
6 0.10 0.60 51 -60
7 0.10 0.70 61 - 70
8 0.10 0.80 71 - 80
9 0.10 0.90 81 -90
10
.
Dastnbution of service tames.
0.10 1.00 91 -00

Service time Probability Cumulative probability Random digit assignment


I O.oJ 0.05 l-S
2 0.10 0.15 6-15
3 0.20 0.35 16-35
4 0.30 0.65 36-65
s 0.25 0.90 66-90
6 0.10 1.00 9 1-00
Inter arrival tame determmatlon

9
VIII Se,m,(CSE I IS'E) S y ~ M ~CMld,,S ~ CBCS · Veo2019 I J
Customers Random digilS Time between arrivals
I - - 2. a. A company uses 6
2 91 10 a railroad. Each t
3 72 8 moves to the weig
4 15 2 come first served
s 94 10 afterward returns
6 30 3 time are given in t
7 92 10
8 75 8 Loadin
9 23 3 Wei hin
10 30 3 Travel ti
Service time determination Calculate the total
Customers Random digits Time between arrivals uUlization. Assum
I 84 5 Stopping event ti
2 10 2 Ans. Refer Q.No. 2.b. 0
3 74 5 b. Explain the terms
4 53 4 i) Event ii
s 17. 3 v) Clock v
6 79 5 Ans. 1. Event : An insta
7 91 6
8 67 5 Ex: an arrivaVdepa
9 89 s 2. Event notice :
10 38 4 along with any as
Simulation table for single channel queue problem record includes the
Customer later Interval Service Time W1ltln1 Time Time customer Idle of an bank, we hav
■rrlv■I time time lffVke time la aervlce spends la time or a possible e,vent no
time be&ln• queue ends system server • Event type (e.g.
I - 0 5 0 0 s s 0 • Event time (e.g.
2 10 10 2 10 0 12 2 s • Customer num
3 8 18 s 18 0 23 5 6 3. FEL : A list ofe
4 2 20 4 23 3 27 7 0 known as the futu
s 10 30 3 30 0 33 3 3 4. Delay : Duratio
6 3 33 .s 33 0 38 .s 0 Ex: a customer's
7 10 43 6 43 0 49 6 s S. Clock : A varia
8 8 .SI .s .SI 0 56 5 2 Consider die Able.
9 3 54 5 56 2 61 7 0
10 3 51 4 61 4 65 8 0 LQ(t): The numbe
LA(t): 0 or I to in
·1) A verage wa1tmg
. . time=
. 9
- = 09 •
. mms LB(t) 0 or I to ind
10 6. System state :
44 to describe the sy
1·1·) Average service
. time=
. - = 4.4 mms
·
10 Ex: Busy or Idle s
"') Average time
· customer spend .m system • -53 • S.3 mins 7. Activity: Dura
111
10 Ex: a service time

10
~s~ CBCS · Vec,2019 / J<M\/2020
Iarrivals OR
2. a. A company usea 6 trucks to haul manganese from the entrance of a mine to
a railroad. Each truck is loaded by one of two loadeni. After loading a truck
moves to the weighting scale to be weighted. Both loaders and scale have lint-
come first served queue. After being weighted a truck begins'a travel time
afterward returns to loader queue. The activity ofloading, weighing and travel
· time are given in the following table:
Loading time 10 5 5 10 15 10 10
Weighing time 12 12 12 16 12 16 -
Travel time 60 100 40 40 80 - -
Calculate the total busy time ofw1th loaders, the scale, average loader and scale
p arrivals utilization. Assume 5 trucks are at the loader and one is at the scale, at time '0'.
Stopping event time TE = 24 min. (08 Marks)
Ans. Refer Q.No. 2.b. of MQP - 2
b. Explain the terms used in discrete event simulation with am example:
i) Event Ii) Event notice iii) FEL Iv) Delay
v) Clock vi) System state vii) Activity viii) System (08 Marks)
Ans. 1. Event : An instantaneous occurrence that changes the state of a system.
Ex: an arrival/departure of a customer
2. Event notice : A record of an event to occur at the current or some future time,
along with any associated data necessary to execute the event; at a minimum, the
record includes the event type and the event time. Ex :when simulating the operation
ofan bank, we have two types of events, Arrival &Departure. With these two events,
Time customu ldlt
a possible e.vent notice will be
spends in time or
system server • Event type (e.g. Arrival or Departure)
s 0 • Event time (e.g. 8)
2 s • Customer number
s 6 3. FEL : A list ofevent notices for future events, ordered by time of occurrence; also
7 0 known as the future event list (FEL).
3 3 4. Delay: Duration of time of unknown length, which is not known until it ends
5 0 Ex: a customer's waiting in-line for service to begin
6 5 S. Clock : A variable representing simulated time.
s 2 Consider the Able. Baker carhop system. The components ofdiscretc-cventmodel are:
7 0
8 0
LQ(t): The number of cars waiting to be served at time t
LA(t): 0 or I to indicate Able being idle or busy at time t
LB(t) 0 or I to indicate Baker being idle or busy at time t .
6. System state: A collection of variables that contain all the information necessary
to describe the system at any time.
Ex: Busy or Idle state of server
7. Activity : Duration of time of specified length .It is known when it begins.
Ex: a service time or inter-arrival time

11
VIII Sem., (CSE I ISE) S y ~ Model.Un.g, ~S~ CBCS · 'Deo2019
8. System : A collection of entities (e.g., people and machine~) that interact together 3) Geometric a
over time to accomplish one or more goals. Bernoulli trials,
Ex: Bank, Transportation to achieve the fi
Module-2
3. a. Discuss different types of discrete distribution. P(x) ={q':P,
(10 Marks)
Ans. Types of discrete distributions
1) Bernoulli trails and tbe Bernoulli distribution : Let x, = I if the jd' experiment The event [x •
resulted in success and let x = 0 if the jd' experiment resulted in a failure. Then the failures has
Bernoulli trails are called a ~moulli process if the trails are independent, each trial p. Thus
has only two possibilities outcomes and the probability of a success remains constant P(FFF .... FJ -=
from trial to trail. Thus The mean varia
P( x., x 2 ••..x = P, (x, ),P1 (x1 ) .. . ..P1 ( x 1 }
• E(x) =.!.
0 )

p
P, x, =I, j =1,2, ....n
and P; ( x,) = P( x,) = 1- P = q, x, = =
0, j 1,2,....n and V(x)=..9...
pl
{ 0, otherwise
4) Poisson dis
For one trial, the distribution given in equation (I) is called Bernoulli distribution.
The mean and variance of x, are calculated as follows.
E(x,) = 0.q + 1.P + P
and V(x)=
J
[(02. q)+ (12. P)-P2 = P(l -P)
:;:·~{·~~; 0,
2) Binomial distribution
The random variable x that denotes the number of successes in Bernoulli trials has a Where a> 0 o
binomial distribution given by p(x) where mean and vari

{(") .•-·
E(x) -= a = V(x
P(x) = x pq x = 0,1,2, ....n ....(1) The cumulativ
0, otherwise F(x) = f e-oi!
,-o
All the successes, each denoted bys, occurring in the first x trials, followed by the
n-x failures, each denoted by an F that i~ b. Explain the d
Ans. Refer Q.No. 4
P(SS;::. SS)(F; -~~-=-FF) p"q" •
4. a. .Explain the
where q -= I - P. There are Ans. Unifonn and

(:)= xl(nn~x)!
Triangular di
Nonnal distr
Each with mean P and variable P ( I - P) • pq. b. Explain Ke
also interp
Then x = x.+ '½ + ..... + x.
and the mean E(x) is given by Ans. The letters I
E(x) = p + p + ... p:. np A-+ represe
and variance V(x) is given by e-rep
V(x) = pq + pq + ..... + pq = npq
C-+rep
N-+re
K-+re
12 s.-+w
Cl,t'\.d, s~ CBCS - Oec-2019 / JOtKv2020

pthat interact together 3) Geometric and Negative Binomial distributions : It is related to sequence of
Bemoul Ii trials, the random variable of interest x, is defined to be the number of trials
to achieve the first success. The distribution ofx is given by
P(x) ={q•-'p, x=l,2.....
(10 Marks) 0 otherwise
The event [x = x) occurs when there are x - I failures followed by a success, Each of
I if the j"' experiment
the failures has an associated probability ofq = I - P and each success has probability
lted in a failure. Then
p. Thus
independent, each trial P(FFF .... F8) = q•- 1 p
ccess remains constant
The mean variance are given by
E(x)=.!_
p

and V(x)=.1.
p2
4) Poisson distribution :- It describes many random process quite well and is
Bernoulli distribution. mathematically simple. The poisson probability mass functfon is given by

P(x) = Je-:~• • x =0,1.....


1 0, Otherwise
in Bernoulli trials has a Where a > 0 one of the important properties of the poisson distribution is that the
mean and variance are both are equal to a, that is
E(x) = a= V(x)
The cumulative function is given by
• e-a•
F(x) = ~ -
~
,-o ·,
I.
x trials, followed by the
b. Explain the different characteristics or queuing systems. (06 Marks)
Ans. Refer Q.No. 4.a. of MQP- I
OR
4. a. Explain the different types or continuous distribution. (10 Marks)
Aas. Uniform and exponential distribution - Refer Q.No. 3.a. ofMQP- I
Triangular distribution - Refer Q.No. 3.b. of MQP - 2
Normal distribution - Refer Q.No. 3.b. of June / July 2019
b. Explain Kendall's notation for parallel sen-er queuing system AIBICINIK and
also interpret meaning or MIMl<X>! oo I• (06 Marks)
Ans. The letters IAIBICINIK represent the following system characteristics
A -+ represents the inter arrival time distribution
B -+ represents the service time distribution
C -+ represents the number of parallel servers
N -+ represents the system capacity
K-+ represents the size of the calling population
5Mt,4'41,- f«AM $uMv 13
VIII Se.mt (CSf I ISf) GBCS - 'Dec- 201
Common symbols for A and B include µ(exponential or Morkov), D(Constant
or deterministic), E/Enlang of order k), pll(Phase type), H(hyperx potential), C
(arbitrary or general) and GI (general independent)
Ex :- M I M I 21 oo I oo indicates a double server system
that has unlimited queue capacity and an infinite population of potential arrivals, the b. Explain the
inter arrival times and service times are exponentially distributed when N and K are
infinite. they may be dropped from the notation. Ans. Properties o
Module-3 A sequence
S. a. What is acceptance-rejection technique? Generate three Poisson variates with properties :
mean a= 0,2. The random numbers are 0.4357. 0.4146, 0.8353,0.9952, 0.8004, independent
0.7945, 0.1530. (06 Marks) I that is, the
Ans. Refer Q.No. S.b. of MQP - I f(x) ,,,,-{ I,
b. Explain linear congruential method. Write three ways of achieving maximal 0,
period. (OS Marks) Some conse
Ans. Refer Q.No. S.a. ofMQP- 2 I) If the inte
c. The sequence of random numbers 0.54, 0.73, 0.98, 0.11 and 0.68 has been expected nu
generated. Use Kolmogorov-Smlrnov test with a - 0.05 to determine if the of observati
hypothesis that the numbers are uniformly distributed on the intenral (0, 1) can 2) The prob
be rejected. Take Dax= 0.565. (05 Marks) previous val
Ans. Properties o
0- ReferQ.No.
D+
i-1 c. Use the Ch
i RI -I

N
-N
I
- -R
N I
R,-C~') uniformly

I 0.11 0.20 0 0.09 0.11


2 0.54 0.40 0.20 - 0.34
3 0.68 0.60 0.40 - 0.28
4 0.73 0.80 0.60 0.07 0.13
5 0.98 I 0.80 0.02 0.18
Compute D - max (D♦, 0--) Ans.
= max(0.09, 0.34)
D = 0.34
Check= D > Da, if reject
= 0.34 < 0.S6S
As D < D0 , Hypotensis is accepted
OR
6. a. What is inverse transform technique? Generate five exponential random
variates with= I and random numben 0.1306. 0.0422, 0.6597, 0.7965, 0.7696.
(06 Marks)
Ans. Inverse transform technique can be used to sample from the exponential, uniform,
weibul, triangular and emprical distributions. It is '!?e under lying principle for
sampling from a wide variety of discrete distributions.

14
cacs . vec, 201 9 / J'4W 2020
s~ 4 5
l 2 3
I
orkov), D(Constant
yperx potential), C R 0.1306 0.0422 0.6597 0.796S 0.7696
0.1400 0.0431 1.078 l.592 1.468
X
b. Explain the properties of random numben and pseudo-random numben.
tenlial arrivals, the (05 Marks)
ed when N and K are
Ans. Properties of Random numben
A sequence of random numbers R1, R1.... must have two important statistical
properties : uniformity and independence. Each random number R, must be an
oiuon variates with independent sample drawn from a continuous uniform distribution between zero and
53,0.9952, 0.8004, 1 that is, the pdf is given by
(06 Marks)
f(x)~·{'• Osxsl
0, otherwise
achieving maximal
(OS Marks) Some consequences of the uniformity and independence properties are following :-
1) If the interval (0, I] is divided into n classes or subintervals of equal length, the
and 0.68 bas been
expected number of observations in each interval i,; N/n where N is the total number
to determine if the of observations.
2) The probability of observing a value in a particular interval is independent of the
tlaeinterval(O,llcan
(OS Marks) previous value drawn.
Properties of Pseudo Random Numbers
Refer Q.No. 6.a. ofMQP • I
c. Use the Chi-square test with a • O.OS to test whether the data shown below are
uniformly distributed or not. (Note: X,2...., • 6.9J
0.594 1>J5I 0.422 0.127 0.182
0.11
0.34 0.928 0.262 0.797 0.788 0.929
0.28 0.515 o.cm 0.798 0.825 0.852
0.13 0.507 0.474 0.227 0.07 0.05
0.18
(05 Marki)
Ans.

Interval (0, - E.)2


01 E; 0-E
I I (01-E,)I
E,
0 - 0.1 5 2 3 9
0.1-0.2 4.S
2 2 0 0
0.2 •0.3 I
0
ntial random 2 -I I
.7965, 0.7696. 0.3 -0.4 2 2 0.5
0 0
06Marks) 0.4 • 0.5 3 2 I
0
o., •0.6 0 2
I o.s
.J, c1ple
unifonn -2
. for' 0.6 - 0.7
0.7 • 0.8
3 2 I
4
I
2
2 2 0
0.5
0.8 • 0.9 2 0 0
2 0
0.9- 1. 0 0 0
~
2 -2
"" .,,. -
4 2
VIII Sem, ( CSE I ISE) Syltem-M~ (U'\d, S ~
x,/ = I0. This is compared with the critical value 16.9
"'r,2 < x/,.. 1 so hypothesis is accepted i.e., 10 < 16.9
Module- 4
7. a. Explain different steps in the development of useful model of input data with
example. (08 Marks)
Ans. Refer Q.No. 7.a. ofMQP- I
b. Explain Chi-Square goodness of fit test. Apply it to Poisson assumption with
parameter a • 3.64, the number ofvebicles arriving at a junction in a five minute
period was observed for 100 days. The resulting data is as follows:
No. of arrivals O I 2 3 4 S 6 7 8 9 IO 11
Frequency 12 10 19 17 10 8 7 S S 3 3
Determine whether the assumption that arrivals follows Poisson distribution
can be accepted at 0.05 level ofsignificance, given x20s,s= 11.1 (08 Marks)
Ans. Refer Q.No. 7.b. ofMQP- I
OR
8. a. Explain the type ofsimulation with respect to output analysis, give an example.
(08 Marks)
Ans. Refer Q.No. 8.b. of June/ July 2019
b. Briefty explain the confidence-interval estimation method. (08 Marks)
Ans. Refer Q.No. 8.a. of MQP - I
Module- 5
9, a. Explain three step approach for validation process as formulated by Nayier and
Fineer (08 Marks)
Ans. Refer Q.No. IO.a. of MQP - 2
b. Explain with heat diagram, model building verification and validation.
(08 Marks)
Ans. Refer Q.No. IO.a. of MQP - I
OR
10. a. Explain output analysis for termination simulation. (08 Marks)
Ans. ReferQ.No. 8.b. ofMQP- 2
b. DiscUJS output analysis for steady state simulation in detail, (08 Marks)
Ans. ReferQ.No. 9.b. ofMQP- I

16
odel of input data with
(08 Marks)

lsson assumption with


unction in a five minute
as follows:
9 10 11
3 3 I
Poisson distribution
U.l (08 Marks)

Is, give an example.


(08 Marks)

(08 Marks)

ulated by Nayierand
(08 Marks)

a■d validation.
(08 Marks)

(08 Marks)

(08 Marks)
TECHNICAL POSTS
OROUP • B
ELECTRONICS
r,:COMMUMCATION

You might also like