BIGDATA 600 Assignment

402106779 BWALYA TERISSA POPOPO
FACULTY OF INFORMATION TECHNOLOGY

BIG DATA & INTERNET OF THINGS 600
Name & Surname: BWALYA TERISSA POPOPO ICAS/ITS No: 402106779
Qualification: BSCIT Semester: 2nd Module Name: BGDIOT 600
Date Submitted: 30 Sept 2022
ASSESSMENT CRITERIA MARK EXAMINER MODERATOR

ALLOCATION MARKS MARKS
MARKS FOR CONTENT
QUESTION ONE 50
QUESTION TWO 20
QUESTION THREE 20
TOTAL 90
MARKS FOR TECHNICAL ASPECTS
TABLE OF CONTENTS 2
Accurate numbering according to the numbering in
text and page numbers.
LAYOUT AND SPELLING 3
Font – Calibri 12
Line Spacing – 1.0
Margin should be justified.
REFERENCES 5
According to the Harvard Method
TOTAL 10
TOTAL MARKS FOR ASSIGNMENT 100
Examiner’s Comments:
Moderator’s Comments:
Signature of Examiner: Signature of Moderator:
1|P ag e
TABLE OF CONTENTS
QUESTION ONE ..............................................................2
QUESTION TWO ............................................................. 8
QUESTION THREE ........................................................15
QUESTION FOUR.......................................................... 21
REFERENCES ................................................................ 26
2|P ag e
QUESTION ONE
1.1. Define what is Big Data and what is IoT in detail?
(1.1.1)Big data refers to a collection or a large set of data, hard-to-manage data both structured and
unstructured data. Below are the types of Big Data:
 Structured – This is any data that can be stored, accessed, and processed in a fixed format
called ‘structured’ data.
 Unstructured – Any data with an unknown form or structure is classified as ‘unstructured’
data.
 Semi-structure – This data contains both forms of data. It can either be structured or
unstructured data.
This data is large and complex and none of the traditional data management tools can store it or
process it efficiently. Below shows a diagram of the big V’s also known as the characteristics of Big
Data.
Why is Big Data so important?
Big Data is important for the progress of technology and to improve our lives if it is used wisely.
Big Data also has a lot of potential. Companies use Big Data in their systems to improve
operations, provide better customer service, create personalized marketing campaigns, and more.
This can help businesses make faster and more informed decisions.
3|P ag e
How does Big Data work?

Big data brings new ideas that create paths for new opportunities and business models. Data is
produced every day and the main idea behind Big Data is that the more information we know and
acquire, the more we can gain insights and make a decision or find a solution.
Analysing Big Data can be done by machines or humans but it depends on the needs.
(1.1.2)What is Internet of Things?

Internet of Things is a network of physical devices (things) embedded with special software and
sensors, which allow them to connect and share data.
IoT stands for Internet of Things and it’s simply a network of Wi-Fi-enabled appliances or
other devices that all connect to the internet. The goal is to create a smart home filled with
internet-connected appliances that you can control remotely from your phone or other devices.
How does Internet of Things work?
An Internet of Things usually consists of web-enabled smart devices that use embedded
processors, sensors and communication hardware to collect, send and act on data they acquire
from their environments.
IoT devices contain sensors and mini-computer processors that act on data collected via machine
learning. IoT devices are miniature computers, and because they do connect to the internet, they
are also vulnerable to malware and hackers. The devices do most of the work without human
intervention, although people can interact with the devices - for instance, to set them up, give
them instructions or access the data.
4|P ag e
Why is Internet of Things important?
The internet of things helps people live and work smarter and gain complete control over their lives. In
addition to offering smart devices to automate homes, IoT is essential to business. IoT provides businesses
with a real-time look into how their systems work, delivering insights into everything from the performance
of machines to supply chain and logistics operations.
5|P ag e
1.2. Advantages of Big Data and IoT

 Cost-effective Operations - Due to the reduced downtime periods, ensured by automatically
scheduled and controlled maintenance, supply of raw materials, and other manufacturing
requirements, the equipment may have a higher production rate resulting in bigger profits.
Again, IoT devices greatly facilitate management within individual departments and across the
whole enterprise structure.
 Improved staff productivity and reduced human labour - Thanks to IoT solutions, mundane tasks
can be done automatically, so human resources may be transferred to more complex tasks
requiring personal skills, especially out-of-the-box thinking. This way, the number of workers can
be minimized, which results in reduced costs of business operation.
 Efficient management operations - Another significant benefit offered by the interconnection of
smart devices is automated control over multiple operation areas, including, among others,
inventory management, shipping tracking, and fuel and spare parts management. For example,
this approach involves using RFID tags and a corresponding network of sensors to track the
location of equipment and goods.
 Improved work safety - The scheduled maintenance is also highly advantageous for ensuring
operational safety and compliance with the required regulations. In their turn, safe working
conditions make the enterprise more attractive for investors, partners, and personnel, increasing
the brand reputation and trust. Smart devices also reduce the probability of human error during
various stages of business operation, which also contributes to a higher level of safety. In addition,
a network of IoT devices such as surveillance cameras, motion sensors, and other monitoring
devices can be utilized to ensure the security of an enterprise and prevent theft and even corporate
espionage.
 Improved customer service and retention – A collection of user-specific data achieved by using
smart devices also helps businesses to understand the expectations and behaviour of customers
better. IoT also improves customer service by facilitating follow-ups after sales such as automatic
tracking and reminding the customers about required maintenance of purchased equipment after
its predefined period of use, the ending of warranty periods, etc.
 More trustworthy image of the company - A company that employs high-tech solutions, and IoT in
particular, generally makes a positive impression on customers, investors, and other business
partners who are aware of numerous advantages offered by the Internet of Things. Moreover, it is
easier to attract highly-sought experienced staff if a company provides a safe and secure working
environment ensured by a network of smart devices.
Disadvantages:
 Compliance – Another thorny issue for big analytics efforts is complying with government
regulations. Much of the information included in companies’ big data stores are sensitive or
personal, and that means the firm may need to ensure that they are meeting industry standards or
government requirements when handling and storing the data. In the Syncsort survey, data
governance, including compliance, was the third most significant barrier to working with big data.
When respondents were asked to rank big data challenges on a scale from 1(most significant) to
5(least significant), this disadvantage of big data got more 1s than other challenges.
6|P ag e
 Need for cultural change: Many of the organizations that are utilizing big data analytics don’t just
want to get a little bit better at reporting, they want to use analytics to create a data-driven
culture throughout the company. In fact, in the NewVantage survey, a full 98.6 per cent of
executives said that their firms were in the process of creating this new type of corporate culture.
However, changing culture is a tall order. So far, only 32.4 per cent were reporting success on this
front.
 Data quality – In the Syncsort survey, the number one disadvantage to working with big data was
the need to address data quality issues. Before they can use big data for analytics efforts, data
scientists and analysts need to ensure that the information they are using is accurate, relevant and
in the proper format for analysis. That shows the reporting process considerably, but if enterprises
don’t address data quality issues, they may find that the insights generated by their analytics are
worthless – or even harmful if acted upon.
 Cybersecurity risks – Storing big data, particularly sensitive data, can make companies more
attractive targets for cyber-attackers. In the AtScale survey, respondents have consistently listed
security as one of the top challenges of big data, and in the NewVantage report, executives ranked
cybersecurity breaches as the single greatest data threat their companies face.
 Costs – Many of today’s big data tools rely on open source technology, which dramatically reduces
software costs, but enterprises still face significant expenses related to staffing, hardware,
maintenance and related services. It’s not uncommon for big data analytics initiatives to run
significantly over budget and to take more time to deploy than IT managers had originally
anticipated.
 Difficulty integrating legacy systems – Most enterprises that have been around for very many
years have siloed data in a variety of different applications and systems throughout their
environments. Integrating all those disparate data sources and moving data where it needs to be
also added to the time and expense of working with big data.
7|P ag e
QUESTION TWO
2.1. List five big data analysis techniques and define how each is applied.
 Natural Language Processing - NLP is a sub-speciality of computer science, artificial

intelligence, and linguistics, which uses algorithms to analyse human (natural) language.
Natural Language Processing (NLP) allows machines to break down and interpret human
language. It’s at the core of tools we use every day – from translation software, chat-bots,
spam filters, and search engines, to grammar correction software, voice assistants, and social
media monitoring tools. NLP tools transform the text into something a machine can
understand, and then machine learning algorithms are fed training data and expected outputs
(tags) to train machines to make associations between a particular input and its corresponding
output. Machines then use statistical analysis methods to build their own “knowledge bank”
and discern which features best represent the texts, before making predictions for unseen data
(new texts):
Ultimately, the more data these NLP algorithms are fed, the more accurate the text analysis
models will be.
Sentiment analysis (seen in the above chart) is one of the most popular NLP tasks, where machine
learning models are trained to classify text by polarity of opinion (positive, negative, neutral, and
everywhere in between).
8|P ag e
 Regression Analysis - Regression analysis is a powerful statistical method that investigates

the relationship between two or more variables. The extensive use of regression analysis is
building models on datasets that accurately predict the values of the dependent
variable. To conduct a regression analysis, you gather the data on the variables in question.
(Reminder: you likely don’t have to do this yourself, but it’s helpful for you to understand the
process your data analyst colleague uses.) You take all of your monthly sales numbers for, say,
the past three years and any data on the independent variables you’re interested in. So, in this
case, let’s say you find out the average monthly rainfall for the past three years as well. Then
you plot all of that information on a chart that looks like this:
The y-axis is the number of sales (the dependent variable, the thing you’re interested in, is
always on the y-axis) and the x-axis is the total rainfall. Each blue dot represents one month’s
data—how much it rained that month and how many sales you made that same month.
Glancing at this data, you probably notice that sales are higher on days when it rains a lot.
That’s interesting to know, but by how much? If it rains 3 inches, do you know how much you’ll
sell? What about if it rains 4 inches?
Now imagine drawing a line through the chart above, one that runs roughly through the
middle of all the data points. This line will help you answer, with some degree of certainty, how
much you typically sell when it rains a certain amount.
9|P ag e
This is called the regression line and it’s drawn (using a statistics program like SPSS or STATA or
even Excel) to show the line that best fits the data. In other words, explains Redman, “The red
line is the best explanation of the relationship between the independent variable and
dependent variable.”
 Association Rule - Association rule learning is a rule-based machine learning method for
discovering interesting relations between variables in large databases. It is intended to identify
strong rules discovered in databases using some measures of interestingness. Association rule
mining is a procedure which aims to observe frequently occurring patterns, correlations, or
associations from datasets found in various kinds of databases such as relational databases,
transactional databases, and other forms of repositories. An association rule has two parts: ->
an antecedent (if) and -> a consequent (then). An antecedent is something that’s found in
data, and a consequent is an item that is found in combination with the antecedent. Have a
look at this rule for instance:
“If a customer buys bread, he’s 70% likely of buying milk.” In the above association rule, bread
is the antecedent and milk is the consequent. Simply put, it can be understood as a retail
store’s association rule to target their customers better. If the above rule is a result of a
thorough analysis of some data sets, it can be used to not only improve customer service but
also improve the company’s revenue.
Association rules are created by thoroughly analyzing data and looking for frequent if/then
patterns. Then, depending on the following two parameters, the important relationships are
observed:
 Support: Support indicates how frequently the if/then relationship appears in the
database.
 Confidence: Confidence tells about the number of times these relationships are true.
The association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such databases consist of a large number of transaction records
which list all items bought on a single purchase. So the manager could know if certain groups
of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, and promotions based on statistics.
10 | P a g e
 Social Network Analysis - Social network analysis maps and measures the relationships and
flows between people, groups, organizations, computers, URLs, and other connected
information or knowledge entities. The nodes in the network represent the people and groups
while the links identify the relationships or flow between the nodes. Social network analysis
can be applied to any data that highlights relationships between things (e.g. individuals,
objects, events, etc.). When looking at gangs, the approach works best with data that can
capture non-criminal as well as criminal links, since a lot of useful information is contained in
social links.
 Genetic Algorithm – Genetic algorithms are inspired by inheritance, mutation and natural
selection. The genetic algorithm repeatedly modifies a population of individual solutions. At
each step, the genetic algorithm selects individuals from the current population to be parents
and uses them to produce the children for the next generation. Over successive generations,
the population "evolves" toward an optimal solution. You can apply the genetic algorithm to
solve a variety of optimization problems that are not well suited for standard optimization
algorithms, including problems in which the objective function is discontinuous, non-
differentiable, stochastic, or highly nonlinear. The genetic algorithm can address problems
of mixed integer programming, where some components are restricted to be integer-valued.
This flow chart outlines the main algorithmic steps. For details, see how the algorithm works.
11 | P a g e
The genetic algorithm uses three main types of rules at each step to create the next generation from the
current population:
 Selection rules select the individuals, called parents that contribute to the population of the next
generation. The selection is generally stochastic and can depend on the individuals' scores.
 Crossover rules combine two parents to form children for the next generation.
 Mutation rules apply random changes to individual parents to form children.
12 | P a g e
2.2. Give a brief description of examples of devices that you might connect to and show your connection
in a simple well-labelled graph.
I. Smart Heating systems are a great way of ensuring your home’s heating bills aren’t hitting the
roof. By monitoring your usage and notifying users when they could potentially be using less
heating, many smart heating systems can bring in savings in certain areas. It’s also nice to be able
to come home to a warm apartment after a day’s work with a few swipes of a smartphone app.
Other more advanced smart heating systems such as NEST can also be used as smoke detectors
and security cameras, helping them to provide an additional layer of built-in home security to their
smart heating systems.
II. Video doorbells - Another one of the more popular IoT smart home devices is that of video
doorbells which allow users to receive video calls from their doorbells when someone is at their
door. These appliances can also allow users to unl and allow access to their homes remotely from
smartphone apps.
One of the added security bonuses to video doorbells is that they are also able to notify
homeowners when someone is loitering near their property, without them having to push a button
on the doorbell itself. You can even set some systems not to ring at specific times of the day or
during periods where it would be inconvenient.
III. Door Locks - Unfortunately, crimes such as burglary and theft can happen to anyone, and so one of
the most popular smart home technologies is connected door locks. One of the biggest selling
points for smart door locks is their ability to allow you to lock your home from anywhere using a
smartphone app.
Tired of fumbling for your keys in the rain? You could unlock your home before you’re even out of
the car using smart locks. Internet-connected smart locks provide an enhanced level of access
control to users by giving them the ability to unlock their home from their office to allow friends or
relatives inside, or just ensure they didn’t forget to lock the house on their way out that morning.
IV. Smart Gardening - While many Internet of Things devices is located inside the home, the garden
has not been forgotten in this era of ever-increasing connectivity. Smart gardening has become a
thriving area of the smart home with automated, remote-controlled sprinkler systems, robot
lawnmowers and various other garden-focused IoT applications currently available and many more
likely in the future.
IoT sensors allow smart gardening systems to automate whether to increase or decrease water
supply or even collect data on incoming weather patterns to determine the most suitable course of
action. These systems can also be remotely controlled and customized by users to suit their
requirements.
V. Light automation - This involves the use of smart lighting, where you can set the light to turn on,
off, or deem at a specific time, whether you’re at home or not. Lighting automation works in
different ways. Dims and brightens lights automatically: Lighting automation can automatically
dim the household lights during bedtime and also makes them brighter at the appropriate time.
This works based on how you’ve programmed the system to function. Turns off all lights: Turning
off all lights is another function of lighting automation. You can use a single command via the app
installed on your phone to turn all lights off when you’re off to bed. Sunrise alarm: Lighting
automation also works for sunrise alarms. It simulates sunrise and gradually brightens the light in
13 | P a g e
the morning. This can help you avoid being pushed to wake up by a loud alarm clock. This is also
done through the app installed on your smartphone. Entry and pathway lights: You can also
schedule your entryway or pathway lights to automatically turn on before you drive in. This works
with geo-fencing. You don’t have to worry about how it is done; the features come with your
automation app installed on your phone. All you need is a few appropriate clicks on your
smartphone to turn on the function. Night lights: Night lights are also possible with lighting
automation. You can set your kitchen, hallway, and bathroom lights to turn on or deem when
you’re up at night. Also, you can set the garage lights to turn on once you open the door.
VI. Entertainment automation - Entertainment automation offers excellent benefits. Here are some
Best Home Automation Ideas: Music: With your smart speakers, you can stream music to every
room in your home. Needless to mention that you can as well control the music from anywhere
without running cables. Turn on the TV automatically: You can set your TV to turn on when you’re
entering your driveway. This is more or less a welcome home feature. It helps if you have an
important TV program to catch up on. The system automatically turns the TV on for you when
you’re pulling in. Turns off the TV for bedtime: Through the smart hub, you can set all TVs in your
home to turn off at bedtime. This helps in reducing energy bills as well.
VII. Smart Thermostat - Reduce bathroom humidity – Another aspect of convenience automation is
reducing bathroom humidity. Sometimes, your bathroom may get humid, and the fan
automatically turns on—until the moisture returns to normal. For this fan to turn on, you should
have automated it to function that way. You won’t have to worry about going to turn on the fan or
off. It works automatically as programmed. Refrigerator notifications: Get notified when your
refrigerator is left open for a couple of minutes. Sometimes one may likely forget to close the
fridge. If this happens, you can be notified. Most refrigerators come with this feature, but you can
recreate it using an open-close sensor. A professional can help you install this feature on your
fridge doors and get it connected to your smart hub. Circulate air to fireplaces: If your fireplace
causes one room to get too warm, your home ventilation system automatically turns on—
circulates and distributes air to the room. A smart hub is also needed to enable this feature in your
home. Also, you need an indoor and outdoor temperature sensor, with a connected home
ventilation system.
14 | P a g e
QUESTION THREE
3.1. List and describe the IOT architecture components.
 Gateways - data goes from things to the cloud and vice versa through the gateways. A gateway
provides connectivity between things and the cloud part of the IoT solution, enables data
processing and filtering before moving it to the cloud (to reduce the volume of data for detailed
processing and storing) and transmits control commands going from the cloud to things. Things
then execute commands using their actuators.
 Cloud gateways - facilitates data compression and secure data transmission between field
gateways and cloud IoT servers. It also ensures compatibility with various protocols and
communicates with field gateways using different protocols depending on what protocol is
supported by gateways.
 Streaming data processor - ensures effective transition of input data to a data lake and control
applications. No data can be occasionally lost or corrupted.
 Data Lake - A data lake is used for storing the data generated by connected devices in its natural
format. Big data comes in "batches" or in “streams”. When the data is needed for meaningful
insights it’s extracted from a data lake and loaded into a big data warehouse.
 Big data warehouse - Filtered and pre-processed data needed for meaningful insights is extracted
from a data lake to a big data warehouse. A big data warehouse contains only cleaned, structured,
and matched data (compared to a data lake which contains all sorts of data generated by sensors).
Also, the data warehouse stores context information about things and sensors (for example, where
sensors are installed) and the commands control applications send to things.
 Data analytics - Data analysts can use data from the big data warehouse to find trends and gain
actionable insights. When analysed (and in many cases – visualized in schemes, diagrams, and
infographics) big data show, for example, the performance of devices, helps identify inefficiencies
and work out the ways to improve an IoT system (make it more reliable, more customer-oriented).
Also, the correlations and patterns found manually can further contribute to creating algorithms
for control applications.
 Machine Learning and the model’s ML generates - With machine learning, there is an opportunity
to create more precise and more efficient models for control applications. Models are regularly
updated (for example, once a week or once a month) based on the historical data accumulated in a
big data warehouse. When the applicability and efficiency of new models are tested and approved
by data analysts, new models are used by control applications.
 Things - A “thing” is an object equipped with sensors that gather data that will be transferred over
a network and actuators that allow things to act (for example, to switch on or off the light, to open
or close a door, to increase or decrease engine rotation speed and more). This concept includes
fridges, street lamps, buildings, vehicles, production machinery, rehabilitation equipment and
everything else imaginable. Sensors are not in all cases physically attached to things: sensors may
need to monitor, for example, what happens in the closest environment to a thing.
15 | P a g e
 User applications - are software components of an IoT system that enable the connection of users
to an IoT system and give the options to monitor and control their smart things (while they are
connected to a network of similar things, for example, homes or cars and controlled by a central
system). With a mobile or web app, users can monitor the state of their things, send commands to
control applications, and set the options of automatic behaviour (automatic notifications and
actions when certain data comes from sensors).
3.2. Describe two major layers and two other supporting modules of HADOOP.
I. Hadoop Distributed File System (HDFS)

This is a primary data storage system used to store Hadoop applications. Hadoop is an open-
source distributed processing framework that manages data processing and storage for big
data applications. HDFS is a key part of the many Hadoop ecosystem technologies. It provides
a reliable means for managing pools of big data and supporting related big data analytics
applications.
How does Hadoop work?

HDFS enables the rapid transfer of data between compute nodes. At its outset, it was closely
coupled with Map-Reduce, a framework for data processing that filters and divides up work
among the nodes in a cluster, and it organizes and condenses the results into a cohesive
answer to a query. Similarly, when HDFS takes in data, it breaks the information down into
separate blocks and distributes them to different nodes in a cluster.
With HDFS, data is written on the server once, and read and reused numerous times after that.
HDFS has a primary Name-Node, which keeps track of where file data is kept in the cluster.
HDFS also has multiple Data-Nodes on a commodity hardware cluster -- typically one per node
in a cluster. The Data-Nodes are generally organized within the same rack in the data centre.
Data is broken down into separate blocks and distributed among the various Data-Nodes for
storage. Blocks are also replicated across nodes, enabling highly efficient parallel processing.
The Name-Node knows which Data-Node contains which blocks and where the Data-Nodes
reside within the machine cluster. The Name-Node also manages access to the files, including
reads, writes, creates, deletes and the data block replication across the Data-Nodes.
The Name-Node operates in conjunction with the Data-Nodes. As a result, the cluster can
dynamically adapt to serve real time demand in real-time by adding or subtracting nodes as
necessary.
The Data-Nodes are in constant communication with the Name-Node to determine if the Data-
Nodes need to complete specific tasks. Consequently, the Name-Node is always aware of the
status of each Data-Node. If the Name-Node realizes that one Data-Node isn't working
properly, it can immediately reassign that Data-Node's task to a different node containing the
same data block. Data-Nodes also communicate with each other, which enables them to
cooperate during normal file operations.
Moreover, the HDFS is designed to be highly fault-tolerant. The file system replicates -- or
copies -- each piece of data multiple times and distributes the copies to individual nodes,
placing at least one copy on a different server rack than the other copies.
16 | P a g e
The above illustration of the Hadoop Distributed File system works.
II. Hadoop MapReduce

MapReduce is a software framework and programming model used for processing huge
amounts of data. MapReduce program work in two phases, namely, Map and Reduce. Map
tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby,
Python, and C++. The programs of Map Reduce in cloud computing are parallel in nature, and
thus are very useful for performing large-scale data analysis using multiple machines in the
cluster. MapReduce is a component of the Apache Hadoop ecosystem, a framework that
enhances massive data processing.
17 | P a g e
How does Hadoop MapReduce?
MapReduce architecture consists of various components. A brief description of these

components can improve our understanding of how MapReduce works.
In the MapReduce architecture, clients submit jobs to the MapReduce Master. This master will
then sub-divide the job into equal sub-parts. The job-parts will be used for the two main tasks
in MapReduce: mapping and reducing.
The developer will write logic that satisfies the requirements of the organization or company.
The input data will be split and mapped.
The intermediate data will then be sorted and merged. The reducer that will generate a final
output stored in the HDFS will process the resulting output.
The following diagram shows a simplified flow diagram for the MapReduce program.
18 | P a g e
Phases of MapReduce
The MapReduce program is executed in three main phases: mapping, shuffling, and reducing.
There is also an optional phase known as the combiner phase.
Mapping phase
This is the first phase of the program. There are two steps in this phase: splitting and mapping.
A dataset is split into equal units called chunks (input splits) in the splitting step. Hadoop
consists of a RecordReader that uses TextInputFormat to transform input splits into key-value
pairs.
The key-value pairs are then used as inputs in the mapping step. This is the only data format
that a mapper can read or understand. The mapping step contains a coding logic that is
applied to these data blocks. In this step, the mapper processes the key-value pairs and
produces an output of the same form (key-value pairs).
Shifting phase
This is the second phase that takes place after the completion of the Mapping phase. It
consists of two main steps: sorting and merging. In the sorting step, the key-value pairs are
sorted using the keys. Merging ensures that key-value pairs are combined.
The shuffling phase facilitates the removal of duplicate values and the grouping of values.
Different values with similar keys are grouped. The output of this phase will be keys and
values, just like in the Mapping phase.
Reducer phase
In the reducer phase, the output of the shuffling phase is used as the input. The reducer
processes this input further to reduce the intermediate values into smaller values. It provides a
summary of the entire dataset. The output from this phase is stored in the HDFS.
The following diagram shows an example of a MapReduce with the three main phases.
Splitting is often included in the mapping stage.
Combiner phase
This is an optional phase that’s used for optimizing the MapReduce process. It’s used for
reducing the pap outputs at the node level. In this phase, duplicate outputs from the map
outputs can be combined into a single output. The combiner phase increases speed in the
Shuffling phase by improving performance of Jobs.
III. Hadoop Common

Hadoop common is a collection of Java libraries and utilities that are required by/common for
other Hadoop modules, which contain all the necessary Java files and scripts required to start
Hadoop.
The Hadoop Common package is considered the base/core of the framework as it provides
essential services and basic processes such as the abstraction of the underlying operating
system and its file system. Hadoop Common also contains the necessary Java Archive (JAR)
files and scripts required to start Hadoop. The Hadoop Common package also provides source
code and documentation, as well as a contribution section that includes different projects from
the Hadoop Community.
19 | P a g e
IV. Hadoop Yet Another Resource Navigator

Hadoop YARN is the resource management and job scheduling technology in the open-
source Hadoop distributed processing framework. One of Apache Hadoop's core components,
YARN is responsible for allocating system resources to the various applications running in
a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.
In a cluster architecture, Apache Hadoop YARN sits between HDFS and the processing engines
being used to run applications. It combines a central resource manager with containers,
application coordinators and node-level agents that monitor processing operations in
individual cluster nodes. YARN can dynamically allocate resources to applications as needed, a
capability designed to improve resource utilization and application performance compared
with MapReduce's more static allocation approach.
20 | P a g e
QUESTION FOUR
4.2. List and describe five security technologies applied in Big Data
I. Physical Security
Physical Security must not be ignored. Physical security should be deployed when the big
Data platform in the data centre is being built. If your data centre is cloud-based carefully do due
diligence on the cloud provider’s data centre security. Physical security systems serve an important
role in that they can deny data centre access to both strangers, or to staff members who should
not have access to sensitive areas. Video surveillance and security logs will also serve the same
purpose.
Building a strong firewall is another useful big data security tool. Firewalls are effective at filtering
traffic that both enters and leaves servers. Organizations can prevent attacks before they happen
by creating strong filters that avoid any third parties or unknown data sources.
Essentially, big data security requires a multi-faceted approach. When it comes to enterprises
handling vast amounts of data, both proprietary and obtained via third-party sources, big data
security risks become a real concern.
A comprehensive, multi-faceted approach to big data security encompasses:
 Visibility of all data access and interactions
 Data classification
 Data event correlation
 Application control
 Device control and encryption
 Web application and cloud storage control
 Trusted network awareness
 Access and privileged user control
II. Data Provenance

Data provenance primarily concerns metadata (data about data), which can be extremely helpful
in determining where data came from, who accessed it, or what was done with it. Usually, this kind
of data should be analysed with exceptional speed to minimize the time in which a breach is active.
Privileged users engaged in this type of activity must be thoroughly vetted and closely monitored
to ensure they do not become their own big data security issues.
III. Encryption
Encryption tools need to secure data in transit and data at rest, and more importantly, these need
to be achieved across massive data volumes. Furthermore, encryption needs to operate on many
different types of data, both user and machine-generated. Encryption tools also need to work with
different analytics toolsets and their output data, and on common big data storage formats
including relational database management systems (RDBMS), non-relational databases like
NoSQL, and specialized file systems such as Hadoop Distributed File System (HDFS). Encrypted data
is useless to external entities, such as hackers, if they do not have the key to unlock it. Moreover,
encrypting data means that both at input and output, information is completely protected.
21 | P a g e
IV. Centralized Key Management

Centralized key management has been a security best practice for many years even in big data
environments, especially those with wide geographical distribution. Best practices include policy-
driven automation, logging, on-demand key delivery, and abstracting key management from key
usage.
The benefits of a centralized key management system include:
 Unified key management and encryption policies
 System-wide key revocation

 A single point to protect
 Cost reduction through automation
 Consolidated audit information
 A single point for recovery
 Convenient separation of duty
 Key mobility
V. User Access Control

User access control may be the most basic network security tool, but many companies practice
minimal control because the management overhead can be so high. This is dangerous at both the
network level, as well as on the big data platform. Strong user access control requires a policy-
based approach that automates access based on user and role-based settings policy-driven
automation manages complex user control levels, such as multiple administrator settings that
protect the big data platform against inside attack. Controlling who has root access to Business
Intelligence tools and analytics platforms is another key to protecting your data. By developing a
tiered access system, the opportunities for an attack can be reduced.
4.2. List five ethics and five policies that govern the use and implementation of Big Data.
FIVE ETHICS
i. Informed Consent
Informed consent is the most careful, respectful and ethical form of consent. It requires the data
collector to make a significant effort to give participants a reasonable and accurate understanding
of how their data will be used.
In the past, informed consent for data collection was typically taken for participation in a single
study. Big data makes this form of consent impossible as the entire purpose of big data studies,
mining and analytics is to reveal patterns and trends between data points that were previously
inconceivable. In this way, consent cannot possibly be ‘informed’ as neither the data collector nor
the study participant can reasonably know or understand what will be garnered from the data or
how it will be used.
Revisions to the standard of informed consent have been introduced. The first is known as ‘broad
consent’, which pre-authorises secondary uses of data. The second is ‘tiered consent’, which gives
22 | P a g e
consent to specific secondary uses of data, for example, for cancer research but not for genomic
research. Some argue that these newer forms of consent are a watering down of the concept and
leave users open to unethical practices.
Further issues arise when potentially ‘unwilling’ or uninformed data subjects have their
information scraped from social media platforms. Social media Terms of Service contracts
commonly include the right to collection, aggregation and analysis of such data. However, Ofcom
found that 65% of internet users usually accept terms and conditions without reading them. So, it’s
not unreasonable to assume that many end-users may not understand the full extent of the data
usage, which increasingly extends beyond digital advertising and into social science research.
ii. Privacy
The ethics of privacy involve many different concepts such as liberty, autonomy, security, and in a
more modern sense, data protection and data exposure.
You can understand the concept of big data privacy by breaking it down into three categories:
 The condition of privacy
 The right to privacy
 The loss of privacy and invasion
The scale and velocity of big data pose a serious concern as many traditional privacy processes
cannot protect sensitive data, which has led to an exponential increase in cybercrime and data
leaks.
One example of a significant data leak that caused a loss of privacy to over 200 million internet
users happened in January 2021. A rising Chinese social media site called Social-larks suffered a
breach due to a series of data protection errors that included an unsecured Elastic Search
database. A hacker was able to access and scrape the database which stored:
 Names
 Phone numbers
 Email addresses
 Profile descriptions
 Follower and engagement data
 Locations
 LinkedIn profile links
 Connected social media account login names
A further concern is the growing analytical power of big data, i.e. how this can impact privacy
when personal information from various digital platforms can be mined to create a full picture
of a person without their explicit consent. For example, if someone applies for a job,
information can be gained about them via their digital data footprint to identify political
leanings, sexual orientation, social life, etc. All of this data could be used as a reason to reject
an employment application even though the information was not offered up for judgement by
the applicant.
23 | P a g e
iii. Ownership
Ownership refers to the redistribution of data, the modification of data, and the ability to benefit
from data innovations. In the past, legislators have ruled that as data is not property or a
commodity, it, therefore, cannot be stolen - this belief offers little protection or compensation to
internet users and consumers who provide valuable information to companies without personal
benefit.
We can split ownership of data into two categories:

 The right to control data - edit, manage, share and delete
 The right to benefit from data – profit from the use or sale of data
Contrary to common belief, those who generate data, for example, Facebook users, do not
automatically own the data. Some even argue that the data we provide to use ‘free’ online
platforms is in fact a payment for that platform. But big data is big money in today’s world. Many
internet users feel that the current balance is tilted against them when it comes to ownership of
data and the transparency of companies who use and profit from the data we share.
iv. Big Data divide
The big data divide seeks to define the current state of data access; the understanding and mining
capabilities of big data are isolated within the hands of a few major corporations. These dividers
create ‘haves’ and ‘have nots’ in big data and exclude those who lack the necessary financial,
educational and technological resources to access and analyze big datasets.
Despite the growing industry of applications that use data to enhance our lives in terms of health,
finance, etc., there is currently no way for individuals to mine their own data or connect potential
data silos missed by commercial software. Again, we face the ethical problem of who owns the data
we generate; if our data is not ours to modify, analyze and benefit from on our own terms, then
indeed we do not own it.
The data divide creates further problems when we consider algorithm biases that place individuals
in categories based on a culmination of data that individuals themselves cannot access. For
example, profiling software can mark a person as a high-risk potential for committing criminal
activity, causing them to be legally stop-and-searched by authorities or even denied housing in
certain areas. The big data divide means that the ‘data poor’ cannot understand the data or
methods used to make these decisions about them and their lives.
v. Algorithm bias and objectivity

Algorithms are designed by humans, the data sets they study are selected and prepared by humans,
and humans have bias.
So far, there is significant evidence to suggest that human prejudices are infecting technology and
algorithms, and negatively impacting the lives and freedoms of humans. Particularly those who
exist within the minorities of our societies.
The so-called “coded bias” has been identified in such high-profile cases as MIT lab researcher Joy
Buolamwini’s discovery of racial skin-type bias from commercial artificial intelligence systems
created by giant US companies. Buolamwini found that the software had been trained on datasets
of 77% male pictures and more than 83% white-skinned pictures. These biased datasets created a
situation wherein the program fails to recognize white male faces at an error rate of only 0.8%,
whereas dark-skinned female faces are detected at an error rate of 20% in one case and 34% in the
24 | P a g e
other two. These biases extend beyond racial and gendered lines and into the issues of criminal
profiling, poverty and housing.
Algorithm biases have become such an ingrained part of everyday life that they have also been
documented as impacting our personal psyches and thought processes. The phenomenon occurs
when we perceive our reality to be a reflection of what we see online. However, what we view is
often a tailored reality created by algorithms and personalized using our previous viewing habits.
The algorithm shows us content that we are most likely to enjoy or agree with and discards the rest.
When filter bubbles like this exist they create echo chambers and, in extreme cases, can lead to
radicalization, sectarianism and social isolation.
FIVE POLICIES
vi. Data Processing Policies

Organizations should map the way data flows through their organization to see what data is being
processed, how it’s being used and who is receiving it, so a policy is required to ensure this
happens. This has become especially important since the GDRP took effect, as it enables
organizations to account for all their data and provide the necessary information to individuals
who submit data subject access requests.
vii. Email Policies

One of the biggest email-based threats is phishing, which can be mitigated by technology only to
some extent.
Email policies should, therefore, mandate that employees take regular staff awareness courses to
stay up to date with the threat of email-based fraud.
viii. Acceptable Use Policies

If you don’t want employees spending all day on non-work related websites, you ought to put in
place an acceptable use policy.
This outlines any activities that are outright prohibited, as well as stating limits on the number of
time employees can spend pursuing non-work activities.
Be careful when writing your acceptable use policy. Remember, it’s about keeping your employees
away from malware and viruses as much as it is about preventing them from slacking off.
ix. Encryption Policies

Although encryption won’t stop malicious actors accessing an organization’s personal information,
it will prevent them from being able to use it.
It works by obscuring information and replacing identifiers with something else, meaning it is only
accessible or understandable to approved users.
x. Password Policies
There is so much advice on creating strong passwords, and so many warnings about the perils of
weak ones that there is simply no excuse for employees to use combinations such as “password1” or
“0000” or “1234”.
Policies should outline guidance for what a password should look like, for example (combination of
letters, numbers and special characters) and require staff to use different passwords for each
account.
25 | P a g e
REFERENCES
 https://www.whichhomeautomation.com/blog/best-home-automation-ideas/
 https://www.rfwireless-world.com/Terminology/Advantages-and-Disadvantages-of-Big-
Data.html#
 https://www.scnsoft.com/blog/iot-architecture-in-a-nutshell-and-how-it-works
 https://www.techtarget.com/searchdatamanagement/definition/Hadoop-Distributed-File-
System-HDFS
 https://www.section.io/engineering-education/understanding-map-reduce-in-hadoop/
26 | P a g e

BIGDATA 600 Assignment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIGDATA 600 Assignment

Uploaded by

Copyright:

Available Formats

402106779 BWALYA TERISSA POPOPO

FACULTY OF INFORMATION TECHNOLOGY

Name & Surname: BWALYA TERISSA POPOPO ICAS/ITS No: 402106779

Qualification: BSCIT Semester: 2nd Module Name: BGDIOT 600

Date Submitted: 30 Sept 2022

ASSESSMENT CRITERIA MARK EXAMINER MODERATOR

Signature of Examiner: Signature of Moderator:

QUESTION ONE ..............................................................2

QUESTION TWO ............................................................. 8

QUESTION THREE ........................................................15

Why is Big Data so important?

How does Big Data work?

(1.1.2)What is Internet of Things?

Why is Internet of Things important?

1.2. Advantages of Big Data and IoT

 Natural Language Processing - NLP is a sub-speciality of computer science, artificial

 Regression Analysis - Regression analysis is a powerful statistical method that investigates

I. Hadoop Distributed File System (HDFS)

How does Hadoop work?

The above illustration of the Hadoop Distributed File system works.

II. Hadoop MapReduce

How does Hadoop MapReduce?

MapReduce architecture consists of various components. A brief description of these

III. Hadoop Common

IV. Hadoop Yet Another Resource Navigator

II. Data Provenance

IV. Centralized Key Management

 System-wide key revocation

V. User Access Control

We can split ownership of data into two categories:

iv. Big Data divide

v. Algorithm bias and objectivity

vi. Data Processing Policies

vii. Email Policies

viii. Acceptable Use Policies

ix. Encryption Policies

You might also like