BIG Data Analysis Assign - Final

FEDERAL TVET INSTITUTE (FTVETI)
Department of ICT
Course title: BIG Data Analysis (IT 562)
BIG Data Analysis Individual Assignment
Name: Behailu Demssei ID. MTR 078/12
Submittedto: Vasu (Doctor)

vasupinninti@gmail.com
Date: June,22 2020
Table of Content
Question No 1.............................................................................................................................................2
A) List the main characteristics of big data architecture with a neat schematic diagram....................2
B) How would you show your understanding of the tools, trends and technology in big data?...........3
Question No 2.............................................................................................................................................6
A) What are the best practices in Big Data analytics? Explain the techniques used in Big Data
Analytics.................................................................................................................................................6
B) How can u identify the companies that are using Big Data Analytics in Ethiopia?........................8
Question No 3.............................................................................................................................................9
A) What is the difference between Hadoop and Traditional RDBMS? ........................................9
B) Highlight the features of Hadoop and explain the functionalities of Hadoop cluster?.................10
C) What is the best hardware configuration to run Hadoop? What platform (OS) and Java
version is required to run Hadoop?.................................................................................................12
Question No 4...........................................................................................................................................13
A) Explain the significances Hadoop distributed file systems and its application..............................13
B) Explain the difference between NameNode, Backup Node and Checkpoint NameNode. . .14
Question No 5...........................................................................................................................................15
A) What is commodity hardware?....................................................................................................15
B) How big data analysis helps businesses increase their revenue? Give example and Name
some companies that use Hadoop....................................................................................................16
Question No 6...........................................................................................................................................18
Given the current situation on going COVID19 crisis, how could the Information Communication
Technologies and big data analytics contribute to solve it?................................................................18
BIG DATA ANALYTICS ASSIGNMENT 1

Question No 1
A) List the main characteristics of big data architecture with a neat
schematic diagram.
Answer:
(5 Vs of Big data)
Big Data is important because it enables organizationsto gather, store, manage, and
manipulate vast amountsdata at the right speed, at the right time, to gain theright insights.
In addition, Big Data generators mustcreate scalable data (Volume) of different
types(Variety) under controllable generation rates(Velocity), while maintainingthe
importantcharacteristics of the raw data (Veracity), the collecteddata can bring to the
intended process, activity orpredictive analysis/hypothesis.
1. Volume: refers to the quantity of data gathered by a company. This data must be
used further to gain important knowledge. Enterprises are overflowing with ever-
growing data of all types, easily build-up terabytes even petabytes of information
(e.g. turning 12 terabytes of Tweets per day into improved product sentiment
analysis; or converting 350 billion annual meter readings to better predict power
consumption).
2. Velocity: refers to the time in which Big Data can be processed. Some activities are
very important and need immediate responses, which is why fast processing
maximizes efficiency. For time-sensitive processes such fraud detection, Big Data
flows must be analysed and used as they stream into the organizations in order to
maximize the value of the information (e.g. inspect 5 million trade events created

each day to identify potential fraud; analyse 500 million daily call detail records in
real-time to predict customer churn faster).
3. Variety: refers to the type of data that Big Data can comprise. This data maybe
structured or unstructured. Big data consists in any type of data, including structured
and unstructured data such as text, sensor data, audio, video, click streams, log files
and so on. The analysis of combined data types brings new problems, situations, and
so on, such as monitoring hundreds of live video feeds from surveillance cameras to
target points of interest, exploiting the 80% data growth in images, video and
documents to improve customer satisfaction) );
4. Value: refers to the important feature of the data which is defined by the added-
value that the collected data can bring to the intended process, activity or predictive
analysis/hypothesis. Data value will depend on the events or processes they
represent such as stochastic, probabilistic, regular or random. Depending on this the
requirements may be imposed to collect all data, store for longer period (for some
possible event of interest), etc. In this respect data value is closely related to the data
volume and variety.
5. Veracity: refers to the degree in which a leader trusts information in order to make a
decision. Therefore, finding the right correlations in Big Data is very important for
the business future. However, as one in three business leaders do not trust the
information used to reach decisions, generating trust in Big Data presents a huge
challenge as the number and type of sources grows.
B) How would you show your understanding of the tools, trends

andtechnology in big data?
Answer:
Tools
A. Data collection tools:
There is no doubt that today there are a number of Big Data tools that are present in
market. Semantria, Opinion Crawl, OpenText, Trackur are some of them which are
commonly used
B. Data Storage and frameworks tools: The captured data that may be structured or
unstructured need to be stored in databases. There is need of some databases to
accommodate Big Data. A lot of frameworks have been developed by organizations like
Apache, Oracle etc. that are used as analytics tools to fetch and process data which is
stored on these repositories.
C. Data filtering and extraction tools: Some tools are used for data filtering and
extraction. These tools are very helpful to collect useful information from Internet.
D. Data cleaning and validation tools: Data cleansing, and validation is an important
stage. Various validation rules are used to confirm the necessity and relevance of data
extracted for analysis. Sometimes it may be difficult to apply validation constraints due to
complexity of data. Data cleaning tools are very helpful because they help in minimizing
the processing time. They also reduce the computational speed of data analytics tools.
Trends
Big data started with a little shift from traditional analytics to include batch-processing
computations.
It gradually moved from this stage with MapReduce paradigm to a higher level where stream
processing is involved with Apache Spark platform. The trend continues to near real-time
processing and is currently progressing to real-time analytics.
I have already read these trends since 2019 from different source:
 IoT.
 Augmented analytics.
 The harnessing of dark data.
 Cold storage and cloud cost optimization.
 Edge computing and analytics.
 Data storytelling and visualization.
 DataOps.
Technology
BIG DATA is a term used for a collection of large and complex data sets so that it is difficult to
process using traditional applications/tools. It is the data exceeding Terabytes in size. Because of
the variety of data that it encompasses, big data always brings a number of challenges relating to
its volume and complexity.Here are technologies used to store and analyse Big Data. Some
books categorise them into two (storage and Querying/Analysis).
 Apache Hadoop
Apache Hadoop is a java based free software framework that can effectively store large
amount of data in a cluster.

 Microsoft HDInsight
It is a Big Data solution from Microsoft powered by Apache Hadoop which is available
as a service in the cloud. HDInsight uses Windows Azure Blob storage as the default file
system. This also provides high availability with low cost.
 NoSQL
While the traditional SQL can be effectively used to handle large amount of structured
data, we need NoSQL (Not Only SQL) to handle unstructured data. NoSQL databases
store unstructured data with no particular schema.
 Hive
This is a distributed data management for Hadoop. This supports SQL-like query option
HiveSQL (HSQL) to access big data. This can be primarily used for Data mining
purpose.
 Sqoop
This is a tool that connects Hadoop with various relational databases to transfer data. This
can be effectively used to transfer structured data to Hadoop or Hive.
 PolyBase
 This works on top of SQL Server 2012 Parallel Data Warehouse (PDW) and is used to
access data stored in PDW.
 Big data in EXCEL
As many people are comfortable in doing analysis in EXCEL, a popular tool from
Microsoft, you can also connect data stored in Hadoop using EXCEL 2013.
Hortonworks, which is primarily working in providing Enterprise Apache Hadoop,
provides an option to access big data stored in their Hadoop platform using EXCEL
2013. You can use Power View feature of EXCEL 2013 to easily summarise the data.
Similarly, Microsoft’s HDInsight allows us to connect to Big data stored in Azure cloud
using a power query option.
 Presto
Facebook has developed and recently open-sourced its Query engine (SQL-on-Hadoop)
named Presto which is built to handle petabytes of data. Unlike Hive, Presto does not
depend on MapReduce technique and can quickly retrieve data.

 Question No 2
A)What are the best practices in Big Data analytics? Explain the
techniques used in Big Data Analytics.
Answer:
Big data good practices
 Creating dimensions of all the data being store is a goodpractice for Big data analytics. It
needs to be divided intodimensions and facts.
 All the dimensions should have durable surrogate keysmeaning that these keys can’t be
changed by any businessrule and are assigned in sequence or generated by some hashing
algorithm ensuring uniqueness.
 Expect to integrate structured and unstructured data as all kind of data is a part of Big
data which needs to be analyzed together.
 Generality of the technology is needed to deal with different formats of data. Building
technology around key value pairs work.
 Analyzing data sets including identifying information about individuals or organizations
privacy is an issue whose importance particularly to consumers is growing as the value of
Big data becomes more apparent.
 Data quality needs to be better. Different tasks like filtering, cleansing, pruning,
conforming, matching, joining, and diagnosing should be applied at the earliest touch
points possible.
 There should be certain limits on the scalability of the data stored.
 Business leaders and IT leaders should work together to yield more business value from
the data. Collecting, storing and analyzing data comes at a cost. Business leaders will go
for it but IT leaders have to look for many things like technological limitations, staff
restrictions etc. The decisions taken should be revised to ensure that the organization is
considering the right data to produce insights at any given point of time.
Big data analysis techniques
Big data techniques that draw from various fields such as statistics, computer science, applied
mathematics, and economics. As these methods rely on diverse disciplines, the analytics tools
can be applied to both big data and other smaller datasets:
 A/B testing
This data analysis technique involves comparing a control group with a variety of test
groups, in order to discern what treatments or changes will improve a given
objectivevariable.
 Data fusion and data integration
By combining a set of techniques that analyse and integrate data from multiple sources
and solutions, the insights are more efficient and potentially more accurate than if
developed through a single source of data.
 Data mining

A common tool used within big data analytics, data mining extracts patterns from large
data sets by combining methods from statistics and machine learning, within database
management.
 Machine learning
Well known within the field of artificial intelligence, machine learning is also used for
data analysis. Emerging from computer science, it works with computer algorithms to
produce assumptions based on data. It provides predictions that would be impossible for
human analysts.
 Natural language processing (NLP).
Known as a subspecialty of computer science, artificial intelligence, and linguistics, this
data analysis tool uses algorithms to analyse human (natural) language.
 Statistics.
This technique works to collect, organise, and interpret data, within surveys
andexperiments.
Other data analysis techniques include spatial analysis, predictive modelling, association rule
learning, network analysis, Visual Analysis … and many, many more. The technologies that
process, manage, and analyse this data are of an entirely different and expansive field, that
similarly evolves and develops over time.

B) How can u identify the companies that are using Big Data Analytics in
Ethiopia?
Answer:
Currently I can saythat thereis no well-known big data analytics company in Ethiopia, but
there are mega companies which have millions of customers in Ethiopia like Ethiopian
telecommunication, Ethiopian airlines and commercial bank of Ethiopia. Here I can’t get
any information about these companies whether they use Big data technology or not.
However, there are some international companies that are well-known and have large
number of customers in Ethiopia like social media companies (Facebook, twitter…
etc)and Google.

Question No 3
A) What is the difference between Hadoop and Traditional RDBMS?
Answer:
Before directly go to the difference between Hadoop and Traditional RDBMS first I
should try to explain basic meaning of bothterms.
Hadoop is not a database; it is basically a distributed file system which is used to process
andstore large data sets across the computer cluster .
On the other hand, RDBMS is a database which is used to store data in the form of tables
comprising of several rows and columns. It uses SQL, Structured Query Language, to
update and access the data present in these tables.
Difference Between Hadoop And Traditional RDBMS
Data Volume:
RDBMS works better when the volume of data is low(in Gigabytes). But when the data
size is huge i.e, in Terabytes and Petabytes, RDBMS fails to give the desired results.On
the other hand, Hadoop works better when the data size is big. It can easily process and
store large amount of data quite effectively as compared to the traditional RDBMS.
Architecture:
HDFS(Hadoop Distributed File System), Hadoop MapReduce(a programming model to
process large data sets) and Hadoop YARN(used to manage computing resources in
computer clusters).Traditional RDBMS possess ACID properties which are Atomicity,
Consistency, Isolation, and Durability.
Throughput:
Throughput means the total volume of data processed in a particular period of time so
that the output is maximum. RDBMS fails to achieve a higher throughput as compared to
the Apache Hadoop Framework.
Data Variety:
Hadoop has the ability to process and store all variety of data whether it is structured,
semi-structured or unstructured. Although, it is mostly used to process large amount of
unstructured data.
Traditional RDBMS is used only to manage structured and semi-structured data. It cannot
be used to manage unstructured data. So, we can say Hadoop is way better than the
traditional Relational Database Management System.
Latency/ Response Time:
Hadoop has higher throughput, you can quickly access batches of large data sets than
traditional RDBMS, but you cannot access a particular record from the data set very
quickly. Thus, Hadoop is said to have low latency.
But the RDBMS is comparatively faster in retrieving the information from the data sets.
It takes a very little time to perform the same function provided that there is a small
amount of data.
Scalability:

RDBMS provides vertical scalability which is also known as ‘Scaling Up’ a machine. It
means you can add more resources or hardware’s such as memory, CPU to a machine in
the computer cluster.
Whereas, Hadoop provides horizontal scalability which is also known as ‘Scaling Out’ a
machine. It means adding more machines to the existing computer clusters as a result of
which Hadoop becomes a fault tolerant. There is no single point of failure.
Data Processing:
Apache Hadoop supports OLAP(Online Analytical Processing), which is used in Data
Mining techniques.
On the other hand, RDBMS supports OLTP(Online Transaction Processing), which
involves comparatively fast query processing.
Cost:
Hadoop is a free and open source software framework; you don’t have to pay in order to
buy the license of the software.
Whereas RDBMS is a licensed software, you have to pay in order to buy the complete
software license.
B) Highlight the features of Hadoop and explain the functionalities of Hadoop

cluster?
Answer:
Features of Hadoop
 Hadoop is Open Source:
Hadoop is an open-source project, which means its source code is available free of cost
for inspection, modification, and analyses that allows enterprises to modify the code as
per their requirements.
 Hadoop cluster is Highly Scalable:
Hadoop cluster is scalable means we can add any number of nodes (horizontal scalable)
or increase the hardware capacity of nodes (vertical scalable) to achieve high
computation power. This provides horizontal as well as vertical scalability to the Hadoop
framework.
 Hadoop provides Fault Tolerance:
 Hadoop provides High Availability:
 Hadoop is very Cost-Effective:
Since the Hadoop cluster consists of nodes of commodity hardware that are inexpensive,
thus provides a cost-effective solution for storing and processing big data. Being an open-
source product, Hadoop doesn’t need any license.
 Hadoop is Faster in Data Processing:

Hadoop stores data in a distributed fashion, which allows data to be processed distributed
on a cluster of nodes. Thus, it provides lightning-fast processing capability to the Hadoop
framework.
 Hadoop is based on Data Locality concept:
Hadoop is popularly known for its data locality feature means moving computation logic
to the data, rather than moving data to the computation logic.
 Hadoop provides Feasibility:
Unlike the traditional system, Hadoop can process unstructured data. Thus,
providefeasibility to the users to analyse data of any formats and size.
 Hadoop is Easy to use:
Hadoop is easy to use as the clients don’t have to worry about distributing computing.
The processing is handled by the framework itself.
 Hadoop ensures Data Reliability:
In Hadoop due to the replication of data in the cluster, data is stored reliably on the
cluster machines despite machine failures.
Function of a Hadoop Cluster
 Hadoop clusters can boost the processing speed of many big data analytics jobs, given
their ability to break down large computational tasks into smaller tasks that can be run in
a parallel, distributed fashion.
 Hadoop clusters are easily scalable and can quickly add nodes to increase throughput, and
maintain processing speed, when faced with increasing data blocks.
 The use of low cost, high availability commodity hardware makes Hadoop clusters
relatively easy and inexpensive to set up and maintain.
 Hadoop clusters replicate a data set across the distributed file system, making them
resilient to data loss and cluster failure.
 Hadoop clusters make it possible to integrate and leverage data from multiple different
source systems and data formats.
 It is possible to deploy Hadoop using a single-node installation, for evaluation purposes.

C) What is the best hardware configuration to run Hadoop? What
platform (OS) and Java version is required to run Hadoop?
Answer:
Prerequisites
Server: To run Apache Hadoop jobs, it is recommended to use dual core machines or
dualprocessors.
Memory: There should be 4GB or 8GB RAM with the processor with Error-correcting code
(ECC) memory. Without ECC memory, there is high chance of getting checksum errors.
Storage: For storage high capacity SATA drives (around 7200 rpm) should be used in
Hadoopcluster.
Bandwidth: Around 10GB bandwidth Ethernet networks are good for Hadoop.
Hadoop is written in Java, so you will need to have Java installed on your machine, version 6 or
later. Sun's JDK is the one most widely used with Hadoop, although others have been reported to
work. Hadoop runs on Unix and on Windows. Linux is the only supported production platform,
but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development.
Windows is only supported as a development platform, and additionally requires Cygwin to run.
During the Cygwin installation process, you should includethe openssh package if you plan to
run Hadoop in pseudo-distributed mode.
Hadoop can be run in one of three modes:
 Standalone (or local) mode:
There are no daemons running and everything runs in a single JVM. Standalone mode is
suitable for running MapReduce programs during development, since it is easy to test and
debug them
 Pseudo-distributed mode:
The Hadoop daemons run on the local machine, thus simulating a cluster on a smallscale.
 Fully distributed mode:
The Hadoop daemons run on a cluster of machines.

Question No 4
A) Explain the significances Hadoop distributed file systems and its
application.
Answer:
Filesystems that manage the storage across a network of machines are called distributed
filesystems.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed
Filesystem.
HDFS is a filesystem designed for storing very large files with streaming data accesspatterns,
running on clusters on commodity hardware.
Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes
in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once,
read-many-times pattern. A dataset is typically generated or copied from source, then various
analyses are performed on that dataset over time. Each analysis will involve a large proportion, if
not all, of the dataset, so the time to read
the whole dataset is more important than the latency in reading the first record.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designedto run on
clusters of commodity hardware (commonly available hardware available from multiple
vendors†) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user in
the face of such failure.
BIG DATA APPLICATIONS

Big data technologies have wide and long list of theirapplications. It is used for Search Engine,
Log Processing,Recommender System, Data Warehousing, Video andImage Analysis, Banking
& Financial, Telecom, Retail,Manufacturing, Web & Social Media, Medicine,Healthcare,
Science & Research and Social Life.
 Politics
Big Data analytics help Mr. Barack Obama to win the USpresidential election in 2012.
His campaign was builtof 100-strong analytics staff to shake dozens of terabytescale data.
They used a combination of the HP Verticamassively parallel processing analytical
database andpredictive models with Rand Statatools.http://www.infoworld.com/article/2613587/big-
data/thereal-story-of-how-big-data-analytics-helped-obama-win.html,
 National Security
Authors relate the BigData technologies to national security and crimedetectionand

prevention. They present strategic approaches todeploy Big Datatechnologies for
preventing terrorism andreducing crime
 Health Care and Medicine
Big Data technologies can be used for storing andprocessing medical records. Streaming
data can be capturedfrom sensors or machines attached to patients, stored inHDFS and
analysed quickly. https://github.com/hadoop-illuminated/hadoop-book
 Science and Research

Science and research are now driven by technologies. Big Data adds new possibilities to
them.
 Social Media Analysis
IBM provides social media analytics, a powerful SaaS solution to discover hidden
insights from millions of web sources. It is used by businesses to gain a better
understanding of their customers, market and competition.
B) Explain the difference between NameNode, Backup Node and

Checkpoint NameNode.
Answer:
NameNode:NameNode is the core of HDFS that manages the metadata.The information of
what file maps to what block locations and what blocks are stored on what datanode. In simple
terms, it’s the data about the data being stored. NameNode supports a directory tree-like structure
consisting of all the files present in HDFS on a Hadoop cluster. It uses following files for
namespace:
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NameNode: Checkpoint NameNode has the same directory structure as NameNode,
and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits
file and margining them within the local directory. The new image after merging is then
uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not
support the ‘upload to NameNode’ functionality.
BackupNode: Backup Node provides similar functionality as Checkpoint, enforcing

synchronization with NameNode. It maintains an up-to-date in-memory copy of file system
namespace and doesn’t require getting hold of changes after regular intervals. The backup node
needs to save the current state in-memory to an image file to create a new checkpoint.

Question No 5
A) What is commodity hardware?
Answer:
Let me start with simpledefinition
Commodity hardware means cheap servers.
This doesn’t mean scrap servers. Instead, it indicates servers that are affordable and easy to
obtain, even in large quantities. The modern company needs computing power to process and
store information. Of course, we can get that from a battery of severs. The thing is, any company
needs that. Alessandro Maggio try to elaborate this concept with scenario on
www.ictshore.com/data-center
Imagine a company in the ’60s, when all the information was on paper. At that time,
companies heavily relied on paper to store information. Yet, no one thought that paper
was a critical part of the business model. Yet, everyone used it. Paper was just there; its
usage was taken for granted. In other words, paper was a commodity.
Fast forward 60 years, things have changed, yet principles stay the same. At the end of
the day, business is hungry for information, just like they have been for the last century.
However, they now rely on modern technologies instead of paper. Now like then, what’s
important is the information, not the way you process it. Thus, now the hardware is the
commodity.
Big Data is another buzzword of modern times. With all this digital information, any company
has a lot of data to process. Big data means big computers to process them, right? Well, not
quite. Thanks to new development paradigms, the industry is moving away from super
computers. Rather than having one big system processing everything, we now want to have
many servers, each processing a small chunk of that. That’s where commodity hardware comes
into the picture.
Now, modern applications prefer parallelism. This means wecan have lots of not-so-powerful
servers instead of a single big one. If one server fails, you lose only one tiny part of our
processing power. In the end, wewill have better efficiencies and more availability. In case
weneed more power, wecan simply add a server: the application is already parallel, and wedon’t
need to rethink our entire solution.
Commodity Hardware in Hadoop
Hadoop is an open source solution by the Apache foundation that helps weachieve parallelism. It
is one of the leading projects in the Big Data world, and has driven the industry in its early
stages.
The concept behind Hadoop is simple: wehave several servers (the commodity hardware) and
you distribute the load among them. This is possible thanks to Hadoop MapReduce, a special
feature of this solution. Hadoop is installed on all the severs, and it then distributes the data
among them. In the end, each server will have a piece of data, but no server will have everything.
However, the same piece of data will be duplicated on two servers to protect against faults. Now
that each server has its piece of data, wecan process the data. When wedo that, Hadoop tells each
server to process its own data and give back only the results. That’s parallelism, many servers
running together at the same time toward the same goal.
In order to summarize this question.
Commodity hardware is server hardware you can get at affordable prices, fast and in
large quantities. That’s what modern companies use, even tech giants like Google.
B) How big data analysis helps businesses increase their revenue? Give
example and Name some companies that use Hadoop.
Answer:
Big data analytics is done using advanced software systems. This allows businesses to
reduce the analytics time for speedy decision making. Basically, the modern big data
analytics systems allow for speedy and efficient analytical procedures. This ability to
work faster and achieve agility offers a competitive advantage to businesses.
Here are some of them…
 Using Big Data Analytics to Boost Customer Acquisition and Retention
The use of big data allows businesses to observe various customer related patterns and
trends. Observing customer behavior is important to trigger loyalty. Theoretically, the
more data that a business collects the more patterns and trends the business can be able to
identify. In the modern business world and the current technology age, a business can
easily collect all the customer data it needs. This means that it is very easy to understand
the modern-day client. Basically, all that is necessary is having a big data analytics
strategy to maximize the data at your disposal. With a proper customer data analytics
mechanism in place, a business will have the capability to derive critical behavioral
insights that it needs to act on so as to retain the customer base.
Understanding the customer insights will allow your business to be able to deliver what
the customers want from you. This is the most basic step to attain high customer
retention.Example of a Company that uses Big Data for Customer Acquisition and
Retention
o Coca-Cola. In the year 2015, Coca-Cola managed to strengthen its data strategy
by building a digital-led loyalty program.
 Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
A more targeted and personalized campaign means that businesses can save money and
ensure efficiency. This is because they target high potential clients with the right
products. Big data analytics is good for advertisers since the companies can use this data
to understand customers purchasing behavior. We can’t ignore the huge ad fraud
problem. Through predictive analytics, it is possible for the organizations to define their
target clients. Therefore, businesses can have an appropriate and effective reach avoiding
the huge losses incurred as a result of Ad fraud. Example of a Brand that uses Big Data
for Targeted Adverts
o Netflix is a good example of a big brand that uses big data analytics for targeted
advertising. If you are a subscriber, you are familiar to how they send you
suggestions of the next movie you should watch. Basically, this is done using
your past search and watch data. This data is used to give them insights on what
interests the subscriber most. See the screenshot below showing how Netflix
gathers big data.
 Big Data Analytics for Risk Management
So far, big data analytics has contributed greatly to the development of risk management
solutions. The tools available allow the businesses to quantify and model risks that they
face every day. Considering the increasing availability and diversity of statistics, big data
analytics has a huge potential for enhancing the quality of risk management models.
Therefore, a business can be able to achieve smarter risk mitigation strategies and make
strategic decisions. Example of Brand that uses Big Data Analytics for Risk Management
o UOB bank from Singapore is an example of a brand that uses big data to drive
risk management. Being a financial institution, there is huge potential for
incurring losses if risk management is not well thought of. UOB bank recently
tested a risk management system that is based on big data.
 Big Data Analytics as a Driver of Innovations and Product Development
Every design process has to begin from establishing what exactly fits the customers.
There are various channels through which an organization can study customer needs.
Then the business can identify the best approach to capitalize on that need based on the
big data analytics. Example of use of Big Data to Drive Innovations

o Amazon Fresh and Whole Foods. This is a perfect example of how big data can
help improve innovation and product development. Amazon leverages big data
analytics to move into a large market.
 Use of Big Data in Supply Chain Management
Modern supply chain systems based on big data enable more complex supplier networks.
These are built on knowledge sharing and high-level collaboration to achieve contextual
intelligence. It is also essential to note that supply chain executives consider the big data
analytics as a disruptive technology. This is based on the thinking that it will set a
foundation for change management in the organizations.Example of a Brand that uses
Big Data for Supply Chain Efficiency
o PepsiCola is a consumer-packaged goods company that relies on huge volumes of

data for an efficient supply chain management.
Question No 6
Given the current situation on going COVID19 crisis, how could the
Information Communication Technologies and big data analytics contribute to
solve it?
Answer:
Big Data Fight against COVID-19
It is a known fact that big data is acting as an asset helping forecast and understand the impact of
corona virus. It is being used by healthcare workers, scientists, epidemiologists, and
policymakers to aggregate and synthesize data on a regular scale.
Here are 10 ways artificial intelligence, data science, and technology is being used to manage
and fight COVID-19.
1. AI to identify, track and forecast outbreaks
The better we can track the virus, the better we can fight it. By analyzing news reports, social
media platforms, and government documents, AI can learn to detect an outbreak. Tracking
infectious disease risks by using AI is exactly the service Canadian startup BlueDot provides. In
fact, the BlueDot’s AI warned of the threat several days before the Centers for Disease Control
and Prevention or the World Health Organization issued their public warnings
2. AI to help diagnose the virus
Artificial Intelligence Company Infervision launched a corona virus AI solution that helps front-
line healthcare workers detect and monitor the disease efficiently. Imaging departments in

healthcare facilities are being taxed with the increased workload created by the virus. This
solution improves CT diagnosis speed. Chinese e-commerce giant Alibaba also built an AI-
powered diagnosis system they claim is 96% accurate at diagnosing the virus in seconds.
3. Process healthcare claims
It’s not only the clinical operations of healthcare systems that are being taxed but also the
business and administrative divisions as they deal with the surge of patients. A blockchain
platform offered by Ant Financial helps speed up claims processing and reduces the amount of
face-to-face interaction between patients and hospital staff.
4. Drones deliver medical supplies
One of the safest and fastest ways to get medical supplies where they need to go during a disease
outbreak is with drone delivery. Terra Drone is using its unmanned aerial vehicles to transport
medical samples and quarantine material with minimal risk between Xinchang County’s disease
control centre and the People’s Hospital. Drones also are used to patrol public spaces, track non-
compliance to quarantine mandates, and for thermal imaging.
5. Robots sterilize, deliver food and supplies and perform other tasks
Robots aren’t susceptible to the virus, so they are being deployed to complete many tasks such as
cleaning and sterilizing and delivering food and medicine to reduce the amount of human-to-
human contact. UVD robots from Blue Ocean Robotics use ultraviolet light to autonomously kill
bacteria and viruses. In China, Pudu Technology deployed its robots that are typically used in the
catering industry to more than 40 hospitals around the country.
6. Develop drugs
Google’s Deep Mind division used its latest AI algorithms and its computing power to
understand the proteins that might make up the virus, and published the findings to help others
develop treatments. Benevolent AI uses AI systems to build drugs that can fight the world’s
toughest diseases and is now helping support the efforts to treat corona virus, the first time the
company focused its product on infectious diseases. Within weeks of the outbreak, it used its
predictive capabilities to propose existing drugs that might be useful.
7. Advanced fabrics offer protection
Companies such as Israeli startup Sonovia hope to arm healthcare systems and others with face
masks made from their anti-pathogen, anti-bacterial fabric that relies on metal-oxide
nanoparticles.
8. AI to identify non-compliance or infected individuals
While certainly a controversial use of technology and AI, China’s sophisticated surveillance
system used facial recognition technology and temperature detection software from SenseTime

to identify people who might have a fever and be more likely to have the virus. Similar
technology powers "smart helmets" used by officials in Sichuan province to identify people with
fevers. The Chinese government has also developed a monitoring system called Health Code that
uses big data to identify and assesses the risk of each individual based on their travel history,
how much time they have spent in virus hotspots, and potential exposure to people carrying the
virus. Citizens are assigned a color code (red, yellow, or green), which they can access via the
popular apps WeChat or Alipay to indicate if they should be quarantined or allowed in public.
9. Chatbots to share information
Tencent operates WeChat, and people can access free online health consultation services through
it. Chatbots have also been essential communication tools for service providers in the travel and
tourism industry to keep travellers updated on the latest travel procedures and disruptions.
10. Supercomputers working on a corona virus vaccine
The cloud computing resources and supercomputers of several major tech companies such as
Tencent, DiDi, and Huawei are being used by researchers to fast-track the development of a cure
or vaccine for the virus. The speed these systems can run calculations and model solutions is
much faster than standard computer processing.

BIG Data Analysis Assign - Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIG Data Analysis Assign - Final

Uploaded by

Copyright:

Available Formats

FEDERAL TVET INSTITUTE (FTVETI)

BIG Data Analysis Individual Assignment

Name: Behailu Demssei ID. MTR 078/12

Submittedto: Vasu (Doctor)

BIG DATA ANALYTICS ASSIGNMENT 1

BIG DATA ANALYTICS ASSIGNMENT 2

B) How would you show your understanding of the tools, trends

BIG DATA ANALYTICS ASSIGNMENT 4

BIG DATA ANALYTICS ASSIGNMENT 5

BIG DATA ANALYTICS ASSIGNMENT 6

BIG DATA ANALYTICS ASSIGNMENT 7

BIG DATA ANALYTICS ASSIGNMENT 8

BIG DATA ANALYTICS ASSIGNMENT 9

B) Highlight the features of Hadoop and explain the functionalities of Hadoop

BIG DATA ANALYTICS ASSIGNMENT 10

BIG DATA ANALYTICS ASSIGNMENT 11

BIG DATA ANALYTICS ASSIGNMENT 12

BIG DATA APPLICATIONS

BIG DATA ANALYTICS ASSIGNMENT 13

 Science and Research

B) Explain the difference between NameNode, Backup Node and

fsimage file- It keeps track of the latest checkpoint of the namespace.

BackupNode: Backup Node provides similar functionality as Checkpoint, enforcing

BIG DATA ANALYTICS ASSIGNMENT 14

A) What is commodity hardware?

Commodity Hardware in Hadoop

In order to summarize this question.

Here are some of them…

 Using Big Data Analytics to Boost Customer Acquisition and Retention

BIG DATA ANALYTICS ASSIGNMENT 17

o PepsiCola is a consumer-packaged goods company that relies on huge volumes of

Big Data Fight against COVID-19

1. AI to identify, track and forecast outbreaks

2. AI to help diagnose the virus

BIG DATA ANALYTICS ASSIGNMENT 18

3. Process healthcare claims

4. Drones deliver medical supplies

7. Advanced fabrics offer protection

8. AI to identify non-compliance or infected individuals

BIG DATA ANALYTICS ASSIGNMENT 19

9. Chatbots to share information

10. Supercomputers working on a corona virus vaccine

BIG DATA ANALYTICS ASSIGNMENT 20

You might also like