You are on page 1of 86

BIG

DATA AND
ANALYTICS
CHAPTER 1: DATA AND THE INTERNET OF THINGS

The Internet of Things (IoT) not only attaches sensors to existing things, it creates a market for new
connected things. All of these connected things are generating data. This adds up to an almost
unimaginable quantity of data, called Big Data. This chapter discusses how all of this data can be analyzed
and put to use to improve our lives. It explains the different types of data, where it comes from, and how it
can all be managed.

Our connected world is complex. This complexity generates an ever-increasing amount of data, which is
available at our fingertips. The volume of data that needs to be stored and analyzed continues to expand.
The velocity of data generation shows no signs of slowing. The variety of data will continue to reach into new
areas that have never before been available for analysis. Interactions between people using media
platforms, the automation of processes, and the aggregation of data coming from different sources creates
the Internet of Things (IoT). This digital transformation will reveal new insights that promise to change the
way we live, work, and play.

The digital transformation has a profound impact on three main elements of our lives: business, social, and
environmental, as shown in the figure. Interactions in these areas will create more data to fuel new ideas,
products, and solutions. This will produce even more new data, resulting in a repeating cycle of exponential
innovation that helps us make better decisions and have better ideas.

This course focuses on the pervasive and unique role of data. In this first chapter, you will learn fundamental
terminology related to data, the technologies that are used to process data, and explore the concept of Big
Data.

WHAT IS DATA?

Data can be the words in a book, article, or blog. Data can be the contents of a spreadsheet or a database.
Data can be pictures or video. Data can be a constant stream of measurements sent from a monitoring
device. By itself, data can be rather meaningless. As we interpret data, by correlating or comparing, it
becomes more useful. This useful data is now information. As this information is applied or understood, it
becomes knowledge.

When collecting data, determine the amount of data you will need. It is not always necessary, or possible, to
collect all available data within a project or solution. The amount of data that can be collected is
determined by the ability of the sensors, network, computers, and other hardware involved. It is also
determined by the necessity, for example in a high-speed bottling line, each bottle must be checked for the
proper alignment of the label and kicked out of the line if there is a problem. In this case, the data from
every bottle is important. With a different sensor such as a humidity sensor in a corn field, it is not necessary
to report the humidity every tenth of a second. Every five or ten minutes may be sufficient. This is known as
the data sampling rate.

Not all data collected can be used as is. Extraneous data might have been collected. Incorrect or false data
might also have been collected. In order to make this data usable, it must be cleaned. Cleaning data
consists of removing unwanted data, changing incorrect data, and filling in missing data. It is common to
use code to clean data. This is accomplished by searching for criteria, or lack thereof, and operating on the
data until there are no more anomalies. After the data has been cleaned, it can more easily be searched,
analyzed, and visualized.

Through data analysis, interesting insights can be learned and trends can be uncovered. This often leads to
new queries that had not yet been realized. When you discover that you might be able to discern additional
value from some data set, you can begin to experiment with how the data is organized and presented. For
example, a security camera monitoring a parking lot for crimes could also be used to report the number
and location of free spaces to drivers.

ESTIMATING EXPONENTIAL GROWTH

There are two types of growth: linear growth and exponential growth. It is not difficult to understand linear
growth. For example, if a person gains ¼ kilogram each month, then in one year he would have gained three
kilograms. In two years, he would have gained six kilograms.

Exponential growth is much more dramatic than linear growth. For example, if a person saves $1 one month,
$2 in the next month, $4 the next month, $8 the next month, continuing to double the amount saved every
month, how long would it take for this person to become a millionaire? If you try this on your calculator, you
will notice that by the 12th month the person has saved $2,048, and $4,096 the following month. Eight
months later, the person would be a millionaire, assuming that a little over half-a-million in savings can be
found during the 20th month. (Figure 1).

An old legend also demonstrates the concept of exponential growth. The inventor of chess showed the
game to his king, who was so pleased that he told the inventor to name his prize for the invention. The
inventor asked that for the first square of the chess board, he would receive one grain of wheat, two grains
for the second one, four grains for the third one, doubling the amount each time. The king quickly granted
this modest request. However, the treasurer explained that it would take more than all of the assets that the
kingdom has to give the inventor the reward. The story ends with the inventor becoming the new king.

Imagine that each grain of wheat is equivalent to one byte of data. If so, then the number of bytes would
reach over nine exabytes in the last square of the chess board, as shown in Figure 2. One exabyte is roughly
1.07 billion gigabytes. Nine exabytes is roughly equivalent to the amount of Internet traffic for the year 2014. It
is also equivalent to the amount of global IP traffic that occurred in three days in 2015. In 2016, 88.7 exabytes
crossed the global Internet every month.

GROWTH OF DATA

Why do we care about exponential growth in a course about data? Because today, the growth of data is
exponential! The following are just a few statistics that are part of Cisco’s Visual Networking Index (VNI)
forecast for data growth between 2015 and 2020:

● Consumer mobile data traffic will reach 26.1 exabytes per month in 2020, the equivalent of 7 billion
DVDs per month, or 9 million DVDs per hour.
● Globally, IP traffic will reach 194.4 exabytes per month in 2020, up from 72.5 exabytes per month in
2015, as shown in the figure.
● Globally, 64% of all Internet traffic will cross content delivery networks in 2020, up from 45% in 2015.
● Global mobile data traffic will grow 3 times faster than global fixed IP traffic from 2015 to 2020.
● Global Internet traffic in 2020 will be equivalent to 95 times the volume of the entire global Internet in
2005.
● Globally, the average fixed broadband speed will grow 1.9-fold from 2015 to 2020, from 24.7 Mbps to
47.7 Mbps.
● In 2020, the gigabyte equivalent of all movies ever made will cross global IP networks every 2
minutes.
● Globally, consumer IP VOD traffic will reach 28.8 exabytes per month in 2020, the equivalent of 7
billion DVDs per month, or 10 million DVDs per hour.

DATA GROWTH IMPACT

The proliferation of devices in the IoT is one of the primary reasons for the exponential growth in data
generation. While the number of sensors and other end devices grows exponentially, mobile routers are
increasingly used to better manage Internet traffic for systems that are literally on the move. Mobile routers
are deployed in airplanes, commercial vehicles, and even in personal automobiles. Not only is the IoT
growing, but its boundaries are actually moving! Just as the advent of wireless roaming improved Internet
access, the implementation of mobile networks is changing the psychology and behavior of consumers and
businesses by expanding anytime, anywhere, on-demand access.

Here are three examples of how data growth is affecting society in healthcare, retail, and education:

● Robotics, mobile devices, integrated software systems, and collaboration are changing the way
healthcare is delivered. Many of these technologies enable or expand upon efficiencies in healthcare
delivery. These technologies use data, and in turn, create more data. Click Play in Figure 1 to see how
healthcare is being implemented at the Palomar Medical Center.

● Retailers increasingly depend on the data generated by digital technologies to improve their bottom
line. Cisco’s Connected Mobile Experiences (CMX) allow retailers to provide consumers with highly
personalized content while simultaneously gaining visibility into their behavior in the store. Click Play
in Figure 2 to see how into shopping centers in the UK are engaging with their customers to create a
new retail experience with Cisco CMX.

● Education is changing with digital technologies. It is now standard practice to incorporate tablets in
elementary school education in many parts of the world. Virtual schools give students access to
textbooks, content, and assistance using learning management systems. Students and teachers
want to be able to bring their own devices and connect to learning resources. Click Play in Figure 3 to
see how the McAllen Independent School District uses Cisco Identity Service Engine (ISE) and Secure
Access to implement a bring your own device (BYOD) initiative.

BUSINESS EXAMPLE: KAGGLE

To stay competitive in the business world, every organization must become more efficient. Innovation allows
an organization to stay relevant. More organizations are putting sensors in their operations and products.
Their goal is to collect and analyze the data to gain valuable insights. To take advantage of the power of IoT,
organizations require skilled and creative people. Online platforms, such as Kaggle, allow companies to
connect with talented people from different parts of the world.

Kaggle is a platform that connects businesses and other organizations that have questions about their data
to the people who know how to find the answers. They run online competitions to create the world’s best
predictive data models. Players in the competitions generate many models using a variety of techniques.
Players come from all around the world with different educational backgrounds and specializations. They
can connect to form teams or simply help each other. The winner, or winning team, of each competition
wins a prize. Usually this prize is money, but occasionally it will be employment, or something equally
desirable.

In each competition, there are continuous improvements, as each winner beats the previous score.
Eventually the scores plateau. This means that the players have found the threshold of what is possible to
predict with the data provided. These new predictive data models consistently outperform existing
best-of-breed models.

The Mayo Clinic, NASA, GE, and Deloitte are just a few of the businesses and organizations that have hosted
competitions on Kaggle.

ENVIRONMENTAL EXAMPLE: PLANETARY SKIN INSTITUTE

NASA and Cisco partnered to develop an online collaborative global monitoring platform called the
Planetary Skin. This platform captures, collects, analyzes, and reports data on environmental conditions
around the world.

Planetary Skin Institute (PSI) is a global, nonprofit organization. It collaborates with research and
development partners to incubate scalable Big Data innovations that are designed to increase food, water,
and energy security, and to protect critical ecosystems and biodiversity.

“Mitigating the impacts of climate change is critical to the world's economic and social stability. This
unique partnership taps the power and innovation of the market and harnesses it for the public good.
Cisco is proud to work with NASA on this initiative and hopes others from the public and private sectors will
join us in this exciting endeavor.”

John Chambers, Former Cisco CEO

DEFINING BIG DATA

The exponential growth of data has created a new area of interest in technology and business called “Big
Data". In general, a data set or a business problem belongs to the Big Data classification when its data is so
vast, fast or complex that it becomes impossible to store, process, and analyze using traditional data
storage and analytics applications.

How much data does it take to become Big Data? Are 100 terabytes enough? Are 1000 petabytes enough?
The baby in the figure already has gigabytes of online data associated with her name. Is that Big Data?
Volume is only one of the criteria because the need of real time processing of the data (also called data in
motion) or the need of integrating structured and unstructured data, may qualify the problem as Big Data
problem.

For example, International Data Corporation (IDC) uses 100 terabytes as the size of a data set that qualifies
as Big Data. If the data is streaming, the size of the data set can be smaller than 100 terabytes but still be
considered as Big Data as long as the data that is being generated is increasing at a rate of more than 60%
a year.

For IBM’s Big Data perspective, click here to view a video in which the presenter, Paul Zikopoulos, says that
200 to 600 terabytes are a minimum qualification for data to be called Big Data.
Many of the awe-inspiring quantifications of data sizes in our near future are documented in the Cisco
white paper The Zettabyte Era: Trends and Analysis.

In response to this need, a completely new class of software platform has emerged called Big Data
Platforms. It is discussed in Chapter 6 of this course.

According to NIST’s Big Data Interoperability Framework: "The Big Data paradigm consists of the distribution
of data systems across horizontally coupled, independent resources to achieve the scalability needed for
the efficient processing of extensive data sets."

BIG DATA CHARACTERISTICS

To help distinguish data from Big Data, consider the Four Vs of Big Data:

● Volume - This describes the amount of data being transported and stored. The current challenge is
to discover ways to most efficiently process the increasing amounts of data, which is predicted to
grow 50 times by 2020, to 35 zettabytes.

● Velocity - This describes the rate at which this data is generated. For example, the data generated
by a billion shares sold on the New York Stock Exchange cannot just be stored for later analysis. The
data infrastructure must be able to immediately respond to the demands of applications accessing
and streaming the data.

● Variety - This describes the type of data, which is rarely in a state that is perfectly ready for
processing and analysis. A large contributor to Big Data is unstructured data, which is estimated to
represent anywhere from 70 to 90% of the world’s data.

● Veracity - This is the process of preventing inaccurate data from spoiling your data sets. For
example, when people sign up for an online account, they often use a false contact information.
Increased veracity in the collection of data reduces the amount of data cleaning that is required.

While there are four Vs listed here, most discussions, tools, and documents will address only the top three Vs
(volume, velocity, variety).

SOURCES OF BIG DATA

To businesses, data is the new oil. Like crude oil, it is valuable, but if it is unrefined it cannot be easily used.
Crude oil has to be changed to gasoline, plastic, chemicals, and other substances to create a valuable
product. It is the same with data. Data must be broken down and analyzed for it to have value.

Having the right data that can be turned into information and then into business intelligence is critical to
success. The data sources available to businesses are growing exponentially. The proliferation of sensors
guarantees that they will continue to be a primary source of Big Data. Sensors are found in a variety of
applications:

● Telemetry for vehicle monitoring


● Smart metering
● Inventory management and asset tracking
● Fleet management and logistics
Businesses need information and information is everywhere in an organization. The two primary types are
transactional information and analytical information. Transactional information is captured and stored as
events happen. Transactional information is used to analyze daily sales reports and production schedules
to determine how much inventory to carry. Analytical information supports managerial analysis tasks like
determining whether the organization should build a new manufacturing plant or hire additional sales
personnel.

Click Play in the figure to see how the use of intelligent devices in cars can be used to facilitate day-to-day
activities. These types of applications are important generators of data. Each sensor in the car contributes
to the overall amount of generated data.

REAL-WORLD EXAMPLES OF BIG DATA SOURCES

To gain some perspective on Big Data, here are some specific real-world examples of Big Data generators,
as shown in the figure. An Airbus A380 Engine generates 1 petabyte of data on a flight from London to
Singapore. The Large Hadron Collider (LHC) generates 1 gigabyte of data every second. The Square
Kilometer Array (SKA), when it becomes operational in 2020, will be the largest radio telescope in the world.
It will generate 20 exabytes of data per day. That is equivalent to 20 billion gigabytes per day. Click here to
learn more about the LHC here.

The Human Genome Project was an effort to sequence and map all the human genes. It began in 1990 and
was completed in 2003 for approximately $3 billion. Now an individual can order a complete sequencing of
his or her genes for about $1,000 to $2,000.

WHAT IS OPEN DATA?

With the rise in importance of data to businesses and people, many questions arise regarding the privacy
and the availability of large public and private data repositories. For a data professional, it is fundamentally
important to understand the continuum between open data and private data. Making decisions about what
and how various types of data will be used in an organization is as important as knowing how to implement
a distributed storage and processing solution for Big Data.

The Open Knowledge Foundation defines open knowledge as “any content, information or data that people
are free to use, reuse, and redistribute without any legal, technological, or social restriction.” They then go on
to explain that open data comprise the building blocks of open knowledge. Open knowledge is what open
data becomes when it is useful, usable, and used.

The value of open data can immediately be seen by viewing sites like New York City’s Open Data Portal, NYC
Open Data, where a resident or visitor can quickly find ratings for restaurants based on annual inspections
by the Department of Health and Mental Hygiene. A visualization of the portal is shown in the figure and can
be accessed at the NYC Open Data portal. The portal is a clearinghouse of over 1300 data sets from city
agencies to facilitate government transparency and civic engagement. A data set is a collection of related
and discrete records that may be accessed for management individually or as a whole entity.

Gapminder is a non-profit venture promoting sustainable global development. The site presents engaging
analyses of open data sets with clarifying statistics on such topics as:

● Health and wealth of nations


● CO2 emissions since 1820
● Child mortality
● HIV infection

Statistician Hans Rosling gave a must-see TED talk, The Best Stats You’ve Ever Seen, bringing to life the facts
and myths about the developing world in a quick tour through human history.

WHAT IS PRIVATE DATA?

The expectation of privacy and what an individual or society considers private data continues to evolve. As
new apps are developed, more and more data is requested from the end user to give companies and
advertisers more information to make business decisions.

The current state of data protection regulations across the globe is shown in the figure. Click here to access
the 2016 World Economic Forum 2014 report.

What is the right approach to maximize the benefit of new sources of data while, at the same time,
empowering individuals with the ability to control access to their personal data? Some intriguing work is
being done in this area including efforts by openPDS and Privacy by Design.

Instead of looking to strip personal data of all its identifying characteristics, called data anonymization,
openPDS takes a slightly different approach. Using what they call the SafeAnswers framework, openPDS
provides only answers to specific queries. No raw data is sent. The calculation for the answer is done within
the user’s personal data store (PDS):

“Only the answers, summarized data, necessary to the app leaves the boundaries of the user’s PDS. Rather
than exporting raw accelerometer or GPS data, it could be sufficient for an app to know if you’re active or
which general geographic zone you are currently in....computation can be done inside the user’s PDS by the
corresponding Q&A module.”

Click here to learn more about openPDS.

Privacy by Design began in the 1990s to address the growing concern of large-scaled, networked data
systems. With its “7 Foundational Principles”, Privacy by Design “advances the view that the future of privacy
cannot be assured solely by compliance with legislation and regulatory frameworks; rather, privacy
assurance must ideally become an organization’s default mode of operation.”

In Europe for example, the official texts of a new Regulation have been published in the EU Official Journal.
The regulation will apply starting in May of 2018.

STRUCTURED DATA

Previously, we have classified data in terms of its accessibility; data is either open or private. Data can also
be classified by the way it is arranged, either structured or unstructured.

Structured data refers to data that is entered and maintained in fixed fields within a file or record. Structured
data is easily entered, classified, queried, and analyzed by a computer. This includes data found in
relational databases and spreadsheets. For example, when you submit your name, address, and billing
information to a website, you are creating structured data. The structure will force a certain format for
entering the data to minimize errors and make it easier for a computer to interpret it.
If the data set is small enough, structured data is often managed with Structured Query Language (SQL), a
programming language created for querying data in relational databases. SQL only works on structured
data sets. However, with Big Data, structured data may be part of a data set, but Big Data tools do not
depend on that structure. It is not uncommon for Big Data to have data sets that consist of unstructured
data.

UNSTRUCTURED DATA

Unstructured data lacks the organization found in structured data. Unstructured data is raw data.
Unstructured data is not organized in a predefined way. It does not possess a fixed schema that identifies
the type of the data. Unstructured data lacks a set way of entering or grouping the data, and then analyzing
the data. Examples of unstructured data include the content of photos, audio, video, web pages, blogs,
books, journals, white papers, PowerPoint presentations, articles, email, wikis, word processing documents,
and text in general. Figure 1 shows Dostoevsky’s notes for Chapter 5 of The Brothers Karamazov. The
contents of the notes are not searchable because they have no structure. Figure 2 shows Chapter 5 of the
same novel after publication. Even a PDF version of this chapter is unstructured. The text is searchable, but it
is not organized in a predefined form, for example, using fields and records.

Both structured and unstructured data are valuable to individuals, organizations, industries, and
governments. It is important for organizations to take all forms of data and determine ways to format that
data so it can be managed and analyzed.

NATURE OF DATA

In the past, data sets were mostly static, residing on a single server or a collection of servers within the
organization, and processed using a database programming language like SQL. Although this model still
exists, storage of large data sets has migrated to data centers. Today, with the rise of cloud computing, Big
Data, and the need to analyze data in real time, data continues to be stored in data centers. Data must also
be available for analysis closer to where it is created and the knowledge gained from that data can have
the greatest impact. This is called fog computing.

As shown in the figure, fog is a cloud close to the ground, close to the source of data generation. Fog
computing is not a replacement for cloud computing, rather, fog computing enables the development of
new tools. In the fog computing model, there is interplay between the cloud and the fog, particularly when it
comes to data management and analytics. Fog computing provides compute, storage, and networking
services between end devices and traditional data centers.

Fog computing produces an enormous amount of data from the various sensors and controllers. When
dealing with data in the IoT, three very important factors must be taken into consideration:

● Energy or Battery – The amount of energy used by an IoT sensor, for example, depends on the
sample rate of the sensor. Range between devices can also affect how much energy must be used
to report sensor data to controllers. The farther away the sensor, the more energy that must be used
by the transmitting radio.

● Bandwidth – When many sensors are transmitting data, there may be delay in communications if
there is not enough bandwidth to support all of the devices. Additional analysis in the fog can help to
alleviate some communications bandwidth requirements.
● Delay – Real-time data analysis is affected when there is too much delay in the network. It is very
important that only the necessary communications to the cloud are performed and computation
happens as close to the data source as possible.

DATA AT REST AND DATA AT MOTION

Data can be at rest and data can be in motion. Data at rest is static data that is stored in a physical
location, for example, on a hard drive in a server or data center. Data at rest follows the traditional analysis
flow of Store > Analyze > Notify > Act. Data is stored in a database and then analyzed and interpreted.
Decision makers are notified and determine whether action is required.

Data in motion is dynamic data that requires real-time processing before that data becomes irrelevant or
obsolete. It represents the continuous interactions between people, process, and things. Analysis and action
happen sooner rather than later. Devices at the edge of the network work together to act immediately on
knowledge gained from dynamic data analysis. The flow of analysis for data in motion is often Analyze > Act
> Notify > Store. The order of analyze, act, and notify can be different. The important distinction between
data at rest and data in motion is that, with data in motion, acting on the data happens before the data is
stored.

Data in motion is used by a variety of industries that rely on extracting value from data before it is stored.
Sensors in a farmer’s field continuously send data for temperature, soil moisture, and sunlight to a local
controller. The controller analyzes the data. If the conditions are not right, the controller acts immediately by
sending signals to actuators in the field to begin watering. The controller then sends a notification to the
owner of the field that watering has begun. Then the controller sends the data to be stored for historical
records.

Due to the characteristics of Big Data, it is no longer feasible to duplicate and store all that data in a
centralized data warehouse. Emerging device implementations include a large number of sensors
capturing and processing data. Decisions and actions need to take place at the edge, where and when the
data is created. With sensors gaining more processing power and becoming more context-aware, it is now
possible to bring intelligence and analytic algorithms close to the source of the data. In this case, data in
motion stays where it is created and presents insights in real time, prompting better, faster decisions.

BIG DATA INFRASTRUCTURE

Many companies realize that it makes sense to invest in some of the Big Data technologies to remain
competitive in their market. Currently, their data infrastructures may look something like Figure 1, with
database servers and traditional data processing tools. Typically, data access is limited to a few
knowledgeable individuals within the organization. Companies are rapidly moving towards leveraging Big
Data technologies to drive business intelligence. According to NIST, the Big Data paradigm consists of the
distribution of data systems across horizontally coupled, independent resources to achieve the scalability
needed for the efficient processing of extensive data sets. This is horizontal scalability. It is different from
vertical scalability in that it does not attempt to add more processing power, storage or memory to existing
machines. These infrastructures can allow many users to seamlessly and securely access the data
simultaneously. One such example is thousands of online shoppers or mobile gamers. In this course, we will
briefly explore the technologies that are now common with Big Data implementations, as shown in Figure 2.
The icons in Figure 2 represent devices in an organization’s Big Data infrastructure.
FLAT FILE DATABASE

Before SQL and other database programming languages were commonplace, professionals worked with
flat file databases. A flat file database stores records in a single file with no hierarchical structure. As shown
in the figure, these databases consist of columns and rows. Columns are also called fields, and rows are
also called records. A spreadsheet file is an example of a flat file database.

RELATIONAL DATABASES

The next generation of data management emerged with the relational database management system
(RDBMS). For 30 years, this was the standard approach to data management. Relational databases capture
the relationships between different sets of data, creating more useful information. In Figure 1, these
relationships are shown with lines. For instance, more detail about subcontractors can be accessed from
both Product and Material database queries.

IBM's Structured Query Language/Data Store (SQL/DS) and Relational Software Corporation's Oracle were
the first two commercial RDBMS solutions. Most commercial RDBMS solutions use SQL as their query
language to this day. An example of SQL is: SELECT id,name,price FROM inventory WHERE price < 20. Examples
of products that use structured query language to access data include MySQL, SQLite, MS SQL, Oracle, and
IBM DB2.

Another characteristic of relational databases is the distinction between the database and the
management system used to query the database. Typically, with RDMS and the underlying database, many
users can be querying the relational database at the same time. The user normally does not know all the
relationships that exist inside the database. Rather, the user abstracts a view of the database that is suited
to that user’s needs.

Figure 2 shows a simplified view of data abstraction in a relational database. The lowest level of abstraction
describes how the data is physically stored. The next level describes what data is stored and the
relationships between the data. This is the level where a database administrator operates. The user level is
the highest level and describes what part of the database a particular user or group of users can access.
There can be many different views defined and many simultaneous connections to the database at any
given time.

In contrast to traditional SQL relational database management systems that can be challenging to scale,
non-relational SQL (NoSQL) databases scale very well as distributed databases Because NoSQL can handle
big data and real time web application better than RDBMS, NoSQL database queries are focused on the
collection of documents, such as information gathered from websites. NoSQL also allows clusters of
machines to process the data and provide better control over availability.

NoSQL databases are being adopted widely to solve business problems.

DISTRIBUTED DATA AND PROCESSING

From a data management perspective, analytics were simple when only humans created data. The amount
of data was manageable and relatively easy to sift through. Relational databases serve the needs of data
analysts. However, with the pervasiveness of business automation systems and the explosive growth of web
applications and machine-generated data, analytics is becoming increasing more difficult to manage with
just an RDBMS solution. In fact, 90% of data that exists today has been generated in just the last two years.
This increased volume within a short period of time is a property of exponential growth. This high volume of
data is difficult to process and analyze within a reasonable amount of time.

Rather than large databases being processed by big and powerful mainframe computers and stored in
giant disk arrays (vertical scaling), distributed data processing takes the large volume of data and breaks it
into smaller pieces. These smaller data volumes are distributed in many locations to be processed by many
computers with smaller processors. Each computer in the distributed architecture analyzes its part of the
Big Data picture (horizontal scaling).

Most distributed file systems are designed to be invisible to client programs. The distributed file system
locates files and moves data, but the users have no way of knowing that the files are distributed among
many different servers or nodes. The users access these files as if they were local to their own computers. All
users see the same view of the file system and are able to access data concurrently with other users.

Hadoop was created to deal with these Big Data volumes. The Hadoop project started with two facets: The
Hadoop Distributed File System (HDFS) is a distributed, fault tolerant file system, and MapReduce, which is a
distributed way to process data. Hadoop has now evolved into a very comprehensive ecosystem of
software for Big Data managment. There are many other distributed file system (DFS) programs. Here are
just a few: Ceph, GlusterFS, and Google’s File System.

A NoSQL database stores and accesses data differently than relational databases. NoSQL is sometimes
called “Not only SQL”, “non-SQL”, or “non-relational”. NoSQL systems may support SQL-like query languages.
NoSQL databases use data structures such as key-value, wide column, graph, or document. Many NoSQL
databases provide "eventual consistency". With eventual consistency, database changes eventually appear
in all nodes. This means that queries for data might not provide the most recent information available. This
problem is known as ‘stale reads’.

The reason for creating NoSQL was to make database design simpler. It is easier to scale clusters of nodes
using NoSQL than it is in standard relational databases.

The most popular NoSQL databases in 2015 were MongoDB, Apache Cassandra, and Redis.

WHAT IS SQLite

A structured query language (SQL) is designed to manage, search, and process data, including Big Data.
SQLite is a simple and easy to use SQL database engine that you will use in labs later in this course.

SQLite is an in-process library that uses a self-contained, transactional SQL database engine. The code for
SQLite is in the public domain which mean it is free to use for commercial or private purposes. SQLite is the
most widely deployed database in the world. Go here to learn more about SQLite being used by many
high-profile organizations.

SQLite is also an embedded SQL database engine. Unlike most other SQL databases, SQLite does not have a
separate server process. SQLite reads and writes directly to ordinary disk files.

SQLite is a popular choice for the database engine in mobile phones, MP3 players, set-top boxes, and other
electronic gadgets. SQLite has a small code footprint, makes efficient use of memory, disk space, and disk
bandwidth, is highly reliable, and requires no maintenance from a Database Administrator. SQLite is often
used instead of an enterprise RDBMS for testing. SQLite requires no setup, which makes testing much easier.
This also makes it a good choice for databases that are behind small to medium-sized websites.
SQLite FEATURES

SQLite has several useful features. A few are listed here:

● No setup or administration is required. It has an easy to use API.


● A complete database is stored in a single, cross-platform disk file. Can be used as an application file
format.
● It has a small code footprint.
● It is a cross-platform SQL. It supports Android, iOS, Linux, Mac, Windows and several other operating
systems.
● Sources for SQLite are in the public domain.
● It has a stand-alone command-line interface (CLI).
● All changes within a single transaction occur completely or not at all. This is true even in the event of
a program or operating system crash, or a power failure.

ABOUT PROGRAMMING

Using SQL and database technologies is very effective to extract a subset of data from an existing data set
stored in the database. The SQL expression that performs this action is called a SQL query. In business, many
important problems cannot be solved with just a simple SQL query and need a more complex analytical
process. This is where use of a more powerful data analysis programming language like R or Python come
in. R and Python have very large communities of developers. Their users are known for developing data
analysis modules and making them available to the community free of charge. Because of that, any user
can download and use pre-programmed modules and tools.

While third party tools and program modules are very useful, it is very important to know how to create your
own data analysis tools. The ability to create data analysis tools from scratch allows for highly customized
applications. The process of creating a data analysis tool from scratch can be divided in two main parts:
the model and the code.

Modeling consists of deciding what to do with the data to achieve the desired results and conclusions.
Suppose for example, you want to create a personal fitness tracker. Suppose there is no pre-programmed
module in existence to do exactly what you want to do. This is why learning a programming language is so
important. You do not have to change your idea of what you want, instead, you can become part of the
developer community and create exactly what you want.

In your module, the tracker, which is built into a chest band, contains an accelerometer, which is a sensor
capable of measuring the device’s acceleration. The accelerometer can be used to determine the speed
and direction of movement. The speed and direction of the device’s movement always matches the speed
and direction of its user when attached to the user’s chest. But what if the device is attached to a dumbbell
weight? How about a tennis racket? The device will still yield the same data, speed, and movement
direction, but because of the different applications, the interpretation of this data must be adjusted for the
new usage. In this context, modeling can be seen as a way to interpret and process data. If the fitness
tracker is attached to the user’s chest, two consecutive points of no movement (speed equals zero), likely
represent the beginning and end of a sprint. When attached to a dumbbell weight, the same data points
likely represent the moment that the dumbbell was picked up off the floor and the highest point the user
was able to lift it before putting it back on the floor.
The code (or the program) is the second part of creating data analysis tools from scratch. The code is the
program that processes the data and must be written according to the model created. While the model
and the code are two separate entities, they are related because the code is built based on the model. In
this course, we focus on the programming language known as Python.

SUMMARY

Data can be the words in a book, article, or blog. Data can be the contents of a spreadsheet or a database.
Data can be pictures or video. Data can be a constant stream of measurements sent from a monitoring
device.

There are two types of data growth: linear growth and exponential growth. Exponential growth is much more
dramatic than linear growth. The digital transformation and its generation of Big Data has a profound
impact on three main elements of our lives: business, social, and environmental.

The four Vs of Big Data are: volume, velocity, variety, and veracity.

Open data (or knowledge) is “any content, information or data that people are free to use, reuse, and
redistribute without any legal, technological, or social restriction.” The expectation of privacy and what an
individual or society considers private data continues to evolve.

Structured data refers to data that is entered and maintained in fixed fields within a file or record.
Unstructured data does not possess a fixed schema that identifies the type of the data.

Data at rest is static data that is stored in a physical location, for example, on a hard drive in a server or
data center. Data at rest follows the traditional analysis flow of Store > Analyze > Notify > Act. Data in motion
is used by a variety of industries that rely on extracting value from data before it is stored. The flow of
analysis for data in motion is often Analyze > Act > Notify > Store.

A flat file database stores records in a single file with no hierarchical structure. These databases consist of
columns and rows. Relational databases capture the relationships between different sets of data, creating
more useful information.

Rather than large databases being processed by big and powerful mainframe computers and stored in
giant disk arrays (vertical scaling), distributed data processing takes the large volume of data and breaks it
into smaller pieces. These smaller data volumes are distributed in many locations to be processed by many
computers with smaller processors. Each computer in the distributed architecture analyzes its part of the
Big Data picture (horizontal scaling).

A structured query language (SQL) is designed to manage, search, and process data, particularly Big Data.
SQLite is a simple and easy to use SQL database engine.
CHAPTER 2: FUNDAMENTALS OF DATA ANALYSIS

DATA IS EVERYWHERE

Data is being generated at an unprecedented rate by machines, by people, and by things. The Cisco Visual
Networking Index (VNI) forecasts global Internet traffic growth and broadband trends for mobile and fixed
networks. According to the Cisco VNI, IP traffic will triple over the next 3 years. By 2020, there will be more
than 26 billion global IP networked devices/connections (up from 16.3 billion in 2015). Globally, IP traffic will
reach 194.4 exabytes per month in 2020. Internet video will account for 79 percent of global Internet traffic
by 2020. That is up from 63 percent in 2015. The world will reach three trillion Internet video minutes per
month by 2020, which is five million years of video per month, or about one million video minutes every
second.

We are more connected than ever. In our homes, schools, work and even the areas in which we play,
advancements in IoT technologies are generating large quantities of data. Everywhere you go and
everything you do in this digital world becomes a new source of data. Data is being generated from sensors,
devices, video, audio, networks, log files, transactional applications, the web and social media. It is more
commonly streaming over the networks and comes in a variety of sizes and formats. The high volume, high
velocity and high variety of these data sets is a key feature that distinguishes data from Big Data.

The emergence of these large data sets requires more advance methods, technologies, and infrastructure
to process the data and convert it to actionable information. Data can no longer be stored on a few
machines nor be processed with one tool.

Companies are actively creating profiles and processing data on their systems, their users, and their
processes to spur growth and innovation. Researchers and analysts are looking for ways to access and
analyze data that was once considered unusable. Advanced analytics techniques can be used on large
data sets such as text analytics, machine learning, predictive analytics, data mining, statistics, and natural
language processing. Businesses can analyze previously untapped data sources independently, or together
with their existing enterprise data, to gain new insights. These insights result in significantly better and faster
decisions.

This chapter explains what we mean by data analysis, what it can do for businesses and other
organizations, and the tools and methodologies available. In the past, data analysts had access to
historical, static data. This data was from the past and it did not change. Now, with sensors, social media
and other sources, data is dynamic. It needs to be analyzed as soon as it is created. There is also so much
more of this dynamic data that we need new tools and methodologies, new storage solutions, and new
ways of thinking about this Big Data.

CRISP-DM

What does data analysis mean? Is it a standardized process or more of an art? With data analysis we start
with a business question and the availability of some data. We end with the creation of new information
relevant to solve the business question.This chapter presents the concept of data analysis and applies
basic elements of data analysis to a particular situation. In this case, we measure the Internet speed of a
connected node.
There are many methodologies for conducting data analysis, including the popular Cross Industry Standard
Process for Data Mining (CRISP-DM) used by more than 40% of data analysts. About 27% of data analysts
use their own methodology. The rest use a variety of other methodologies. (Source: KDnuggets)

To keep it simple in this introductory course, we will use the six-step Data Analysis Lifecycle shown in the
figure. Closely resembling the scientific method, the Data Analysis Lifecycle is designed for use in a business
environment. Notice that arrows are pointing in both directions between some steps. This highlights the fact
that the lifecycle may require many iterations before decision makers are confident enough to move
forward.

DATA ANALYTICS TOOL CAPABILITIES

Even before computers were invented, the information gathered while doing business was reviewed with the
goal of making processes more efficient and profitable. With the limited amount of data and the
painstaking process of manual analysis, the task was still worthwhile. Today, with the massive growth in the
volumes of data, computers and software are required to gain insight into business patterns and make
sense of all this data.

What tools you use depends on your needs and the solutions you have already implemented. Because
“best” is a relative term, the tools you use will depend on your specific objectives or the questions you are
trying to answer. The tool to use depends on the type of analysis to be performed. Some tools are designed
to handle manipulation and visualization of large data sets. Other tools are designed with complex
mathematical modeling and simulation capabilities for prediction and forecasting. No matter which tools
are used, they should be able to handle these five capabilities:

● Ease of use – The tool that is easy to learn and to use is often more effective than a tool that is
difficult to use. Also, a tool that is easy to use requires less training and less support.

● Data manipulation – The software should allow users to clean and modify the data to make it more
usable. This leads to data being more reliable because anomalies can be detected, adjusted or
removed.

● Sharing – Everyone must be looking at the same data sets to be able to collaborate effectively. This
helps people to interpret data the same way.

● Interactive visualization – To fully understand how data changes over time, it is important to visualize
trends. Basic charts and graphs cannot fully represent how information evolves the way a heat map
or time motion view can.

THE ROLE OF PYTHON IN DATA ANALYSIS

There are a variety of programs that are used to format data, clean it, analyze it, and visualize it. Many
companies and organizations are turning to open source tools to process, aggregate and summarize their
data. The Python programming language has become a commonly used tool for handling and
manipulating data. Python will be used in this course to perform all of these functions.

Python was created in 1991 as an easy to learn language with many libraries used for data manipulation,
machine learning, and data visualization. Through the use of these libraries, programmers do not have to
learn multiple programming languages or spend time learning how to use many different programs to
perform the functions of these libraries. Python is a flexible language that is growing and becoming more
integral to data science because of this flexibility and ease of learning.

This course will use Jupyter Notebooks, shown in the figure. Jupyter Notebooks allow instruction and
programming to exist within the same file. It is easy to alter code in the notebooks and experiment with how
different code can be used to manipulate, analyze, and visualize your data.

These are some of the libraries that will be used in this course:

● NumPy – This library adds support for arrays and matrices. It also has many built-in mathematical
functions for use on data sets.

● Pandas – This library adds support for tables and time series. Pandas is used to manipulate and
clean data, among other uses.

● Matplotlib – This library adds support for data visualization. Matplotlib is a plotting library capable of
creating simple line plots to complicated 3D and contour plots.

BIG DATA AND DECISION MAKING

The scalable technologies made possible by distributed computing and virtualization are enabling data
center administrators to manage the top three of the four aspects of Big Data: volume, velocity, and variety.
Statistical methodologies embedded in applications are empowering data analysts to interpret and use Big
Data to make better decisions. Modern data analysis tools make it possible to extract and transform the raw
data to produce a much smaller set of quality data. However, data alone is not meaningful information, the
data must be analyzed and then presented in a form which can be interpreted. This is what decision makers
need to take the right action.

Decision makers will increasingly rely on data analytics to extract the required information at the right time,
in the right place, to make the right decision. This information can tell many different stories, depending on
how the data is analyzed. For example, in politics, it is common practice for data analysts to extract
information that is relevant to their candidate. In business, a data analyst may uncover market trends that
enable a company to move ahead of its competition.

DATA, INFORMATION, KNOWLEDGE, AND WISDOM

The Data, Information, Knowledge, and Wisdom (DIKW) model shown in the figure is used to illustrate the
transitions that data undergoes until it gains enough value to inform wise decisions. This structure provides
a means of communicating the value of data in various states of incarnation.

The following is an example of each level of the pyramid, from the bottom up:

● Data – Collect temperature readings from multiple, geo-localized sensors.

● Information - Extract temporal and localization insights. It shows that temperatures are constantly
rising globally.

● Knowledge - Compare multiple hypotheses and it becomes apparent that the rise appears to be
caused by human activities, including greenhouse emissions.

● Wisdom – Work to reduce greenhouse gas emissions.


Wise decisions rely on a well-established base of knowledge. A common phrase used in data analytics is
business intelligence. Business intelligence encompasses the entire process from data to information to
knowledge to wisdom.

DESCRIPTIVE ANALYTICS

There are multiple types of analytics that can provide businesses, organizations and people with
information that can drive innovation, improve efficiency and mitigate risk. The type of data analytics to
implement will depend on the problem that needs to be solved or questions that need to be answered.

Three types of data analytics will be covered in this course:

● Descriptive
● Predictive
● Prescriptive

Descriptive analytics primarily uses observed data. It is used to identify key characteristics of a data set.
Summarized data from descriptive analytics provides information on prior events and trends in
performance. Descriptive analytics relies solely on historical data to provide regular reports on events that
have already happened. This type of analysis is also used to generate ad hoc reports that summarize large
amounts of data to answer simple questions like “how much...” or “how many...” or “what happened.” It can
also be used to drill down into the data, asking deeper questions about a specific problem. The scope of
descriptive analytics is to summarize your data into more compact and useful information. An example of a
descriptive analysis is an hourly traffic report.

PREDICTIVE ANALYTICS

Predictive analytics attempts to predict what may happen next with a certain degree of confidence, based
on data and statistics. Predictive analytics can be used to infer missing data and establish a future trend
line based on past data. It uses simulation models and forecasting to suggest what could happen.

An example of a predictive analysis is a computer model that uses Big Data to forecast the weather.

Another way to look at predictive analytics is to produce new data by starting with existing data. A common
example is the price of a house. Imagine you want to sell your house and you do not know what price to set
for it. You can take the prices of recent sales of houses in the neighborhood and the characteristics of those
houses (e.g. number of bedrooms, bathroom, status, etc.) as an indication of the price. But your house is
probably not identical to any of the other houses. Here is where predictive analytics can help. A predictive
model for the price is based on the data that you have of previous sales. It “predicts” the appropriate price
for your house. Another example is classification. For example, given a tweet or a post, classify the tweet as
positive or negative based on the text it contains.

In 2014, Jameson Toole won the MIT Big Data Challenge: “What can you learn from data about 2.3 million taxi
rides?” Based on information garnered by writing pattern-matching machine-learning algorithms for very
large data sets, Toole was able to predict the number of taxi pickups that would likely occur in 700 two-hour
time intervals at 36 locations in the Boston area. Knowledge of where and when taxis are needed most
could be used to reduce traffic congestion eliminating the need for taxis to drive around the city looking for
people to pick up.
PRESCRIPTIVE ANALYTICS

Prescriptive analytics predicts outcomes and suggests courses of actions that will hold the greatest benefit
for the enterprise or organization. Prescriptive analytics recommends actions or decisions based on a
complex set of targets, constraints, and choices. It can be used to suggest how to mitigate or even avoid
risks. Prescriptive analytic implementations may require a feedback system to track the outcome of the
actions taken.

An example of a prescriptive analysis is a computer model that uses Big Data to make stock market
recommendations to buy or sell a stock.

All three types of analytics are used in data analysis.

THE ROLE OF TIME IN DATA ANALYTICS

Before the era of Big Data, the role of time in data analytics was restricted to how long it took to compile a
data set from disparate sources, or how long it took to run a data set through some calculation. With Big
Data, time becomes important in other ways because much of the value of data is derived from creating
opportunities to take action immediately.

Data is being generated constantly by sensors, consumers, social media users, jet engines, the stock
market, and almost anything else that is connected to a network. This data is not just growing in quantity; it
is also changing in real time. Data analysis must also be carried out in real-time while the data is being
collected.

When discussing Big Data and business, making decisions based on analytics can improve the return on
investment (ROI) for businesses as a function of time. Data-driven decisions can have the following benefits:

● Increased time for research and development of products and services


● Increased efficiency and faster manufacturing
● Faster time to market
● More effective marketing and advertising

TRADITIONAL ANALYTICS TO BIG DATA ANALYTICS

In the past, when most data sets were relatively small and manageable, analysts could use traditional tools
such as Excel or a statistical program, such as SPSS, to create meaningful information out of the data.
Typically, the data set contained historical data and the processing of that data was not always time
dependent. Traditional databases had to be designed before the data could be entered. Then the data, if
not too large, could be cleaned, filtered, processed, summarized and visualized using charts, graphs and
dashboards.

As the data sets grow in volume, velocity and variety, the complexity of data storage, processing, and
aggregation becomes a challenge for traditional analytic tools. Large data sets may be distributed and
processed across multiple, geographically-dispersed physical devices as well as in the cloud. Big Data
tools, such as Hadoop and Apache Spark, are needed for these large data sets to enable real-time analysis
and predictive modeling.
NEXT GENERATION ANALYTICS

For businesses to make optimal decisions, it is no longer enough gather data from the previous fiscal year
and run descriptive analysis types of queries. It is increasingly necessary to use predictive and prescriptive
analysis tools to remain competitive in a world in which the rate of change is accelerating. Next generation
analytics do not have to solely rely on performing statistical analytics on an entire data set, as was done
with traditional analytics tools. Because of the vast amount of data points and attributes collected per
record or per “thing”, new behaviors and insights can be gained from advanced analyses that improve
prediction and prescription accuracy.

For example, the following questions can be answered to make real-time adjustments to decisions:

● Which stocks will most likely have the highest daily gain based on trading in the last hour?
● What is the best way to route delivery trucks this afternoon, based on morning sales, existing
inventory, and current traffic reports?
● What maintenance is required for this airplane, based on performance data generated during the
last flight?

The handling of this machine-generated data, combined with the geographical scope of very large-scale
systems, the number of data generating devices, the diversity of devices manufacturers, the frequency of
data generation, and the overall volume of data requires new infrastructure software. This infrastructure
software must be able to distribute the computing and the storage of the data among edge, fog and cloud
where it better fits the needs the business.

THE SCIENTIFIC METHOD

The process that a data analyst uses to make conclusions is very similar to the scientific method shown in
the figure. A data analyst may ask the question, “What district in San Francisco had the most incidents of
reported crimes between June 1 and August 1 of 2014?” A scientist may want to solve the problem, “Why
does blood from a young mouse reverse the effects of aging when put into an older mouse?” Regardless of
the exact method or steps used, both data analysts and scientists will complete a process that includes
asking questions, gathering data, analyzing the data, and making conclusions or presenting the results.

BUSINESS VALUE

Data analytics allows businesses to better understand the impact of their products and services, adjust
their methods and goals, and provide their customers with better products faster. The ability to gain new
insights from their data brings business value.

Michel Porter from Harvard describes how IT, for the third time in 50 years, has reshaped business:

“The first wave of IT, during the 1960s and 1970s, automated individual activities like paying employee
stipends or supporting the design and manufacturing of products. The second wave of business
transformation was the rise of the Internet, the 1980s and 1990s, that enabled the coordination and
integration of outside suppliers, distribution channels, and customers across geography.

With IoT we are now in the third wave, IT is becoming an integral part of the product itself. Embedded
sensors, processors, software, and connectivity in products (in effect, computers are being put inside
products), coupled with a cloud where product data is stored and analyzed and some applications are run,
are driving dramatic improvements in product functionality and performance. Massive amounts of new
product-usage data enable many of those improvements.”

DATA ANALYSIS LIFECYCLE EXAMPLE

Like in the scientific method, the Data Analysis Lifecycle begins with a question. For example, we could ask
the question, “What was the most prevalent crime committed in San Francisco, California on July 4, 2014?”
Each step in the Data Analysis Lifecycle includes many tasks that must be completed before moving on to
the next step. Only one example task is shown in the figure.

The following is a brief description of each step:

● Gathering the data - The process of locating data and then determining if there is enough data to
complete the analysis. In this case, we would search for an open data set of crime statistics for San
Francisco during July of 2014.

● Preparing the data - This step can involve many tasks to transform the data into a format
appropriate for the tool that will be used. The crime data set may already be prepared for analysis.
However, there are usually some adjustments to make to help answer the question.

● Choosing a model - This step includes choosing an analysis technique that will best answer the
question with the data available. After a model is chosen, a tool (or tools) for data analysis is
selected. In this chapter, you will learn to use Python and Python libraries to prepare, analyze, and
present data.

● Analyzing the data - The process of testing the model against the data and determining if the model
and the analyzed data are reliable. Were you able to answer the question with the selected tool?

● Presenting the results - This is usually the last step for data analysts. It is the process of
communicating the results to decision-makers. Sometimes, the data analyst is asked to
recommend actions. For the July 4th crime data, a bar graph, a pie chart, or some other
representation could be used to communicate which crime was most prevalent. An analyst might
suggest increasing police presence in certain areas to deter crime on a specific holiday like July 4th.

● Making decisions - The final step in the data analysis lifecycle. Organizational leaders incorporate
the new knowledge as part of the overall strategy. The process begins anew with gathering data.

FILES

There are many different sources of data. A vast amount of historical data can be found in files such as MS
Word documents, emails, spreadsheets, MS PowerPoints, PDFs, HTML, and plaintext files. These are just a few
of the types of files that contain data.

Big Data can also be found in public and private archives. Scanned paper archives containing historical
data from a variety of sources is certainly Big Data. For example, there is an enormous amount of data in
medical insurance forms and invoices, business statements and customer interaction, and tax documents.
This list is just a small portion of archived data.

Internal to organizations, raw data is created through customer relationship management systems,
learning management systems, human resource systems and records, intranets, and other processes.
Different applications create files in different formats that are not necessarily compatible with one another.
For this reason, a universal file format is needed. Comma-separated values (CSV) files are a type of
plaintext file outlined in RFC 4180. CSV files use commas to separate columns in a table of data, and the
newline character to separate rows. Each row is a record. Although they are commonly used for importing
and exporting in traditional databases and spreadsheets, there is no specific standard. JSON and XML are
also plaintext file types that use a standard way of representing data records. These file formats are
compatible with a wide range of applications. Converting data into a common format is a valuable way to
combine data from different sources.

INTERNET

The Internet is a good place to look for Big Data. There you can find images, videos, and audio. Public web
forums also create data. Social media such as YouTube, Facebook, instant messaging, RSS, and Twitter all
add to the data found on the Internet. Most of this data is unstructured, which means it is not easy to
categorize into a database without some type of processing.

Web pages are created to provide data to humans, not machines. “Web scraping” tools automatically
extract data from HTML pages. This is similar to a Web Crawler or spider of a search engine. It explores the
web to extract data and create the database to respond to the search queries. Web scraping software may
use Hypertext Transfer Protocol or a web browser to access the World Wide Web. Typically, web scraping is
an automated process which uses a bot or web crawler. Specific data is gathered and copied from the web
to a database or spreadsheet. The data can then be easily analyzed.

To implement web scraping, the process must first download the web page and then extract the desired
data from it. Web scrapers typically take something out of a page, to make use of it for another purpose
somewhere else. Perhaps the web scraper is being used to find and copy names, phone numbers, and
addresses. This is known as contact scraping.

In addition to contact scraping, web scraping is used for other types of data mining such as real estate
listings, weather data, research, and price comparisons. Many large web service providers such as
Facebook provide standardized interfaces to collect the data automatically using APIs. The most common
approach is to use RESTful application program interfaces (APIs). RESTful APIs use HTTP as communication
protocol and JSON structure to encode the data. Internet websites like Google and Twitter gather large
amounts of static and time series data. Knowledge of the APIs for these sites allow data analysts and
engineers to access the large amounts of data that are constantly being generated on the Internet.

SENSORS

The Internet of Things (IoT) uses sensors to create data. This data can come from temperature and
humidity sensors found in agriculture. Sensors are now in everything from smart phones to cars, and jet
engines to home appliances. These, along with many other types of sensors (the list of things with sensors
grows every year) contribute to the exponential growth of Big Data. Click here to learn more about the
sources of Big Data.

We need new tools, new technologies, and a new way to approach how we store, process and compute, so
that raw data can become meaningful information.
DATABASES

Databases contain data that has been extracted, transformed and loaded (ETL). ETL is the process of
‘cleaning’ raw data so that it can be placed into a database. Often data is stored in multiple databases and
must be merged into a single dataset for analysis.

Most databases contain data that is owned by an organization and is private. As mentioned in the previous
chapter, there are many public databases that can be searched by anyone. For example, the Internet has
several public databases with ancestral records available for free or low cost.

DATA TYPES AND FORMATS

After data has been accessed from different sources, it requires preparation for analysis. In fact, experts in
the field of Data Science estimate that data preparation can take up 50 to 80 percent of the time required
to complete an analysis.

Because the data that will comprise the data set to be analyzed can come from very diverse sources, it is
not necessarily compatible when combined. Another issue is that data that may be presented as text will
need to be converted to a numeric type if it is to be used for statistical analysis. Data types are important
when computer languages, such as Python or R, are used to operate on data. Some different data types,
and their descriptions, are shown in the Figure 1.

In addition to different data types, a single type of data can be formatted differently, depending on its
source. For example, different languages may use different symbols to represent the same word. British
English may use different spellings than American English. An analysis of English text for mentions of modes
of travel would need to look for both airplane and aeroplane in order to be accurate.

Time and data formats present challenges. Although times and dates are very specific, they are
represented in a wide variety of formats. Time and date are essential to the analysis of time series
observations. Therefore, they must be converted to a standard format in order for an analysis to have any
value. For example, dates may be formatted in with the year first followed by the day and the month in
some countries, while other countries may present data with the month first followed by the day and year.
Similarly, time may be represented in 12-hour format with the AM and PM designation, or could be
represented in 24-hour format.

DATA STRUCTURE

Data science is a rapidly evolving field. Like many new disciplines, the language used is diverse and not
widely standardized. This means the same thing may have several names depending on the context and
background of the speaker. This is true of data structures as well.

When discussing data, we can think of a hierarchy of structures. For example, a data warehouse or data
lake is a place that stores many diverse databases in such a way that the databases can be accessed
using the same system. A database is a collection of data tables that are related to one another in one or
more ways. Data tables consist of fields, rows, and values that are similar to the columns, rows, and cells in
a spreadsheet. Each data table can be considered as a file, and a database as a collection of files. Figure 1
illustrates the relationship of these structures and associated terminology. For this course, we will use fields,
rows, and values as our standard vocabulary for the structure of data tables.
Other data structures, or objects, are used by Python. For example, Python uses strings, lists, dictionaries,
tuples, and sets as its primary data structures. Each data structure has its own group of functions, or
methods, which can be used to work with the object. Figure 2 shows the common Python data structures. In
addition, a popular Python data analysis library called ‘pandas’ uses other data structures such as series
and data frames.

EXTRACT, TRANSFORM, AND LOAD DATA

As mentioned earlier in this topic, much of the data that is going to be placed in a database so that it can
then be queried comes from a variety of sources and in a wide range of formats. Extract, Transform and
Load (ETL) is a process for collecting data from this variety of sources, transforming the data, and then
loading the data into a database. One company’s data could be found in Word documents, spreadsheets,
plain text, PowerPoints, emails and pdf files. This data might be stored in a variety of servers which use
different formats.

There are three steps to the ETL process:

Step 1. Extract – Data is culled from several sources.

Step 2. Transform – After the data has been culled, it must be transformed. Data transformation may
include aggregating, sorting, cleaning and joining data.

Step 3. Load – The transformed data is then loaded into the database for querying.

The above descriptions of the three steps of the ETL process are simplified. In fact, there is quite a lot of work
to do before data can be loaded into a database and then queried.

EXTRACTING DATA

The extract step gathers the desired data from the source and makes it available to be processed.
Extraction converts the data into a single format that is ready to be transformed. For example, combining
data from a NOSQL server and an Oracle DB will give you data in different formats. This data must be
converted into a single format. Also, the data must be checked to ensure it has the desired type of
information (value). This is done using validation rules. If data does not meet the validation rule(s), it may
be rejected. Sometimes, this rejected data is rectified and then validated.

Ideally, during extraction, all of the required data from the source(s) is retrieved using minimal compute
resources, so as not to affect network or computer performance.

TRANSFORMING DATA

The transform step uses rules to transform the source data to the type of data needed for the target
database. This includes converting any measured data to the same dimension (e.g. Imperial to Metric). The
transformation step also requires several other tasks. Some of these tasks are joining data from several
sources, aggregating data, sorting, determining new values that are calculated from aggregated data, and
then applying validation rules (Figure 1).

While it may seem as though this data is completely ready to load, there is usually still work to be done to
prepare it. Data (possibly including some rejected data) may go through another part of the transform step
known as ‘cleaning’ or ‘scrubbing’ data. The cleaning part of the transform step further ensures the
consistency of the source data.
LOADING DATA

The load step is when the transformed data is loaded into the target database. This may be a simple flat file
or a relational database. The actual load process varies widely. It depends on the type of source data, the
type of target database, and the type of querying that is to be done. Some organizations may overwrite
existing data with cumulative data. Loading new transformed data may be done on an hourly, daily, weekly,
or monthly basis. It may only happen when there has been a specific amount of change to the transformed
data.

During the load step, rules that have been defined in the database schema are applied. Some of these rules
check for uniqueness and consistency of data, fields that are mandatory possess have the required values,
etc. These rules help to ensure that the load and any subsequent querying of the data is successful.

CURRENT AND FUTURE REGULATIONS

A quick search on the Internet will most likely reveal that the ethical use of data continues to cause concern
for many people. However, the response from governments on data protection regulations varies from
country to country, as shown in the figure. The European Union (EU) has enacted the strictest regulations,
defining personal data as “any data that can be attributed to an identifiable person either directly or
indirectly.” For more information, click here for the World Economic Forum’s Global Information Technology
Report 2014.

The General Data Protection Regulation (GDPR) was approved by the EU Parliament on April 14, 2016. It goes
into effect on May 25, 2018 at which time any organizations in non-compliance will face heavy fines. The EU
GDPR was designed to make data privacy laws consistent across Europe, to protect data privacy of all EU
citizens, and to reshape the way organizations across the region approach data privacy. Click here for more
information about the GDPR.

On October 6, 2015, the European Court of Justice created a set of data privacy requirements that are
included in the EU-US Privacy Shield. The EU-US Privacy Shield takes the place of the previous Safe Harbour
Framework. This new arrangement requires that companies in the U.S. protect the personal data of
Europeans. It requires stronger monitoring and enforcement by the U.S. Department of Commerce and
Federal Trade Commission (FTC), including increased cooperation with European Data Protection
Authorities. Click here to learn more about the EU-US Privacy Shield.

Click here to read more about US Data Privacy.

What about the people who work with Big Data on a daily basis, the data scientists? What do they think
about the ethical issues around its use? In August 2013, Revolution Analytics surveyed 865 data scientists.

These are some results of that survey:

● 88% of those surveyed believed that consumers should worry about privacy issues.
● 80% of those surveyed agreed that there should be an ethical framework in place for collecting and
using data.
● More than half of those surveyed agreed that ethics already play a big part in their research.

BIG DATA ETHICS SCENARIOS

Consider the following scenarios and how your own personal data might be used:
Scenario 1: You are unconscious after an accident and you are taken to the hospital for treatment. A vast
amount of data is generated over the next couple of hours as the medical professionals work to save your
life. Do you own this data even though it could be used in the future to save other lives? Click here to read
about actual medical information data protection case studies. This site also contains several other types of
data protection case studies.

Scenario 2: A city installs surveillance cameras to reduce crime. Later, the city performs Big Data analytics
on citywide video data over the last year, finds that human traffic patterns in your neighborhood
demonstrate that sidewalk utilization is less than in other neighborhoods. The city then uses this data
analysis to justify a street widening, resulting in a significant increase in traffic noise in your home. Does the
overall reduction in crime in the city outweigh your rights as a homeowner?

Scenario 3: An online retailer uses Big Data predictive modeling to make suggestions to you about future
purchases based on your previous purchasing data. You save dozens of hours over a period of a few years
by spending less time researching product pricing and availability. What if the retailer sells your purchasing
habits to a third party? Should you be responsible for reading and understanding a lengthy End-User
License Agreement (EULA) that is used as legal cover by a corporation reselling your information?

These scenarios illustrate the complex ethical issues currently facing organizations, governments, and
individuals. Ethics will continue to be a major concern as the amount of data we generate grows.

DATA SECURITY

Confidentiality, integrity and availability, known as the CIA triad (Figure 1), is a guideline for data security for
an organization. Confidentiality ensures the privacy of data by restricting access through authentication
encryption. Integrity assures that the information is accurate and trustworthy. Availability ensures that the
information is accessible to authorized people.

Confidentiality

Another term for confidentiality would be privacy. Company policies should restrict access to the
information to authorized personnel and ensure that only those authorized individuals view this data. The
data may be compartmentalized according to the security or sensitivity level of the information. For
example, a Java program developer should not have to access to the personal information of all
employees. Furthermore, employees should receive training to understand the best practices in
safeguarding sensitive information to protect themselves and the company from attacks. Methods to
ensure confidentiality include data encryption, username ID and password, two factor authentication, and
minimizing exposure of sensitive information.

Integrity

Integrity is accuracy, consistency, and trustworthiness of the data during its entire life cycle. Data must be
unaltered during transit and is not changed by unauthorized entities. File permissions and user access
control can prevent unauthorized access. Version control can be used to prevent accidental changes by
authorized users. Backups must be available to restore any corrupted data, and checksum hashing can be
used to verify integrity of the data during transfer.

A checksum is used to verify the integrity of files, or strings of characters, after they have been transferred
from one device to another across your local network or the Internet. Checksums are calculated with hash
functions. Some of the common hash functions are MD5, SHA-1, SHA-256, and SHA-512. A hash function uses
a mathematical algorithm to transform the data into fixed-length value that represents the data, as
represented in Figure 2. The hashed value is simply there for comparison. From the hashed value, the
original data cannot be retrieved directly. For example, if you forgot your password, your password cannot
be recovered from the hashed value. The password must be reset.

After a file is downloaded, you can verify its integrity by verifying the hash values from the source with the
one you generated using any hash calculator. By comparing the hash values, you can ensure that the file
has not been tampered with or corrupted during the transfer.

Availability

Maintaining equipment, performing hardware repairs, keeping operating systems and software up to date,
and creating backups ensure the availability of the network and data to the authorized users. Plans should
be in place to recover quickly from natural or man-made disasters. Security equipment or software, such as
firewalls, guard against downtime because of attacks, such as denial of service (DoS). Denial of service
occurs when an attack attempts to overwhelm resources so the services are not available to the users.

Note: It is not always easy to differentiate between many of the terms you see when reading about the IoT.
Figure 3 explains the difference between data protection, data privacy, data security, and data
confidentiality.

FORMATTING TIME AND DATE DATA

In this last section, you will prepare to complete three labs. These are the first in a series of labs that will be
expanded upon through Chapter 5 of this course. These labs are known as the Internet Meter labs. In the first
lab, you will use a function, called Speedtest, that returns the upload and download speeds of your Internet
connection. After acquiring the measurements, you will save the collected data. You will also import a
larger, previously collected Internet speed data set. This is for learning data manipulation so you can
present the data concisely.

The second lab is not an Internet Meter lab. It is a lab where you will work with Python and SQLite, to prepare
you for the third lab.

The third lab is the next Internet Meter lab. In it, you will use a relational database, SQLite, and perform some
basic SQL queries using Python. You will also calculate the average and plot the data and the calculated
averages of Internet speed data. You will merge tables that contain average speed information and
geographical information into a single database.

As previously mentioned, IoT data that has been combined from many sources may be formatted in ways
that are incompatible. For example, there are many ways that time and date data may be presented.
However, for the purpose of analytics, it is best that times and dates be formatted consistently. One way to
deal with this problem is to simply use data that has single time and date format. However, this would
cause an analyst to discard relevant, but incompatibly formatted data, which would create bias and lead to
flawed conclusions.

One of the reasons that Python is so popular with data analysts is that the core functionalities of the
language have been extended with many different libraries, or modules. One Python module, which will be
used in the upcoming lab, is dedicated to handling time and date data. This module is called datetime.
Click here to read detailed documentation about the datetime module.

The datetime module is included in most Python distributions as a standard library; however, it must be


imported to be used in your code. Features of the datetime module are represented by the object-oriented
programming paradigm. The module consists of the date, time, and datetime classes. Each class has its
own methods that can be called to work with instances of the classes called objects. Figure 1 offers
definitions of some basic object-oriented concepts. A detailed introduction to object-oriented
programming is beyond the scope of this course. Click here for an introductory lesson in Python classes and
objects. Figure 2 illustrates the use of some of the basic objects and methods and included with
the datetime
module. In the first Internet Meter lab in this chapter, you will change date and time data from one format to
another. This is done with the strftime (string from time) method that is available to datetime objects. The
strftime method uses a series of formatting codes, or directives, as its parameters. The list of formatting
codes is shown in Figure 3.

Figure 4 shows Python code that uses the datetime module to represent the date and time in a format
commonly used in the United States. This code can be recreated in a new Python notebook for practice and
exploration.

READING AND WRITING FILES

In addition to the datetime module, the lab uses a Comma Separated Value module


called csv.The csv module is also a core module that is part of the Python standard library. The csv module
allows reading and writing to .csv files. Python also has basic methods for creating, opening, and closing
external files. Later in the course you will learn how to extensively modify data by manipulating it within data
tables. These data tables will only exist in RAM until they are saved to files. You will learn a number of ways to
do this.

In this lab, you will use the Python open() and close() methods. The open() method is used to create a new
file or to open an existing file that will contain the data to be saved. The close() function removes unwritten
data from buffers and ends the file writing functionality for the specified file. It is important to explicitly close
all files that are not to be written to. This preserves system resources and protects the file from corruption.
Figure 1 shows the syntax for the open() function and explains some important values that can be supplied
to the method. It also illustrates the use of the close() method.

Note: Opening a nonexistent file in "a" mode will create that file the same way as "w" mode. The only
difference is where the pointer is. It will either point to the beginning of the file or to the end of the file.

Figure 2 explains some important values that can be supplied to the open() method. These parameters can
be combined or the “+” symbol can be added to specify that both read and write or read and append
modes are to be used.

Data can be written to the file using the write() file method. If the file was opened in “a” mode, data will be
added to the end of the file. Escape characters (sometimes called escape sequence) may need to be
added to the file for formatting. For example, the \n or \r\n escape characters will add line breaks to the
end of a line of data that has been written. The read() file method reads the contents of an open file object.
This is shown in Figure 3. In the figure, in input cell one, a new file is created. The file is then closed. In cell two,
the file is reopened in append mode. Three lines of text are written to the file. In cell three, the file is closed in
order to ensure that the text has been written to the file. The file is reopened in read mode, and
the read() method is used to the view the file. The file is also shown as it appears in a text editor that
formats the text using the escape characters to create three separate lines.

INTERACTING WITH EXTERNAL APPLICATIONS

Python allows interaction with external applications and the operating system. In the first lab, you will install
and run an external application and gather data from it for analysis.

In Jupyter Notebooks, the “!” symbol allows direct interaction with the operating system. For example, the
Figure 1 shows two Linux commands that have been executed in a Jupyter notebook. Note that the
commands begin with the”!” symbol.

Figure 2 illustrates the use of the subprocess module to communicate with an external application and
store the output of a command issued to that application in a Python object. First, a string object is created
to hold the command to be sent the program. In this case, we intend to send a command to the ping utility
that is available from the Linux shell. We then send that command, after splitting it into individual words, to
the program using a subprocess method. Finally, we store the output of the command into a variable and
split it into a string. Then we can view the contents of the object with print, and address individual elements
of it using string indexing.

SQL

SQL has been discussed previously in Chapter 1. In the second lab in this chapter we work with SQLite to
create and modify external databases.

There are many ways to work with external files in Python. SQLite is an SQL implementation that works well
with Python. Instead of using a client server method of operation, it uses connections established between
Python and an SQL database by creating an SQL connection object. This object will have methods
associated with it. After creating the connection object, a method is used to create a cursor object. The
cursor object has SQLite methods available for executing SQL operations on the database. Many SQL
operations can be executed in this way.

BASIC SQL OPERATIONS

SQL is a language for interacting with databases and data tables. There are a number of dialects of SQL,
however some core operations are standard, and should work similarly whether in SQLite, MySQL, or other
SQL implementations. SQLite can run in interactive mode, from a command line. Alternatively, a computer
language, like Python, can interact with SQLite through imported modules. The focus of this course will
mostly be on using Python to interact with SQLite.

In general, SQL can be said to be a language composed of three special purpose languages. The first is the
data definition language. It is used to create and manipulate the structure of SQL databases and tables.
Figure 1 shows some common commands from the SQL data definition language. The second is the data
manipulation language. It is used to add, remove, or transform data that is present in data tables. Finally,
there is the data query language. It is used to access data in data tables in order to generate information.
Figure 2 shows common data manipulation and data query language commands.
WORKING WITH PYTHON AND SQLite

The figure shows a sequence of commands that illustrate the basics of the SQLite operations that are done
in the lab. First, an external tool called the csvkit needs to be installed in the operating system, so that a csv
file can be imported into a SQLite database. The figure illustrates the steps in the process of creating a
SQLite database, importing csv data into the database, executing a query on the data table, and viewing
the results of the query.

Note: In the figure, the "!csvsql --db ..." can be executed as the first command. This is an external tool that
needs to be installed in the OS. A command prompt can be used (Linux CLI) to execute this, but to simplify
things, this external command can be executed directly from a notebook by prefixing the command with "!".

SUMMARY

This chapter began by asking the question “What are analytics?” According to the Cisco VNI, IP traffic will
triple over the next 3 years. The emergence of these large data sets requires more advance methods,
technologies, and infrastructure to process the data and convert it to actionable information. Data can no
longer be stored on a few machines nor be processed with one tool. There are many methodologies for
conducting data analysis, including the popular Cross Industry Standard Process for Data Mining
(CRISP-DM) used by more than 40% of data analysts. The Python programming language has become a
commonly used tool for handling and manipulating data.

The next section detailed the issues that surround using Big Data. Decision makers will increasingly rely on
data analytics to extract the required information at the right time, in the right place, to make the right
decision. Wise decisions rely on a well-established base of knowledge. Business intelligence encompasses
the entire process from data to information to knowledge to wisdom.

Descriptive analytics relies solely on historical data to provide regular reports on events that have already
happened. The scope of descriptive analytics is to summarize your data into more compact and useful
information. Predictive analytics attempts to predict what may happen next with a certain degree of
confidence, based on data and statistics. Prescriptive analytics predicts outcomes and suggests courses of
actions that will hold the greatest benefit for the enterprise or organization.

Some data analysis must also be carried out in real-time while the data is being collected. As the datasets
grow in volume, velocity and variety, the complexity of data storage, processing, and aggregation becomes
a challenge for traditional analytic tools. Large data sets may be distributed and processed across multiple,
geographically-dispersed physical devices as well as in the cloud. Data analytics allows businesses to
better understand the impact of their products and services, adjust their methods and goals, and provide
their customers with better products faster. The section closes with an explanation of the Data Analytics
Lifecycle.

The next section of this chapter covered data acquisition and preparation. Files, the Internet, sensors, and
databases are all good sources of data. Because the data that will comprise the data set to be analyzed
can come from very diverse sources, it is not necessarily compatible when combined. In addition to
different data types, a single type of data can be formatted differently, depending on its source. Data tables
consist of fields, rows, and values that are similar to the columns, rows, and cells in a spreadsheet. Each
data table can be considered as a file, and a database as a collection of files. As mentioned earlier in this
topic, much of the data that is going to be placed in a database so that it can then be queried comes from
a variety of sources and in a wide range of formats. Extract, Transform and Load (ETL) is a process for
collecting data from this variety of sources, transforming the data, and then loading the data into a
database.

The next section discussed Big Data ethics. A quick search on the Internet will most likely reveal that the
ethical use of data continues to cause concern for many people. Several governments have regulations for
the appropriate use of personal data. Confidentiality, integrity and availability, known as the CIA triad, is a
guideline for data security for an organization. Confidentiality ensures the privacy of data by restricting
access through authentication encryption. Integrity assures that the information is accurate and
trustworthy. Availability ensures that the information is accessible to authorized people. The responsibility of
data security now extends beyond the user to the cloud service providers. It is important for the user to
make sure that they are following security procedures by using strong password policy and proper
authentication methods. The cloud service provider must also implement cloud security controls.

The final section of this chapter covered preparation for the Internet Meter labs and working with Python
and SQL.
CHAPTER 3: DATA ANALYSIS

EXPLORATORY DATA ANALYSIS

We add sensors and capture data from our networks, systems, and lives so that we can make data-driven
decisions that ultimately impact performance, the situation or the environment. The data from the sensors
and things are a critical element in providing the opportunities for change. As shown in the data analysis
lifecycle in the figure, data is changed from its raw format into information after it has been gathered,
prepared, analyzed, and presented in a usable format. A first step in creating the needed information is to
perform an exploratory data analysis.

Exploratory data analysis is a set of procedures designed to produce descriptive and graphical summaries
of data with the notion that the results may reveal interesting patterns. It is a process of discovery that
sometimes enables us to create a hypothesis about the data. It allows for the discovery of new question to
be answered. Sometimes the purpose of an analysis is to answer specific questions. Other times, someone
may have a “hunch” or intuition about some phenomenon in relation to a set of data. An analyst may be
called upon to investigate the cause or effect of that phenomenon. Exploratory data analysis provides a
useful way to examine the data to determine if any relationships exists between the observed or collected
data or if there are problems in the data.

For example, an analyst for a chain of fast food restaurants is asked to examine negative Twitter comments
about the restaurants. These comments have been flagged as negative by a real time semantic analysis
process. The analyst performs some descriptive analyses on the tweets to see what is happening in the
data. The analyst decides to investigate the time of day when the negative tweets are occurring. By plotting
the tweets versus the time of day, the analysis revealed that the number of negative tweets that come in
during breakfast time was disproportionately higher than those generated during the rest of the day. This
basic exploratory analysis reveals that something regarding the breakfast offerings could be a problem but
it does not allow the analyst to make conclusions as to why this is occurring. Further analysis is needed to
understand the specific cause for this result. It could be a specific item on the breakfast menu that is
mentioned in the tweets, or other variables such as customer satisfaction with food quality, service, or
cleanliness.

ANALYZING IoT DATA

Gathering data is one of the first steps in performing an exploratory data analysis. No matter the type of
analysis to be done, IoT data provides special challenges. First, IoT data may come in large volumes and in
varying formats. Some data may be structured so that the nature and meaning of the data can be quickly
processed and understood. Other data may be unstructured and require considerable processing to be
made meaningful. Because considerable value can be derived from combining structured and
unstructured data for analysis, IoT data may require more advanced analytic tools. New technologies are
constantly being invented for the acquisition, storage, and computational analysis of Big Data.

In addition to the volume, another important aspect in IoT data is time as variable. IoT data is frequently
transmitted in real time or near real time. The data generated from observations of how a variable change
over time is called time series. Examples of time series data are the air temperature measured in a weather
station every minute or the electric power consumption of a home reported by the smart meter to the
power grid every 15 minutes. Time series data is different from the cross-sectional data where the
observation is at one specific time across many different variables. Typically, data will be formatted in a
table, as shown in the figure. When the data points have timestamps, the order of the data in the tables
does not matter. This is because the data points can be sorted by their timestamps.

OBSERVATIONS, VARIABLES, AND VALUES

When performing any kind of experiment or analysis, it is critical to define the key characteristics that need
to be measured or observed to answer the questions posed or to create the hypothesis needed. These
characteristics to be studied are called variables. A variable is anything that varies from one instance to
another. Not only is a variable something that can be measured, but its value can also be manipulated or
controlled.

During an experiment or analysis, different variables and their associated values may be observed. The
recordings of the values, patterns and occurrences for a set of variables is an observation. The set of values
for that specific observation is called a data point. Each observation can be considered as a record in a
database or a row of data in an Excel spreadsheet of data. The collection of observations makes up the
data set for your analysis.

Because observations usually have a purpose, only some characteristics are relevant to that purpose. For
example, if you have lost your pet and have asked other people to help you search for it, only a small set of
characteristics are relevant to the observations. These characteristics might be:

● What type of animal is your pet? It is a dog.


● What type of dog? It is a Schnauzer.
● What color is your Schnauzer? It is gray.
● What is size is the Schnauzer? It is a medium sized Schnauzer.
● How much does the Schnauzer weigh? It weighs 15 kg.

As shown in the figure, the variables are the characteristics, such as breed, color, size and weight. All of
these characteristics are variables, because each can have multiple values. As people search for your dog,
data points are added for each observation. Because the purpose of your observations is to search for
dogs, observations that do not meet the required criteria are discarded.

TYPES OF VARIABLES

When looking for meaningful patterns in data, we frequently look for correlational relationships between
variables. All variables can be classified by the characteristic that is being studied. The variables will either
be categorical or numerical.

Categorical variables indicate membership in a particular group and have a discrete or specific qualitative
value. They are further classified into two types:

● Nominal – These are variables that consist of two or more categories whose value is assigned based
on the identity of the object. Examples are gender, eye color or type of animal.

● Ordinal – These are variables that consist of two or more categories in which order matters in the
value. Examples are student class rank or satisfaction survey scales (dissatisfied, neutral, satisfied).
Numerical variables are quantitative values:

● Continuous – These are variables that are quantitative and can be measured along a continuum or
range of values. There are two types of continuous variables, Interval variables can have any value
within the range of values. Examples are temperature or time. Ratio variables are special case
interval variables where a value of zero (0) means that there is none of that variable. Examples of
this include income or sales volume.

● Discrete – These types of continuous variables are quantitative but have a specific value from a
finite set of values. Examples include the number of sensors activated in a network, or the number of
cars in a lot.

Why is it important to know what type of variables are in your data set? Some statistical methods and data
visualizations are designed to work better with certain types of data than others. How the results of the
analysis are best displayed will depend on the type of variables used in the data. Some variables lend
themselves better to bar graphs while others may allow for more examination and discovery using a scatter
plot. Examples of some of the suggested types of graphs that represent the different types of variables can
be seen in the figure.

WHAT IS STATISTICS?

Now that the purpose for the analysis is defined and the variables and observations are gathered and
recorded, it is time to perform some statistical analysis. Statistics is the collection and analysis of data using
mathematical techniques. It also includes the interpretation of data and the presentation of findings.
Another use of statistics is to discover patterns or relationships between variables and to evaluate these
patterns to see how often they occur. Statistical findings are frequently judged by their relationship with
chance effects. In other words, what is the chance of something happening repeatedly under the same
conditions? For example, a hypothesis might be that variable x is related to a change in variable y. An
analysis reveals that a relationship does exist. However, variable y also changes when variable x does not
change. A question to be answered is, “How much of the change in variable y is in response to changes in
variable x and how much is due to other factors?” Statistics seeks to answer this question in order to
estimate effects in relation to chance or events that are not included in an analysis. If the results of the
analyses show high probabilities of recurrences, the findings of a study on one representative group can be
generalized to a much larger group.

The terms statistics and analytics are often interchanged, but are somewhat different. In general, analytics
embraces a larger domain of tools than statistics. Analytics uses the mathematical modeling tools in
statistics in addition to other forms of analysis, such as machine learning. It can also involve working with
very large data sets that include unstructured data.

POPULATIONS AND SAMPLES

Statistics focus on aspects of reality that are studied for a specific purpose. Those aspects of reality could
be aspects of people, or the content of tweets or Facebook posts. Statistics have been used extensively in
the social and life sciences. Some terms commonly used in statistics derive from this usage.

One such term is population. A population is a group of similar entities such as people, objects, or events
that share some common set of characteristics which can be used for statistical or investigative purposes.
It may be strange to think of tweets for Facebook posts as members of populations, but this is how they are
thought of for statistical analyses. The definition or structure of a given population varies. A population could
be “all living people” or “all tweets since August 1, 2015”. It is a large group of things that we are interested in
knowing more about.

It may not always be practical to study all living people or even “all tweets since August 1, 2015”. The
practicalities of obtaining the required data from the population make data gathering nearly impossible.
Instead, a representative group from the population can be used for analysis. This group is called a sample.
Samples are often chosen to represent the larger population in some way. If this is the case, special care
needs to be taken in selecting the sample in order to ensure that all the necessary characteristics of the
population are represented. A number of techniques are used for deriving samples from populations.

DESCRIPTIVE STATISTICS

After the problem statement (or the questions to be asked) is determined, and a population has been
defined, some form of analysis or statistics are needed. There are two key branches of statistics that we will
discuss in this course:

● Descriptive Statistics
● Inferential Statistics

Descriptive statistics are used to describe or summarize the values and observations of a data set. For
example, a fitness tracker logged a person’s daily steps and heart rate for a 10-day period. If the person met
their fitness goals 6 out of the 10 days, then they were successful 60% of the time. Over that 10-day period,
the person’s heart rate may have been a maximum of 140 beats per minute (bpm) but an average of 72
bpm. Information about counts, averages, and maximums are some of the ways to describe and simplify
the data set that was observed.

Basic descriptive statistics may include the number of data points in a data set, the range of values that
exist for numeric data points, or the number of times various values appear in a data set, among others.
Additionally, descriptive statistics include values that summarize the data set in various ways. It may answer
questions such as:

● How widely dispersed is the data?


● Are there values that occur more often than others?
● What is the smallest or largest value?
● Are there particular trends occurring?

The answers to these questions can be provided in numerical and graphical formats. Results of descriptive
statistics are often represented in pie charts, bar charts or histograms. This helps to visualize the data
better.

One important point to note is that while descriptive statistics describe the current or historical state of the
observed population, it does not allow for comparison of groups, conclusions to be drawn, or predictions to
be made about other data sets that are not in the population. In the fitness tracker example, we cannot infer
that the person has poor health because they were only successful in meeting their goal 60% of the time.
We also cannot use the data set for this one person to predict the fitness performance for others with
similar characteristics. This is where inferential statistics becomes important.
INFERENTIAL STATISTICS

Descriptive statistics allow you to summarize findings based on data that you already have or have
observed about a population. But there are situations in which gathering data for a very large population
may not always be practical or even possible. For example, it may not be possible to study every person in
the world in order to discover the effects of a new drug that is under development. However, it is possible to
study a smaller, representative sample of a population and use inferential statistics to test hypotheses and
draw conclusions about the larger population.

Inferential statistics is the process of collecting, analyzing and interpreting data gathered from a sample to
make generalizations or predictions about a population. Because a representative sample is used instead
of actual data from the entire population, concerns that the particular groups chosen for the study, or the
environment in which a study is carried out, may not accurately reflect characteristics of the larger group
must be addressed. When using inferential statistics, questions of how close the inferred data is to the
actual data and how confident we can be in the findings must be answered. Typically, these types of
analyses will include different sampling techniques to reduce error and increase confidence in the
generalizations about the findings. The type of sampling technique used will depend on the type of data.

STATISTICS AND BIG DATA

Different statistical approaches are used in Big Data analytics. As we know, descriptive statistics describe a
sample. This is useful for understanding the sample data and for determining the quality of the data. When
dealing with large amounts of data that come from multiple sources, many problems can occur.
Sometimes data points can be corrupted, incomplete, or missing entirely. Descriptive statistics can help
determine how much of the data in the sample is good for the analysis and identify criteria for removing
data that is inappropriate or problematic. Graphs of descriptive statistics are a helpful way to make quick
judgements about a sample.

For example, a sample of tweets may be selected for analysis. Some tweets in the sample contain only
characters, while other tweets contain characters and images. Determine whether you want to analyze
tweets that contain images or tweets with no images. This will identify tweets that are invalid based on a
very simple criterion. Data points that do not meet this basic criterion would be removed from the sample
before the analysis continues.

A number of types of inferential and machine learning analysis are very commonly used in Big Data
analytics:

● Cluster – Used to find groups of observations that are similar to each other

● Association – Used to find co-occurrences of values for different variables

● Regression - Used to quantify the relationship, if any, between the variations of one or more variables

In machine learning, computer software is either provided with, or derives its own set of rules that are used
to perform an analysis. Machine learning techniques can require a lot of processing power and have only
become viable with the availability of parallel processing.
DISTRIBUTIONS

There are multiple ways to summarize the data using descriptive statistics. You can look for the actual
distribution of the data, measures of central tendency or measures of ranges. At a basic level, distribution is
a simple association between a value and the number or percentage of times it appears in a data sample.
Distributions are useful for understanding the characteristics of a data sample. The figure shows a table
consisting of two fields. One field contains a variable, and the other consists of a statistic that describes the
value of that variable. In this example, ten students have taken a ten-point quiz. The score for each student
is shown in the Raw Score by Student table. When the teacher analyzes the scores, a distribution of scores is
created as shown in the second table. This expresses the number of times that a score occurred in the
class. The probability of the score occurring is expressed as a ratio of the frequency of the score to the total
number of scores.

Frequency distributions consist of all the unique values for a variable and the number of times the values
occurs in the data set. In probability distributions, instead of frequencies, the proportion of times the value
occurs in the data are used.

A histogram can immediately represent the distribution of a dataset. In the case of a discrete variable, each
bin of the histogram is assigned to a specific value. In the case of a continuous one, each bin is associated
to a range of values. In both cases, the height of the bin represents the number of times the value of the
variable assumes a given value or falls into the range, respectively.

The histogram representation of the data distribution can take any shape. In the case of the continuous
variable, the shape will also depend on the width of the bins, i.e. their range. Some shapes can be modelled
using well-defined functions, which are called probability distribution functions.

Probability distribution functions allow for representing the shape of the whole dataset distribution using
only a small set of parameters, such as the mean and the variance, which will be explained later in the
chapter. A probability distribution function that is particularly suited to represent many events occurring in
nature is Gaussian, or Normal distribution, which is symmetrical and bell-shaped.

Other distributions are not symmetrical. The peak of the graph could be either to the left or to the right of
center. This property of a distribution is called skew. Some distributions will have two peaks and are known
as bimodal. The right and left ends of the distribution graph are known as the tails.

CENTRALITY

One characteristic of distributions that is very commonly used is measures of central tendency. These
measures express the values that a variable has that is closest to the central position in a distribution of
data. The common measures of centrality are the mean, median, and mode. The mode of a data sample is
the value that occurs the most often. These measures are illustrated in Figure 1. Essentially, these values that
are closer to the center of the distribution occur with greater frequency.

The mean, also known as the average, is the most well-known measure of central tendency. It takes into
account all of the values in a data set and is equal to the sum of all the data values divided by the number
of values in the data set. Although the mean is very commonly used in everyday life, it is typically not the
best measure of the most representative value for a distribution. For example, if there are unusually high or
low values in the distribution, the mean can be highly influenced by those extreme values, also called
outliers. Depending on the number of outliers in the data set, the mean, or average, is “skewed” or changed
in one direction or another.

The median is the middle value in the data set after the list of values has been ordered. As shown in Figure 2,
the median is not sensitive to these extreme values. Because the total number of values and the actual
values in the data set are the same, the midpoint in the list or median remains the same. This is different for
the mean or average. Depending on the number of outliers in the data set, the mean, or average, is
“skewed” or changed in one direction or another.

In addition to outliers, the type of variable used in the data set will also impact which measure of central
tendency is best used to represent the data. As shown in Figure 3, the mean or average is best used when
the data is interval data that is not skewed.

DISPERSION

While the mean, or average, is currently used to describe many distributions, it leaves out an important part
of the picture, which is the variability in the distribution. For example, we know that outlier values can distort
the mean. The median gets us closer to what is central in the distribution, however we still do not know how
spread out the values in the sample are.

The most basic way of describing variability in a sample is by calculating the difference between the
highest and lowest values for a variable. This statistic is known as the range. It is always useful to have an
idea of what the highest and lowest values are for a variable as a basic way to know if the data makes
sense.

The variance ( 2) of distribution is a measure of how far each value in a data set is from the mean. Related
to the variance is the standard deviation ( ). The standard deviation is used to standardize distributions as
part of the normal curve, as shown in Figure 1. Figure 2 shows the how standard deviation values relate to
centrality. The more data points that are centered around the mean, the lower the standard deviation. The
standard deviation values are higher as the distribution becomes more spread out.

Comparing standard deviations between two samples on the same measure can help to tell the story of
what is occurring. For example, if the mean on test scores in one school is higher than that for the same test
in another school, it would be natural to assume that all the students in first school are higher achievers
than those in the second school. However, standard deviations can add an extra layer of interpretation to
the story. If the standard deviation for the first school is higher for the first group, it says that the distribution
is more spread out and that more students are scoring at the extremes of the distribution. It is possible that
a small group of very high scorers has influenced the mean. Further investigation shows that a special
program for gifted students at the school has elevated the mean by pushing it away from the median.

USING PANDAS

Pandas is an open source library for Python that adds high-performance data structures and tools for
analysis of large data sets. Pandas is easy to use and is very popular for adding extra capabilities to Python
for data analysis. A link to the pandas project is shown in Figure 1.

Pandas data structures include the series and dataframe structures. Dataframes are the primary pandas
structure and are the most commonly used. We will use pandas dataframes often within this course. A
dataframe is like a spreadsheet with rows and columns. In addition, dataframes can have optional indexes
and columns which are labels for rows and columns.

Dataframes are easily built from a range of other data structures and external files, such as csv. A wide
range of methods are available to dataframe objects. Rows and columns can be manipulated in various
ways and operators are available to perform mathematical, string, and logical transformations to
dataframe contents. Figure 2 shows the components of a dataframe.

Pandas is imported into a Python program using import, like other modules. It is conventional to use
import pandas as pd to make reference to pandas components easier to type. Figure 3 shows the code
required to create the dataframe that is shown in Figure 2.

IMPORTING DATA FROM FILES

Large data sets are compiled from various sources and may exist as different kinds of files. Creating a
pandas dataframe by coding the data values individually is not very useful for analyzing Big Data.

Pandas includes some very easy to use functions for importing data from external files, such as csv, into
dataframes. We will recreate the telephone directory dataframe, this time from a larger csv file. Pandas
includes a dataframe function called read_csv() for this purpose.

The figure illustrates the process of imported data from an external csv file into pandas. The procedure is as
follows:

● Step 1. Import the pandas module.


● Step 2. Verify that the file is available from the current working directory. In this case the head Linux
command is used to verify the file and preview its contents.
● Step 3. To import the file into a dataframe object use the pandas read_csv() method. In this case the
dataframe object is called directory_df.
● Step 4. Use the pandas info() dataframe method to view a summary of the file contents.
● Step 5. Display the dataframe. In this case the head() method was used to display the headings,
index, and values for the first five rows.

IMPORTING DATA FROM THE WEB

It is very easy to import data from the web with pandas. While there are many application program
interfaces (APIs) available for accessing web data, including streaming data, static data sets can also be
accessed from the Internet based on the URL of the file. In the example shown in the figure, a data set is
imported into a data set from the extensive collection at the Humanitarian Data Exchange. This website is
an excellent resource for people interested in exploring data related to international humanitarian concerns.
In this case, we import a data set containing information regarding the percentage of woman serving in
national parliaments for a series of nations over a period of years. Information about this data set can be
found here. The data set can be downloaded here.

The process is simple:

● Step 1 Import pandas.


● Step 2. Create a string object to contain the URL of the file.
● Step 3. Import the file into a dataframe object using the pandas read_table() method. read_table()
is essentially the same as the read_csv() method, but it allows use of different delimiters. In this
case, we specify the comma as the separator to illustrate how a separator is specified for this
method.
● Step 4. Verify import with head() and info(). Note that the info() output indicates a number of
missing values (null entries), which is the difference between the total number of entries and the
number of non-null entries for each year.

There are many sources of data on the Internet. For example, sites like Google and Twitter have APIs that
allow for the connection of Python programs to live streaming data. Numerous other databases exist online
which can be directly addressed and inserted into pandas dataframes using a range of pandas methods
and the associated parameters.

DESCRIPTIVE ANALYSIS IN PANDAS

Pandas provides a very simple way of viewing basic descriptive statistics for a dataframe. The describe()
method for dataframe objects displays the following for numeric data types:

● count – This is the number of values included in the statistics.


● mean - This is the average of values.
● std - This is the standard deviation of the distribution.
● min - This is the lowest value in the distribution.
● 25% - This is the value of the first quartile. 25% of the values are at or below this value.
● 50% - This is the value for the second quartile. 50% of the values are at or below this value. This is also
the median value.
● 75% - This is the value for the second quartile. 75% of the values are at or below this value.
● max - This is the highest value in the distribution.

In the example shown in the figure, the same data set has been used as in the previous page. However, this
time only the first, second, and seventh columns have been imported into the dataframe. This shows the
country name and value for the years 2015 and 2010. The describe() method is called on the resulting
dataframe, and descriptive statistics are shown for the two years. This allows a quick comparison of the
data over a five year period.

CORRELATION VS. CAUSATION

“Correlation does not imply causation” is a statement commonly heard about interpreting statistical
analyses. This is because people tend to confuse the two. What is causation? What is correlation?

Both causation and correlation are types of relationships between conditions or events. Causation is a
relationship in which one thing changes, or is created, directly because of something else. For example, an
increase in global temperature causes a decrease in the Arctic ice cap. This is an intuitive relationship to
phenomena. An increase in global temperature may also result in a decrease in the consumption of wool
for use in creating warm garments. This too is an intuitive relationship. The warmer the climate becomes,
the less demand there will be for warm clothing.

Correlation is a relationship between phenomena in which two or more things change at a similar rate. For
example, both the global temperature and the consumption of wool are decreasing. These two things can
be said to positively correlate because they change at a similar rate, and in a similar direction (they both
decrease).
It is a problem however if we say that that one of these things causes the other. In other words, a decrease
in wool consumption cannot be seen to cause a decrease in Arctic ice. This is pretty clear. However, it is less
intuitive to say that a decrease in Arctic ice causes a decrease in wool production. Both phenomena share
the same cause, but the decrease in Arctic ice is unlikely to make people feel so much warmer that they do
not need warm clothes, especially in places very far from the Arctic, like South Africa.

Some people have made an industry out of identifying and sharing spurious correlations. Many unrelated
phenomena change similarly over the same period of time. A number of examples have been collected by
Tyler Vigen on his website. Tyler has also written a book that collects many examples of incorrect
correlations.

CORRELATION COEFFICIENT

Correlations can be positive or negative. Positively correlated quantities change in the same direction. If one
quantity increases the other increases to a similar degree. Negative correlation occurs when quantities
change in some similar proportion but in opposite directions. In other words, if one increases, the other
decreases similarly.

Correlations between quantities can be quantified using statistical approaches. The most commonly used
statistic for expressing correlation is the Pearson product-moment correlation coefficient. This is also known
as the Pearson r. The Pearson r is a quantity that is expressed as a value between -1 and 1. Positive values
express a positive relationship between the changes in two quantities. Negative values express an inverse
relationship. The magnitude of either the positive or negative values indicates the degree of correlation. In
other words the closer the value is to 1 or -1, the stronger the relationship. 0 indicates no relationship. The
figure shows scatterplots of small data sets that have low, positive, and negative correlation.

An excellent interactive resource for visualizing samples with different correlation coefficient values can be
found here.

CORRELATION IN PANDAS

Pandas has some wonderful statistical methods available for dataframes that are very easy to use. You
have seen that the dataframe describe() method quickly generates summary descriptive statistics. The
corr() method is also easy to use.

In the example shown in the figure, a small data set has been created from a larger data set which
describes the demographics of the population of a series of cities. The data has been simplified to contain
only two fields, the percent of people living below the poverty threshold of monetary income, and the
percentage of people who are unemployed. It should be no surprise that these two fields should show a
strong correlation. This would mean that we would expect that for a city with many people who are living in
poverty, that unemployment would also be high.

In the pandas example in the figure, a csv file containing the data is imported into a dataframe. The data is
quickly checked by using the head() and describe() methods to verify that the import worked as expected.
Finally, the corr() method is called for the dataframe. The result is displayed in a correlation table. We can
see that with a correlation coefficient of .73, unemployment does indeed have a strong relationship to
poverty.
APPROPRIATE VISUALIZATION

Correlations can be calculated for multiple variables simultaneously. This will result in correlation
coefficients being computed between all the fields supplied to the dataframe. The results for this can be a
large table of correlation coefficients. A visualization called a heat map is useful for understanding how
values for correlation coefficients relate to one another.

As we saw in the previous example, scatterplots are useful for quickly visualizing possible correlation in a
data set. In a heat map, the fields in the data form horizontal and vertical labels for a grid of values. Each
cell value is the correlation coefficient for a field on the horizontal dimension of the grid with a field on the
vertical axis. The value at the intersection of the chosen dimensions is the coefficient for that pair of values.
To further aid in interpretation of the data, the correlation values are color-coded. The intensity or hue of the
color for each value is proportional to that value. For example, all negative correlation coefficients might be
presented in a shade of red and all positive ones in a shade of blue. The deeper the color, the closer to 1 or -1
the value is. This helps to bring meaning from the correlation data.

ISSUES WITH DATA QUALITY

This section focuses on some of the skills that are necessary for completing the final lab in this chapter. This
lab focuses on basic analysis, reporting, and visualization of Internet meter data similar to that collected in
previous labs. Before the analyses and visualizations can be completed, the data must be prepared.

So far it has been relatively easy to bring a data set into pandas. You choose a portion of that data, and act
on it to derive some meaningful analysis like descriptive statistics and simple correlations. This can be
misleading. More often than not, the data sets that you will work from will have incompatibilities, like
improper or inconsistent formatting, unwanted information, and even missing portions of information in the
data. It is the preliminary task of the data analyst to clean the data in the data set.

Cleaning data can involve removing missing or unwanted values, or altering the format of the values to
make them consistent. For example, you are expecting a data set to return data in the format of integers,
but a few of the values are returned in the form of floats or strings. How will you deal with the errant values
that are incorrectly formatted? You will need to convert or clean those particular instances of data.

DEALING WITH MISSING DATA

An example of a data set that needs preliminary cleaning is a data set with the presence of NaN values.
NaNs (Not a Number) are used to represent data that is undefined or cannot be represented. Pandas refers
to missing data as NaN values, which are also commonly referred to as NA values. NaNs can make data
analysis functions abruptly terminate during calculations, throw errors, or produce incorrect results. NaNs
can also be purposely used to uniformly represent all of the pieces of information that are missing from the
data set, either incorrect or null values, or data that is simply not present. Many data sets have missing data
because the data failed to be collected correctly, or was missing from the start. Another common cause of
NaNs is the reindexing of the data in a data set. See the example in Figure 1.

Missing values can take different forms based on the datatype. The pandas datatypes are: objects/strings,
int64/integers, float64/floats, and datatime64/timestamps. NaNs are used for undefined strings, integers,
and floats, and NaTs are used for timestamps. There may also be situations where the Python value of None
will also represent missing data.
CONVERTING DATA TYPES

Pandas has many built-in functions for converting the datatypes. As mentioned before, the pandas
datatypes are: objects/strings, int64/integers, float64/floats, and datatime64/timestamps. In this example,
you convert datatypes in a sample data set. Open a new iPython Notebook and enter the following
commands. In Figure 1, notice how the data2 data set is made up of integers, strings, and floats. Also, notice
how each column is represented by a datatype. Now you will change column 2 from an object/string
datatype to a numeric datatype.

In Figure 2, the convert objects function converts column 2 from a string/object to a numeric datatype. In
figure 3, notice how column 2 ended up converting to floats. This was due to the presence of the ‘0.33’ string
in column 2. If the string had been ‘33’ then the column would have converted to integers. If the string had
been ‘x’ instead of ‘0.33’ the conversion would have thrown an error due to the inability to convert ‘x’ to a
numeric value, integer or float.

MANIPULATING DATAFRAMES

Cleaning the data set is a preliminary task before data analysis can take place. Manipulating a
2-dimensional dataframe using pandas in Python can involve dropping, adding, or renaming data columns
or rows. To do this, invoke the drop() function, the loc() function and the rename() function.

To drop a column of data, invoke the drop() function or the del command. In Figure 1, a simple data set is
created. In the following figures, columns and rows will be dropped and added using the drop() and add()
functions.

In Figure 2, column one is removed using the drop() function. The axis refers to the columns numbered left
to right, starting with the index column at axis=0 and the one column at axis=1.

The same result can be achieved with the del command.

del df3[‘one’]

Now drop the first two rows (0,1) with the following drop() function, where [0,1] are the rows that will be
dropped, and axis=0 refers to the index column on the far left (Figure 3).

You can add a column by assigning it a label. You can also assign a value (Figure 4).

To add a row, you can use the location or loc() method. In Figure 5, the location method also finds the
maximum index or last row number and then adds 1 to it, creating row 3.

If you simply call the loc() function and pass it an index number, it will be appended as a new bottom row
(Figure 6). Notice how the row that was added is numbered 1 even though it is the last row.

You can change the index by assigning new index values with the dataframe index property (Figure 7).

Along with the drop() and loc() functions, pandas also offers a rename() function. You can use the rename
function to rename the column labels to ‘one’, ‘two’, and ‘three’ respectively. To do this, use the rename
function and assign the columns in a key:value pair with the old name and the new name as the key:value
pair (Figure 8).
BASIC DATA STATISTICS

Pandas has built-in functions for running statistical analysis on data sets, including functions for computing
averages, standard deviations, and correlations.

In figure 1, a dataset and pandas dataframe is created using the data4 array of numbers. The dataframe is
printed to the screen by running the code cell.

In figure 2, the mean, or average, of all of the numbers is calculated using the dataframe mean() method.
The mean is 6.863636. Using dot syntax, the result can also be rounded to the nearest whole number by
attaching the round() method after the mean().

In figure 3, the means is manually calculated by first using the sum() method to add all of the numbers in
the dataset, divided by the count() method which counts the number of items in the dataset.

In figure 4, the median, or middle, value is calculated by finding the middle value of all the counted numbers
in the dataset. If the dataset has an odd number of items, the median is the middle number sorted by
numerical order. If the dataset has an even number of items, the median is calculated from the means of
the two middle numbers.

The second box shows the std() method for calculating the standard deviation. The standard deviation
shows the amount of variation in a set of data values. A low standard deviation indicates that the numbers
in the dataset tend to be close to the mean. The standard deviation is found by taking the square root of the
average of the squared deviations of the values from the average or mean value in the dataset.

SUMMARY

This chapter began with a large section about exploratory data analysis. Exploratory data analysis is a set of
procedures designed to produce descriptive and graphical summaries of data with the notion that the
results may reveal interesting patterns. When performing any kind of experiment or analysis, it is critical to
define the key characteristics that need to be measured or observed to answer the questions posed or to
create the hypothesis needed. These characteristics to be studied are called variables. A variable is
anything that varies from one instance to another. Not only is a variable something that can be measured,
but its value can also be manipulated or controlled. Categorical variables indicate membership in a
particular group and have a discrete or specific qualitative value. They are further classified into two types:
nominal and ordinal. Numerical variables are quantitative values that are further categorized as either
continuous or discrete.

Statistics is the collection and analysis of data using mathematical techniques. It also includes the
interpretation of data and the presentation of findings. Another use of statistics is to discover patterns or
relationships between variables and to evaluate these patterns to see how often they occur. The terms
statistics and analytics are often interchanged, but are somewhat different. In general, analytics embraces
a larger domain of tools than statistics. Analytics uses the mathematical modeling tools in statistics in
addition to other forms of analysis, such as machine learning. Analytics can also involve working with very
large data sets that include unstructured data.

A population is a group of similar entities such as people, objects, or events that share some common set of
characteristics which can be used for statistical or investigative purposes. Descriptive statistics are used to
describe or summarize the values and observations of a data set. Inferential statistics is the process of
collecting, analyzing and interpreting data gathered from a sample to make generalizations or predictions
about a population. If a population is too large to be used, a representative group from the population can
be used for analysis. This group is called a sample. Samples are often chosen to represent the larger
population in some way.

A number of types of inferential and machine learning analysis are very commonly used in Big Data
analytics:

● Cluster – Used to find groups of observations that are similar to each other
● Association – Used to find co-occurrences of values for different variables
● Regression - Used to quantify the relationship, if any, between the variations of one or more variables

At a basic level, distribution is a simple association between a value and the number or percentage of times
it appears in a data sample. Distributions are useful for understanding the characteristics of a data sample.
One characteristic of distributions that is very commonly used is measures of central tendency. These
measures express the values that a variable has that is closest to the central position in a distribution of
data. The common measures of centrality are the mean, median, and mode. The standard deviation is used
to standardize distributions as part of the normal curve. The more data points that are centered around the
mean, the lower the standard deviation. The standard deviation values are higher as the distribution
becomes more spread out.

Pandas is an open source library for Python that adds data structures and tools for analysis of large data
sets. Pandas includes some very easy to use functions for importing data from external files, such as csv,
into dataframes. Pandas provides a very simple way of viewing basic descriptive statistics for a dataframe.
The describe() method for dataframe objects displays the following for numeric data types including count,
mean, std, min, 25%, 50%, 75%, and max.

The section closes with an explanation of both causation and correlation. Both are types of relationships
between conditions or events. Causation is a relationship in which one thing changes, or is created, directly
because of something else. Correlation is a relationship between phenomena in which two or more things
change at a similar rate. Correlations can be positive or negative. Positively correlated quantities change in
the same direction. If one quantity increases the other increases to a similar degree. Negative correlation
occurs when quantities change in some similar proportion but in opposite directions. In other words, if one
increases, the other decreases similarly. A visualization called a heat map is useful for understanding how
values for correlation coefficients relate to one another. Scatterplots are useful for quickly visualizing
possible correlation in a data set. In a heat map, the fields in the data form horizontal and vertical labels for
a grid of values.

The final section of this chapter covered preparation for the Internet Meter labs and basic analysis with
pandas. More often than not, the data sets that you will work from will have incompatibilities, like improper
or inconsistent formatting, unwanted information, and even missing portions of information in the data. It is
the preliminary task of the data analyst to clean the data in the data set.

Cleaning data can involve removing missing or unwanted values, or altering the format of the values to
make them consistent. An example of a data set that needs preliminary cleaning is a data set with the
presence of NaN (Not a Number) values. NaNs are used to represent data that is undefined or cannot be
represented. NaNs can make data analysis functions abruptly terminate during calculations, throw errors, or
produce incorrect results. Pandas has many built-in functions for converting the datatypes. Pandas has
built-in functions for running statistical analysis on data sets, including functions for computing averages,
standard deviations, and correlations.

CHAPTER 4: ADVANCED DATA ANALYSIS AND MACHINE LEARNING

MACHINE LEARNING: LOOKING AHEAD

As we know, Big Data is characterized by its volume, velocity, variety, and veracity. These characteristics
distinguish Big Data from the data which has been traditionally used to make decisions in a pre-Big Data
world. These four Vs present not only considerable challenges, but also enormous opportunities in many
areas.

The four Vs of Big Data offer great promise for analytics. The ever-evolving platforms and tools offer the
ability to develop models that learn from existing patterns among variables and that apply those models to
reliably predict what will happen in the future. In addition, using these models, simulations can be
developed to not only answer the question, “What will happen?” but also to answer the question “How
should we act?” to respond to trends.

Machine learning addresses the challenges and opportunities presented by Big Data analytics to model
existing data in order to predict future outcomes. Google has created and excellent video that highlights
some of the benefits of machine learning and the promise that it offers.

WHAT IS MACHINE LEARNING?

In an article in Wired, it was claimed that Google is not actually a search company and that it is instead a
machine-learning company. So what is machine learning, and why is it so important?

In his book, Kevin Patrick Murphy defines machine learning as “…a set of methods that can automatically
detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other
kinds of decision making under uncertainty.” For example, a computer program is designed by a video
service to recommend to individual users’ movies that they might enjoy. The algorithm analyzes movies
that viewers have already watched and movies that people with similar viewing preferences have rated
highly. The goal is to increase customer satisfaction with the video service.

Machine learning methods have been applied to a wide range of applications including speech recognition,
medical diagnostics, self-driving cars, sales recommendation engines, and many others.

Whatever the application, machine learning algorithms improve their performance on specific tasks based
on repeated performance of those tasks, if the algorithm and the model are able to cope with the increased
variability introduced by the additional data. This is the fundamental trigger for looking for better models
and algorithms.

TYPES OF MACHINE LEARNING ANALYSIS

Machine learning encompasses many different algorithms, some with a wide range of applicability, while
others may be suited for specific applications. These algorithms can be divided in two main categories:
supervised and unsupervised. Supervised machine learning algorithms are the most commonly used
machine learning algorithms for predictive analytics. These algorithms rely on data sets that have been
processed by human experts (hence the word "supervision"). The algorithms then learn how to perform the
same processing tasks autonomously on new data sets. In particular, supervised methods are used to
solve regression and classification problems:

● Regression problems – These are the estimation of the mathematical relationship(s) between
a continuous variable and one or more other variable(s). This mathematical relationship can then
be used to compute the values of one unknown variable given the known values of the others.
Examples of regression are the estimation of a car position and speed using GPS, predicting the
trajectory of a tornado using weather data, or predicting the future value of a stock using historical
data and other sources of information. To mentally visualize the simplest example of regression,
imagine two variables, whose values are visualized as points in 2D plot similar to the image on the
right side of Figure 1. Performing regression means finding the line that best interpolates the values.
The line can take a lot of different shapes and is expressed as a regression function. A regression
function allows you to estimate the value of one variable given the value of the other, for values that
have not been observed before.

● Classification problems – These are used when the unknown variable is discrete. Typically, the
problem consists in estimating to which, of a set of pre-defined classes, a specific sample belongs.
Typical examples of classification are image recognition, or diagnosing pathologies from medical
tests, or identifying faces in a picture. A visual interpretation of a classification problem can be seen
in two dimensions, where points belonging to different classes are marked with a different symbol,
similar to the image on the left side of Figure 1. The algorithm "learns" examples of the location and
the shape of the boundary line between the classes. This boundary line can then be used to classify
new examples.

● Unsupervised machine learning algorithms do not require human experts to learn from, but
autonomously discover patterns in data. Examples of problems solved with unsupervised methods
are clustering and association:

● Clustering methods – These can be seen as the automatic discovery of groups of samples that have
similar characteristics, which can possibly point to the fact that a member of the cluster belongs to a
well-defined class. For example, clustering algorithms are used to identify groups of users based on
their on-line purchasing history, and then send to each member targeted ads. In Figure 2, the
clustering algorithm has automatically assigned a different color to group of observations that are
"close" to each other.

● Association methods – These are also a very relevant problem for on-line retailers, and consist of
discovering groups of items that are frequently observed together. They are used to suggest
additional purchases to a user, based on the content of their shopping cart. For a detailed summary
of ten of the most commonly used machine learning algorithms, read this article on the KDnuggets
website. KDnuggets is an excellent resource for data scientists and data scientists in training.

A MACHINE LEARNING PROCESS

Developing a machine learning solution is seldom a linear process. Several trial and error steps are
necessary to fine tune the solution, similar to what we have seen for the CRISP-DM model. However, the
process can be simplified:
Step 1. This is the data preparation step. In this step, we include the data cleaning procedures (i.e. the
transformation into a structured format, the removal of missing data and noisy/corrupted observations).

Step 2. Create a learning set that will actually be used to train the model.

Step 3. Create a test set that will be used to evaluate the model performance. The test set step is only
performed in case of supervised learning.

Step 4. Create a loop. An algorithm is chosen, based on the problem required, and its performances are
evaluated on the learning data. Depending on the chosen algorithm, additional pre-processing steps might
be necessary, such as extracting features from the data set that are relevant for the problem. For example, if
you are trying to analyze the activity level of a person based on a fitness tracker, features such as the
number of steps, elevation, maximum acceleration and so forth can be extracted from the raw sensor
measurements. Post-processing steps can be also be performed at this point, such as fine-tuning of the
model/algorithm parameters. If the algorithm and the model reach a sufficient performance on learning
data, the solution is validated on test data. Otherwise, a new model and/or algorithm is proposed and the
learning process is repeated.

Step 5. Test the solution on test data is called the model evaluation step. The performances on learning
data are not necessarily transferrable on test data. The more complex and fine-tuned the model is, the
higher the chances are that the model will become prone to overfitting. Imagine a student studying for an
exam. Learning the material by heart does not ensure a positive outcome in the exam. In the same way, a
machine learning algorithm can focus too much on the learning set, and perform poorly on the test.
Overfitting can result in going back to the model learning process.

Step 6. When the model achieves satisfactory performances on test data, the model can be implemented.
This means performing the necessary tasks to scale the machine learning solution to Big Data, and
deciding what component of the IoT will actually perform each step. Can some computation be done on
the device? Does it require a cloud infrastructure? Does the fog computing model help with massive
amounts of streaming data? All these questions can only be answered with a collaboration between
experts of different fields, such as Data Analysts, Data Engineers and Business Managers.

REGRESSION ANALYSIS

Regression analysis is one of the oldest and most commonly used statistical methods for analyzing data.
The main idea of regression is to quantify the mathematical relationship between one or more independent
(also called predictor) variable(s), and a dependent (also called target) one. Being a supervised method, it
relies on a data set of observed predictor(s) and target values. When the relationship (also called the
regression function) between the two is obtained, it can be used to estimate the values of the dependent
variable outside of the range of the observed values. In other words, a regression model allows the analyst
to extrapolate outside of the available data set.

When working with time-series data, for example, regression allows the analyst to predict future values from
historical data. In principle, regression looks to find a relationship between any types of continuous
variables. In particular, it tries to answer the generic question: "how much will variable V1 change if
variable(s) V2 (V3, V4, V5) change(s) by quantity X?" A simple way to visualize a regression function is to
imagine a set of points in two dimensions, such as the ones in the figure. The predictor variable, by
convention plotted on the X-axis, is the proportion of licensed drivers in different geographical areas. On the
Y-axis, normally used for the target variable, is the corresponding consumption of gasoline. In this case, a
possible regression function is represented by the red line. The fact that, in this example, it is a straight line,
suggests a very intuitive result: an increase of licensed drivers in the area will cause a proportional increase
of gasoline consumption. While a simple visual examination of the distribution of the data points suggests
that a line is the best fit, regression makes no restriction on the shape of the regression function. In general,
it can be said that the more flexible the shape of the regression function, the more parameters the model
contains, and the more complicated the algorithm will be from a mathematical and computational point of
view.

LINEAR REGRESSION

The most common regression methods are called linear regressions. These are the simplest from both a
computational and mathematical point of view; and therefore, represent the first option for a data analyst
presented with a regression problem. Despite the name, linear regression does not imply fitting a line
through the data points. The term linear means that the regression function will always try to fit the data
using a weighted average of other functions, whether those function are linear or not. The linearity property
simplifies the calculation of the parameters of the regression model, while at the same time allowing for
virtually any shape to be used to fit the observations. The simplest case of linear regression consists of
fitting a straight line. This is also called simple linear model, as shown in Figure 1.

A high Pearson correlation indicates that a simple linear model is a good candidate to fit the data. Figure 2
shows examples of strong positive and negative correlated observations. The regression process in this
case, consists of finding the slope and the intercept of the line that minimizes the sum of the distances
between the line and all the data points, as shown in Figure 3. When using linear models, the most common
algorithm used to estimate these optimal model parameters is called least squares.

In Figure 4 we can see three data sets, each has one target and one predictor variable. In all three cases, it
can be observed how, despite the noise affecting the observations, there is a clear line that captures the
underlying relationship between the variables. The red line represents the linear regression model that
minimizes the distance from all the observations. The models were obtained using linear regression.

APPLICATION OF REGRESSION ANALYSIS

Regression analysis has many applications. It is frequently used in business and financial analysis with
historical data to inform strategies for future action. It can be used to predict trends in economics and can
inform political action to guide economic growth. Customer behavior can also be predicted to determine
normal from possibly fraudulent behavior in fields of insurance and consumer credit.

In healthcare, multiple regression can be used to evaluate which of a number of variables may influence a
target variable. For example, the relationship between a group of lifestyle choices such as smoking, amount
of exercise, and eating habits could be analyzed to determine how they affect a health variable such as
blood pressure, diabetes, or even life expectancy.

No matter what the application, any machine learning model requires validation. Some models are very
sensitive to outlying or data anomalies. Other models may generate results that may be unsuitable to
answer the research question.
CLASSIFICATION PROBLEMS

Classification is another common machine learning problem that fits in the category of supervised learning.
Massive improvements have been achieved in the last decade, especially in the domain of image
recognition. Classification can be seen as a regression problem where the target variable is discrete, and
represent a class in which a human expert has classified the data sample. It is common, in classification
problems, to provide not only a set of examples data points each class, but to also establish which are the
features of each data point that are more useful to estimate the corresponding class. These features can be
readily available from the sensors, but more often need to be computed (or extracted) from the raw data
before being fed to the learning algorithm. The definition of relevant features is a crucial step that, with the
exception of very advanced algorithms such as Deep Learning, relies on human expert knowledge.

For example, a web-based travel company is interested in providing a reliability rating for the flights that it
finds for customers. Via trial error of different models, it has been determined which variables among all the
ones in the dataset are the most relevant for the classifications. This is also known as the variables with the
highest discriminant power. Only these relevant features are extracted from the data and used to train the
classifier.

The company decided to use a classifier to predict which flights are most likely to belong to the groups of
on-time, late, or cancelled flights. Via trial error of different models, it has been determined which variables
among all the ones in the data set are the most relevant for the classifications (also said to have the
highest discriminant power). Only these relevant features are extracted from the data and used to train the
classifier. The reliability rating is designed to communicate the degree of likelihood that a flight will be
on-time, delayed, or cancelled. The travel company has access to a large amount of historical data for
different airlines, flights, origins and destinations, flight status, and other information.

CLASSIFICATION ALGORITHMS

There are numerous classifier algorithms that are popular for various purposes. We will briefly discuss three
of them:

● k-nearest neighbor (k-NN) - k-NN is possibly the simplest classifier, which uses the distance between
training examples as a measure of similarity. To visualize how a k-NN classifier works, imagine that
each sample has two features, for which the values can be represented in a 2D plot. In Figure 2, the
data points of each class are marked with a different symbol. The distance between the points
represent the difference between the values of its features. Given a new data point, a k-NN classifier
will look at the closest training points. The predicted class for the new point will be the most common
class among the k neighbors.

● Support vector machines (SVM) - Support vector machines (SVM), shown in Figure 3, are examples
of supervised machine learning classifiers. Rather than basing the assignment of category
membership on distances from other points, support vector machines compute the border, or
hyperplane, that better separates groups. In the figure, H3 is the hyperplane that maximizes the
distance between the training points of the two classes, marked in color or black and white. When a
new data point is presented, it will be classified based on whether it lies on one side or the other of
H3.
● Decision trees - Decision trees represent a classification problem as a set of decisions based on the
values of the features. Each node of the tree represents a threshold over the value of a feature, and
splits the training samples in two smaller sets. The decision process is repeated over all the features,
growing the tree until an optimal way of splitting the samples is computed. The classification of a
new sample can then be obtained by following the tree branches based on the values of its features.
A simplified view of a binary decision tree and the types of nodes is shown in Figure 4.

VISUALIZING CLASSIFICATIONS

● Visualization for a classifier built to predict which passengers on the sunken cruise ship Titanic would
be survivors or victims. Note that the decision tree nodes include a measure of probability and
percentage of the population of passengers that is represented by each node. This decision tree is
very useful in identifying the factors, such as gender, that had the greatest impact on survivability.
This system could be supplied with fictitious passengers and it would very accurately classify the
new passengers by survival outcome.

● Figure 3 – This is a three-dimensional plot of a support vector machine. In this case, the hyperplane
that separates the two groups is derived from three variables for each observation. A small amount
of error is present as shown by the data points that appear on the wrong side of the hyperplane.

APPLICATIONS OF CLASSIFICATION

Classification algorithms have many applications. These are several examples:

● Risk Assessment - Classification systems can be used to determine which of many factors
contribute to the likelihood of various risks. For example, a number of factors can be used to classify
automobile insurance users into low, medium, and high risk categories and to adjust the premiums
that the drivers pay according to the level of risk.

● Medical Diagnostics - Classification systems can use guided questions to build a decision tree that
can help diagnose various diseases and risks of disease. machine learning classification systems
could also perform preliminary analysis of large numbers of diagnostic images, and flag suspicious
conditions for review by physicians.

● Image Recognition - For example, in handwriting recognition, a system may be working at the task of
identifying handwritten numerals. The numerals 0 - 9 can be considered as classes. The classifier is
provided with a large sample of handwritten numerals, each of which has been labelled with the
actual numeral represented. The classifier would look for features that are most likely to be present
and unique to each of the numerals.

ISSUES IN USING ANALYSIS

Scientific discovery often comes from the use of the scientific method. The scientific method is a six-step
process:

Step 1. Ask a question about an observation such as what, when, how, or why.
Step 2. Perform research.
Step 3. Form a hypothesis from this research.
Step 4. Test the hypothesis through experimentation.
Step 5. Analyze the data from the experiments to draw a conclusion.
Step 6. Communicate the results of the process.

Sensationalism often dominates the news coverage of any scientific discovery that promises to change the
world as we know it. For example, in 1989, two scientists claimed to have created cold fusion which would
provide a low-cost means of acquiring clean and abundant energy.

However, after a few years, it became clear to other scientists that the original experiment was flawed and
that it could not be repeated. The issues of validity and reliability are fundamental to supporting the results
claimed by any experiment or study.

VALIDITY

Other scientists who analyzed the design of the cold fusion experiment found that it was lacking some
necessary controls. This meant that the validity of the original experiments was now in question.

While there are many terms used to describe types of validity, researchers typically distinguish between
four types of validity:

● Construct validity - Does the study actually measure what it claims to measure?

● Internal validity - Was the experiment designed correctly? Does it include all the steps of the
scientific method?

● External validity - Can the conclusions apply to other situations or other people in other places at
other times? Are there any other causal relationships in the study that might account for the results?

● Conclusion validity - Based on the relationships in the data, are the conclusions of the study
reasonable?

RELIABILITY

A reliable experiment or study means that someone else can repeat it and achieve the same results. For
example, children can reliably repeat the experiment of mixing baking soda and vinegar to achieve the
same results: a volcano.

Researchers distinguish between four types of reliability:

● Inter-rater reliability - How similarly do different people score on the same test?

● Test-Retest Reliability - How much variation is there between scores for the same person taking a
test multiple times?

● Parallel-Forms Reliability - How similar are the results of two different tests that are constructed from
the same content?

● Internal Consistency Reliability - What is the variation of results for different items in the same test?

In the cold fusion and volcano examples, the verification of the validity of the claims can be performed by
replicating the experiment. In data analytics, however, repeating an experiment can be too costly or even
impossible. An example is the image classification system implemented by Facebook, which allows users to
search for images using a text description. When developing such a solution, Facebook data scientists have
access to millions of pictures with a textual description provided by human experts. They can fine tune their
classification algorithm to improve its performance but they cannot know how it will behave with new
pictures that users will post in the future, and they cannot evaluate whether it will give a correct answer or
not.

So how can they be reasonably sure that the classification system will work with pictures that it has not
processed before? They resort to a method called cross-validation. Cross-validation is where you train your
algorithm using only a randomly selected sample of the data, called the training set. Then, the model is
evaluated on the rest of the data, called the validation set. The classification performance that a
classification system obtains on the training set is usually higher than the one of the validation set. However,
this better represents how the algorithm behaves with samples that it has not processed before.

The success of the classification algorithm using pictures that the user will post in the future, will depend on
how well the training set represented the entire data set. If, for example, users start posting a picture of a
solar eclipse, the system will know how to classify it only if there were examples of a solar eclipse in the
training set.

Data analytics solutions are sometimes subject to the same bias of humans, because this bias is expressed
in the data sets used for training them. Click here to read about some examples of how this bias can lead to
discrimination.

ERROR IN DATA ANALYTICS

Errors, and more in general, uncertainty, affect the data analytics process at different levels. The first type of
error is the measurement error. We have often said that data needs to be cleaned because the values of
the variables can be corrupted by noise. But where does this noise comes from? Often, the error is caused
by sensor, or by the human reading or using the sensor.

Any device for taking measurements is limited in its precision. Therefore, all measurements have a built-in
error component. Regardless of which human is reading the measurement, the device will always have this
built-in error. For example, measuring tape used to mark the cutting line on a piece of plywood has a
built-in measurement error. An error of several millimeters would not impact the effectiveness of storm
shutters. However, you might want to be more precise when cutting wood for kitchen cabinets.

Because of the measurement error, the true value cannot be known, but this error can be studied
statistically and accounted for, and is defined as the difference between a true value (that is unknown) and
the measured one.

Another type of error is the prediction error. In supervised learning, the prediction error is quantified as the
difference between the value predicted by the model and the observed value. The observed value is
affected by the measurement error, and while this cannot be eliminated, there are techniques, such as
cross-validation, to limit its effects.

TYPES AND SOURCES OF MEASUREMENT ERROR

Measurement errors can be further categorized into these three groups:

● Gross errors - These are caused by a mistake in the instrument being used to take the
measurement, or in recording the result of the measurement. For example, an observer records 1.10
instead of the actual measurement of 1.01.
● Random errors – As shown in Figure 1, these are caused by factors that randomly impact the
measurement over a sample of data. For example, a calibrated scale at your grocery store might still
have a plus or minus error of 1 gram every time you weigh the same item.

● Systematic errors - As shown in Figure 2, these are caused by instrumental or environmental factors
that impact all measurements taken over a given period of time. For example, a scale that is not
calibrated will generate a systematic error every time a measurement is taken.

Random errors tend to create a normal distribution around the mean of an observation (Figure 1). It is
possible to build a statistical model of the error, in which case regression and classification algorithms can
easily take it into account. For some methods, the fact that the error follows a normal distribution is, in fact, a
requirement.

Systematic errors tend to shift the distribution of the observations (Figure 2) in one direction or another. A
systematic error is therefore harder to deal with, because the true value is not known, so the only way to
detect a systematic error is to use another measurement system that we deem more reliable.

ERRORS IN PREDICTIVE DATA

Prediction error is a difference between the value predicted by the regression or classification model, and
the measured value. For regression, a simple visual explanation of the error is provided in the figure.

The prediction error is the distance between the regression function, and the data points. In particular, it is
common to evaluate the prediction error using the mean of the sum of squared distances for all the points.
For classification, the error is given by the number of times the true class and the estimated class are
different. It is commonly divided by the number of data points.

In regression, the error on the training set is smaller than the one on the validation set. The prediction error
has two components:

● The first component is caused by the choice of model. No matter the algorithm, every time we fit a
regression or a classification model, we make an assumption on how the data is distributed, which is
inevitably an approximation. For example, we could fit a model that is a good approximation only for
a given range of samples, but fails to capture the relationship outside of it, as shown in the figure. An
old motto of Data Analysts says, "Every model is wrong, but some of them are useful". The figure
shows the difference between a second order polynomial model (Figure 1) and a third order model
(Figure 2). The third order model clearly performs better in terms of error (Figure 3).

● Even when the chosen model perfectly reflects the true distribution, there will still be differences
between predicted and actual values because of the measurement error. This cannot be eliminated;
and therefore, the measurement error influences the regression model.

● In machine learning, the first cause of prediction error is often called bias of a model, while the
second is variance. One cannot minimize both; this situation is often called the bias-variance
tradeoff.
MISLEADING RESEARCH

Understanding the impact of validity, reliability, and errors in a pattern of data is an important first step to
ensuring that your conclusions are based on a solid research design. It is also the first step in evaluating the
results reported by someone else.

Misleading, bad, or erroneous research is more common than you may think. In fact, John P.A. Ioannidis
states that most research findings are false. Click here to read how the probability of a research finding
being true is based on six corollaries.

SENSATIONALISM IN RESEARCH FINDINGS

Many different voices in today's culture clamour for your attention. Therefore, it is not uncommon to see
misleading or incorrect headlines from some media sources. For example, click here to read why Cancer
Research UK had to refute the claim that chocolate can detect cancer. Instead, ingesting a spoonful of
sugar before an MRI might help detect cancer. But which headline gets more clicks?

Sometimes there can be unrealistically high expectations about a new technology. For example, the Tesla
Autopilot driving assistance system is a great example of machine learning and Big Data analytics applied
to sensor data. Self-driving cars, mainly thanks to Google, have been in the news for years. When Tesla
released its Autopilot system, the public assumed that the system was able to completely replace the
driver.

However, Tesla Autopilot is not a self-driving technology, because the driver is supposed to keep their hands
on the steering wheel and take over at any moment. After a recent fatal car accident involving a Tesla, the
media spoke out against self-driving cars and accused them of being unreliable.

Inflated expectations of an AI solution lead to misuse of the technology and to an accident for which the AI
is blamed. An analysis of the Tesla Autopilot crash revealed a limitation in the camera and the image
classification system. The system itself was not designed to replace the driver. Self-driving cars are
definitely feasible, but we are still a few years away from a system that can replace the driver completely, in
any weather conditions, and with higher safety.

GUIDELINE FOR EVALUATING RESULTS

There are several guidelines you can following when evaluating the results reported by a research study or
a data analysis report:

● Statistics - Does the study have a large enough sample size to support the findings? For example, a
nationwide opinion poll should have a sample size of at least 1,024 participants to get a margin of
error that is less than 3%. For classical scientific studies, statistics provides a set of tools to determine
precisely how much data is needed. For data analytics, it is not possible to answer this question in
general. Cross validation is crucial in this case to predict how the model will generalize outside of the
data available.

● Research design - Did the architects of the study follow generally accepted methods of research
design? Did they use blind observation or control groups, if necessary? Did they account for their own
bias in conducting the research? Who paid for the research and what is that organization's
motivation?
● Duration - Does the research appropriately account for the impact on time? How long should a
research team follow the participants of a study to make sure the results are valid? In the case of a
data analytics model for an IoT solution, the duration is important because of environmental
changes. An image classification system trained with pictures of trees in spring might fail to
recognize them in winter.

● Correlation and causation - Just because two variables are correlated does not mean that one
caused the other. Click here to read about the correlation between ice cream sales and crime. Ask if
the researchers accounted for any other confounding variables that may have impacted the study.

● Alignment to other studies - Do the results confirm or align with other studies in the field? If not, can
the study be replicated to account for the reliability of the findings?

● Peer review - Has the study been reviewed by experts in the same field? Are there any experts who
disagree with the findings?

DETECTING ANOMALIES

Anomalies can represent data that is anomalous, or values that are anomalous. Data can be corrupted or
distorted by many factors during measurement, transmission, or storage. These values are considered
outliers. They deviate so far from expected values that they could distort the results of the analysis. These
observations are frequently removed from the data set after careful consideration.

There are other types of anomalies that are very important. These anomalies can represent serious
problems with the item that is being measured. For example, unusually high temperature or vibration
measurements made by sensors attached to a machine could indicate that a part is ready to fail. In this
case, a streaming data analysis application on the IoT could send an alarm that would alert maintenance
personnel that the machine requires attention.

The rest of the lab uses classes and functions from matplotlib and Numpy. Notice in the 3D plot in Figure 1
that there are several points of data that lie outside of the clustered area. These outliers are anomalies.
Anomalies can be identified by detecting points that lie beyond the average. By measuring the difference
between the x, y, and z coordinates of each data point and the x, y, and z coordinates of the mean, you
derive a list of distances for every data point. The distance is called Euclidean.

To detect anomalies, you will need to identify the decision boundary that defines whether a data point is
normal or an anomaly. To do this, you first normalize the distance data by setting the farthest distance to 1.
Then you determine a threshold between 0 and 1 that defines the threshold for the decision boundary. In the
lab, you will set the threshold to 0.1 and code functions to visualize the 3D plot in Figure 2. The sphere shows
the decision boundary between normal data and anomalous data.

SUMMARY

This chapter began with an explanation of how Big Data is characterized by its volume, velocity, variety and
veracity. It continued by discussing the concept of machine learning. Machine learning is “…a set of
methods that can automatically detect patterns in data, and the use the uncovered patterns to predict
future data, or to perform other kinds of decision making under uncertainty.”

Regression and classification analyses are example of supervised machine learning approaches. Clustering
and association analysis are examples of unsupervised machine learning.
Regression analysis is the most commonly used statistical method for analyzing data. It is a supervised
machine learning technique. Regression uses the historical relationship between one or more independent
variables and a dependent variable to predict future values of the dependent variable. The goal of linear
regression is to construct a trend line that best fits the data.

Next to regression analysis, classification is the most common type of machine learning used in Big Data
analytics. Classification modeling is performed by a family of machine learning algorithms that are
commonly used to assign observations to groups. Classification models, also known as classifiers, are
supervised machine learning algorithms. There are numerous classifier algorithms that are popular for
various purposes: k-nearest neighbor (k-NN), Support vector machines (SVM), and Decision tree.
Classification can be seen as a regression problem where the target variable is discrete, and represent a
class in which a human expert has classified the data sample. It is common, in classification problems, to
provide not only a set of examples data points each class, but to also establish which are the features of
each data point that are more useful to estimate the corresponding class.

The next section discuss evaluating models and scientific discovery. Scientific discovery often comes from
the use of the scientific method. The scientific method is a six step process:

Step 1. Ask a question about an observation such as what, when, how, or why.
Step 2. Perform research.
Step 3. Form a hypotheses from this research.
Step 4. Test the hypotheses through experimentation.
Step 5. Analyze the data from the experiments to draw a conclusion.
Step 6. Communicate the results of the process.

While there are many terms used to describe types of validity, researchers typically distinguish between
four types of validity: construct, internal, external, and conclusion. Researchers distinguish between four
types of reliability: inter-rater, test-retest, parallel-forms, and internal consistency. Error is the difference
between the actual value and the measured value of an observation.

error = actual value - measured value

We distinguish two main types of errors in Data Analytics, the measurement error and the prediction error.
Measurement error is caused either by a human mistake, noise or lack of precision in the measurement
system or sensor. There are three basic types of measurement error: gross, systematic, and random.
Random errors tend to have a normal distribution around the mean of an observation. Systematic errors
tend to shift the distribution of the observations. Prediction error is a difference between the value predicted
by the regression or classification model, and the measured value. In machine learning the first cause of
prediction error is often called bias of a model, while the second is variance. One cannot minimize both, and
this situation is often called the bias-variance tradeoff.

Understanding the impact of validity, reliability, and errors in a pattern of data is an important first step to
ensuring that your conclusions are based on a solid research design.

The final section of this chapter covered preparation for the Internet Meter labs. In the first lab, you used
regression analysis to view historical data about the growth of Internet traffic. You quantified the
relationship between the year and the measurement of Internet traffic. You installed pandas, numpy, and
matplotlib. The matplotlib library includes different styles for showing your plots.
In the second lab, you visualized data in three dimensions. To do so, you extended the matplotlib library by
installing the mpl_toolkits class from the mplot3d library. You then used the Internet meter data to create a
3D plot to display three axis: download rate (x axis); upload rate (y axis); and ping rate (z axis). To detect
anomalies, you identified the decision boundary that defines whether a data point is normal or an anomaly.
CHAPTER 5: STORYTELLING WITH DATA

TELLING A STORY

Exploratory analysis is a set of procedures designed to produce descriptive and graphical summaries of
data with the notion that the results may reveal interesting patterns. It is a process of discovery that
sometimes enables us to create a hypothesis about the data.

At this point in the data analysis lifecycle, you have completed the exploratory analysis. From the results,
you have concluded that there is something interesting that you need to report to decision-makers.

Telling the story is the step in the process that will actually drive the changes made by decision-makers.
However, you do not want to spend too much time on the data. This is hard to do. You have spent a lot of
time gather and analyzing the data. You have created many data visualizations during your exploratory
analysis. It is very tempting to walk your audience through the analysis process.

Click here for eight examples of how to present rich data in simple ways.

Look for those one or two crucial insights that your data analysis revealed. While telling the story, give
enough of the data to explain your point. Remove what Edward Tufte would call “administrative debris” or
the superfluous navigational elements that detract from the content of your message. For example, can a
legend in your visualization be incorporated in the data instead of creating a separate box with this
information?

Click here to learn more about Edward Tufte.

AUDIENCE

Before you can effectively tell a story, you have to know your audience. Answering the following questions
will go a long way to determining the perceptions and motivations of your audience.

Who is your audience?

Who will hear your story? What is the listener’s motivation to hear your story? What is each important
participant’s level of technical knowledge and familiarity with the business problem? What are the possible
reactions to hearing your story?

Where is your audience?

Are they all in a conference room sitting around a table listening? Or are they online in a virtual conference?
Or is your audience a combination of both? If online, how much will you depend on verbal messaging? Will
you be sharing a presentation? Are some of the participants joining the conference using audio only?

When is your audience available?

If any important participants were not able to join your conference or were only able to call in, will they be
able to access a recording of the presentation later? How time-sensitive is your story? Is the recording of the
presentation allowed? If so, what are the security considerations? How confidential is the information? Can
the participants who were not in the meeting take action on your story in a timely manner?
You may have one version of a presentation for a live meeting in which you demonstrate certain findings
and aspects of the data. But you also create another version of the presentation to persist eternally as a
"research product" in your research archive. Any current or potential stakeholders should be able to
understand your presentation as a stand-alone data product.

Regardless of who, where, or when your audience is, a good data visualization should stand on its own when
removed from the context of the original delivery environment. Click here to read an article by Daniel
Waisberg of Google, “Tell a Meaningful Story with Data”.

BUSINESS VALUE AND GOAL

Business value means different things to different audiences, even within the same company. Therefore, you
should be very clear on why your particular audience should care about the story you are telling.

Identifying the business value for each audience will point you to the overall goal of your presentation. What
do you want the audience to take away? What do you want to teach them? What is your call to action?

For example, a presentation to different departments in a manufacturing company about an upcoming


new product should be tailored to the audience. Marketing wants to know the features that will help them
sell the product. Finance wants to know the cost to make the product and whether customers will need
credit options. Production wants to know the specifications so they can begin to integrate the product in the
manufacturing process.

A good example at the consumer level is the announcement of the iPod in 2001. Steve Jobs knew users did
not really care how many gigabytes they could store on their MP3 players. What they cared about was how
many songs they could store. So Jobs told the audience that the iPod “puts a thousand songs in your
pocket.”

USING EVIDENCE

Your story is usually meant to persuade the audience to adopt your point of view. The explanation is done
through the presentation of evidence.

The evidence you present should be critical to your end goal. If a piece of evidence does not support your
concluding remarks, or is secondary to your primary focus, you should consider leaving it out of your
presentation.

DEDUCTIVE REASONING

Logic is reasoning that is used to make valid statements about a conclusion. There are two basic types of
reasoning: deductive and inductive.

Deductive reasoning uses facts, or premises, to arrive at a conclusion. A syllogism is an example of


deductive reasoning. A syllogism is made up of three premises. The first two premises, which are assumed
to be true, each share a term with the third statement. For example:

All mammals have eyes. Humans are mammals. Therefore, humans have eyes.

The third statement shares the term “eyes” with the first statement and it also shares the term “humans”
with the second statement.
In research, the conclusion is a new specific fact that is derived from the general theory. Deductive
reasoning is considered “top-down” in that it moves from a general premise to specific facts that are
derived from the general premise. In the syllogism given above, the major premise describes all mammals;
it is the general rule. The next premise states a narrower case from which it can be deduced that humans
have eyes.

Sound deductive reasoning always leads to conclusions that are true.

INDUCTIVE REASONG

Inductive reasoning works in the opposite direction. Inductive reasoning moves from the specific to the
general. It is the process of creating a conclusion based on observations, patterns, and hypotheses. In
exploratory data analysis, we are often using inductive reasoning. We sample a population, study the
sample, and then make inferences that we believe will be true for the entire population. These inferences
can then lead to a hypothesis. That hypothesis would then need to be confirmed by using deductive
reasoning.

To avoid making an inference that is not logical, be sure that your sample actually represents the
population to which you are applying your conclusions. When asked the question, “What are you touching?”
blindfolded people touching different parts of an elephant will report different conclusions.

Inductive arguments should be very clear about the underlying assumptions in the data. For example, in the
following series of numbers, what is the next number?

6, 13, 20, 27...

If you said 34, then your underlying assumption is that the data is a simple series of numbers which
increases by increments of seven, with no end. However, what if the numbers represent dates? If that is your
underlying assumption, then you may not know the next number. Is the month 30 days long, 31 days long, 28
or 29 days long? In this case, there is no month with 34 days, so that cannot be the next number.

Because inductive reasoning tends be exploratory and based in observation, even the most precise
inductive reasoning can lead to false conclusions. This is not a flaw of inductive reasoning. Both types of
reasoning have their value. Inductive reasoning can lead to the development of hypotheses that require
proof through deductive reasoning.

Regardless of whether you are using deductive or inductive reasoning to tell your story, clearly describe the
data that is necessary to support your logic. In addition, state any possible caveats or limitations to the truth
of your conclusions. This will help your audience to avoid fallacious conclusions and provide them with the
information that they require to make decisions.

FALLACIES

A fallacy in your reasoning means that your conclusion is not justified by the premise of your argument. This
can occur because of several reasons:

● Your argument might not apply a rule of logic.


● Your argument might leave out or misinterpret a crucial premise that would invalidate your
conclusion.
● Your conclusion might not follow logically from the premises.
There are two types of logical fallacy:

● Formal fallacy - One or more of the premises can be shown to be false. For example, the following
argument is formally fallacious:

If milk is kept in the refrigerator, it will not spoil. The milk is spoiled. Therefore, the milk was not kept in the
refrigerator.

The first premise can be shown to be not true. Milk kept in a refrigerator can still spoil.

● Informal fallacy - The premises do not adequately support the conclusion. For example, the following
argument is a burden of proof informal fallacy:

Humberto: Some people have psychic powers.

Alexei: Can you prove it?

Humberto: No one has been able to disprove it.

Humberto is obligated to provide evidence for his premise. However, he fallaciously switches the burden of
proof to Alexei to disprove Humberto’s claim.

When telling your story, make sure that the evidence of your supporting proposition does not suffer from a
logical fallacy. Click here and here for fallacy examples.

INTRODUCTION TO PYPLOT

Pyplot is a matplotlib module that includes a collection of style functions you can use to create and
customize a plot. The code in the figure creates a quick plot with default styles. Click here for a reference of
all the default settings for style attributes.

MODIFYING THE DEFAULT STYLE INLINE

You can customize the default style by adding inline code. The following modifications are implemented in
the figure:

● The default figure size is 6.4 by 4.8 inches. To change the size, use the code plt.figure(figsize = (width,
height).
● The line width defaults to 1.5 points. To change the size, add the linewidth attribute.
● The font size defaults to 10.0 points. To increase the font size, add the fontsize attribute.

CLASSIFYING A STYLE SHEET

If you find yourself reusing the same inline code for several plots, you may want to create a custom style
sheet. By referencing a custom style sheet, you can make all your plots have the same style features and
you avoid making minor errors to inline code.

To create a custom style sheet, open a text editor and add a line for each element in the plot that you want
to customize. In Jupyter notebooks, you can use the Linux terminal to create and edit a text file.

Click here to learn how to create and edit Linux text files.

The code in the figure changes a few of the many customizable plot attributes. Click here for a reference to
settings for style attributes.
REFERENCING A STYLE SHEET

Save the style sheet in an appropriate location with the extension .mplstyle. You can store it anywhere, but
then you will need to provide path information when you reference the style sheet. For example, if you store
the style sheet in ‘myfiles’, then your code to reference the style sheet is as follows:

plt.style.use('/home/pi/notebooks/myfiles/mystyle.mplstyle')

However, if you store the config file in the matplotlib configuration directory, then you do not need path
information. Use the command get_configdir() to find the default location for matplotlib configuration files.
Then copy your style sheet into that location, as shown in Figure 1.

MOTPLOTLIB STYLE SHEETS

Matplotlib already has several styles that you can call in your code. Use print(plt.style.available) to see the
matplotlib styles, as shown in the Figure 1.

Matplotlib defaults to the classic style. Use plt.style.use to change the style. For
example, plt.style.use(‘grayscale’) will change the style to gray scale, as shown in Figure 2. Figure 3 and 7
show more examples of the stylesheets available in matplotlib.

INTRODUCTION TO PLOTLY

Plotly is a powerful online tool that you can use to quickly generate beautiful data visualizations. Plotly has a
variety of resources for data analysts and web developers including API libraries, figure converters, apps for
Google Chrome, and an open source JavaScript library.

The Plotly website has an extensive amount of content that is available for free, including a GUI for creating
and visualizing your data. To save your visualizations, you will need to sign up for a free user account, as
shown in the figure.

Click here to register. After you register, you can browse the website for examples of public data
visualizations done by other users.

COMMON TYPES OF DATA VISUALIZATIONS

So far in this section, we have been using a line chart to demonstrate different plot styles in Python. But a
line chart is not always the best chart type to use. Determining the best chart usually depends on your
answers to the following questions:

● How many variables are you going to show?


● How many data points are in each variable?
● Is your data over time or are you comparing items?

LINE CHARTS

Line charts are one of the most commonly used types of comparison charts. Use line charts when you have
a continuous set of data, the number of data points is high, and you would like to show a trend in the data
over time. Some examples include:

● Quarterly sales for the past five years.


● Number of customers per week in the first year of a new retail shop.
● Change in a stock’s price from opening to closing bell.

Some best practices for line charts include:

● Label your axes.


● Plot time on the x-axis (horizontal) and the data values on the y-axis (vertical). Label both axes.
● Use a solid line for the data to emphasize the continuity of the data.
● Keep the number of data sets plotted to a minimum. You should have a really good reason for
plotting more than four lines. If needed, add a legend to your line chart to help your audience
understand what they are viewing.
● Remove or minimize gridlines to reduce distraction. Consider using no gridlines unless you want to
emphasize data values or time.
● Modify the axis starting point to obtain a 45-degree slope. This ensures you emphasize the change in
the data without introducing distortions that dramatize the visualization.

COLUMN CHARTS

Column charts are positioned vertically, as shown in the figure. They are probably the most common chart
type used when you want to display the value of a specific data point and compare that value across
similar categories. Some examples include:

● Population of the BRICS nations (Brazil, Russia, India, China, and South Africa).
● Last year’s sales for the top four car companies.
● Average student test scores for six math classes.

Some best practices for column charts include:

● Label your axes.


● If time is one of the dimensions, it should be plotted on the x-axis.
● If time is not part of the data, consider ordering the column heights to ascend or descend.
● Fill the columns with a solid color. If you would like to highlight one column, consider using an accent
color and make all the other columns the same color.
● Columns are best when there are no more than seven categories on the horizontal axis. This will help
the viewer clearly see the value for each column.
● Start the value of the y-axis at zero to accurately reflect the full value of the column.
● The spacing between columns should be roughly half the width of a column.

BAR CHARTS

Bar charts are similar to column charts except they are positioned horizontally. Longer bars indicate larger
numbers. They are best used when the names for each data point is long. Some examples include:

● Gross domestic product (GDP) of the top 25 nations.


● Number of cars sold by each sales representative.
● Exam scores for each student in a math class.

Some best practices for bar charts include:

● Label your axes.


● Consider ordering the bars so that the lengths go from longest to shortest. The type of data will most
likely determine whether the longest bar should be on the bottom or the top.
● Fill the bars with a solid color. If you would like to highlight one bar, consider using an accent color
and make all the other columns the same color.
● Start the value of the x-axis at zero to accurately reflect the full value of the bar.
● The spacing between bars should be roughly half the width of a bar.

PIE CHARTS

Pie charts are used to show the composition of a static number. Segments represent a percentage of that
number. The total sum of the segments must equal 100%.

Some examples include:

● Annual expenses for a corporation (e.g., rent, administrative, utilities, production)


● A country’s energy sources (e.g., oil, coal, gas, solar, wind)
● Survey results for favorite type of movie (e.g., action, romance, comedy, drama, science fiction)

Some best practices for pie charts include:

● Keep the number of categories to a minimum so that the viewer can differentiate between
segments. After ten or more segments, the slices begin to lose meaning and impact.
● If necessary, consolidate smaller segments into one segment with a label such as “Other” or
“Miscellaneous”.
● Use a different color or gray scale for each segment.
● Order the segments according to size.
● Make sure the value of all segments equals 100%.

SCATTER PLOT

Scatter plots are very popular for correlation visualizations, or when you want to show the distribution of a
large number of data points. Scatter plots are also useful for demonstrating clustering or identifying outliers
in the data.

Some examples include:

● Comparing each country’s life expectancy to its GDP.


● Comparing the daily sales of ice cream to the average outside temperature.
● Comparing the weight to the height of each person.

Some best practices for scatter plots include:

● Label your axes.


● Make sure the data set is large enough to provide visualization for clustering or outliers.
● Start the value of the y-axis at zero to accurately reflect the full value of the bar. The value of the
x-axis will depend on the data. For example, age ranges might be labeled on the x-axis.
● If scatter plot shows a correlation between x- and y-axis, consider adding a trend line.
● Do not use more than two trend lines.
When you need to add a third variable to a scatter plot, consider using a bubble chart. For example, a
scatter plot showing a correlation between life expectancy and GDP can be enhanced by increasing the size
of each data point to represent population.

Click here to view Hans Rosling’s TED talk for 2006 to see the power of a bubble chart.

To learn more about chart other types, click here to read Tableau Software’s guidelines for choosing the
right chart.

FOLIUM LIBRARY

You can also use Python libraries to plot maps. In the first lab, you will import a Python script that calls
methods from the Folium library. Folium combines the strength of Python with the mapping abilities of the
Leaflet.js library. Folium allows you to take your Python data frames and display them on an interactive
Leaflet map.

FOLIUM TILESETS

A tileset is a collection of raster or vector data that can display a map on mobile devices or in a browser.
The Folium library supports a number of different tilesets including OpenStreetMap, Mapbox, and Stamen. By
default, Folium uses the OpenStreetMap tileset. Mapbox and Stamen maps can be specified with the tiles
attribute. However, Mapbox requires a user account to obtain API access tokens.

You can create a quick map by importing the Folium library and specifying a set of coordinates. In the
figure, the coordinates map to Silicon Valley in California. Notice that the OpenStreetMap tileset is used.

SUMMARY

This chapter began with an explanation of the basics of telling your story with data. Exploratory analysis is a
set of procedures designed to produce descriptive and graphical summaries of data with the notion that
the results may reveal interesting patterns.

Before you can effectively tell a story, you have to know your audience. Who are they? Where are they? Will
some of them need to see a recorded version of your presentation? Identifying the business value for each
audience will point you to the overall goal of your presentation. What do you want the audience to take
away?

Your story is usually meant to persuade the audience to adopt your point of view. The explanation is done
through the presentation of evidence. The evidence you present should be critical to your end goal.
Deductive reasoning uses facts, propositions, or other statements of truth to arrive at a conclusion. In data
analysis, we are often using inductive reasoning. We sample a population, study the sample, and then make
inferences about the population as a whole. When telling your story, make sure that the evidence of your
supporting proposition does not suffer from a logical fallacy.

The next section discusses tools for visualizing your data. Pyplot is a matplotlib extension that includes a
collection of style functions you can use to create and customize a plot. If you find yourself reusing the same
inline code for several plots, you may want to create a custom style sheet.

Plotly is a powerful online tool that you can use to quickly generate beautiful data visualizations. Plotly has a
wealth of tutorials and help pages that you can explore to learn more about the power data visualization
platform.
The last topic of this section reviews some uses and best practices for the most popular chart types: line,
column, bar, pie, and scatter.

Line charts are one of the most commonly used types of comparison charts. Use Line charts when you have
a continuous set of data, the number of data points is high, and you would like to show a trend in the data
over time.

Column charts are positioned vertically. They are probably the most common chart type used when you
want to display the value of a specific data point and compare that value across similar categories.

Bar charts are similar to column charts except they are positioned horizontally. Longer bars indicate larger
numbers.

Pie charts are used to show the composition of a static number. Segments represent a percentage of that
number. The total sum of the segments must equal 100%.

Scatter plots are very popular for correlation visualizations, or when you want to show the distribution of a
large number of data points. Scatter plots are also useful for demonstrating clustering or identifying outliers
in the data.

The final section of this chapter covered preparation for the Internet Meter labs, importing a Python script
from the Folium library, using Folium tilesets, and modifying and labeling a map.
CHAPTER 6: ARCHITECTURE FOR BIG DATA AND DATA ENGINEERING

EDGE ANALYTICS AND CLOUD ANALYTICS

Transforming data into valuable insights requires computing and storage capacity. Different IoT
architectures have slightly different approaches on where and when the data is processed and stored. For
example, in the architecure Device-Network-Cloud, all the data points collected by the sensors that are
included in the connected device are sent directly to the cloud for storage and processing. This is what
happens with most of the wearables used to track fitness activities. The data collected by the device is
wirelessly sent to the cloud. There, it is transformed using descriptive analytics and presented to the user on
the web profile.

This architectural model is simple but not scalable. When the numbers of sensors increase along with the
amount of data points generated, or when the processing of the data requires a much shorter response
time, those are situations when the data needs to be processed closer to where it is generated. This is where
the architecture Device-Gateway-Network-Cloud is used. Depending on the application, data can be
processed almost immediately after it is generated, very near the source of its creation on the gateway or
other intermediate places on the network. This is called fog computing. It is called fog because is similar to
the “cloud” but is closer to the ground. This source area is also known as the edge and for this reason, this
approach is also called edge analytics.

Examples of applications that require fog computing are the networks of sensors that are geographically
distributed, like ground moisture sensors in a vineyard and motion sensors at every intersection in a city.
Fog computing helps to reduce the response time (low latency) and reduce the amount of data that must
be sent to the cloud. For example, fog computing removes redundant data when a sensor variable is not
changing because it is pointless to keep transmitting the same value. Instead, the data is transmitted only
when its value changes.

Regardless of the IoT architecture used, most or all of the data will eventually be collected and stored in the
cloud, where potentially infinite computing and storage capacity is available. What is behind the cloud? A
network of data centers that use virtualization technology. The data centers are physical facilities where
thousands (or hundreds of thousands) of servers, and exabytes of storage capacity, are available over
high-speed network connections. This virtualization technology allows the creation inside of each physical
server of one or more virtual machines where the data analytics process can run. Cloud platform providers
have a network of fault tolerant data centers. These data centers are the infrastructure that makes the
cloud possible.

THE DATE CENTER

Data moves on the network infrastructure. Organizations are now critically dependent on their IT operations.
The ability of the infrastructure to rapidly make resources available directly affects the velocity of data.
Leveraging Big Data to extract information and insight for business requires powerful solutions, such as
those provided by data centers, as shown in the figure.

For example, as organizations evolve, they require increasing amounts of computing power and hard drive
storage space. If left unaddressed, this will have a negative impact on an organization’s ability to provide
vital services. The loss of vital services means lower customer satisfaction, lower revenue, and, in some
situations, loss of property.
Large enterprises typically own a data center to manage the storage and data access needs of the
organization. In these single-tenant data centers, the enterprise is the only customer using the data center
services. However, as the amount of data continues to expand, even large enterprises are expanding their
data storage capacity by utilizing the services of multi-tenant data centers.

Data centers can be used to serve internal IT needs, as explained above. This is sometimes called a private
cloud. Data centers can also offer these same products and services to other companies and organizations.
This is sometimes called a public cloud.

DATE CENTERS AND CLOUD COMPUTING

To help address the four Vs of Big Data: volume, variety, velocity, and veracity, many organizations are
turning to cloud computing. Cloud computing supports a variety of data management issues:

● It enables access to organizational data anywhere and at any time.


● It streamlines the organization’s IT operations by subscribing only to needed services.
● It eliminates or reduces the need for onsite IT equipment, maintenance, and management.
● It reduces cost for equipment, energy, physical plant requirements, and personnel training needs.
● It enables rapid responses to increasing data volume requirements.
● Its “pay-as-you-go” model, allows organizations to treat computing and storage expenses as a
utility, rather than investing in infrastructure. Capital expenditures are transformed into operating
expenditures.

The three main cloud computing services defined by the National Institute of Standards and Technology
(NIST) in their Special Publication 800-145 are:

● SaaS – Software as a Service


● PaaS - Platform as a Service
● IaaS – Infrastructure as a Service

These services are discussed in more detail later in this chapter.

Cloud service providers have extended this model to also provide IT support for each of the cloud
computing services (ITaaS).

Currently, there are over 3,000 data centers in the world that offer general hosting services to organizations.
There are many more data centers that are owned and operated by private industries for their own use.

DATA CENTER STRUCTURE

Data centers are centralized locations containing large amounts of computing and networking equipment.
This equipment is used to collect, store, process, distribute, and provide access to vast amounts of data. Its
main function is to provide business continuity by keeping computing services available whenever and
wherever they are needed.

To provide the necessary level of service to its clients, the figure details several factors are considered when
building a data center including:

● Electrical
● Location
● Network Design
● Security
● Environmental

BENEFITS OF A DATA CENTER

Just about every organization needs its own data center, or needs access to a data center. Some
organizations build and maintain their own data centers in-house. Other organizations rent servers at
co-location facilities (colos). There are others that use public, cloud-based services. Amazon Web Services,
Microsoft Azure, Rackspace, and Google are examples of companies that provide public cloud services.

Due to the operational complexity of data centers, very few organizations manage their own data center
facility. It would be very difficult and expensive for a small or mid-sized company to build out their own
space with all of the features that a colo data center provides. Therefore, many organizations lease space
from specialized service provider-owned data centers to host their systems. Click here to learn about Cisco
Powered services.

The figure identifies the benefits of leasing space at a data center.

ISSUES IN DATA CENTER SECURITY

Data centers typically deal with sensitive or proprietary information; therefore, these sites must be secured
physically and digitally.

Physical security can be divided into:

● Outside perimeter security - This can include on-premise security officers, fences, gates, continuous
video surveillance, and security breach alarms, as shown in the figure.

● Inside perimeter security - This can include continuous video surveillance, electronic motion
detectors, security traps, and biometrics access and exit sensors.

Physical security is not only implemented to keep unwanted physical and digital intruders out, but is also
needed to protect people and equipment. For example, fire alarms, sprinklers, seismically-braced server
racks, and redundant HVAC and UPS systems are in place to protect people and equipment.

Physical security is different than network security, which includes firewalls and other methods designed to
keep electronic intruders and hackers out. Data centers are prone to the same threats as enterprise
networks. A number of security-related issues exist in the data center:

● Instant On - In a data center, there are virtual machines (VMs) that are used occasionally. When a
VM, that has not been used for a period of time is brought online, it may have outdated security
policies that deviate from the baseline security and can introduce security vulnerabilities.

● Hyperjacking - An attacker could hijack a VM hypervisor and then use it as a launch point to attack
other devices on the data center network.

● Antivirus Storms - This happens when all VMs attempt to download antivirus data files at the same
time.

Cisco Cloudlock is a cloud access security broker (CASB) and cloud cybersecurity platform. It protects
users, data, and apps across SaaS, PaaS, and IaaS.
User Security

Cloudlock uses advanced machine learning algorithms to detect anomalies based on multiple factors. It
also identifies activities outside whitelisted countries and detects actions that seem to take place at
impossible speeds across distances. With User and Entity Behavior Analytics (UEBA), Cisco Cloudlock
detects suspicious activity across SaaS, PaaS, IaaS and IDaaS platforms. By establishing a behavioral
baseline for each individual user and continuously monitoring user activity, Cisco Cloudlock detects
potential anomalies that suggest malicious behavior. Thresholds can be established in centralized policies
and alerts can be sent to security operations in real time.

Data Security

Cloudlock’s data loss prevention (DLP) technology continuously monitors cloud environments to detect and
secure sensitive information. It provides multiple out-of-the-box policies, as well as custom policies that
can be fine-tuned. Cisco Cloudlock protects organizations against data breaches in any cloud environment
and app through a highly-configurable data loss prevention engine. With a wide range of automated,
policy-driven response actions such as file-level encryption, quarantine, and end-user notifications, Cisco
Cloudlock provides exceptional coverage of cloud traffic, including on- and off-network, programmatic, and
user-driven communications by managed and unmanaged users and devices, retroactively and in
real-time.

App Security

The Cloudlock Apps Firewall discovers and controls cloud apps connected to the corporate environment. A
crowd-sourced Community Trust Rating for individual apps is available. Administrators can can ban or
whitelist these apps based on risk.

WHAT IS VIRTUALIZATION?

Operating systems (OSs) separate the applications from the hardware. OSs create an “abstraction” of the
details of the hardware resources to the application. Virtualization separates the OS from the hardware.

Cloud providers offer services that can dynamically provision servers as required. Server virtualization takes
advantage of idle resources on a physical machine and consolidates several virtual servers on a single
machine. This also allows for multiple operating systems to exist on a single hardware platform.

For example, in the figure, the original eight dedicated servers have been consolidated into two servers
using hypervisors to support multiple virtual instances of the operating systems. The hypervisor is a
program, firmware, or hardware that adds an abstraction layer on top of the real physical hardware. The
abstraction layer is used to create virtual machines which have access to all the hardware of the physical
machine such as CPUs, memory, disk controllers, and NICs. Each of these virtual machines runs a complete
and separate operating system. With virtualization, enterprises can now consolidate the number of servers
they own and operate. For example, it is not uncommon for 100 physical servers to be consolidated as
virtual machines on top of 10 physical servers using hypervisors.

The use of virtualization normally includes redundancy to protect from a single point of failure. Redundancy
can be implemented in different ways. If the hypervisor fails, the VM can be restarted on another hypervisor.
Also, the same VM can be run on two hypervisors concurrently, copying the RAM and CPU instructions
between them. If one hypervisor fails, the VM continues running on the other hypervisor.
ABSTRACTION LAYERS

To help explain how virtualization works, it is helpful to use layers of abstraction in computer architectures.
Abstraction layers are also used by the OSI reference model to help describe network protocols. A computer
system consists of the following abstraction layers, as illustrated in Figure 1:

● Applications
● OS
● Hardware

At each of these layers of abstraction, some type of programming code is used as an interface between the
layer below and the layer above. For example, the programming language C is often used to program the
firmware that accesses the hardware.

An example of virtualization is shown in Figure 2. A hypervisor is installed between the firmware and the OS.
The hypervisor can support multiple instances of operating systems.

HYPERVISORS

A hypervisor is software that creates and runs VM instances. The computer, on which a hypervisor is
supporting one or more VMs, is a host machine. There are two types of hypervisors:

● Type 1 Hypervisor – This is also called the “bare metal” approach because the hypervisor is installed
directly on the hardware, as shown in Figure 1. Type 1 hypervisors are usually used on enterprise
servers.

● Type 2 Hypervisor – This is also called the “hosted” approach. The Type 2 hypervisor adds an extra
layer of abstraction. This is because the hypervisor is an application running on the physical host’s
OS, and additional OS instances are installed in the hypervisor, as shown in Figure 2.

CONTAINERS

Hypervisors allow for each virtual machine to have its own operating system while sharing the same
hardware. This configuration is wasteful if the operating systems used within the virtual machines are the
same as the operating system running on the host computer. Containers address this problem.

Containers are a specialized “virtual area” where applications can run independently of each other while
sharing the same OS and hardware. From an application standpoint, it is the only application running on the
computer. By sharing the host operating system, most of the software resources are reused, optimizing
operation.

The figure shows the structure of container technology compared to the traditional hypervisor. The
container only needs the necessary portion of an operating system, system resources, and any programs
and libraries required to run a program. This allows a server to support many more containers than it could
virtual machines at any one time.

Just because more programs can be run on a server using container technology does not mean that the
traditional virtual machine is dead. Containers require the virtual machine’s operating system to be the
same as the host computer. If there is a need for multiple operating systems, hypervisors must be used.
Deciding whether to use a virtual machine or containers depends on what you wish to accomplish. While
the same OS requirement may seem like a limitation, production environment applications usually require
the same OS.

SaaS, PaaS, and IaaS

Data centers can also use virtualization to cut costs and expand offerings as cloud providers. Some of these
offerings are SaaS, PaaS, and IaaS, as shown in the figure.

Amazon Web Services (AWS) is a cloud provider, offering on-demand computing resources and services in
the cloud. This means that you can operate, on demand, one or multiple virtual servers on AWS when you
need them and for how long you need them. Then you can log on to, configure, secure, and run as if it was a
server in your office. With AWS you can store data, host a website or web app, host a Learning Management
System (LMS), and process Big Data generated by the IoT. This is an example of IaaS, where the creation of a
computing infrastructure is transformed in a purchase of a service.

Amazon machine learning enables developers to build applications for fraud detection, demand
forecasting, targeted marketing, and click prediction. Amazon machine learning algorithms create Machine
Learning (ML) models by finding patterns in existing data. These models then process new data and
generate predictions for your application. For example, one possible application of an ML model would be to
predict how likely a customer is to purchase a particular product based on their past behavior.

There are many cloud providers. In North America they include Microsoft Azure and Google Cloud. In Europe,
some cloud providers are Aruba Cloud, UpCloud, and CenturyLink.

VIRTUALIZED DATA STORAGE

Storage virtualization combines physical storage from multiple network storage devices into what appears
to be a single storage device. This storage “device” is managed from a central console. Storage
virtualization makes backup, archiving, and recovery simpler and faster by disguising the complexity of a
storage area network (SAN). Storage virtualization is implemented using software, or with hardware and
software hybrid appliances. Storage virtualization benefits include increased storage capacity, automated
management, reduced downtime, and update simplification.

VIRTUALIZED NETWORK

Network virtualization (NV) is the creation of virtual networks within a virtualized infrastructure. The process
combines hardware and software network resources and network functionality into a single,
software-based administrative entity. This entity is the virtual network. Network virtualization combines
network resources by separating bandwidth into channels. Each channel is independent of the others, and
each is assigned to a specified server or device. Each channel is independently secured. Subscribers to a
network each have shared access to all the resources on that network.

To virtualize the network, the control plane function is removed from each device and is performed by a
centralized controller, as shown in the figure. The centralized controller communicates control plane
functions to each device. Each device can now focus on forwarding data while the centralized controller
manages data flow, increases security, and provides service chaining.

Data centers are frequently multi-tenancy (many different customers) environments. NV can provide
separate virtual networks to different customers within a virtual environment. This virtual network is
completely separate from other network resources. Traffic can be separated into a zone, or container, to
ensure that it does not mix with other resources.

WHAT IS DATA ENGINEERING?

Data engineering typically involves a business-related, computer-based information system where


information (data) is captured or generated, processed, stored, distributed, and analyzed.

The ability to capture data and analyze it in a meaningful way is typically done with a database and a
database management system (DBMS). Data engineering and data analysis is useful for any business or
organization that wants to direct its resources based on meaningful information and statistics. Today,
technological advances in computer processing power, storage capability, the speed of the Internet and
the development of cloud technologies and virtualization, has given us the ability to collect and process
vast amounts of data at incredible speeds.

Evolution of Database Management Systems

Early databases used tape storage systems to store data sequentially. Data could be accessed by
fast-forwarding or rewinding through the tape. The invention of spinning magnetic disks for storage allowed
individual records to be accessed directly. At this period of time however, in order to access a database, an
application needed to be created that defined exactly how the data handling would occur. One of the first
major breakthroughs in database technology came by separating the logic for handling the database from
the database itself. This was the creation of the Database Management System (DBMS).

Relational Databases

The relational database emerged around the same time as the personal computer revolution. The relational
database and the structured query language (SQL) programming language are the foundation of the
relational database management system (RDMS). The relational database was inspired by a paper
published in 1970 by the programmer and mathematician Edgar Codd. At its core, the relational model
separates how a given set of data is presented to the user, versus how it is stored on disk or memory. The
relational database differs from the navigational model by creating the ability to search for data by
content, rather than by following links. Instead of records being stored in a linked list, a relational database
organizes data into columns (attributes), tables (relations), and rows (tuples), where each table can be
used for a different type of information.

SQL is the programming language that was developed to collect information from relational databases
using a relational database management systems (RDBMS). Today, relational database systems like MySQL,
Microsoft SQL Server, Oracle, and IBM DB2 are the most popular database management systems.

In an RDBMS, there can be multiple users with many database transactions. The Atomic, Consistent, Isolated,
and Durable (ACID) transaction model specifies how database transactions maintain data integrity and
survive failures.

Non-relational Databases

Beginning in the 1990s a new type of non-relational database was developed from the observation that the
emerging Object Oriented Programming paradigm could also be a model databases. The object-oriented
database management system (OODBMS) was created. OODBMSs were not widely adopted and failed to
replace relational databases.
In the early 2000s, the emergence of Web 2.0, e-commerce, and companies like Google made it clear that
relational databases were not able to meet the volume and velocity of search requests over the web. To
meet this demand, Google developed the Google distributed file system (GFS), the distributed parallel
processing algorithm MapReduce, and the BigTable distributed NoSQL database. In 2004

Jeffrey Dean and Sanjay Ghemawat from Google published a seminal paper title “MapReduce: Simplied
Data Processing on Large Clusters” that changed forever the way to store and process large dataset. That
paper was the inspiration for two programmers, Doug Cutting and Mike Cafarella, to created Apache
Hadoop. After that, the MapReduce approach formed the basis for the development of Yahoo’s Hadoop
ecosystem and HBase database, as well as Amazon’s Dynamo key-value pair NoSQL database. Hadoop in
turn, helped promote the development of other Big Data technologies and applications.

NoSQL is a large families of databases that do not rely on the relational database approach of linked tables.
NoSQL databases may use a key-value store approach instead of a relational table-based approach. Other
NoSQL databases store data as structured documents in XML or JSON formats. NoSQL databases are much
faster than relational databases, and can import unstructured data. NoSQL databases are designed to
scale horizontally, which means that the storage and management capacity can be increased simply by
adding other machines to the cluster. The most popular NoSQL systems include MongoDB, Couchbase, Riak,
Memcached, Redis, CouchDB, Hazelcast, Apache Cassandra, HBase and Dynamo, which are all open-source
software products.

Big Data Today

All these technologies have emerged as solution to the Big Data problem. Data so big, fast or diverse that it
cannot be managed with one single computer, no matter how big that computer is. For this reason, is
common to call these software solutions “big data technologies”. In reality, Big Data problems cannot be
reduced to any single technology but must incorporate new and old technologies.

THE ROLE OF THE DATA ENGINEER

With the emergence of the IoT and Big Data, there are new categories of jobs, and adjustments to existing
jobs. If you are interested in pursuing a career in Big Data Analytics, you should know what some of these
jobs are and how they are different.

Business Analyst

The digitalization of business makes a lot of business data available. To get at the data you need, it is critical
to be able to ask the right question, in the right way. A business analyst is the person who can study a
business or an industry and then formulate a specific question. Business analysts are data experts who work
with company stakeholders to determine the issue of concern. That question is then reformulated into a
specific data problem. Business analysts then create business intelligence reports of different types for the
company stakeholders. They usually work with Miscrosoft Excel, SQL and products that are specific to
industry verticals called Business Intelligence software.

Data Analyst

Data analysts query and process data, provide reports, summarize and visualize data. They leverage
existing tools and methods to solve a problem. They help people, such as business analysts, to understand
specific queries with ad-hoc reports and charts. Data analysts need to understand basic statistical
principles, the process for cleaning different types of data, data visualization, and exploratory data analysis.
Some tools and applications that help data analysts do their jobs are SAS, Rapid Miner, and computer
languages like R or Python.

Data Scientist

A data scientist takes raw data and turns it into meaningful information. Data scientists apply statistics,
machine learning and analytic approaches to answer critical business questions.

Data science is an existing field that has expanded because of the IoT and Big Data. Like data analysts, data
scientists must have data analytical skills. But data scientists are also expected be programmers and to be
Big Data algorithm designers. Data scientists must interpret and deliver the results of their findings, by
visualization techniques, building data science apps, or narrating interesting stories about the solutions to
their data (business) problems. They work with data sets of different sizes and shapes, and run algorithms
on large data sets. Data scientists must be current with the latest technologies. They must know computer
science fundamentals and programming, including experience with languages and database (big/small)
technologies. Some of the tools that help data scientists do their work are Python, R, Scala, Apache Spark,
Hadoop, data mining tools and algorithms, machine learning, and statistics.

Data Engineer

None of the three jobs detailed above can exist without the data engineer. The data engineer creates the
infrastructure that supports Big Data. They design and build the platform on which all of this data is stored
and processed. Data engineers also manage all this data. They ensure accessibility and availability for data
scientists and data analysts.

Data engineers may also integrate data from disparate sources and even perform some data cleaning.
However, because data engineers primarily design the Big Data infrastructure for their company, they are
not generally required to know any machine learning or analytics. Some of the tools and applications that
data engineers routinely use are Hadoop, MapReduce, Hive, Pig, MySQL, MongoDB, Cassandra, Data
streaming, NoSQL, SQL, and programming.

In some environments, there is likely to be some overlap between data analysts, data scientists, and even
data engineers. As the IoT grows and Big Data becomes even more pervasive, these job descriptions may
also change. There is no denying that it is a very exciting time to be in Big Data.

SCALIBILITY WITH BIG DATA

In the context of Big Data, scalability means designing a solution that can meet the exponential growth
demands of companies like Google and Facebook. This is also true for any company, organization, or
government agency that needs to store and analyze unstructured and structured data from vast and
varying data sources. Scalability in this context means the ability to scale both data storage as well as data
processing. The Hadoop Big Data ecosystem follows a scalability approach that is similar to Google’s,
shown in the figure.

Storage

Hadoop uses a distributed file system that expands by simply adding more computers and hard disk drives.
Storage is increased by adding servers that are run on commodity (off the shelf) hardware. The ability to
use commodity hardware instead of expensive SANs or NAS storage systems keeps costs down.
Processing

When Google needs additional processing power they use the Google Modular Data Center, a
custom-designed solution that adds an additional 1000 CPUs networked, powered and cooled in a shipping
container. The 1000 CPUs work together in a cluster thanks to Google’s MapReduce technology. The Hadoop
Big Data ecosystem is also based on the MapReduce distributed processing. Adding new servers or nodes
to the Hadoop cluster not only adds additional disk storage, it also provides additional CPU processing.

Database Management

Companies that deal with Big Data or massive amounts of web transactions also have to meet the
challenge of databases that can scale. Traditional relational databases were designed for a traditional
client-server architecture. They were not designed to be distributed across multiple database servers and
they were not designed to work with unstructured data or extremely large objects and rows of data.
Google’s BigTable distributed database, Amazon’s Dynamo, and Hadoop’s HBase are key-value store,
non-relational databases that were designed to be distributed across multiple servers and scale as new
servers are added.

AVAILABILITY WITH BIG DATA

Maintaining availability is the primary concern for many companies working with Big Data. A website that
cannot respond within 3 seconds will lose visitors. The explosion of web-based e-commerce means that it is
extremely costly for a company like Amazon to not be able to process thousands of transactions around the
world immediately.

Load Balancing

Companies can improve website availability by deploying load balancing web servers and load balancing
DNS servers. Duplicate web servers can be deployed in data centers in different locations around the world
to improve web response time.

Distributed Databases

In Big Data, availability refers to the velocity or speed at which extremely large volumes of data can be
searched and processed. Distributed computing including processing, storage and database management
improves speed and availability.

Memcaching

Memcaching servers can offload the demand on database servers by keeping frequently requested data
available in memory for fast access.

Sharding

The relational database was designed to run on a single server, not distributed across many servers. To
adapt to the need for distributed computing, a relational database can be sharded or partitioned across
multiple servers. Sharding requires incredible complexity and reduces the relational database to row level
access on one shard at a time.

FAUL TOLERANCE WITH BIG DATA


Fault tolerance is similar to availability in that a company’s business needs to be constantly online and
available 24/7. Large, web-based search companies like Google and Yahoo, social network companies like
Facebook and Twitter, and e-commerce sites like Amazon cannot afford any downtime. Big Data
ecosystems like Hadoop achieve fault tolerance through multi-server redundancy.

HOW HADOOP WORKS: THE HDFS

In the early 2000s, Google realized that it would need a new database management system in order to
catalogue all of the data on the World Wide Web. No single database or server would be able to handle the
massive amounts data involved in such a large task. In order to meet the task, Google would need a
distributed computing system made up of a cluster of servers, using a single file system spread across
multiple servers, where each server shares a piece of the processing load.

The Hadoop Distributed File System (HDFS) is the filesystem where Hadoop stores data. HDFS does not
replace the Linux filesystem on individual servers, but instead sits on top of the cluster of servers as a single
filesystem spanning the entire cluster. HDFS stores data in 64Mb chunks using a minimum of at least three
DataNode servers, as shown in the figure.

HDFS manages information across the cluster from a centralized coordination server called the NameNode.
The NameNode server tracks what data resides on the various DataNode servers. When data is brought into
the system, it is imported to the NameNode, the NameNode then divides the data into 64Mb chunks, which
are then duplicated across three or more DataNodes depending on the configuration. This redundancy
provides fault tolerance similar to a mirrored RAID array, in that if one DataNode were to fail, the server
DataNode can be replaced, and the filesystem and data will be restored from the duplicate DataNodes.

HOW HADOOP WORKS: MAPREDUCE

MapReduce is a distributed processing framework for parallelizing algorithms across large numbers of
commodity servers, with the capability of handling massive data sets. MapReduce divides the data
processing into two phases: a mapping phase, in which data is broken up into chunks that can be
processed by separate threads, even running on separate machines; and a reduce phase, which combines
the output from the multiple mappers into a final result, as shown in the figure.

THE EVOLUTION OF HADOOP

Hadoop is not a single application but an ecosystem of applications all working together. Hadoop v2.0
incorporates the following core technologies:

● HDFS distributed file system


● HBase distributed database
● MapReduce distributed processing
● Hive provides a SQL-like interface
● YARN resource negotiator

YARN negotiates resources for multiple processing engines:

● Spark for running processes in memory


● Tez for running batch processes
There are additional client applications which can access Hadoop:

● Pig a scripting interface


● Mahout as a machine learning interface

THE PROBLEM OF DATA INGESTION

In its most basic form, the big data pipeline consists of three components: data ingestion, data storage, and
data processing (compute), as shown in the figure. For each of these components, there are many software
platforms that are used to complete each task. Depending on the type of data, the type of ingestion, and
the compute requirements, the components of each data pipeline are unique, often put together and
adjusted to work with one another.

The first stage of the big data pipeline, data ingestion, offers many challenges to overcome. Big data often
comes from many different sources. Much of this data is streaming and must be ingested in real-time.
Scalability is also a challenge. As more devices become connected, more data must be ingested. The
quality of this data may also be a challenge. The data may be structured, unstructured, or have very
complex formats.

There are many different tools available to collect and move large amounts of data from multiple sources
to a data store. The tool we will use as an example in the data pipeline is Kafka.

WHAT IS KAFKA?

To ingest data in real-time, a distributed streaming platform such as Kafka must be used. These are some
of the characteristics of Kafka:

● Streams of records are published and subscribed to (pub-sub), in a similar way as an enterprise
messaging system.

● Streams of records are fault-tolerant because of distributed storage.

● Streams of records are processed as they happen, in real-time.

Kafka is used to pipe real-time streaming data between different systems and applications. It is also used
to transform and react to data streams through real-time applications. Kafka was developed by LinkedIn,
but became open source software in 2011. Cisco Systems, Netflix, and eBay are just a few of the businesses
that have used, or currently use, Kafka. These are some important core concepts about Kafka:

● Kafka runs on one or more servers as a cluster. Servers in the cluster are known as brokers.
● The cluster stores streams of records in groupings called topics.
● Each record has a key, a value, and a timestamp.

Kafka has four core APIs, as shown in the figure:

● Producers – This API is where a stream of records is published by an application to Kafka topics.
● Stream Processors – This API is where an application consumes input streams from topics and can
produce output streams to topics.
● Connectors – This API is where Kafka topics are connected to existing systems and applications.
● Consumers – This API is where the stream of records in topics are produced.
Kafka acts as a messaging system to centralize communication between all of the producers of data and
all the consumers of data in a system. Kafka has a distributed design, allowing it to handle massive
amounts of data and scale easily. It is also resilient to hardware, software, and network failures.

WHAT IS THE ADVANTAGE OF KAFKA VS OTHER APPROACHES?

Traditionally, transferring messages consists of two different methods, as shown in the figure:

● Publish-and-Subscribe – The requested messages are broadcast to all of the consumers.

● Point-to-Point – Multiple consumers read messages from the server. Each of these messages goes
to one of the consumers.

Kafka is much more than just a messaging server. Kafka works in a similar fashion as a distributed
database. The messages written to Kafka are replicated to many servers and written to disk. The data store
can last forever if so desired. Because of this distributed design, Kafka has high availability, supports
automatic recovery, and is highly resilient to network failures.

What makes Kafka different than traditional message brokers is the use of transaction logs. Each topic is
made up of a set of logs that are called partitions. The producers append these logs and consumers can
read the logs whenever they need to. The topics are the data that is replicated to many brokers to achieve
fault tolerance. With Kafka, there is no single point of failure.

Kafka is able to manage very high throughput. This is because many of the responsibilities are placed on
the producers and the consumers. This keeps the brokers lightweight. This makes Kafka a desirable tool for
many IoT platforms. With the enormous amount of data coming in from sensors and other devices in
real-time, and the desire to analyze this data without losing information, Kafka can regulate this ingestion
and provide the data to multiple consumers at the same time.

WHAT IS CASSANDRA: NOSQL DB?

Cassandra is an open-source NoSQL distributed database management system. Remember that a NoSQL
database is schema-free and does not use traditional methods of storing and retrieving data like those
found in relational databases. A NoSQL database has a very simple design, supports horizontal scaling, and
provides more control of availability. While Hadoop is primarily used on projects involving data lakes and
data warehousing, Cassandra is used for very high-speed data projects. Cassandra is also completely
distributed, able to be deployed across the globe, if needed. It provides a decentralized database with no
single point of failure. Cassandra is also highly available.

Cassandra was originally developed at Facebook and became open sourced in 2008. One of the things that
makes Cassandra so fast is that it uses sequential read and write. This makes it ideal for use with time series
data, found throughout IoT solutions. Also, instead of appending files, when a change such as an addition to
a file, or a removal of data to a file, a new file is created and the old file(s) are deleted. This keeps the
sequential data intact on the disk, retaining high speed reads.

These are the some of the most important features of Cassandra:

● Distributes data easily – Data can be replicated across multiple data centers.

● Supports transactions – Supports Atomicity, Consistency, Isolation, and Durability (ACID). ACID is the
four primary attributes that a transaction manager ensures to any transaction.
● Elastic scalability – When more data is required or more customers are added, more hardware can
be added to scale with the need.

● Fast linear scaling – As the number of nodes are increased in a cluster, throughput is increased,
maintaining very fast response times.

● Very fast writes – Cassandra performs fast writes without sacrificing read speed while storing
hundreds of terabytes of data. This is because it was designed to run on standard hardware, not
top-of-the-line machines.

● Always on – Many applications are business-critical, and with no single-point of failure, Cassandra is
always available even with hardware and/or network failures.

● Data storage flexibility –Unstructured, structured, and semi-structured data is supported.

WHAT ARE THE ADVANTAGES OF CASSANDRA VS HADOOP FOR STORAGE?

The Apache Hadoop project defines HDFS as “the primary storage system used by Hadoop applications”
that enables “reliable, extremely rapid computations.” As you have learned, HDFS uses a master-slave
architecture. The primary NameNode in a cluster regulates the file system and data access by clients. There
are also DataNodes in the cluster, often one physical machine, to handle the attached storage. The data is
stored in files, divided into blocks, and spread across the DataNodes. HDFS copies these blocks onto two
additional servers by default. The fact that HDFS has this type of automatic redundancy makes it a popular
choice to use for data warehouse batch analytics.

Cassandra uses the Cassandra File System (CFS). CFS is not a master-slave architecture like HDFS. Every
node in the cluster is the same, a peer-to-peer implementation. Clusters store real-time data and analytic
operations can be performed on that data. Cassandra’s built-in replication copies the data among all
real-time, analytic, and search nodes.

With the CFS, analytic metadata is stored in a keyspace, much like a database in an RDBMS. There are two
column families, similar to RDBMS tables, which contain the data. The data in these column families is
replicated across the cluster for fault tolerance and data protection. The two column families are similar to
the two primary HDFS services. The inode column family takes the place of the HDFS NameNode service. The
information in this data family are things like the user, parent path, filename, and the block IDs that make up
the file. The sblocks column family takes the place of the HDFS DataNode service. This column family stores
the contents of any file. The CFS data model is shown in the figure.

These are some benefits of using CFS over HDFS:

● Better availability – No shared storage solution is needed. The more nodes and clusters there are, the
better availability there will be. This can also be improved by increasing the replication factor which
determines how many nodes will receive replicated data.

● Basic hardware support – No special servers are needed and no special network devices are needed
for CFS.

● Automatic failover – As with availability, failover is automatic because of replication. Nodes, or entire
clusters, even entire data centers can fail and the data on the nodes and clusters will remain
available at other locations.
● Data integration – All data that is written to Cassandra is replicated to both analytics and search
nodes. This allows analytic, enterprise search, and real-time jobs to be completed simultaneously
without affecting each other.

● Easier deployment – Clusters are easy to set up and can be running in just a few minutes. There are
no complicated storage requirements and no master-slave configuration.

● Supports multiple data centers – CFS can run a single database across multiple data centers. Tools
are available to allow each data center to have a local copy of all the data in the database. Analytic
jobs can be run across multiple data centers at the same time.

HDFS is an excellent choice when a low-cost storage solution is needed for a Hadoop application with a
focus on data warehousing. CFS can run analytics on data which comes from large programs that integrate
into databases and DBMS.

THE PROBLEM OF COMPUTING FUNCTION

Perhaps the largest challenge facing Big Data computing is the size of the data sets being used in many
different fields. For example, genetic sequencing (see figure) has become a huge focus of the scientific
community in recent years. This is mainly due to the fact that high-performance computing (HPC),
combined with advances in sequencing technology have made it possible to sequence the whole human
genome in as little as 26 hours and for around $1,000. A sequence of the entire human genome is about
200GB of data and takes enormous amounts of computing power to work with it.

Eventually, the amount of data and the desired analysis of it will exceed even the most powerful HPC. When
this happens, the HPC must be either scaled up by adding more processors and memory to a computer or
scaled out by adding more computers to a cluster and connecting them with high-speed connections.
When the volume of processing data is huge, frequent data movement can increase latency considerably.
It is desirable to have a compute system that works to overcome the limitations of the storage system to
keep latency to a minimum.

Big Data contains big value. The analytics performed on Big Data can provide new opportunities and
unforeseen trends, providing a better view of customers and the market in general. Accurate customer
analytics, fraud detection, and risk analysis all benefit from Big Data analytics. These complex computations
require not only large stores of data but low-latency, high-throughput stream processing. This further
increases the size and difficulty of HPC analytics.

To reduce latency from frequent data I/O operations, computations can be moved to the server where the
data is located. Multiple small tasks are performed in many locations to distribute the load. The MapReduce
programming model can reduce much of the I/O overhead and network bottleneck to some degree, but
this is not the main purpose of MapReduce.

Tail-latency is also a problem that contributes to the Big Data problem. Big data computation is split, to
execute specific computations on different nodes. When one node is slower than the others, the response
time is increased. This causes the results of the overall job to wait for this node to complete its part before
they can be realized. Tail-latency is a significant issue in large-scale data centers, causing significant
impact to time-sensitive computing on data streams, for example.
Also, job scheduling can significantly increase delay. Often, large jobs are completed while other, small jobs
are being performed. The system must be proficient when scheduling these different job types. Smaller jobs
that must wait for large ones to complete can cause delay. This can be a real problem when some jobs
need to be processed in real-time.

There are many different tools available to compute Big Data. The tool we will use as an example in the data
pipeline is Spark.

WHAT ARE THE ADVANTAGES OF SPARK VS MAPREDUCE?

Spark is able to run right on top of an Hadoop instance, using HDFS for storage and YARN for cluster
management. Spark does not need to use Hadoop at all. Other storage solutions can be used such as CFS
or AWS S3, and other cluster managers can be used such as Mesos. Spark is a very independent platform in
that it supports so many different technologies and programming languages. This is helpful because there
are so many different needs when it comes to Big Data solutions and analysis in many different fields.

The approach to a Big Data solution is to pick the right tools for the job. Sometimes a blend of Spark and
MapReduce is the best solution to a specific job. Using Hadoop with MapReduce is still viable when
performing batch processing, using an application that uses HDFS exclusively, when personnel are already
experts with them, or existing applications that use the technology are already in production.

Spark has become very popular because of its performance, ease of administration, simplicity, and the fact
that applications can be created more quickly when using it. These are some reasons to use Spark instead
of MapReduce when creating a big data solution:

● Streaming data – Spark is very capable of dealing with enormous amounts of real-time data. The
data may come from mobile devices, social networking, or IoT sensors, for example.

● Heterogeneous data – Many Big Data solutions acquire data from many different sources. Because
of the flexibility of Spark and its ability to transcend these different silos of data, it is a good choice for
a solution.

● Machine learning – With a built-in machine learning library, Spark is able to cater to a much larger
audience than previous solutions.

● Real-time applications – In-memory processing allows Spark to return calculation results much
faster than MapReduce. This is important in all situations, but it is imperative when real-time results
are required by an application.

● Less code – Spark has support for many different languages which means there is less code that
needs to be written and maintained.

● Developer experience – Spark is much easier to learn and much less intimidating than MapReduce.
This brings more reliable code to the project much faster.

THE ARCHITECTURE: BATCH LAYER, SPEED LAYER, SERVING LAYER

In an effort to converge analysis of historical data and real-time data, the Lambda Architecture was
created. Lambda is a data processing architecture that uses both stream processing and batch processing
to get accurate views of both “live” data and batch data. The Lambda Architecture has four layers, as
shown in the figure:
● Ingestion – This layer is where data is imported. This can be from multiple sources including data
streams.

● Batch – This is the data at rest layer. Data here is often built on a schedule and includes importing
data from the stream layer. Accuracy is more important than speed.

● Stream – This is a complex layer supporting incremental updating. Low latency is the priority here
over accuracy. This data is often only seconds behind the generation of the data.

● Presentation – This layer runs the operation. It accepts all the queries and elects to use the speed
layer or the batch layer. The preference here is most often the batch layer because the data is
cleaner. If asked for young data, it will use the stream layer.

All data going into the architecture is sent to both the batch layer and the speed layer at the same time, as
shown in the figure. The batch layer manages the master data set and pre-computes batch views that are
constantly being computed. The serving layer indexes the batch views so that they can be queried. The
speed layer only deals with recent data which compensates for the latency of updates to the layer which is
serving the data. The two different query results can be merged to form a new view of the data.

The IoT is the perfect area for the implementation of a Lambda architecture. Data is being produced at an
extremely rapid rate, data sets can be incredibly large, and queries may be for both data at rest and data in
motion. The IoT also benefits from the combined view that the Lambda Architecture can provide.

LAMBDA ARCHITECTURE EXAMPLE: FLOATING BUS DATA

This is a real-life example of how a Lambda Architecture can be used to visualize the 171 bus routes in Los
Angeles, CA. The structure of the solution includes SACK (Spark, Akka, Cassandra, and Kafka). This allows for
near real-time analytics on the data. One tool used here that we have not examined is Akka. Akka is a free
and open-source toolkit for building distributed and resilient message-driven applications. In this structure,
Akka is used to retrieve the route information metadata and store it in Cassandra every 30 seconds. More
about Akka can be found here.

In this example, Akka is used to request data through a free REST API. The data is then stored in Cassandra.
Subsequent requests are sent to Kafka. Spark is then used to read the vehicle information from Kafka. When
all this data is available, another API is used to visualize the position of the buses with OpenStreetMap. What
is amazing about this process is that the current positions of the buses are streamed right from Kafka onto
the map by using a websocket communication.

Click here to read more about the Floating Bus Data platform. Click here to see a video about the Floating
Bus Data platform.

This particular platform is not just limited to IoT data. It is suited to solutions where there is a lot of incoming
data. It is very useful when there are multiple and parallel incoming data streams. This is a much better
solution than a traditional RDBMS which would not be able to provide near-real time data.

SETTING THE STAGE

In this course, we have been concerned with the analysis of numeric data that represents the aspects of the
physical world that we want to understand and act upon. We do not normally think of media, such as
images, video, and sound, as data. However, in the digital age, media is numeric data also. It is represented
by ones and zeros as digital data.

Some of the techniques that we have learned in this course can be applied to digital media data. For
example, images are arrays of pixels that have color values for each pixel. Digital images can be
transformed into numpy arrays that can be analyzed as if they were arrays of any other sort of data.

Similarly, machine learning techniques can be used to analyze features of digital media data. These
analyses can be applied to unique instances of the media to provide real-time or near real-time analyses
of streaming media data.

Streaming audio data can be analyzed to identify problems with running machinery. The sound of the
machine can be compared to the sound signatures of failing machine parts to alert personnel to the need
for service before the machinery actually fails.

Digital images captured in an agricultural setting can be used to identify problems with crops. This can alert
farmers to the need for water, fertilizer, or pesticides. Facial recognition algorithms can be used to identify
individuals from streaming security camera data. This type of analysis is extremely valuable to law
enforcement and domestic security agencies.

SUMMARY

This chapter covered how the virtualized data center supports Big Data and analytics. Data can be
processed almost immediately after it is generated, near the source of its creation on the network (fog
computing). Data centers are centralized locations containing large amounts of computing and networking
equipment. This equipment is used to collect, store, process, distribute, and provide access to vast amounts
of data. Its main function is to keep computing services available whenever and wherever they are needed.
Data centers typically deal with sensitive or proprietary information; therefore, these sites must be secured
physically and digitally.

Operating systems (OSs) separate the applications from the hardware. OSs create an “abstraction” of the
details of the hardware resources to the application. Virtualization separates the OS from the hardware. A
computer system consists of the following abstraction layers: applications, OS, firmware, and hardware.
Data centers can also use virtualization to cut costs and expand offerings as cloud providers. Some of these
offerings are SaaS, PaaS, and IaaS. Storage virtualization combines physical storage from multiple network
storage devices into what appears to be a single storage device. Network virtualization (NV) is the creation
of virtual networks within a virtualized infrastructure.

The next section discussed data engineering. Data engineering typically involves a business-related,
computer-based information system where information (data) is captured or generated, processed, stored,
distributed, and analyzed. The data engineer creates the infrastructure that supports Big Data. They design
and build the platform on which all of this data is stored and processed. Data engineers also manage all
this data. They ensure accessibility and availability for data scientists and data analysts.

In the context of Big Data, scalability means designing a solution that can meet the exponential growth
demands of large companies. It is the ability to scale both data storage as well as data processing.
Maintaining availability is the primary concern for many companies working with Big Data. It is extremely
costly for an online company like Amazon to not be able to process thousands of transactions around the
world immediately. Fault tolerance is similar to availability in that a company’s business needs to be
constantly online and available 24/7. Security on these platforms is largely achieved through HTTPS
webpages using Transport Layer Security TLS.

The Hadoop Distributed File System (HDFS) is the filesystem where Hadoop stores data. MapReduce is a
distributed processing framework for parallelizing algorithms across large numbers of commodity servers,
with the capability of handling massive data sets. Hadoop is not a single application but an ecosystem of
applications all working together.

The third section explains how a Big Data pipeline supplies streaming IoT data for analysis. In its most basic
form, the Big Data pipeline consists of three components: data ingestion, data storage, and data processing
(compute). Kafka is used to pipe real-time streaming data between different systems and applications.

Big Data is the term for the vast volumes of data that we are constantly creating from a massive amount of
data sources. It also refers to the vast amounts of data that we have stored. Cassandra uses the Cassandra
File System (CFS). CFS is not a master-slave architecture like HDFS. Every node in the cluster is the same, a
peer-to-peer implementation. Cassandra is an open-source NoSQL distributed database management
system. Spark is an open-source, distributed data processing engine used for Big Data jobs. In an effort to
converge analysis of historical data and real-time data, the Lambda Architecture was created. Lambda is a
data processing architecture that uses both stream processing and batch processing to get accurate
views of both “live” data and batch data.

The last section of this chapter introduces the image processing labs. Some of the techniques in this course
can be applied to digital media data. Digital images can be transformed into numpy arrays that can be
analyzed as if they were arrays of any other sort of data. Machine learning techniques can be used to
analyze features of digital media data.

You might also like