You are on page 1of 51

Data Analytics for Beginners

Basic Guide to Master Data Analytics


Table of Contents:
Introduction
Chapter 1: Overview of Data Analytics
Foundations Data Analytics
Getting Started
Mathematics and Analytics
Analysis and Analytics
Communicating Data Insights
Automated Data Services
Chapter 2: The Basics of Data Analytics
Planning a Study
Surveys
Experiments
Gathering Data
Selecting a Useful Sample
Avoiding Bias in a Data Set
Explaining Data
Descriptive analytics
Charts and Graphs
Chapter 3: Measures of Central Tendency
Mean
Median
Mode
Variance
Standard Deviation
Coefficient of Variation
Drawing Conclusions
Chapter 4: Charts and Graphs
Pie Charts
Create a Pie Chart in MS Excel
Bar Graphs
Create a Bar Graph with MS Excel
Customizing the Bar Graph
Time Charts and Line Graphs
Create a Line Graph in MS Excel
Customizing Your Chart
Annual Employee Losses
Adding another Set of Data
Histograms
Create a Histogram with MS Excel
Creating a Histogram
Scatter Plots
Create a Scatter Chart with MS Excel
Spatial Plots and Maps
Chapter 5: Applying Data Analytics to Business and Industry
Business Intelligence (BI)
Data Analytics in Business and Industry
BI and Data Analytics
Chapter 6: Final Thoughts on Data
Conclusion
Introduction
We live in thrilling and innovative times. As business moves to the digital environment, virtually every
action we take produces data. Information is collected from every online interaction. All sorts of devices
gather and store data about who we are, where we are, and what we are doing. Increasingly-massive
warehouses of data are now freely available to the public. Skilled analyses of all this data can help
businesses, governments, and organizations to make better-informed decisions, respond quickly to
changing needs, and to gain deeper insights into our rapidly-changing environment. It is a challenge to
even attempt to make good use of all of the available data. In order to answer specific questions, a
person must decide what data to collect, which methods to use, and how to interpret the results.
Data analytics is a way to make valuable use all types of information. Analytics is used to help
categorize data, identify patterns, and predict results. Data use has become so ubiquitous that it has
become necessary for individuals in every profession to learn how to work with data. Those who
become the most proficient at working with data in useful and creative ways will be the most successful
in the new world of business.
Until recently, data analytics was limited to an exclusive culture of data analysts, who characteristically
presented this topic in complicated and often unintelligible terminology. Fortunately, data analytics is
not as complicated as many believe. It simply consists of using analytical methods and processes to
develop and explain specific and useful information from data. The point of data analytics is to enhance
practices and to support better-informed decisions. This can result in: safer practices within an industry,
greater revenues for a business, higher customer satisfaction, or any other object of focus. This eBook
introduces a wide range of ideas and concepts used for deriving useful information from a set of data,
including data analytics techniques and what can be achieved by using them.
Chapter 1: Overview of Data Analytics
With a little statistical understanding and procedural training, you will be able to use analytical methods
to make data-based insights. Data analytics offers new ways to understand the world. Businesses and
organizations were in the habit of making decisions based on assumptions and hoping for favorable
outcomes. Data analytics gives people the insights that they need to plan for improvements and specific
results. Analytics are generally used for the following purposes:

To enhance business organizations and increase returns on investment (ROIs).

To improve the success of sales and marketing campaigns.

To identify trends and emerging developments.

To make society more safe.

Foundations Data Analytics


Data analytics requires the use mathematical and statistical procedures. It also requires the skills to work
with certain software applications and a knowledge of the subject area you are working with. Without
knowledge of the subject-matter, analytics is reduced to simple analytics. Due to the increasing demand
for data insights, every field of business has begun to implement data analytics. This has resulted in a
variety of analytic specialties, such as: market analytics, financial analytics, clinical analytics,
geographical analytics, retail analytics, educational analytics, and many other areas of interest.

Getting Started
This chapter explains the major components comprising data analytics, gathering, exploring, and
interpreting data. As a data analyst, you will be collecting and sorting large volumes of raw,
unstructured, and partially-structured data. The amounts of data that you are likely to be working with
can be too large for a normal database system to effective process. A data set that is too large, changes
too quickly, or it does not conform to the structure of standard database designs requires a special
skillset to manage. Data analytics consists of analyzing, predicting, and visualizing data. When data
analysts gather, query, and interpret data, they conduct a process that is quite similar to data engineering.
Although useful insights can be produced from an individual source of data, the blending of several
sources gives context to the data that is necessary to make more informed decisions. As a data analyst,
you can combine multiple datasets that are maintained in a single database. You can also work with
several different databases maintained within a large data warehouse. Data can also be maintained and
managed within a cloud-based platform specially designed for that purpose. However the data is pooled
and wherever it is stored, the analyst must still issue queries on the data and make commands to retrieve
specific information. This is typically done using a specialized database language called Structured
Query Language (SQL).
When using a database software application or conducting an analysis using other programming
languages, like R or Python, you can utilize a variety of digital file formats, such as:
Comma-separated values (CSV) files: Virtually all data-based software applications (including
cloud-based programs) and scripting languages are compatible with the CSV file type.
Programming Scripts: Professional data analysts generally know how to write programming
scripts in order to work with data and visualizations in languages like Python and R.
Common File Extensions: MS Excel files have the .xls or .xlsx extension. Geospatial
applications are saved with their own file formats (e.g., .mxdextension for ArcGIS and the .qgs
extension for QGIS).
Web Programming Files: Web-based data visualizations often use the Data Driven Documents
JavaScript library (D3.js.). D3.js, files are saved as .html files.

Mathematics and Analytics


Data analytics requires the ability to perform mathematical and statistical operations. These skills are
necessary to understand both to make sense of the data and to evaluate its relative significance. This is
also important in data analytics, because they can be used to conduct data forecasting, decision analytics,
and testing of hypotheses. Before getting into more advanced explorations of mathematical and
statistical procedures, we will take some time to explain some distinctions between mathematics and
analytics.
Mathematics relies on specific numerical procedures and deductive reasoning to develop a mathematical
explanation of some phenomenon. Like mathematics, analytics provides a mathematical description of a
phenomenon. Analytics is actually a type of analytics that is based on mathematics. However analytics
uses inductive reasoning and probability to form a conclusions and explanations.
Data analysts use mathematical procedures to make decision models, to produce estimations, and to
make forecasts. In order to follow this book, you need little more than common math skills. This book
will teach you how to statistical techniques to develop insights from data. In the field of data analytics,
statistical procedures are used to determine the meaning and significance of data. This can then utilized
to test hypotheses, build data simulations, and make predictions about future outcomes.

Analysis and Analytics


The major difference between data analysis and data analytics is the need for subject knowledge. Typical
statisticians specialize in data procedures and have little-to-no knowledge of other fields of study. They
must consult with others who have subject-specific expertise to know which data to look for and to help
find meaning in that data. Data analysts, on the other hand, must understand their subject matter. They
seek to gain important insights that they can use with their subject-matter expertise to make meaning of
those insights. Below is a list of ways that subject matter experts use analytics to enhance performance
in their areas:
Engineering analysts use data analytics with building designs.
Clinical data analysts use predictive methods to foresee future health issues.
Marketing data analysts use regression data to predict and moderate customer turnover.
Data journalists search databases for patterns that may be worth investigating.
Crime data analysts develop spatial models to identify patterns and predict future crimes.
Disaster relief data analysts work to organize and explain important data about the effects of
disasters, which is then used to determine the types of assistance needed.
Communicating Data Insights
Data analysts often have to explain data in ways that non-technical people can comprehend. They must
be able to create understandable data visualizations and reports. Generally, people have to visually
process data in the form of charts, graphs, and pictures for to be able to understand data. Analysts have
to be both creative and practical in the ways that they communicate their findings.
Organizational leaders often have difficulties trying to figure out what to do with all of data that their
organization collects. What they do know, however, is that effectively using analytical tools can help
them to both strengthen and gain a valuable competitive edge for their business or organization.
Currently, very few of these leaders know the available options for engaging in the process. The
following section discusses the major data analytics solutions and the benefits that can be gained by
organizations.
When implementing data analytics within an organization, there are three key methodologies. One can
create an internal data analytics department. One could contract out the assignments to independent data
analysts, or one could pay for a cloud-based software-as-a-service (SAS) solution that enables novices to
utilize powerful of data analytics tools.

There are a few major ways to create an internal data analytics team:

Train current personnel. This can be an inexpensive way to provide an organization with
ongoing data analytics. This training can be used to transform certain employees into highly-
skilled subject-matter experts who are proficient in data analysis.

Train current personnel and also hire professional analysts. This strategy follows the same
process as the first method, but also includes hiring a few data professionals to oversee the
process and personally handle the most challenging problems and tasks.

Hire data professionals. An organization get their needs met by hiring or contracting with
professional data analysts. This is the most expensive option, because professional data analysts
are in low supply and generally have high salary requirements.
Securing highly-skilled data analysts to meet the needs of an organization can be extremely difficult.
Many businesses and organizations outsource their data analytics jobs to external experts. This happens
in two different ways: They contract with someone to develop a wide-ranging data analytics plan to
serve the entire organization. Another way is to contract with experts to provide individual data analytics
solutions for specific situations and problems that that their organization may encounter.

Automated Data Services


Although you must understand some certain statistical and mathematical procedures, it is not essential to
learn how to code like professional analysts. Computer program applications have been developed that
can help to provide powerful capabilities without having to code or script. Cloud-based platform
solutions can provide organizations with most or all of their data analytics needs, although training is
still required for personnel to operate the cloud platform programs.
This book will teach you how to use the power of data analytics to achieve a individual and
organizational goals. Regardless of a field of work, learning data analytics can help you to become a
more in proficient and sought after professional. Below is a brief list of benefits that data analytics
provide for various areas:
Benefits for corporations: Cost minimization, higher return on investment (ROI), increased staff-
productivity, reduction of customer loss, higher customer satisfaction, sales forecasting, pricing-
model enhancement, loss detection, and more efficient processes.

Benefits for governments: Increased staff-productivity, improved decision-making models, more


reliable budget forecasting, more efficient resource allocations, and discovery of organizational
patterns.

Benefits for academia: More efficient resource allocations, improved instructional focus and
student performance, increased student retention, refinement of processes, reliable budget
forecasting, and increased ROI for student recruitment practices.
This chapter provided an introduction to the concept of data analytics. Analytics is a growing field of
science that brings together traditional statistical procedures and computer science in order to ascertain
meaningful insights from huge sets of raw data for the benefit of businesses, organizations,
governments, and society. Data analytics is sometimes confused with Business Intelligence (BI) because
of the common tools they both share, particularly data visualizations, such as traditional charts and
graphs. BI, however, is a discipline designed for business leaders without the advanced training
necessary to engage in data analytics. The following chapter discusses the basic principle of data
analytics.
Chapter 2: The Basics of Data Analytics
This Chapter will help you to understand the big picture of the field of analytics. It will discuss the steps
of the scientific method, and it will help you to learn how to apply analytics at each step of the scientific
process. Analytics does not only consist of analyzing data. It also consists of using the scientific process
to find answers to questions and make important decisions. The process includes designing studies,
gathering useful information, explaining the data with figures and charts, exploring the data, and
drawing conclusions. We will now examine each step in this process and discuss the critical role of
analytics.

Planning a Study
Once the research question is established, it is time to design a study to answer that specific question.
This requires figuring out the methods that you will use to extract the necessary data. This section covers
the two main types of studies: descriptive studies and experimental studies.
Surveys
With a descriptive study, data are gathered from people in a way that does not have an impact on them.
The most widely used type of descriptive study is a survey. Surveys are questionnaires that are given to
people who are randomly selected from a target population. Surveys are useful data tools for gathering
information. As with all methods of gathering data, improperly conducted surveys are likely to result in
inaccurate information. Common issues with surveys include inadequately worded questions, which can
be confusing, lack of participant response, or lack of randomization in the selection process. Any of
these problems can invalidate the results of the survey, therefore surveys must be carefully planned
before they are implemented.
A limitation of the survey method is that they can only provide information on relationships that exist
between variables and not information on causes and effects. If the survey researchers observe that the
people who smoke cigarettes, for example, tend to work longer hours per day than those who do not
smoke, they are not in a position to suggest that smoking is the cause for the longer work hours.
Variables that were not part of the research design might cause the relationship, such as number of hours
they sleep every night.
Experiments
Experiments involve the application of one or more treatments to subjects in a controlled environment.
The treatments are things that may or may not affect the subject under study. Some studies involve
medical experiments, wherein the subjects are patients who undergo medical treatments. Other
experiments might include students who receive tutoring, or exposure to a particular instructional tool as
the treatment. Businesses engage in experiments that involve sample participants from the consumer
market. These participants may be exposed to a certain type of advertisement and asked how they were
emotionally affected.
Once the treatments are applied, the responses are systematically recorded. For instance, to study the
effect of a drug dosage amount on blood pressure, a group of subjects may be administered 15mg of a
medicine. A different sample group may be administered 30 mg of the same drug. Typically, a control
group is also involved, where subjects each receive a placebo treatment (i.e., a substance with no
medicinal properties).
Experiments are often designed to take place in a controlled setting, in order to reduce the number of
potential unrelated variables and possible biases that might affect the results. Some possible problems
might include: researchers knowing which participants received particular treatments; a particular
circumstance or condition, not factored into the study, that may impact the results (e.g., other
medications that a participant may be taking), or not including an experimental control group. However,
when experiments are designed correctly, difference in responses, found when the groups are compared,
allow the researchers to conclude that there is a cause and effect relationship. No matter what the study,
it must be designed so that the original questions can be answered in a credible way.
Gathering Data
Once a research plan (whether descriptive or experimental) has been designed, the subjects must be
selected, and data must be gathered. This stage of the research process is essential to generating
meaningful data. The ways in which data are collected vary with the type of study. In experimental
designs, the data should be collected in the most controlled manner possible, in order to reduce the
possibility of generating contaminated results. Some experiments require more strenuous procedures
than others. When gathering data on peoples perceptions of a new business marketing strategy or data
concerning the effectiveness of a new teaching strategy, the consequences of inaccurate results are not as
critical as they would be in a medical study. Therefore, in low-stakes experiments, it is sometimes
preferable to use less robust data gathering procedures in order to save time and money.
Selecting a Useful Sample
In analytics, as with computer programming, garbage in results in garbage out. If subjects are
improperly chosen, for example by giving some more of a chance to be selected than others, the results
will be unreliable and not useful for making decisions. For example, John is researching the attitudes of
individuals about a possible new tax. John stands in front of a local grocery store and asks passers-by to
share their thoughts and attitudes. The problem with that is that John will only get the attitudes of a)
individuals who shop at that grocery store; b) on that specific day; c) at that specific time; d) and who
actually chose to participate. Because of his limited selection process, the subjects in his survey are not
representative of the entire population of the town. Likewise, John design an online survey and ask
people to input their feedback on the new tax. However, only people who are aware of the website, have
access to the Internet, and choose to participate will provide data. Characteristically, only people with
the strongest attitudes are likely to participate. Again, these the participants would not be representative
of everyone in the town.
In order to avoid such selection bias, it is necessary to select the sample randomly, using some type of
process that gives everyone in the population the same statistical opportunity to be chosen. There are
various methods for randomly selecting subjects in order to get valid and useable results.

Avoiding Bias in a Data Set


If you were conducting a phone survey on political voting preferences, and you made your calls to
peoples land lines at home between the hours of 8:00 a.m. and 4:00 p.m., you would fail to get feedback
from individuals who work at that time. Perhaps those who work during those hours have different
preferences than those at home during those hours. For example, more business owners may be at home
and express voting preferences for something completely different than members of the working class.
Surveys that are poorly designed may be too lengthy, resulting in some participants quitting before they
finish. Participants may not be completely honest if the questions are too personal. If the list of choices
is too limited, the survey will not be able to capture valuable data that people would have provided.
Many things can render survey data invalid.
Experiments can be even more problematic in terms of gathering data. If you want to test how well
people retain information when exposed to loud music, a variety of factors could affect the outcomes.
The experiment designer should consider if everyone will listen to the same song, if they will be asked
about the amount of sleep they got the night before, if they have prior knowledge about the type of
subject matter, how they feel about being there participating in the experiment, whether they use drugs
or alcohol regularly, and a host of other considerations that must be considered in order to control for
outside variables.

Explaining Data
Once data has been collected, it is time to compile it in order to get a view of the entire data set.
Analysts describe data in two basic ways: with images, like graphs and charts, and with figures, called
descriptive analytics. Descriptive analytics are the most commonly-used methods for describing data to
the general population. When used effectively, a chart or graph can easily explain volumes of data in a
single snapshot.
Descriptive analytics
Data can be summarized by using descriptive analytics. Descriptive analytics are numerical
representations of data that highlight the most important features of a dataset. With categorical data,
wherein everything is sorted into groups (e.g., age, gender, ethnicity, currency, price, etc.) things are
usually summarized by the number of units in each category. This is referred to in terms of frequency or
percentage.
Numerical data consists of literal quantities or totals (e.g., height, weight, amount of money, etc.),
wherein the actual numbers are meaningful. When working with numerical data, more aspects can be
summarized than just the number or percentage within each category. Such elements include measures
of middle (i.e., the center point of the data); measures of variance (i.e., how widely spread or how
tightly-clustered the data are around the center). Another consideration is a measure of the relationship
between different variables.
Depending on the particular situation, certain descriptive analytics are more appropriate than others. For
example, if you were to assign the codes 1 for men and 2 women, when analyzing the data, it would not
make sense to attempt to average those numbers. Likewise, attempting to use a percentage to explain a
singular amount of time would not be useful.
Another type of data, ordinal data, is somewhat of a combination of the first two types. Ordinal data
appear are in categories, however the categories have a hierarchical order, such as rankings from 1 to 10,
or student ranks of freshman through seniors. This data can be analyzed the same way as categorical
data. Numerical data procedures can also be used when the categories represent meaningful numbers.

Charts and Graphs

Data can be presented visually with graphs and charts. Such graphs include pie charts and bar charts,
which can be used with categorical variables like gender or type of car. A bar graph might present data
about attitudes using, for example, a series of five ordered bars labeled from Strongly Disagree
through Strongly Agree.
Not all data, however, can be presented clearly with these types of charts. Numerical data, such as
height, time, or dollars that represent measures of something or totals require the types of graphs that
can either summarize the numbers or group them numerically. One such graph that is a histogram, which
will be discussed later in this book.
Once the data is collected and described with pictures and numbers, it is time to begin the process of
data analysis. Assuming that the study was planned well, the research question can be properly answered
by applying an appropriate data analysis. As with all previous steps in the process, selecting an
appropriate analytical procedure determines the usefulness of the results.
This chapter discussed the foundations of data analytics. Using mathematical techniques and scientific
procedures to collect, measure, analyze, and draw conclusions from data is what data analytics is all
about. The following chapter discusses the major kinds of data analyses necessary to conduct effective
data analytics. In the following chapter you will learn the basics of calculating and measuring common
descriptive analytics for measuring central tendency and variation within a set of data, as well as the
analytics necessary to evaluate the relative position of a specific value within that data set.
Chapter 3: Measures of Central Tendency
The essence of data analytics is their analysis of data. Analysts use analytical procedures to make sense
out of large amounts of data and their characteristics. Analytical methods can be applied to find
commonalities within groups of people; which can then be used to influence the decisions that they
make. This is done all of the time by advertisers and politicians. A governmental department, for
example, may want to find out the average number of people below the age of 18 that use smokeless
tobacco products. Based on the results of their study, the department could propose a new requirement
that smokeless tobacco products be restricted from advertising near schools. Likewise, a fashion
designer might want to learn the height and weight of U.S. women with full time jobs. A great deal of
data analytics is conducted to find averages and other measures of central characteristics among sets of
data.
When investigating a total of 100 units, it can be convenient to gather the entire population and apply
measurements. When dealing with larger numbers, reaching the millions or even billions, measuring the
entire population can be slightly more challenging. In situations, it is necessary to take random samples
from the total population and allow the sample to represent the total group. This section discusses we
some of the essential principals of data analytical that lay the foundation for all types of data analysis.
These important concepts are: mean, median, mode, variance, and standard deviation.

Mean
The mean or average of a set of data, is the sum of all the numbers within a group divided by the number
of units in the group. The mean of a group is a representative property of the collective group. Useful
assumptions can be made about an entire set of data by figuring out its mean. The formula for
calculating the mean is below.
Mean = Sum of all the set elements / Number of elements.
For example: (1+2+3+4+5) / 5 = 3
The mean of a data set summarizes all of the data with a single value. An analyst might want to compare
the average price of houses between to different neighborhoods. In order to compare theses housing
prices, it would be illogical to compare the price of each individual house to the price of every other
house in the study. The best way to approach this research question would be to find the mean prices of
houses in each of the two neighborhoods, and then compare the two means with each other. By doing
this, the analyst will be able make a valid assumption about which neighborhood has the more expensive
houses.

Median
Median is the middle number of a data set. For a set of data that is composed of an odd number of
values, the value in the middle the median. For a set of data composed of an even number of values, is
the average of the two middle numbers is the median. The median is commonly utilized to divide a
collection of data into two separate halves.
In order to find the median of a set of data, write the numbers of the set in order from smallest to largest,
and count the number of units and identify the one or two numbers in the center. This is different from
calculating the mean, because the range of number values is not taken into consideration. Consider this
set of numbers: (1, 2, 3, 4, 20):

Mean: (1+2+3+4+20) / 5 = 6 Median: (1, 2, 3, 4, 20) = 3

The median of a data set is important, because it is not affected by abnormal deviations in the data set.
As we can see in our example, the value "20" disproportionately affects our median, making it appear as
though half of the values would be below 6 and the other half above 6. The mean, in this case, does not
provide a realistic representation of the data set. If the values represented dollars per week in allowance,
it would appear that the individual receives amounts that are half over and half under $6, when in fact,
the person would have only once received more than $4. The median, in this case, provides us with a
more accurate description of the contents of the data set. Bear in mind that this small collection of data
only consists of 5 values, so it is easy to understand with a quick glance. When the data set contains
hundreds of thousands of values, accurate estimations cannot be made with a quick glance.
The most significant feature of this data set is the single outlier that raises the mean. An outlier is an
outstanding deviation from the majority of the data set. For instance, if a set of data contains the values:
10, 20, 30, 40, 1000, the value 1000 is considered an outlier. Outliers can move the value of the mean far
from its logical central location. The mean of the above set is 1100/5=220 and the median is 30. The
median of this more accurately represents the data set than does the mean.

Mode
In a data set, the mode is the value that occurs most frequently. Mode is a measure of central tendency
like mean and median. The mode also represents a set of data with a single value. For instance, the mode
of the dataset (1,2,3,3,3,4,4,4,4,4,5,5,6,7) is 4, because it appears more than any other value.
If a data set has a normal distribution of values, the mode is equal to the values of the median and the
mean. With data distributions that are skewed (not standard), the mean, median, and mode values may
all be different. Data is symmetrical to the central value in a normal distribution. The distribution curve
in a normally-distributed data set is also symmetrical to an axis. Also, in a perfectly normal distribution,
half of the data values are lower than the mean, and the other half are higher.

Variance
It is sometimes necessary, and always a helpful, to measure the variation from the mean value within a
set of data. As we saw earlier, one or two outliers can result in an inaccurate representation of the data
set. For example, a large variance within family income data for a city may suggest that a mostly poor
population, with a few wealthy members, is earning more than a solidly middle-class population.
Measuring variance adds context to a standard data analysis. Below is the procedure for finding
variance:
------------------------------------------------------------------------------------------------
Step 1
Calculate the mean of the data set.
Example: (1, 2, 3, 4, 5)
Mean: (1+2+3+4+5) / 5 = 3
Step 2
Subtract the difference between the mean of the entire data set and the all of the individual values in the
data set (using absolute values...no negative numbers).
|3| - |1| = 2 |3| - |2| = 1 |3| - |3| = 0 |3| - |4| = 1|3| - |5| = 2
Step 3
Square each of those differences.
2 x 2 = 4 1 x 1 = 10 x 0 = 01 x 1 = 12 x 2 = 4
Step 4
Add all of the differences together.
4 + 1 + 0 + 1 + 4 = 10
Step 5
Divide that total by the number of values in the data set minus one.
10 / 5-1 = 2.5
Var = 2.5
------------------------------------------------------------------------------------------------
Because the variance is calculated using absolute values (see step 2), the variance of a dataset is always
positive. Calculating the variance of an actual data set would take too much time to calculate by hand.
The variance of a dataset with thousands of values can be calculated within seconds (actually micro-
seconds) using data software. Perhaps the most important function of the variance value is the fact that it
is used to calculate the Standard Deviation, which is a critical concept of data analytics.

Standard Deviation
Standard deviation is a single value that represents how widely spread the values in a data set are from
the central value (mean). The more spread out a data distribution is, the greater its standard deviation.
This value provides a precise measure of how widely dispersed the values are in a dataset, allowing for
more advance statistical analyses. The standard deviation is determined by squaring the variance of the
data set. Standard deviation is derived by calculating the square root of the variance. Therefore, standard
deviation is a highly reliable analytical value that can be used to conduct sophisticated analytical
procedures. Standard deviation is also necessary to perform probability calculations, making it that
much more important to data analytics.

Step 1
Calculate the variance of the data set. This is necessary to find the standard deviation.
In our earlier example the variance was 2.5.

Step 2
Calculate the square root of the variance.
2.5 = 1.58.

Check to verify that 4 out of 5 (80%) of the data set (1, 2, 3, 4, 5) is within one standard deviation (1.58)
from the mean (2.5). We know what the standard deviation is...but what does it really mean? In order to
determine whether our standard deviation is low (which means that the distribution is uniform and
therefore representative of the average member of the population) or high (which means that the
distribution is not very uniform and, therefore, it is less representative of the average member), we must
normalize it by calculating the Coefficient of Variation.
Coefficient of Variation
The coefficient of variation (CV) is the standard deviation / mean. This formula is applied to normalize
the standard deviation so that it can evaluated. Generally, a CV >= 1 indicates high variation, and a CV
< 1 indicates low variation. The greater the distance from 1 in either direction is significant. Let us
consider our example:

CV = Std. Dev. / mean


CV: 1.58 / 2.5 = 0.63

Because the CV < 1, we can assume that our data set is strongly representative of the average member of
the total population.

Example:
Imagine that the population of a particular city has an average monthly income of $5,000. We might
assume based upon the mean that the average citizen in this city are doing financially well. As data
analysts, however, we know that before we can make a reasonable assumption, it is necessary to order to
determine how uniformly the income is distributed among the population by calculating the variation of
data set. If the standard deviation is high, we may assume that the salaries are unevenly distributed
throughout the population. In that event we should not assume that the average member makes a
monthly income in the neighborhood of $5,000. If the standard deviation is low, then we may tend to
consider the population generally affluent.
With standard deviation, 68% of the values in a data set will always be within one standard deviation of
the group mean. Ninety-five percent of the values will be within two standard deviations of the mean.
Also, 99.7% of all values in the data set will be within three standard deviations of the mean. Consider
the statement, Ninety-five percent of a towns residents are between the ages of 4 and 84 years old. To
find the mean age, you would use the formula, mean = the sum of all data values / the total number of
values (4+84/2=22). Therefore, the mean age of the population is 22. Because we already know that the
range include 95% of the total population, we can assume that at least 68% of the citizens are within on
standard deviation of 22; therefore the majority of citizens are young.

Drawing Conclusions
Analysts utilize computers and formulas. However neither computers nor formulas can detect if they are
being used to perform useful operations. Nor can these things determine the meaning or significance of
the results. A common error made in in analytics is to overemphasize the significance of the results, or
to apply the results to the general population, when there is no logical basis for doing so. For example, a
research team is researching which types of restaurants airline travelers prefer to frequent. They
interview 100 travelers from the local airport and ask them to rate each restaurant from a provided list.
They produce a top 5 list, and conclude that travelers like those 5 restaurants the most. However, they
actually only know which ones those particular traveler like the most; they cannot draw conclusions
about travelers everywhere.
Analytics is much more than just numbers. It is important for analysts to know how to draw sensible
conclusions from their results.
This chapter discussed measures of central tendency and the role they play in data analytics. Analytical
concepts were explained, including: standard deviation, variance, relative standing, and other measures
of variance. All data analysis is affected by variation and analyses of how the values within the set of
data are distributed. Normally distributed data values strengthen both the inferences that can be drawn
and the predictions that can be made from statistical procedures conducted on a set of data.
Chapter 4: Charts and Graphs
This chapter presents visual ways to present day, including Pie Charts and Bar Graphs for categorical
data, Time Charts for time series data, and Histograms and Boxplots for numerical data. The primary
purpose for data displays is to organize and present data clearly and effectively. The reader will learn the
most common types of data displays used to present both categorical and numerical data. Also discussed
are caveats concerning data interpretation, and guidelines for data evaluation.

Pie Charts
Pie chart take are used for categorical data. They illustrate the percentage of individuals that in each
category. The total of all of the pieces of the pie equal 100%. Categories can clearly be compared and
contrasted with each other, due to the visually straightforward pie chart. The Budgets are typically
presented with pie charts to show how money is distributed.

Total Yearly Sales by Quarter


In order to assess the accuracy of a pie chart:
Make sure that the percentages add up to 100% or very close.
Check for pieces of the pie labeled other that are disproportionately big in relation to other
slices.
Verify that the pie chart consists of percentages for each category and not the literal numbers in
each group.
Create a Pie Chart in MS Excel
Step 1. Open a new MS Excel spreadsheet. Enter your data into two columns.

In this the example, the pie chart will be created to identify the relative percentage of money spent at the
grocery store. The data table includes a column with the list of grocery items and another column with
the amount of money spent on each item. The process is the same whether it is for a small list of
groceries or a large list of corporate transactions.
Items Amount Spent

Cereal $5.50

Milk $4.10

Bananas $1.25

Yogurt $0.75

Total $11.60

Step 2. Highlight the information that you would like to include in your pie chart. You do not have to
include all of the data in your table, however you must have at least 1 data record. Do this by clicking
and dragging your mouse over the area. Be sure to include the column headings when you do this. In
this example, those would be Items and Amount Spent. This way, you can include the headings in
your chart.
Step 3. Click on the "Insert" menu on the tool bar along the top of the screen. Select "Chart" from the
list of options. Then select the Pie Chart.

Step 4. Choose the type of pie chart you would like to make from the range of options. The pie chart
options consist of a flat chart, a 3D chart, an exploded chart, a pie-of-pie chart, or a bar-of-pie chart,
each option includes a section of the chart with more detail.
If you would like to preview each pie chart, click the "Press and Hold to View Sample" button,

Step 5. Click Enter, and review your pie chart. To edit or modify your chart, right click on it and
select from the extensive range of options.
Percentages Spent on Groceries

Bar Graphs
Bar graphs are another way to summarize categorical data. Like pie charts, bar graphs display data by
category, indicating how many objects are in each group, or the percentages of each category. Analysts
typically us bar graphs to compare and contrast categorical groups by separating the categories for each
one and displaying the resultant bars next to each other.

Average Number of Cars Sold per Month


Below is a checklist for evaluating bar graphs:
Make sure that the units on the Y-axis are evenly spaced.
Consider units of measurement on the scale of the bar graph. Smaller scales can make minor
differences appear to be huge.
If the bars represent percentages, as opposed to total numbers, look for the total number of units
being summarized.
Create a Bar Graph with MS Excel
Step 1. Create a data table with 1 independent variable. Bar graphs are horizontal visualizations that
illustrate values or data from a single variable.
Include labels for the data and variable at the head of each column. If you want to graph the number of
military personnel recruited in a month, you would write "Branch" at the head of the first column and
"Recruited" at the head of the second column.
Branch Recruited
Army 210
Navy 165
Airforce 130
Marines 75
As an option, you could insert a third column containing a sub-data category. The Bar Graph menu
allows you to choose from a standard, clustered, or stacked bar graph. The stacked bar graph that
displays an additional number that is related to the variable.
Branch Recruited Code
Army 210 77B
Navy 165 50A
Airforce 130 45C
Marines 75 22D

Highlight the data that you would like to include in your graph. You can include everything in the data
table or just a selection with the data set. Microsoft Excel will separate the X and Y axes by columns.
Step 2. Click on the "Insert" menu on the tool bar along the top of the screen. Select "Chart" from the
list of options. Then select the Bar Graph. Click on the kind of bar graph you want from the choices
available in the bar menu. Bar graph options include: Cylinder, 2-D, 3-D, or Pyramid Cone or shaped
bar graphs.
Step 3. From the range of options, select the type of bar graph you would like to make. To select a
standard bar graph, you will choose "Bar." However, if you would like a vertical graph, then click the
arrow next to column.
The image of your graph will quickly appear inside of your Excel sheet.
Customizing the Bar Graph
If you would like to customize your graph, then double click inside of it.
There are a variety of ways to customize the appearance of you graph, including: line fill, line, 3-D
format, shadow, soft edges, and glow & to format your bar graph. When you are done formatting your
bar graph, click O.K.)
Save your Excel sheet with your new bar graph.
Numbers of Military Recruits in a Month
To paste the image onto other programs to use it in reports or presentations, click on the bar graph.
When there are circles in all 4 corners, click "Copy." Then paste it into you other file.

Time Charts and Line Graphs


Time charts are visual displays that identify trends over a period of time. Time charts are sometimes
referred to as line graphs. Generally, time charts have units of time labeled along the horizontal axis
(e.g., hours, day, years, etc.) and a certain categories along the vertical axis (e.g., rate of growth, total
gallons, loss of inventory, etc.). For each measure of time, a mark indicates a specific amount. All of the
individual marks are joined along a line to visually highlight changes.
Our example indicates that salaries for city employees increased from 1950 until the early 1980s, began
to fall during the 1980s, and essentially remained the same until the early 2000s, when they slightly
increased.

Average salary for city employees, 19502010 (in $1,000s).


Time charts can be used to display data in deceptive ways. If the data is presented in whole numbers,
such as the number of unemployed people, but does not account for changes in the total population
count, it could be made to seem as though unemployment rates increased, even if the actual per capita
rate went down. Always take care to examine the data being displayed in a graph, to ensure that you are
being presented with a fair and accurate picture.
Considerations for evaluating time charts:
Inspect the scale along the vertical axis in addition to the horizontal axis. The data could be
presented to appear more or less significant than they really are by adjusting the scale.

Take into account the units used in the chart and be sure theyre appropriate for comparison over
time (for example, are dollar amounts adjusted for inflation?).

Look for uneven spaces along the timeline.

Measurements over a short period are more precise than over a long time period.
Create a Line Graph in MS Excel
Step 1. Enter your data. Line graphs have two axes. Enter the set of data into two columns. It is best to
use the X-axis for time in the left column and your values in the right column.
In our example, we will measure the number of employees leaving a job over the course of a year.
Month Losses
January 3
February 2
March 4
April 1
May 7
June 10
July 12
August 9
September 3
October 8
November 3
December 4
Select Insert from the tab at the top of the Excel window.
From Charts select Line Graph. A blank graph field will be displayed in your spreadsheet.
You will see several options for line graphs. Select the standard line graph option if you have a lot of
data values. For small data sets, select the Line with Markers option. This will emphasize each data
point along the line.

Click on the chart, and an editing menu will open. Click on Select Data, and a widow will open that
allows you to select the data that you would like. In the Chart Data Range field, highlight the data you
want to include in your line graph. Make sure to include the column headings.
Annual Employee Losses
Customizing Your Chart

Now that the basic line chart is complete. You can edit your chart by right-clicking on it and selecting
Format Chart Area. There are a variety of options for line sizes, background, colors, and chart
elements. You can also format chart elements, one at a time, by right-clicking on them individually and
selecting Format. You can change the title of your chart by clicking the title box and typing inside of
it. Before you change it, the title will automatically be the same as what you labeled the data. The chart
can be resized by clicking the corner and dragging the mouse.

Annual Employee Losses


Adding another Set of Data
To add a second data line, enter your data into the spreadsheet, the same as in the previous section. Add
a third column of data next to your other columns. You should now have three columns that contain the
same number of values. Click on your chart, and select Data under the Data heading. When the
Select Data Source window opens, click the Add button under Legend, and you will see a field
box labeled Series.

In the Series field, click the cell with the heading name for your second set of data. In the Series values
field, select the cells that contain your new data. Press OK. You will be taken back to the Select Data
Source window. Your new line will appear on the chart.

Now, your second line has been added and labeled.


Annual Employee Losses and Hires

Histograms
Histograms present a picture of data separated into ordered groups. A histograms offers a convenient
way to visualize patterns in a large set of data. A histogram is essentially a bar chart that visually
presents numerical data. Unlike categorical data, such as color of hair, which has no innate order,
numerical categories are displayed in order from smallest to largest. Every number in a histogram fits
into only one group. Although the bars on a histogram are contiguous, they do not overlap. Every bar on
the horizontal axis is marked by the values representing its boundaries. The height of each bar signifies
either frequency (the number of units) in each category or the relative frequency (the percentage of
units) in each category.
The following chart displays the grades earned by a group of students, ranging from A to F. The
numerical variable grade is sorted into 5 categories. It is clearly evident that the most frequently
occurring grade is B, and the least frequently occurring is F.
Distribution of Exam Grade Frequencies
If a data point were to fall precisely on a between two categories, then you may want to round up or
down. The validity of the chart is not affected by such decisions, as long as they are consistent.
A Histogram displays three primary elements of numerical data:
The relative distribution of the data (e.g., relative to a bell-shape).
The range of variance in the data (the levels of difference between categories).
The central point of the data set (if you make a combination chart).
A key feature of histograms is that they display the distribution on the data as a shape, which can be
used to make simple inferences. Of course shapes vary with each different set of data; however, there are
three main shapes that are commonly looked for in a set of data:
1. Symmetric (the left side of the histogram is the same as the right side.)
2. Skewed Right (The left side is high and gets continually lower going right.)
3. Skewed Left (The left side is low and gets continually higher going right.)
Variability in the data from a histogram
Histograms also help to illustrate levels of variability within a data set. A histogram that is generally flat
along the top may appear to have low variability; however, that would indicate a wide range within the
data set. Having the same numbers in each category means that the measures were spread out widely. A
hill in the center shows that the majority of measures are near the central point, with a few straying away
in both directions away from the center (which is to be expected). The higher the center point, the lower
the variability in the data. In more advance analyses, the distance of the outliers to the left and right of
the center take on greater significance.
Variability in a histogram is distinct from variability in a time chart. When values on a time chart change
over a period of time, they move either higher or lower on the chart. The more highs and lows along a
time chart indicate greater variability. Conversely, a flat line on a time chart indicates low variability.

Below are considerations for evaluating a histogram:


Inspect the scale being utilized for the frequency (vertical axis). Understand that results can be
made to appear less or more significant by adjusting the size of the scales. For example, if a
group of people have weight difference ranging from end to end by 20 pounds, this can appear to
be massive by using a gram scale or insignificant by using a ton scale.

Examine the units along the vertical axis to see if the graph is using frequencies (numbers) or
relative frequencies (percentages).

Check the size range of the categories for the numerical variables (on the horizontal axis). If
they represent very small measures, the data may appear to have excessive variation. If they are
very large, the chart may conceal significant amounts of variation.
Create a Histogram with MS Excel

Installing the Analysis ToolPak Add-In


In order to create a Histogram with MS Excel, you must install the Analysis ToolPak Add-in. This
section covers the installation process.
Step 1: Locate the "Excel Add-ins" box under File. You can do this from the MS Excel Home screen.
a) Click Options on the File menu.
b) Click Add-Ins
c) Under Manage, select Add-ins.
d) Click Go.
In the Add-Ins dialog box, click on the Analysis ToolPak check box. It is located under Add-Ins
Available, Next, click OK. The Analysis ToolPak Add-in will not be in the dialog box if it has not
been previously installed. If the Analysis ToolPak is not in the dialog box, run MS Excel Setup. And add
the ToolPak to your list of installed items. Now that the Analysis Toolpak is installed and enabled, you
are ready to create a Histogram.
Creating a Histogram
Step 1: Enter the Data. Enter your data into two adjacent columns, and populate the left column with the
"input data" (the set of values that you will analyze with the Histogram tool). In the right column you
will place your bin Numbers (the segments that you use for separating and analyzing your data
values). For example, in order to organize ratings into categories of Good, Better, an Best you could
make bins for 1, 2, and 3.
Navigate to the Data tab, at the top of the screen, and click Data Analysis in the Analysis group. This
will start up the Analysis ToolPak. Then, click to open the Data Analysis box.
In the Data Analysis dialog box, scroll down to Histogram, and click OK. This opens the Histogram
dialog box.

Under Histogram, click the input and the bin ranges from your worksheet. This is done by clicking on
the input box. The input range contains the data that you want to analyze. If the input data is a set of 30
values, and you have copied it into the B column (from B1 to B30), then enter your data range as
B1:B30. The bin range consists of the bin numbers. For example, if there are 5 bins at the very top of
column C, then then your bin will be C1:C5.
Place a check the Chart Output checkbox. Under Output Options, click New Workbook. Then, place a
check in the Chart Output check box.

Once you click OK, you are finished. Excel will produce a new workbook containing a histogram
table long with your chart.
Distribution of Book Prices

Scatter Plots
Scatter plots are charts that visually represent the relationship between two variables. A scatterplot
consists of an X axis (the horizontal axis), a Y axis (the vertical axis), and a series of dots. Each dot on
the scatterplot represents one observation from a data set. The position of the dot on the scatterplot
represents its X and Y values. The example chart below displays the relationship between iPhone sales
and Galaxy Note sales. When we examine the number of Galaxy Note sale along the X (horizontal) axis,
we see that the more Galaxy Note sales there are, the more iPhone sales there are. The red trend line
illustrates this relationship. If the trend line were horizontal and flat, that would tell us that as Galaxy
Note sales go up, iPhone sales level off. A downward sloping trend line along the X axis would indicate
that as Galaxy Note sale rise, iPhone sales drop. This would be a possible situation within a small
population (e.g. 10 customers) who have to simply choose one phone or the other.
The dots along the trend line represent actual data points. These data points give us specific information
about the units being measured. They also help us to see the variance in the set of data. The more close
the data points are to the trend line, the stronger the relationship between the two variables. The more
spread out they are, the weaker the relationship. A weak relationship, for example, might be observed if
the data were collected from a customer population that had several other types of phones to choose
from, besides these two.
Relationship between sales of the iPhone and Galaxy Note
In order to you want to examine relationships between several sets of variables within a data set, you
could utilize a scatter plot matrix. This is a series of scatter plots within a single graph that shows the
relationships between multiple variables. Identifying and proving relationships between variables
enables analysts to draw important conclusions that organizational leaders can use to help them to
efficiently achieve their goals. Using a scatter plot is a good way to easily visualize important patterns
and identify outliers in a set of data. This type of graph plots data points along the X and Y axes, which
helps to describe the central tendency of the data.
Because different charts are able to describe different characteristics of data, it is always a good idea to
use multiple types of graphs and charts to explain a set of data. When using a variety of graphical
displays to explain a data set, it can be useful to begin with a scatter plot, because it will give you a big
picture view of the data characteristics. You could then follow up with a pie chart, histogram, and bar
charts in order to gradually focus in on specific elements within the data set. Graphs and charts allow
you to tell a story about your data in a way that is accessible to non-analysts.
Create a Scatter Chart with MS Excel
Select the data that you want to plot in the scatter chart.
On the Insert tab, in the Charts group, click Scatter.

Click Scatter with only Markers. Your chart will appear on your Excel worksheet. If you want the
trend line, right click on the chart click on Chart Elements and check the box marked Trend Line.
The Relationship between McDonalds Menu Prices and Calories

Spatial Plots and Maps


Two ways to represent spatial data include Spatial Plots and Maps. A map is simply an image that
signifies sizes, shapes, and locations, of a geographical area. Spatial plots visualize the values and
locations of distribution of a set of data.
Below are a few common types of spatial plots and maps:
Choropleth Maps: Choropleth maps are spatial data plotted out along area boundary shapes,
rather than by point, line, or raster coverage. For example, in a map of the U.S., state boundaries
represents area boundary shapes. Colors may be used within areas to signify some sort of value
for an attribute being looked at in each state. Perhaps red areas indicate higher values and blue
areas signify smaller values.

Point Maps: These are composed of spatial data plotted out along specific point locations. Point
maps visually display data in a graphical point form, rather than in shapes, line, or raster surface
formats.

Raster Surface Maps: These maps can be anything from a satellite image map to a surface
coverage with values that have been included from basic spatial data points.
This chapter discussed the purpose and concepts behind common visual methods for displaying data.
Descriptive analytics uses numbers to summarize aspects of a collection of data. They give you
understandable information to help you answer research questions. They can also help you to understand
what is happening in you experiment, so that you can later conduct more in-depth analyses. Visual
representations of data help analysts to present data to the outside world plainly and succinctly.
Chapter 5: Applying Data Analytics to Business
and Industry
Business Intelligence (BI)
The goal of business intelligence is to transform raw data into organized information that can be used to
provide insights that business leaders can apply to make well-informed decisions. Business data analysts
rely on business intelligence (BI) tools to help them generate decision models for decision making. To
build data analytic dashboards, visual presentations, or data reports from collections of data, you can
benefit from the use of (BI) tools to help with the process. Business intelligence (BI) consists of:
Large public and private collections of data: Private collections are information sets supplied by
the organizations data collection methods.

Technological tools and skillsets: This includes online analytical data procedures, database
development and management, warehousing of data, and information technology (IT) for
business programs and applications.
The types of data used in business intelligence insights that are generated in business intelligence (BI)
result from standardized sets of organized business data. BI solutions are primarily comprised of
transactional data that is produced throughout the course of countless events, such as data created during
sales, or records resulting from financial transfers among bank accounts. Transactional data is natural
produced by business actions occurring throughout the organization. This data is critical for variety of
insights that can be gathered from it. BI can be used to extract the following types of business insights:
Customer Information: This data can help managers identify, for example, the areas of their
business that are creating the most customer turnover.

Marketing Data: This data can let businesses know the specific marketing strategies that are
most effective and what exactly makes them so effective.

Operational Data: This data can let business know how efficiently different departments are
functioning and the best actions to take in order to fix identified problems.

Employee Data: This data can let business know which employees are producing the most, and
which are producing the least.

Because the results of data analytics are often extracted from large datasets, cloud-based data platform
solutions are common in the field. Data thats used in data analytics is often derived from data-
engineered big data solutions, like Hadoop, MapReduce, and Massively Parallel Processing. Data
analysts must be innovative, forward-thinkers who must often come up with creative solutions in order
to overcome limitations in data collection and interpretation. Many data analysts prefer open-source
solutions. Considering the free cost of open-source software and its robust development architecture, it
is quite popular among analysts. This benefits the organizations that employ these analysts.
Data Analytics in Business and Industry
Transactional Data: This is the type of organized data used in most BI models. It includes
administration data, customer data, marketing data, organizational data, and employee
productivity data.
Social Data: This includes the unfiltered data generated from emails and social networks, like
Facebook, Twitter, LinkedIn, Pinterest, and Instagram.
Machine data from business operations: This data is used to monitor the organizations
equipment and machines.
Audio, video, image, and PDF file data: These are all well-established formats are all sources of
unstructured data
To streamline BI processes, you must make sure that your data is structured for maximum ease of access
and control. You can use multidimensional databases to accomplish your goals. Unlike the popular
relational, and flat databases, multidimensional databases sort data into cubes that are organized into
multi-dimensional data arrays. To be able to manipulate your data as rapidly and effortlessly as possible,
you can place your data in multidimensional databases as a cube, rather than organizing your data
among multiple relational databases that may encounter difficulties working with each other. The cubic
data architecture allows for online analytical processing (OLAP). OLAP is a technology with which you
can conveniently access and use all of your data for all several different procedures and explorations.
To understand the OLAP model, imagine that you have a cube of market data with three scopes, time,
location, and department. You could, for example, arrange the data to examine only one rectangle, in
order to view one particular department. You could arrange the data to explore a proportionately smaller
cube, consisting of a specific period of time, locations, and departments. You could also drill up or down
your data set to view very detailed data or decidedly summarized data. You could also or total a range of
numbers along a single dimension in order to sum up the totals for small units of business or examine
sales across an extended period of time within a specific location.
OLAP is just merely one system for warehousing data. Another data warehouse system that is popular
among (BI) solutions is called a data mart. This is a data management system used to store specific
elements of data, fitting only one element of business in the organization. The process used for
extracting, changing, and resorting the data into a database or data mart is known as extract, transform,
and load (ETL).
Typically, business analysts are highly trained in (BI) technology. As a general rule, BI training is
accompanied by on traditional IT training and development. Within the business world, data analytics
fulfills the same function as that BI, and that is to turn mountains of raw data into useful information that
can help business leaders make informed, strategic business decisions. If you have large sets of
unconnected data sources, that may possibly be incomplete, and you want to convert all of that into
valuable and useful business insights throughout the entire organization. Business data analysts produce
critical data insights. This is accomplished by identifying patterns and abnormalities in business data.
Data analytics in the business world consists of:
Quantitative Examination: This includes mathematical modeling, statistical analysis, predictive
forecasting, and data simulations. These processes often involve more than one variable at a
time.

Programming skills: You do not need to have software programming skills to gather, organize,
explore the data, and share this data with stakeholders.
Business knowledge: Having knowledge of the particular business from a functional perspective
will definitely help you to better understand the relevancy and meanings of your findings.

Useful Technologies and Skills


Business-centric data analysts might use machine learning techniques to find patterns in (and derive
insights from) huge datasets that are related to a line of business or the business at large. Theyre skilled
in math, analytics, and programming, and they sometimes use these skills to generate predictive models.
They generally know how to program in Python or R. Most of them know how to use SQL to query
relevant data from structured databases. They are usually skilled at communicating data insights to end
users in business-centric data analytics, end users are business managers and organizational leaders.
Data analysts must be skillful at using verbal, oral, and visual means to communicate valuable data
insights. Although business-centric data analysts serve a decision-support role in the enterprise, theyre
different from the business analyst in that they usually have strong academic and professional
backgrounds in math, analytics, engineering, or all of the above. This said, business-centric data analysts
also have a strong substantive knowledge of business management.

BI and Data Analytics

The comparisons between BI and business-related data analytics are easy to see. They are both rooted in
fundamental statistical analyses. The dissimilarities, however, are not quite so apparent. The functions of
both is to transform raw data into useful information that can be used to make well-informed decisions.
BI and data analytics diverge in their methods and approaches. BI employs predictive methods, such as
forecasting. This is done are by drawing basic inferences from past and current information. Therefore,
BI draws from the past and present to make predictions about future events. Relevant data from
historical and current trends can be extremely useful for helping to guide organizational planning and
operations. It is also instrumental for helping to guide day-to-day decisions.
Data analytics, on the other hand, seeks to make discoveries through the use of advanced mathematical
or statistical methods. It analyzes and makes predictions based on massive amounts of unprocessed data.
Such forward-thinking discernment is critical to the long-term success of an organization. Data analysts
try to discover new models of thinking and original ways of understanding data, in order to provide a
fresh perspective on the organization, the way it functions, and its relationships with stakeholders. Data
analyst, unlike statisticians and traditional business analysts, must have an authentic understanding of
the business itself and its context.
Data analytics requires organizational knowledge, in order to understand how new information is
relevant to the current culture of the organization and its goals. A few other things that distinguish data
analysts from traditional BI include:
Sources of Data: BI relies on structured data housed in relational databases. Data analysts utilize
both structured and unstructured data, for example, the information spawned by machines or in
social media interactions.

Products: Traditional BI products include reports, data tables, and decision dashboards. Data
analysts, on the other hand, produce outputs that may be related to dashboards analytics and
advanced data visualization, but typically not data reports. Data analysts typically relay their
findings through words and data visualizations, but not tables and reports. This is due to the fact
that the sources of data, with which they work, tend to be more complex than a typical
organizational leader would be able to truly grasp.
Technology: BI relies on relational databases, data warehouses, OLAP, and ETL technologies.
Data analytics utilizes data-engineered systems that use Hadoop, MapReduce, or Massively
Parallel Processing.

Expertise: BI relies heavily on IT and business technology expertise, whereas data analysts rely
on expertise in analytics, statistical methods, computer programming, and business.
Because most business leaders are not trained to perform advanced data analytics themselves, it is
beneficial for them to distinguish the types of decisions that are best-suited for a business leaders, and
those best left to their data analysts. In our rapidly-evolving knowledge-based economy, organizations
seeking to remain competitive must constantly become more efficient in their operations and more
strategic with resources. The key to this is capitalizing on the opportunities provided by skilled analyses
of industrial-level Big Data.

Chapter 6: Final Thoughts on Data


Prior to the recent rise in analytics, businesses and organizations did not have the capacity to analyze a
great deal of data, so a relatively small amount was maintained. In todays data-driven world, anything
and everything may have significance, so there has been an attempt to record and keep virtually any data
that we have the capacity to collect; and we have a great deal of capacity. Beyond the quantity of data
that we are gathering and storing is the quality of the data. That is to say, data has grown beyond basic
facts and figures to encompass media files. Video, audio, and presentations have all become units of data
for possible analysis. A major concern with regards to data analytics is how to store and maintain all of
these rapidly-increasing piles of data. The data science community has begun to rely more heavily upon
the software engineering community, in order to find solutions to our over-abundance of data.
Not all data is necessarily valuable. Society now has advanced data analytics that allows us to glean
useful and important information from even the smallest bits of data. Such information, when reconciled
with other groups of information, can (and has often) resulted in breakthrough of modern science,
business, and economy. As we consider our need to increase the role of data analytics in the ways that
we function as organizations, we should keep in mind that data does not contain all of the answers to our
growth and advancement. Data provides us with the building material with which we can create new
understanding and innovation. The other part of the process is distinctively human. This part includes
creativity, risk taking, and cooperation. It appears as though the less we have of one, the more we need
of the other. The more intellectual rigor and collaboration between various fields of science the more
that we seem to benefit for even limited amounts of data. Conversely, the less of those things that we
have, the more data we need in order to learn, grow, and innovate. Perhaps, the solution to our looming
problem with big data is to reduce our need for so much of it.
Conclusion
As we have seen, data analytics is inclusive and encompassing field of study. What distinguishes data
analytics from traditional areas of data analysis is its orientation toward the business world and its focus
on Big Data. Data analytics exists at the intersection of data science and computer technology. Each of
these sciences are constantly evolving, and each heavily influences the other. Although a career in data
analytics does not require specialized training in computer programing, familiarizing oneself with the
fundamentals of computer science will definitely benefit a data analyst. This introductory book has
provided you with the necessary understanding and skills to move on to advanced principles, techniques,
and procedures in data analytics.
Advanced data analytics build upon the fundamentals that are covered in this book. Even the most
sophisticated studies begin with basic research design principals that we discussed, measures of central
tendency, descriptive analytics, basic charts and graphs, and analysis of the variance. The differences lie
in additional procedures that are conducted in order to further evaluate the quality of data and reliability
of the results. The majority of data analytics is accomplished utilizing the fundamental principles that
you have just learned.

You might also like