Data Analytics Unit1

Data Analytics
Unit-1 Data Analytics
Introduction to data analytics
Data Analytics: Data Analytics refers to the techniques used to analyze data to enhance
productivity and business gain. Data is extracted from various sources and is cleaned and
categorized to analyze various behavioral patterns. The techniques and the tools used vary
according to the organization or individual.
A data analyst will extract raw data, organize it, and then analyze it, transforming it from
incomprehensible numbers into coherent, intelligible information. Having interpreted the data,
the data analyst will then pass on their findings in the form of suggestions or recommendations
about what the company’s next steps should be.
Difference between data analytics and data science?
The terms “data science” and “data analytics” tend to be used interchangeably. However, they
are two different fields and denote two distinct paths. What’s more, they each have a very
different impact on the business or organization.
One key difference between data scientists and data analysts lies in what they do with the
data and the outcomes they achieve. A data analyst will seek to answer specific questions or
address particular challenges that have already been identified and are known to the business. To
do this, they examine large datasets with the goal of identifying trends and patterns. They then
“visualize” their findings in the form of charts, graphs, and dashboards. These visualizations are
shared with key stakeholders and used to make informed data-driven strategic decisions.
A data scientist, considers what questions the business should or could be asking. They design
new processes for data modeling, write algorithms, devise predictive models, and run custom
analyses. For example, They might build a machine to leverage a dataset and automate certain
actions based on that data—and, with continuous monitoring and testing, and as new patterns and
trends emerge, improve and optimize that machine wherever possible. Data scientists, might be
expected to be proficient in Hadoop, Java, Python, machine learning, and object-oriented
programming, together with software development, data mining, and data analysis.
Data analysts are typically expected to be proficient in software like Excel and, in some cases,
querying and programming languages like SQL, R, SAS, and Python. Analysts need to be
comfortable using such tools and languages to carry out data mining, statistical analysis,
database management and reporting.
Types of data analysis
1. Descriptive analytics
Descriptive analytics is a simple, surface-level type of analysis that looks at what has happened
in the past. The two main techniques used in descriptive analytics are data aggregation and data
mining—so, the data analyst first gathers the data and presents it in a summarized format (that’s
the aggregation part) and then “mines” the data to discover patterns.
The data is then presented in a way that can be easily understood by a wide audience (not just
data experts). Descriptive analytics doesn’t try to explain the historical data or establish cause-
and-effect relationships; at this stage, it’s simply a case of determining and describing the “what”.
2. Diagnostic analytics
While descriptive analytics looks at the “what”, diagnostic analytics explores the “why”. When
running diagnostic analytics, data analysts will first seek to identify anomalies within the data—
that is, anything that cannot be explained by the data in front of them. For example: If the data
shows that there was a sudden drop in sales for the month of March, the data analyst will need to
investigate the cause.
To do this, they’ll embark on what’s known as the discovery phase, identifying any additional
data sources that might tell them more about why such anomalies arose. Finally, the data analyst
will try to uncover causal relationships—for example, looking at any events that may correlate or
correspond with the decrease in sales. At this stage, data analysts may use probability theory,
regression analysis, filtering, and time-series data analytics.
3. Predictive analytics
Just as the name suggests, predictive analytics tries to predict what is likely to happen in the
future. This is where data analysts start to come up with actionable, data-driven insights that the
company can use to inform their next steps. Predictive analytics estimates the likelihood of a
future outcome based on historical data and probability theory, and while it can never be
completely accurate, it does eliminate much of the guesswork from key business decisions.
Predictive analytics can be used to forecast all sorts of outcomes—from what products will be
most popular at a certain time, to how much the company revenue is likely to increase or
decrease in a given period. Ultimately, predictive analytics is used to increase the business’s
chances of “hitting the mark” and taking the most appropriate action.
4. Prescriptive analytics
Building on predictive analytics, prescriptive analytics advises on the actions and decisions
that should be taken. In other words, prescriptive analytics shows you how you can take
advantage of the outcomes that have been predicted. When conducting prescriptive analysis, data
analysts will consider a range of possible scenarios and assess the different actions the company
might take.
Prescriptive analytics is one of the more complex types of analysis, and may involve working
with algorithms, machine learning, and computational modeling procedures. However, the
effective use of prescriptive analytics can have a huge impact on the company’s decision-making
process and, ultimately, on the bottom line.
The typical process that a data analyst
The actual process of data analysis can be outlined in the five main steps that a data analyst will
follow when tackling a new project:
Step 1: Define the question(s) you want to answer
The first step is to identify why you are conducting analysis and what question or challenge
you hope to solve. At this stage, you’ll take a clearly defined problem and come up with a
relevant question or hypothesis you can test. You’ll then need to identify what kinds of data
you’ll need and where it will come from.
For example: A potential business problem might be that customers aren’t subscribing to a paid
membership after their free trial ends. Your research question could then be “What strategies can
we use to boost customer retention?”

Step 2: Collect the data
With a clear question in mind, you’re ready to start collecting your data. Data analysts will
usually gather structured data from primary or internal sources, such as CRM software or email
marketing tools. They may also turn to secondary or external sources, such as open data sources.
These include government portals, tools like Google Trends, and data published by major
organizations such as UNICEF and the World Health Organization.
Step 3: Clean the data
Once you’ve collected your data, you need to get it ready for analysis—and this
means thoroughly cleaning your dataset. Your original dataset may contain duplicates,
anomalies, or missing data which could distort how the data is interpreted, so these all need to be
removed. Data cleaning can be a time-consuming task, but it’s crucial for obtaining accurate
results.
Step 4: Analyze the data
Now for the actual analysis! How you analyze the data will depend on the question you’re
asking and the kind of data you’re working with, but some common techniques include
regression analysis, cluster analysis, and time-series analysis (to name just a few). We’ll go over
some of these techniques in the next section. This step in the process also ties in with the four
different types of analysis we looked at in section three (descriptive, diagnostic, predictive, and
prescriptive).
Step 5: Visualize and share your findings
This final step in the process is where data is transformed into valuable business insights.
Depending on the type of analysis conducted, you’ll present your findings in a way that others
can understand—in the form of a chart or graph, for example.
Data analytics tools
Here some of the data analytics tools are listed that a data analyst might work with.
 Microsoft Excel is a software program that enables you to organize, format, and calculate
data using formulas within a spreadsheet system. Microsoft Excel may be used by data
analysts to run basic queries and to create pivot tables, graphs, and charts. Excel also features
a macro programming language called Visual Basic for Applications (VBA).
 Tableau is a popular business intelligence and data analytics software which is primarily
used as a tool for data visualization. Data analysts use Tableau to simplify raw data into
visual dashboards, worksheets, maps, and charts. This helps to make the data accessible and
easy to understand, allowing data analysts to effectively share their insights and
recommendations.
 SAS is a command-driven software package used for carrying out advanced statistical
analysis and data visualization. Offering a wide variety of statistical methods and algorithms,
customizable options for analysis and output, and publication-quality graphics, SAS is one of
the most widely used software packages in the industry.
 RapidMiner is a software package used for data mining (uncovering patterns), text mining,
predictive analytics, and machine learning. Used by both data analysts and data scientists
alike, RapidMiner comes with a wide range of features—including data modeling, validation,
and automation.
 Power BI is a business analytics solution that lets you visualize your data and share insights
across your organization. Similar to Tableau, Power BI is primarily used for data
visualization. While Tableau is built for data analysts, Power BI is a more general business
intelligence tool.
Different Sources of Data for Data Analysis

Data collection is the process of acquiring, collecting, extracting, and storing the voluminous
amount of data which may be in the structured or unstructured form like text, video, audio,
XML files, records, or other image files used in later stages of data analysis.
1. Primary data:
The data which is Raw, original, and extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the demand
and requirements of the target audience on which analysis is performed otherwise it would be a
burden in the data processing.
Few methods of collecting primary data:
1. Interview method:
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the interviewee.
Some basic business or product related questions are asked and noted down in the form of
notes, audio, or video and this data is stored for processing. These can be both structured and
unstructured like personal interviews or formal interviews through telephone, face to face,
email, etc.
2. Survey method:
The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video. The survey method can be
obtained in both online and offline mode like through website forms and email. Then that
survey answers are stored for analyzing data. Examples are online surveys or surveys through
social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher keenly observes
the behavior and practices of the target audience using some data collecting tool and stores the
observed data in the form of text, audio, video, or any raw formats. In this method, the data is
collected directly by posting a few questions on the participants. For example, observing a
group of customers and their behavior towards the products. The data obtained will be sent for
processing.
4. Experimental method:
The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD,
LSD, FD.
 CRD- Completely Randomized design is a simple experimental design used in data
analytics which is based on randomization and replication. It is mostly used for comparing
the experiments.
 RBD- Randomized Block Design is an experimental design in which the experiment is
divided into small units called blocks. Random experiments are performed on each of the
blocks and results are drawn using a technique known as analysis of variance (ANOVA).
RBD was originated from the agriculture sector.
 LSD – Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns. It is an arrangement of NxN squares with an equal
amount of rows and columns which contain letters that occurs only once in a row. Hence
the differences can be easily found with fewer errors in the experiment. Sudoku puzzle is
an example of a Latin square design.
 FD- Factorial design is an experimental design where each experiment has two factors
each with possible values and on performing trail other combinational factors are derived.
2. Secondary data:
Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of
sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales
record, transactions, customer data, accounting resources, etc. The cost and time consumption
is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained through external
third party resources is external source data. The cost and time consumption is more because
this contains a huge amount of data. Examples of external sources are Government
publications, news publications, Registrar General of India, planning commission, international
labor bureau, syndicate services, and other non-governmental publications.
Other sources:
 Sensors data: With the advancement of IoT devices, the sensors of these devices collect
data which can be used for sensor data analytics to track the performance and usage of
products.
 Satellites data: Satellites collect a lot of images and data in terabytes on daily basis
through surveillance cameras which can be used to collect useful information.
 Web traffic: Due to fast and cheap internet facilities many formats of data which is
uploaded by users on different platforms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through keywords
and queries searched mostly.
Types of Data
Qualitative or Categorical Data
Qualitative or Categorical Data is data that can’t be measured or counted in the form of
numbers. These types of data are sorted by category, not by number. That’s why it is also
known as Categorical Data. These data consist of audio, images, symbols, or text. The gender
of a person, i.e., male, female, or others, is qualitative data. Qualitative data tells about the
perception of people. This data helps market researchers understand the customers’ tastes
and then design their ideas and strategies accordingly.
The other examples of qualitative data are:
 What language do you speak

 Favourite holiday destination
 Opinion on something (agree, disagree, or neutral)
 Colours
The Qualitative data are further classified into two parts :
Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The colour
of hair can be considered nominal data, as one colour can’t be compared with another colour.
The name “nominal” comes from the Latin name “nomen,” which means “name.” With the
help of nominal data, we can’t do any numerical tasks or can’t give any order to sort the data.
These data don’t have any meaningful order; their values are distributed into distinct
categories.
Examples of Nominal Data:
 Colour of hair (Blonde, red, Brown, Black, etc.)

 Marital status (Single, Widowed, Married)
 Nationality (Indian, German, American)
 Gender (Male, Female, Others)
 Eye Color (Black, Brown, etc.)
Ordinal Data
Ordinal data have natural ordering where a number is present in some kind of order by their
position on the scale. These data are used for observation like customer satisfaction,
happiness, etc., but we can’t do any arithmetical tasks on them. Ordinal data is qualitative
data for which their values have some kind of relative position. These kinds of data can be
considered as “in-between” qualitative data and quantitative data. The ordinal data only
shows the sequences and cannot use for statistical analysis. Compared to nominal data,
ordinal data have some kind of order that is not present in nominal data.
Examples of Ordinal Data :
 When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10

 Letter grades in the exam (A, B, C, D, etc.)
 Ranking of people in a competition (First, Second, Third, etc.)
 Economic Status (High, Medium, and Low)
 Education Level (Higher, Secondary, Primary)
Difference between Nominal and Ordinal Data
Nominal Data Ordinal Data
Nominal data can’t be quantified, neither they Ordinal data gives some kind of sequential order by
have any intrinsic ordering their position on the scale
Nominal data is qualitative data or categorical Ordinal data is said to be “in-between” qualitative data
data and quantitative data
They don’t provide any quantitative value, They provide sequence and can assign numbers to
neither we can perform any arithmetical ordinal data but cannot perform the arithmetical
operation operation
Nominal data cannot be used to compare with Ordinal data can help to compare one item with
one another another by ranking or ordering
Examples: Eye colour, housing style, gender,

Examples: Economic status, customer satisfaction,
hair colour, religion, marital status, ethnicity,
education level, letter grades, etc
etc
Quantitative Data
Quantitative data can be expressed in numerical values, which makes it countable and
includes statistical data analysis. These kinds of data are also known as numerical data. It
answers the questions like, “how much,” “how many,” and “how often.” For example, the
price of a phone, the computer’s ram, the height or weight of a person, etc., falls under
quantitative data. Quantitative data can be used for statistical manipulation and these data
can be represented on a wide variety of graphs and charts such as bar graphs, histograms,
scatter plots, boxplots, pie charts, line graphs, etc.
Examples of Quantitative Data :
 Height or weight of a person or object

 Room Temperature
 Scores and Marks (Ex: 59, 80, 60, etc.)
 Time
The Quantitative data are further classified into two parts :
Discrete Data
The term discrete means are distinct or separate. The discrete data contain the values that fall
under integers or whole numbers. The total number of students in a class is an example of
discrete data. These data can’t be broken into decimal or fraction values. The discrete data
are countable and have finite values; their subdivision is not possible. These data are
represented mainly by a bar graph, number line, or frequency table.
Examples of Discrete Data :
 Total numbers of students present in a class

 Cost of a cell phone
 Numbers of employees in a company
 The total number of players who participated in a competition
 Days in a week
Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an android
phone, the height of a person, the length of an object, etc. Continuous data represents
information that can be divided into smaller levels. The continuous variable can take any
value within a range. The key difference between discrete and continuous data is that discrete
data contains the integer or whole number. Still, continuous data stores the fractional
numbers to record different types of data such as temperature, height, width, time, speed, etc.
Examples of Continuous Data :
 Height of a person
 Speed of a vehicle
 “Time-taken” to finish the work
 Wi-Fi Frequency
 Market share price
Discrete Data Continuous Data
Discrete data are countable and finite; they are whole Continuous data are measurable; they are in the form of
numbers or integers fractions or decimal
Continuous data are represented in the form of a

Discrete data are represented mainly by bar graphs
histogram
The values cannot be divided into subdivisions into The values can be divided into subdivisions into
smaller pieces smaller pieces
Continuous data are in the form of a continuous

Discrete data have spaces between the values
sequence
Examples: Total students in a class, number of days Example: Temperature of room, the weight of a
in a week, size of a shoe, etc person, length of an object, etc
Difference between Discrete and Continuous Data
These are 3 types: Structured data, Semi-structured data, and Unstructured data.
1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns. They have relational keys
and can easily be mapped into pre-designed fields. Today, those data are most processed in
the development and simplest way to manage information. Example: Relational data.
2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyze. With some processes, you
can store them in the relation database (it could be very hard for some kind of semi-
structured data), but Semi-structured exist to ease space. Example: XML data.
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have
a predefined data model, thus it is not a good fit for a mainstream relational database. So
for Unstructured data, there are alternative platforms for storing and managing, it is
increasingly prevalent in IT systems and is used by organizations in a variety of business
intelligence and analytics applications. Example: Word, PDF, Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:
Properties Structured data Semi-structured data Unstructured data
It is based on It is based on
Relational database XML/RDF(Resource It is based on character
Technology table Description Framework). and binary data
Matured transaction
and various No transaction
Transaction concurrency Transaction is adapted management and no
management techniques from DBMS not matured concurrency
Version Versioning over Versioning over tuples or

management tuples, row, tables graph is possible Versioned as a whole
It is more flexible than

It is schema structured data but less It is more flexible and
dependent and less flexible than unstructured there is absence of
Flexibility flexible data schema
Properties Structured data Semi-structured data Unstructured data
It is very difficult to It’s scaling is simpler than

Scalability scale DB schema structured data It is more scalable.
New technology, not very

Robustness Very robust spread —
Structured query
Query allow complex Queries over anonymous Only textual queries are
performance joining nodes are possible possible
Data analytics techniques
Data analysts will usually work with quantitative data; however, there are some roles out there
that will also require you to collect and analyze qualitative data, so it’s good to have an
understanding of both. With that in mind, here are some of the most common data analytics
techniques:
 Regression analysis: This method is used to estimate or “model” the relationship between a
set of variables. Regression analysis is mainly used to make predictions. Note, however, that
on their own, regressions can only be used to determine whether or not there is a relationship
between a set of variables—they can’t tell you anything about cause and effect.
 Factor analysis: Sometimes known as dimension reduction, this technique helps data
analysts to uncover the underlying variables that drive people’s behavior and the choices they
make. Ultimately, it condenses the data in many variables into a few “super-variables”,
making the data easier to work with. For example: If you have three different variables which
represent customer satisfaction, you might use factor analysis to condense these variables
into just one all-encompassing customer satisfaction score.
 Cohort analysis: A cohort is a group of users who have a certain characteristic in common
within a specified time period—for example, all customers who purchased using a mobile
device in March may be considered as one distinct cohort. In cohort analysis, customer data
is broken up into smaller groups or cohorts; so, instead of treating all customer data the same,
companies can see trends and patterns over time that relate to particular cohorts. In
recognizing these patterns, companies are then able to offer a more targeted service.
 Cluster analysis: This technique is all about identifying structures within a dataset. Cluster
analysis essentially segments the data into groups that are internally homogenous and
externally heterogeneous—in other words, the objects in one cluster must be more similar to
each other than they are to the objects in other clusters. Cluster analysis enables you to see
how data is distributed across a dataset where there are no existing predefined classes or
groupings. In marketing, for example, cluster analysis may be used to identify distinct target
groups within a larger customer base.
 Time-series analysis: In simple terms, time-series data is a sequence of data points which
measure the same variable at different points in time. Time-series analysis, then, is the
collection of data at specific intervals over a period of time in order to identify trends and
cycles, enabling data analysts to make accurate forecasts for the future. If you wanted to
predict the future demand for a particular product, you might use time-series analysis to see
how the demand for this product typically looks at certain points in time.
Application of Data Analytics In Different Fields
1. Transportation
Data analytics can be applied to help in improving Transportation Systems and the intelligence
around them. The predictive method of the analysis helps find transport problems like Traffic or
network congestion. It helps synchronize the vast amount of data and uses them to build and
design plans and strategies to plan alternative routes and reduce congestion and traffic, which in
turn reduces the number of accidents and mishappenings.
2. Logistics and Delivery

There are different logistic companies like DHL, FedEx, DTCC E-Cart etc that use data
analytics to manage their overall operations. Using the applications of data analytics, they can
figure out the best shipping routes, and approximate delivery times, and also can track the real-
time status of goods that are dispatched using GPS trackers. Data Analytics has made online
shopping easier and more demandable.
Example of Use of data analytics in Logistics and Delivery:
When a shipment is dispatched from its origin, till it reaches its buyers, every position is tracked
which leads to the minimizing of the loss of the goods.
3. Web Search or Internet Web Results
The web search engines like Yahoo, Bing, Duckduckgo, and Google use a set of data to give you
when you search a data. Whenever you hit on the search button, the search engines use
algorithms of data analytics to deliver the best-searched results within a limited time frame. The
set of data that appears whenever we search for any information is obtained through data
analytics.
The searched data is considered as a keyword and all the related pieces of information are
presented in a sorted manner that one can easily understand. For example, when you search for a
product on amazon it keeps showing on your social media profiles or to provide you with the
details of the product to convince you by that product.
4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall work through certain
tools like prediction analysis, regression analysis, budgeting, etc. The unit can figure out the
number of products needed to be manufactured according to the data collected and analyzed
from the demand samples and likewise in many other operations increasing the operating
capacity as well as the profitability.
5. Security
Data analyst provides utmost security to the organization, Security Analytics is a way to deal
with online protection zeroed in on the examination of information to deliver proactive safety
efforts. No business can foresee the future, particularly where security dangers are concerned, yet
by sending security investigation apparatuses that can dissect security occasions it is conceivable
to identify danger before it gets an opportunity to affect your framework and main concern.
6. Education
Data analytics applications in education are the most needed data analyst in the current scenario.
It is mostly used in adaptive learning, new innovations, adaptive content, etc. Is the estimation,
assortment, investigation, and detailing of information about students and their specific
circumstances, for reasons for comprehension and streamlining learning and conditions in which
it happens.
7. Healthcare
Applications of data analytics in healthcare can be utilized to channel enormous measures of
information in seconds to discover treatment choices or answers for various illnesses. This won’t
just give precise arrangements dependent on recorded data yet may likewise give accurate
answers for exceptional worries for specific patients.
8. Military
Military applications of data analytics bring together an assortment of specialized and
application-situated use cases. It empowers chiefs and technologists to make associations
between information investigation and such fields as augmented reality and psychological
science that are driving military associations around the globe forward.
9. Insurance
There is a lot of data analysis taking place during the insurance process. Several data, such as
actuarial data and claims data, help insurance companies realize the risk involved in insuring the
person. Analytical software can be used to identify risky claims and bring them before the
authorities for further investigation.
10. Digital Advertisement

Digital advertising has also been transformed as a result of the application of data science. Data
analytics and data algorithms are used in a wide range of advertising mediums, including digital
billboards in cities and banners on websites.
11. Fraud and Risk Detection

Detecting fraud may have been the first application of data analytics. They applied data
analytics because they already had a large amount of customer data at their disposal. Data
analysis was used to examine recent spending patterns and customer profiles to determine the
likelihood of default. It eventually resulted in a reduction in fraud and risk.
12. Travel
Data analysis applications can be used to improve the traveller’s purchasing experience by
analyzing social media and mobile/weblog data. Companies can use data on recent browse-to-
buy conversion rates to create customized offers and packages that take into account the
preferences and desires of their customers.
13. Communication, Media, and Entertainment

When it comes to creating content for different target audiences, recommending content, and
measuring content performance, organizations in this industry analyze customer data and
behavioural data simultaneously. Data analytics is applied to collect and utilize customer insights
and understand their pattern of social-media usage.
14. Energy and Utility

Many firms involved in energy management use data analysis applications in areas such as
smart-grid management, energy distribution, energy optimization, and automation building for
other utility-based firms.
Characteristics of Data: Data quality is crucial – it assesses whether information can serve its
purpose in a particular context (such as data analysis, for example). So, how do you determine
the quality of a given set of information? There are data quality characteristics of which you
should be aware.
There are five traits that you’ll find within data quality: accuracy, completeness, reliability,
relevance, and timeliness – read on to learn more.
 Accuracy
 Completeness
 Reliability
 Relevance
 Timeliness
Characteristic How it’s measured
Accuracy Is the information correct in every detail?

Characteristic How it’s measured
Completeness How comprehensive is the information?
Reliability Does the information contradict other trusted resources?
Relevance Do you really need this information?
Timeliness How up- to-date is information? Can it be used for real-time

reporting?
Accuracy:
As the name implies, this data quality characteristic means that information is correct. To
determine whether data is accurate or not, ask yourself if the information reflects a real-world
situation. For example, in the realm of financial services, does a customer really have $1 million
in his bank account?
Accuracy is a crucial data quality characteristic because inaccurate information can cause
significant problems with severe consequences. We’ll use the example above – if there’s an error
in a customer’s bank account, it could be because someone accessed it without his knowledge.
Completeness:
“Completeness” refers to how comprehensive the information is. When looking at data
completeness, think about whether all of the data you need is available; you might need a
customer’s first and last name, but the middle initial may be optional.
Why does completeness matter as a data quality characteristic? If information is incomplete, it

might be unusable. Let’s say you’re sending a mailing out. You need a customer’s last name to
ensure the mail goes to the right address – without it, the data is incomplete.
Reliability:
In the realm of data quality characteristics, reliability means that a piece of information doesn’t
contradict another piece of information in a different source or system. We’ll use an example
from the healthcare field; if a patient’s birthday is January 1, 1970 in one system, yet it’s June 13,
1973 in another, the information is unreliable.
Reliability is a vital data quality characteristic. When pieces of information contradict

themselves, you can’t trust the data. You could make a mistake that could cost your firm money
and reputational damage.
Relevance:
When you’re looking at data quality characteristics, relevance comes into play because there has
to be a good reason as to why you’re collecting this information in the first place. You must
consider whether you really need this information, or whether you’re collecting it just for the
sake of it.
Why does relevance matter as a data quality characteristic? If you’re gathering irrelevant
information, you’re wasting time as well as money. Your analyses won’t be as valuable.
Timeliness:
Timeliness, as the name implies, refers to how up to date information is. If it was gathered in the
past hour, then it’s timely – unless new information has come in that renders previous
information useless.
The timeliness of information is an important data quality characteristic, because information

that isn’t timely can lead to people making the wrong decisions. In turn, that costs organizations
time, money, and reputational damage.
“Timeliness is an important data quality characteristic – out-of-date information costs

companies time and money”
Big Data :- Big Data is a massive amount of data sets that cannot be stored, processed, or
analyzed using traditional tools. Today, there are millions of data sources that generate data at a
very rapid rate. These data sources are present across the world. Some of the largest sources of
data are social media platforms and networks. Let’s use Facebook as an example—it generates
more than 500 terabytes of data every day. This data includes pictures, videos, messages, and
more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured data.
For example, in a regular Excel sheet, data is classified as structured data—with a definite format. In
contrast, emails fall under semi-structured, and your pictures and videos fall under unstructured data.
All this data combined makes up Big Data.
What is Big Data Analytics?
Big Data analytics is a process used to extract meaningful insights, such as hidden patterns,
unknown correlations, market trends, and customer preferences. Big Data analytics provides
various advantages—it can be used for better decision making, preventing fraudulent activities,
among other things.
Why is big data analytics important?
In today’s world, Big Data analytics is fueling everything we do online—in every industry.
Take the music streaming platform Spotify for example. The company has nearly 96 million
users that generate a tremendous amount of data every day. Through this information, the cloud-
based platform automatically generates suggested songs—through a smart recommendation
engine—based on likes, shares, search history, and more. What enables this is the techniques,
tools, and frameworks that are a result of Big Data analytics.
If you are a Spotify user, then you must have come across the top recommendation section,
which is based on your likes, past history, and other things. Utilizing a recommendation engine
that leverages data filtering tools that collect data and then filter it using algorithms works. This
is what Spotify does.
Uses and Examples of Big Data Analytics
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
 Using analytics to understand customer behavior in order to optimize the customer experience
 Predicting future trends in order to make better business decisions
 Improving marketing campaigns by understanding what works and what doesn't
 Increasing operational efficiency by understanding where bottlenecks are and how to fix them
 Detecting fraud and other forms of misuse sooner
These are just a few examples — the possibilities are really endless when it comes to Big Data
analytics. It all depends on how you want to use it in order to improve your business.
Benefits and Advantages of Big Data Analytics
1. Risk Management
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify
fraudulent activities and discrepancies. The organization leverages it to narrow down a list of
suspects or root causes of problems.
2. Product Development and Innovations
Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed
forces across the globe, uses Big Data analytics to analyze how efficient the engine designs are
and if there is any need for improvements.
3. Quicker and Better Decision Making Within Organizations
Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the
company leverages it to decide if a particular location would be suitable for a new outlet or not.
They will analyze several different factors, such as population, demographics, accessibility of the
location, and more.
4. Improve Customer Experience
Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They
monitor tweets to find out their customers’ experience regarding their journeys, delays, and so on.
The airline identifies negative tweets and does what’s necessary to remedy the situation. By
publicly addressing these issues and offering solutions, it helps the airline build good customer
relations.
Big Data Analytics Tools
Here are some of the key big data analytics tools :
 Hadoop - helps in storing and analyzing data
 MongoDB - used on datasets that change frequently
 Talend - used for data integration and management
 Cassandra - a distributed database used to handle chunks of data
 Spark - used for real-time processing and analyzing large amounts of data
 STORM - an open-source real-time computational system
 Kafka - a distributed streaming platform that is used for fault-tolerant storage
Data Analytics Life cycle:
The Data analytic lifecycle is designed for Big Data problems and data science projects. The
cycle is iterative to represent real project. To address the distinct requirements for performing
analysis on Big Data, step – by – step methodology is needed to organize the activities and
tasks involved with acquiring, processing, analyzing, and repurposing data.
Phase 1: Discovery –
 The data science team learn and

investigate the problem.
 Develop context and understanding.
 Come to know about data sources needed
and available for the project.
 The team formulates initial hypothesis
that can be later tested with data.
Phase 2: Data Preparation –
 Steps to explore, preprocess, and

condition data prior to modeling and
analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and transform, to
get data into the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
 Team explores data to learn about relationships between variables and subsequently, selects
key variables and the most suitable models.
 In this phase, data science team develop data sets for training, testing, and production
purposes.
 Team builds and executes models based on the work done in the model planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building –
 Team develops datasets for testing, training, and production purposes.

 Team also considers whether its existing tools will suffice for running the models or if they
need more robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –
 After executing model team need to compare outcomes of modeling to criteria established
for success and failure.
 Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop narrative to
summarize and convey findings to stakeholders.
Phase 6: Operationalize –
 The team communicates benefits of project more broadly and sets up pilot project to deploy
work in controlled way before broadening the work to full enterprise of users.
 This approach enables team to learn about performance and related constraints of the model
in production environment on small scale , and make adjustments before full deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.

Data Analytics Unit1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analytics Unit1

Uploaded by

Copyright:

Available Formats

Data Analytics

Unit-1 Data Analytics

Introduction to data analytics

Difference between data analytics and data science?

The typical process that a data analyst

Step 1: Define the question(s) you want to answer

we use to boost customer retention?”

Step 3: Clean the data

Step 4: Analyze the data

Step 5: Visualize and share your findings

Data analytics tools

Different Sources of Data for Data Analysis

Qualitative or Categorical Data

The other examples of qualitative data are:

 What language do you speak

The Qualitative data are further classified into two parts :

Examples of Nominal Data:

 Colour of hair (Blonde, red, Brown, Black, etc.)

Examples of Ordinal Data :

 When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10

Difference between Nominal and Ordinal Data

Nominal Data Ordinal Data

Examples: Eye colour, housing style, gender,

 Height or weight of a person or object

The Quantitative data are further classified into two parts :

Examples of Discrete Data :

 Total numbers of students present in a class

Examples of Continuous Data :

Discrete Data Continuous Data

Continuous data are represented in the form of a

Continuous data are in the form of a continuous

Difference between Discrete and Continuous Data

Differences between Structured, Semi-structured and Unstructured data:

Properties Structured data Semi-structured data Unstructured data

Version Versioning over Versioning over tuples or

It is more flexible than

It is very difficult to It’s scaling is simpler than

New technology, not very

Data analytics techniques

Application of Data Analytics In Different Fields

2. Logistics and Delivery

10. Digital Advertisement

11. Fraud and Risk Detection

13. Communication, Media, and Entertainment

14. Energy and Utility

Characteristic How it’s measured

Accuracy Is the information correct in every detail?

Completeness How comprehensive is the information?

Reliability Does the information contradict other trusted resources?

Relevance Do you really need this information?

Timeliness How up- to-date is information? Can it be used for real-time

Why does completeness matter as a data quality characteristic? If information is incomplete, it

Reliability is a vital data quality characteristic. When pieces of information contradict

The timeliness of information is an important data quality characteristic, because information

“Timeliness is an important data quality characteristic – out-of-date information costs

What is Big Data Analytics?

Why is big data analytics important?

Uses and Examples of Big Data Analytics

 Predicting future trends in order to make better business decisions

 Improving marketing campaigns by understanding what works and what doesn't

 Detecting fraud and other forms of misuse sooner

Benefits and Advantages of Big Data Analytics