You are on page 1of 19

Unit-1

Topics Covered:
Introduction to Data Analytics:
 Sources and nature of data,
 Classification of data (structured, semi-structured, unstructured),
 Characteristics of data,
 Introduction to Big Data platform,
 Need of data analytics,
 Evolution of analytic scalability,
 Analytic process and tools,
 Analysis vs reporting,
 Applications of data analytics.

Data Analytics Lifecycle:


 Need, key roles for successful analytic projects,
 Various phases of data analytics lifecycle – discovery, data preparation, model planning,
model building, communicating results, and operationalization.

 What Is Data Analytics?

Data analytics is the science of analyzing raw data in order to make conclusions about that
information. Many of the techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human consumption.

Data analytics techniques can reveal trends and metrics that would otherwise be lost in the mass
of information. This information can then be used to optimize processes to increase the overall
efficiency of a business or system.
Different Sources of Data for Data Analysis
 Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later
stages of data analysis.
In the process of data analysis, “Data collection” is the initial step before starting
to analyze the patterns or useful information in data. The data which is to be
analyzed must be collected from different valid sources.
 The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich
data.
Few methods of collecting primary data:
1. Interview method
2. Survey method
3. Observation method
4. Experimental method

Few methods of collecting secondary data:


 Secondary data:
Secondary data is the data which has already been collected and reused again for some
valid purpose. This type of data is previously recorded from primary data and it has two
types of sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a
sales record, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
External source:
The data which can’t be found at internal organizations and can be gained through
external third party resources is external source data. The cost and time consumption is
more because this contains a huge amount of data. Examples of external sources are
Government publications, news publications, Registrar General of India, planning
commission, international labor bureau, syndicate services, and other non-governmental
publications.

Types of Big Data

 Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as
the employee details, their job positions, their salaries, etc., will be present in an organized
manner.

 Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.

 Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database),
yet contains vital information or tags that segregate individual elements within the data. Thus we
come to the end of types of data. Lets discuss the characteristics of data.

Characteristics of Big Data


Let’s discuss the characteristics of big data.
 Variety

Variety of Big Data refers to structured, unstructured, and semistructured


data that is gathered from multiple sources. While in the past, data could only
be collected from spreadsheets and databases, today data comes in an array of
forms such as emails, PDFs, photos, videos, audios, SM posts, and so much
more. Variety is one of the important characteristics of big data.

 Velocity
Velocity essentially refers to the speed at which data is being created in real-
time. In a broader prospect, it comprises the rate of change, linking of
incoming data sets at varying speeds, and activity bursts.

 Volume
Volume is one of the characteristics of big data. We already know that Big
Data indicates huge ‘volumes’ of data that is being generated on a daily basis
from various sources like social media platforms, business processes,
machines, networks, human interactions, etc. Such a large amount of data are
stored in data warehouses. Thus comes to the end of characteristics of big
data.

BIG DATA ANALYTICS PLATFORMS TO KNOW


 Microsoft Azure
 Cloudera
 Sisense
 Collibra
 Tableau
 MapR
 Qualtrics
 Oracle
 MongoDB
 Datameer

Microsoft Azure
What it does: Users can analyze data stored on Microsoft’s Cloud platform, Azure,
with a broad spectrum of open-source Apache technologies, including Hadoop and
Spark. Azure also features a native analytics tool, HDInsight that streamlines data
cluster analysis and integrates seamlessly with Azure's other data tools.

CLOUDERA

What it does: Rooted in Apache’s Hadoop, Cloudera can handle massive amounts of
data. Clients routinely store more than 50 petabytes in Cloudera’s Data Warehouse,
which can manage data including machine logs, text, and more. Meanwhile,
Cloudera’s DataFlow—previously Hortonworks’ DataFlow—analyzes and prioritizes
data in real time.

GOOGLE CLOUD

What it does: Google Cloud offers lots of big data management tools, each with its
own specialty. BigQuery warehouses petabytes of data in an easily queried
format. Cloud Dataflow analyzes ongoing data streams and batches of historical data
side by side. With Google Data Studio, clients can turn varied data into custom
graphics.

TABLEAU

What the platform does: The Tableau platform—available on-premises or in the


Cloud—allows users to find correlations, trends and unexpected interdependences
between data sets. The Data Management add-on further enhances the platform,
allowing for more granular data cataloging and the tracking of data lineage.

MAPR

What the platform does: MapR’s platform, which they term "data ware," has
attracted customers like American Express and Samsung with its massive capacity
(exabytes!) and robust security measures. But it's not a platform so much as a meta-
platform—a dashboard for managing big data spread across various platforms, clouds,
servers and edge-computing devices. Its interface offers users a 10,000-foot
perspective on the totality of their data while letting them manage various data types
in one place.

ORACLE

What the platform does: Oracle Cloud’s big data platform can automatically migrate
diverse data formats to cloud servers, purportedly with no downtime. The platform
can also operate on premise and in hybrid settings, enriching and transforming data
whether it’s streaming in real time or stored in a centralized repository, aka "data
lake." The platform comes in three formats, including basic and governance editions.

 Why Data Analytics

Data Analytics is needed in Business to Consumer applications (B2C). Organizations


collect data that they have gathered from customers, businesses, economy and practical
experience. Data is then processed after gathering and is categorized as per the
requirement and analysis is done to study purchase patterns and etc.
1. Cost reduction.

Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify more efficient
ways of doing business.

2. Faster, better decision making.

With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new
sources of data, businesses are able to analyze information immediately – and make decisions
based on what they’ve learned.

3. New products and services.

With the ability to gauge customer needs and satisfaction through analytics comes the power to
give customers what they want. Davenport points out that with big data analytics, more
companies are creating new products to meet customers’ needs.

WHAT ARE DATA ANALYST TOOLS?


The term ‘Data analytics tools’ is used to classify software and applications used by
Data Analysts to create and execute analytic processes that help businesses make
smarter, more informed business decisions while minimizing cost and boosting
profits.

HOW TO CHOOSE A DATA ANALYST TOOL?


Start by considering your company’s business requirements-

 Will it be used by seasoned Data Analysts and Data Scientists or non-technical users who
need an intuitive interface?

 Some Data analytics tools provide an immersive experience in code creation, generally
with SQL, while others are more concerned with click-and-point review best suited for
freshers.

 The Data analytics software should also offer support for visualizations relevant to your
business goals. 3
 Finally, take price and licensing into consideration. Some Data analytics tools charge
license or subscription fees, while some Data analytics tools are free.

 The most expensive Data analytics tools are not always the most comprehensive, and
there are many robust and free Data analytics tools available in the market that shouldn’t
be overlooked.

MOST POPULAR DATA ANALYTICS TOOLS TO KNOW IN 2021

1. R
R is now one of the most popular analytics tools in the industry. It has surpassed SAS
in usage and is now the Data analytics tool of choice, even for companies that can
easily afford SAS. Over the years, R has become a lot more robust. It handles large
data sets much better than it used to, say even a decade earlier. It has also become a
lot more versatile.

2. Python
Python has been one of the favorite languages of programmers since its inception. The
main reason for its fame is the fact that it’s an easy-to-learn language that is also quite
fast. However, it developed into one of the powerful Data analytics tools with the
development of analytical and statistical libraries like NumPy, SciPy etc. Today, it
offers comprehensive coverage of statistical and mathematical functions.

3. Tableau
Tableau is among the most easy-to-learn Data analytics tools that perform an effective
job of slicing and dicing your data and creating great visualizations and dashboards.
Tableau can create better visualizations than Excel and can most definitely handle
much more data than Excel can. If you want interactivity in your plots, then Tableau
is surely the way to go.
4. Excel
Excel is, of course, the most widely used Data analytics software in the world.
Whether you are an expert in R or Tableau, you will still use Excel for the grunt work.
Non-analytics professionals will usually not have access to tools like SAS or R on
their systems. But everyone has Excel. Excel becomes vital when the analytics team
interfaces with the business steam.

Reporting vs. Analytics: What’s the difference?

Reporting

 Reporting takes factual data and presents it. There’s no judgement or insight added.
People can, of course, derive insight from reports, but that’s up to them.

 Reporting extracts data from various data sources, allows comparisons, and makes the
information easier to understand by summarizing and visualizing the data in tables, charts
and dashboards.

 Reporting is “the process of organizing data into informational summaries in order to


monitor how different areas of a business are performing.”

 Reporting: Here is how MRR is typically reported. This chart shows the MRR for
the last year, marked monthly.
Analytics

Analytics asks questions of the data collected and provides answers and insight. It (hopefully)
injects business expertise and knowledge into the analysis to deliver the final output—a
recommendation, course of action, or prediction.

Analytics is “the process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.”

Analytics: Here is MRR spliced by marketing channel, over the same twelve months as
above.
Data Analytics Applications:
Below are the various areas where data analytics applications have been employed:

1.) Policing/Security

Several cities all over the world have employed predictive analysis in predicting
areas that would likely witness a surge in crime with the use of geographical data
and historical data.

2.) Transportation
A few years back at the London Olympics, there was a need for handling over 18 million
journeys made by fans in the city of London and fortunately, it were sorted out.

3.) Fraud and Risk Detection


This has been known as one of the initial applications of data science which was
extracted from the discipline of Finance. So many organizations had very bad
experiences with debt and were so fed up with it.
4.) Manage Risk
In the insurance industry, risk management is the major focus. What most people aren’t
aware of is that when insuring a person, the risk involved is not obtained based on mere
information but data that has been analyzed statistically before a decision is made.

5.) Delivery Logistics


Well, data science and analytics have no limited applications. There are several logistic
companies working all over the world such as UPS, DHL, FedEx, etc. that make use of
data for improving their efficiency in operations.

Key Roles for a Successful Analytics Project:

 In recent years, substantial attention has been placed on the emerging role of
the data scientist.
 We will explain the various roles and key stakeholders of an analytics
project. Each plays a critical part in a successful analytics project. Although
seven roles are listed, fewer or more people can accomplish the work
depending on the scope of the project, the organizational structure, and the
skills of the participants.
 For example, on a small, versatile team, these seven roles may be fulfilled by
only 3 people, but a very large project may require 20 or more people.

1) Business User: Someone who understands the domain area and usually
benefits from the results. This person can consult and advise the project team
in the context of the project, the value of the results, and how the outputs will
be operationalized.

2) Project Sponsor: Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines the core business
problem.

3) Project Manager: Ensures that key milestones and objectives are met on
time and at the expected quality.
4) Business Intelligence Analyst: Provides business domain expertise
based on a deep understanding of the data, key performance indicators
(KPIs), key metrics, and business intelligence from a reporting perspective.

5) Database Administrator (DBA): Provisions and configures the


database environment to support the analytics needs of the working team.
These responsibilities may include providing access to key databases or
tables and ensuring the appropriate security levels are in place related to the
data repositories.

6) Data Engineer: Leverages deep technical skills to assist with tuning SQL
queries for data management and data extraction

7) Data Scientist: Provides subject matter expertise for analytical techniques,


data modeling, and applying valid analytical techniques to given business
problems. Ensures overall analytics objectives are met.

Life Cycle Phases of Data Analytics

The Data Analytics Lifecycle is a cyclic process which explains, in six stages,
how information in made, collected, processed, implemented, and analyzed for
different objectives.
Phase-1 Data Discovery
This is the initial phase to set your project's objectives and find ways to achieve a
complete data analytics lifecycle. Start with defining your business domain and
ensure you have enough resources (time, technology, data, and people) to achieve
your goals.
The biggest challenge in this phase is to accumulate enough information. You need
to draft an analytic plan, which requires some serious work.
Accumulate resources
First, you have to analyze the models you have intended to develop. Then determine
how much domain knowledge you need to acquire for fulfilling those models.
The next important thing to do is assess whether you have enough skills and
resources to bring your projects to reality.
Frame the issue
Problems are most likely to occur while meeting your client's expectations.
Therefore, you need to identify the issues related to the project and explain them to
your clients. This process is called "framing." You have to prepare a problem
statement explaining the current situation and challenges that can occur in the future.
You also need to define the project's objective, including the success and failure
criteria for the project.

Formulate initial hypothesis


Once you gather all the clients' requirements, you have to develop initial hypotheses
after exploring the initial data.
Phase-2 Data Preparation and Processing
The Data preparation and processing phase involves collecting, processing, and
conditioning data before moving to the model building process.

Identify data sources


You have to identify various data sources and analyze how much and what kind of
data you can accumulate within a given timeframe. Evaluate the data structures,
explore their attributes and acquire all the tools needed.
Collection of data
You can collect data using three methods:
Data acquisition: You can collect data through external sources.
Data Entry: You can prepare data points through digital systems or manual entry
as well.
Signal reception: You can accumulate data from digital devices such as IoT devices
and control systems.
Phase-3 Model Planning
This is a phase where you have to analyze the quality of data and find a suitable
model for your project.
Loading Data in Analytics Sandbox
An analytics sandbox is a part of data lake architecture that allows you to store and
process large amounts of data. It can efficiently process a large range of data such
as big data, transactional data, social media data, web data, and many more. It is an
environment that allows your analysts to schedule and process data assets using the
data tools of their choice. The best part of the analytics sandbox is its agility. It
empowers analysts to process data in real-time and get essential information within
a short duration.
Data are loaded in the sandbox in three ways:
ETL − Team specialists make the data comply with the business rules before loading
it in the sandbox.
ELT − The data is loaded in the sandbox and then transform as per business rules.
ETLT − It comprises two levels of data transformation, including ETL and ELT
both.
The data you have collected may contain unnecessary features or null values. It may
come in a form too complex to anticipate. This is where data exploration' can help
you uncover the hidden trends in data.

Steps involved in data exploration:

 Data identification
 Univariate Analysis
 Multivariate Analysis
 Filling Null values
 Feature engineering
For model planning, data analysts often use regression techniques, decision trees, neural networks,
etc. Tools mostly used for model planning and execution include Rand PL/R, WEKA, Octave,
Statista, and MATLAB.

Phase-4 Model Building


Model building is the process where you have to deploy the planned model in a real-
time environment. It allows analysts to solidify their decision-making process by
gain in-depth analytical information. This is a repetitive process, as you have to add
new features as required by your customers constantly.
Your aim here is to forecast business decisions and customize market strategies and
develop tailor-made customer interests. This can be done by integrating the model
into your existing production domain.
Phase-5 Result Communication and Publication
This is the phase where you have to communicate the data analysis with your clients.
It requires several intricate processes where you how to present information to
clients in a lucid manner. Your clients don't have enough time to determine which
data is essential. Therefore, you must do an impeccable job to grab the attention of
your clients
Check the data accuracy
Is the data provide information as expected? If not, then you have to run some other
processes to resolve this issue. You need to ensure the data you process provides
consistent information. This will help you build a convincing argument while
summarizing your findings.
Highlight important findings
Well, each data holds a significant role in building an efficient project. However,
some data inherits more potent information that can truly serve your audience's
benefits. While summarizing your findings, try to categorize data into different key
points.
Phase-6 Operationalize
As soon you prepare a detailed report including your key findings, documents, and
briefings, your data analytics life cycle almost comes close to the end. The next step
remains the measure the effectiveness of your analysis before submitting the final
reports to your stakeholders.
In this process, you have to move the sandbox data and run it in a live environment.
Then you have to closely monitor the results, ensuring they match with your
expected goals. If the findings fit perfectly with your objective, then you can finalize
the report. Otherwise, you have to take a step back in your data analytics lifecycle
and make some changes.

You might also like