Professional Documents
Culture Documents
Topics Covered:
Introduction to Data Analytics:
Sources and nature of data,
Classification of data (structured, semi-structured, unstructured),
Characteristics of data,
Introduction to Big Data platform,
Need of data analytics,
Evolution of analytic scalability,
Analytic process and tools,
Analysis vs reporting,
Applications of data analytics.
Data analytics is the science of analyzing raw data in order to make conclusions about that
information. Many of the techniques and processes of data analytics have been automated into
mechanical processes and algorithms that work over raw data for human consumption.
Data analytics techniques can reveal trends and metrics that would otherwise be lost in the mass
of information. This information can then be used to optimize processes to increase the overall
efficiency of a business or system.
Different Sources of Data for Data Analysis
Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later
stages of data analysis.
In the process of data analysis, “Data collection” is the initial step before starting
to analyze the patterns or useful information in data. The data which is to be
analyzed must be collected from different valid sources.
The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc. The main goal of data collection is to collect information-rich
data.
Few methods of collecting primary data:
1. Interview method
2. Survey method
3. Observation method
4. Experimental method
Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as
the employee details, their job positions, their salaries, etc., will be present in an organized
manner.
Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.
Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database),
yet contains vital information or tags that segregate individual elements within the data. Thus we
come to the end of types of data. Lets discuss the characteristics of data.
Velocity
Velocity essentially refers to the speed at which data is being created in real-
time. In a broader prospect, it comprises the rate of change, linking of
incoming data sets at varying speeds, and activity bursts.
Volume
Volume is one of the characteristics of big data. We already know that Big
Data indicates huge ‘volumes’ of data that is being generated on a daily basis
from various sources like social media platforms, business processes,
machines, networks, human interactions, etc. Such a large amount of data are
stored in data warehouses. Thus comes to the end of characteristics of big
data.
Microsoft Azure
What it does: Users can analyze data stored on Microsoft’s Cloud platform, Azure,
with a broad spectrum of open-source Apache technologies, including Hadoop and
Spark. Azure also features a native analytics tool, HDInsight that streamlines data
cluster analysis and integrates seamlessly with Azure's other data tools.
CLOUDERA
What it does: Rooted in Apache’s Hadoop, Cloudera can handle massive amounts of
data. Clients routinely store more than 50 petabytes in Cloudera’s Data Warehouse,
which can manage data including machine logs, text, and more. Meanwhile,
Cloudera’s DataFlow—previously Hortonworks’ DataFlow—analyzes and prioritizes
data in real time.
GOOGLE CLOUD
What it does: Google Cloud offers lots of big data management tools, each with its
own specialty. BigQuery warehouses petabytes of data in an easily queried
format. Cloud Dataflow analyzes ongoing data streams and batches of historical data
side by side. With Google Data Studio, clients can turn varied data into custom
graphics.
TABLEAU
MAPR
What the platform does: MapR’s platform, which they term "data ware," has
attracted customers like American Express and Samsung with its massive capacity
(exabytes!) and robust security measures. But it's not a platform so much as a meta-
platform—a dashboard for managing big data spread across various platforms, clouds,
servers and edge-computing devices. Its interface offers users a 10,000-foot
perspective on the totality of their data while letting them manage various data types
in one place.
ORACLE
What the platform does: Oracle Cloud’s big data platform can automatically migrate
diverse data formats to cloud servers, purportedly with no downtime. The platform
can also operate on premise and in hybrid settings, enriching and transforming data
whether it’s streaming in real time or stored in a centralized repository, aka "data
lake." The platform comes in three formats, including basic and governance editions.
Big data technologies such as Hadoop and cloud-based analytics bring significant cost
advantages when it comes to storing large amounts of data – plus they can identify more efficient
ways of doing business.
With the speed of Hadoop and in-memory analytics, combined with the ability to analyze new
sources of data, businesses are able to analyze information immediately – and make decisions
based on what they’ve learned.
With the ability to gauge customer needs and satisfaction through analytics comes the power to
give customers what they want. Davenport points out that with big data analytics, more
companies are creating new products to meet customers’ needs.
Will it be used by seasoned Data Analysts and Data Scientists or non-technical users who
need an intuitive interface?
Some Data analytics tools provide an immersive experience in code creation, generally
with SQL, while others are more concerned with click-and-point review best suited for
freshers.
The Data analytics software should also offer support for visualizations relevant to your
business goals. 3
Finally, take price and licensing into consideration. Some Data analytics tools charge
license or subscription fees, while some Data analytics tools are free.
The most expensive Data analytics tools are not always the most comprehensive, and
there are many robust and free Data analytics tools available in the market that shouldn’t
be overlooked.
1. R
R is now one of the most popular analytics tools in the industry. It has surpassed SAS
in usage and is now the Data analytics tool of choice, even for companies that can
easily afford SAS. Over the years, R has become a lot more robust. It handles large
data sets much better than it used to, say even a decade earlier. It has also become a
lot more versatile.
2. Python
Python has been one of the favorite languages of programmers since its inception. The
main reason for its fame is the fact that it’s an easy-to-learn language that is also quite
fast. However, it developed into one of the powerful Data analytics tools with the
development of analytical and statistical libraries like NumPy, SciPy etc. Today, it
offers comprehensive coverage of statistical and mathematical functions.
3. Tableau
Tableau is among the most easy-to-learn Data analytics tools that perform an effective
job of slicing and dicing your data and creating great visualizations and dashboards.
Tableau can create better visualizations than Excel and can most definitely handle
much more data than Excel can. If you want interactivity in your plots, then Tableau
is surely the way to go.
4. Excel
Excel is, of course, the most widely used Data analytics software in the world.
Whether you are an expert in R or Tableau, you will still use Excel for the grunt work.
Non-analytics professionals will usually not have access to tools like SAS or R on
their systems. But everyone has Excel. Excel becomes vital when the analytics team
interfaces with the business steam.
Reporting
Reporting takes factual data and presents it. There’s no judgement or insight added.
People can, of course, derive insight from reports, but that’s up to them.
Reporting extracts data from various data sources, allows comparisons, and makes the
information easier to understand by summarizing and visualizing the data in tables, charts
and dashboards.
Reporting: Here is how MRR is typically reported. This chart shows the MRR for
the last year, marked monthly.
Analytics
Analytics asks questions of the data collected and provides answers and insight. It (hopefully)
injects business expertise and knowledge into the analysis to deliver the final output—a
recommendation, course of action, or prediction.
Analytics is “the process of exploring data and reports in order to extract meaningful
insights, which can be used to better understand and improve business performance.”
Analytics: Here is MRR spliced by marketing channel, over the same twelve months as
above.
Data Analytics Applications:
Below are the various areas where data analytics applications have been employed:
1.) Policing/Security
Several cities all over the world have employed predictive analysis in predicting
areas that would likely witness a surge in crime with the use of geographical data
and historical data.
2.) Transportation
A few years back at the London Olympics, there was a need for handling over 18 million
journeys made by fans in the city of London and fortunately, it were sorted out.
In recent years, substantial attention has been placed on the emerging role of
the data scientist.
We will explain the various roles and key stakeholders of an analytics
project. Each plays a critical part in a successful analytics project. Although
seven roles are listed, fewer or more people can accomplish the work
depending on the scope of the project, the organizational structure, and the
skills of the participants.
For example, on a small, versatile team, these seven roles may be fulfilled by
only 3 people, but a very large project may require 20 or more people.
1) Business User: Someone who understands the domain area and usually
benefits from the results. This person can consult and advise the project team
in the context of the project, the value of the results, and how the outputs will
be operationalized.
2) Project Sponsor: Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines the core business
problem.
3) Project Manager: Ensures that key milestones and objectives are met on
time and at the expected quality.
4) Business Intelligence Analyst: Provides business domain expertise
based on a deep understanding of the data, key performance indicators
(KPIs), key metrics, and business intelligence from a reporting perspective.
6) Data Engineer: Leverages deep technical skills to assist with tuning SQL
queries for data management and data extraction
The Data Analytics Lifecycle is a cyclic process which explains, in six stages,
how information in made, collected, processed, implemented, and analyzed for
different objectives.
Phase-1 Data Discovery
This is the initial phase to set your project's objectives and find ways to achieve a
complete data analytics lifecycle. Start with defining your business domain and
ensure you have enough resources (time, technology, data, and people) to achieve
your goals.
The biggest challenge in this phase is to accumulate enough information. You need
to draft an analytic plan, which requires some serious work.
Accumulate resources
First, you have to analyze the models you have intended to develop. Then determine
how much domain knowledge you need to acquire for fulfilling those models.
The next important thing to do is assess whether you have enough skills and
resources to bring your projects to reality.
Frame the issue
Problems are most likely to occur while meeting your client's expectations.
Therefore, you need to identify the issues related to the project and explain them to
your clients. This process is called "framing." You have to prepare a problem
statement explaining the current situation and challenges that can occur in the future.
You also need to define the project's objective, including the success and failure
criteria for the project.
Data identification
Univariate Analysis
Multivariate Analysis
Filling Null values
Feature engineering
For model planning, data analysts often use regression techniques, decision trees, neural networks,
etc. Tools mostly used for model planning and execution include Rand PL/R, WEKA, Octave,
Statista, and MATLAB.