Unit - I Introduction To Data Analytics

Data Analytics (KIT-601)
3rd year (Semester –VI)

Unit - I
Anil Singh
Asst. Prof.
CSE Dept.
UIT, Prayagraj
Introduction to Data Analytics
• What is data?
– Data is a collection of facts, such as numbers,
words, measurements, observations or just
descriptions of things.
Data Analytics, KIT-601 2

Different Sources of Data for Data
Analysis
• Data Collection:
– Data collection is the process of acquiring,
collecting, extracting, and storing the voluminous
amount of data which may be in the structured or
unstructured form like text, video, audio, XML
files, records, or other image files used in later
stages of data analysis.
– In the process of big data analysis, “Data
collection” is the initial step before starting to
analyze the patterns or useful information in data.

Analysis
– The data which is to be analyzed must be collected
from different valid sources.
– The data which is collected is known as raw data
which is not useful now but on cleaning the impure
and utilizing that data for further analysis forms
information, the information obtained is known as
“knowledge”.
– Knowledge has many meanings like business
knowledge or sales of enterprise products, disease
treatment, etc.
– The main goal of data collection is to collect
information-rich data.
Analysis
– Data collection starts with asking some questions
such as what type of data is to be collected and
what is the source of collection.
– Most of the data collected are of two types known
as “qualitative data” which is a group of non-
numerical data such as words, sentences mostly
focus on behavior and actions of the group and
another one is “quantitative data” which is in
numerical forms and can be calculated using
different scientific tools and sampling data.

Analysis
• Sources of Data:
The data are collected in the following ways.
These are:
– Primary Sources:
• The data which are collected for the first time by an
individual or the group of individuals, institutions or
organizations are known as primary sources of the data.
– Secondary Sources:
• The data collected from any published or unpublished
sources are called secondary sources.
Analysis
• The figure below shows the different methods of data collection.

Analysis
• Primary Data:
– The data which is Raw, original, and extracted directly
from the official sources is known as primary data.
– This type of data is collected directly by performing
techniques such as questionnaires, interviews, and
surveys.
– The data collected must be according to the demand
and requirements of the target audience on which
analysis is performed otherwise it would be a burden
in the data processing.

Analysis
• Few methods of collecting primary data:
• Interview method:
– The data collected during this process is through
interviewing the target audience by a person called
interviewer and the person who answers the interview is
known as the interviewee.
– Some basic business or product related questions are
asked and noted down in the form of notes, audio, or
video and this data is stored for processing.
– These can be both structured and unstructured like
personal interviews or formal interviews through
telephone, face to face, email, etc.

Analysis
• Survey method:
– The survey method is the process of research where a
list of relevant questions are asked and answers are
noted down in the form of text, audio, or video.
– The survey method can be obtained in both online
and offline mode like through website forms and
email. Then that survey answers are stored for
analyzing data.
– Examples are online surveys or surveys through social
media polls.

Analysis
• Observation method:
– The observation method is a method of data collection in
which the researcher keenly observes the behavior and
practices of the target audience using some data collecting
tool and stores the observed data in the form of text,
audio, video, or any raw formats.
– In this method, the data is collected directly by posting a
few questions on the participants.
– For example, observing a group of customers and their
behavior towards the products. The data obtained will be
sent for processing.

Analysis
• Experimental method:
– The experimental method is the process of collecting data
through performing experiments, research, and investigation.
The most frequently used experiment methods are CRD, RBD,
LSD, FD.
– CRD - Completely Randomized design
• It is a simple experimental design used in data analytics which is based
on randomization and replication. It is mostly used for comparing the
experiments.
– RBD - Randomized Block Design

• It is an experimental design in which the experiment is divided into
small units called blocks.
• Random experiments are performed on each of the blocks and results
are drawn using a technique known as analysis of variance (ANOVA).
• RBD was originated from the agriculture sector.
Analysis
– LSD – Latin Square Design
• It is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns.
• It is an arrangement of NxN squares with an equal amount
of rows and columns which contain letters that occurs only
once in a row. Hence the differences can be easily found
with fewer errors in the experiment.
• Sudoku puzzle is an example of a Latin square design.
– FD - Factorial design
• It is an experimental design where each experiment has two
factors each with possible values and on performing trail
other combinational factors are derived.

Analysis
• Secondary data:
– Secondary data is the data which has already been
collected and reused again for some valid purpose.
– This type of data is previously recorded from primary
data and it has two types of sources named internal
source and external source.
– Internal source:
• These types of data can easily be found within the
organization such as market record, a sales record,
transactions, customer data, accounting resources, etc.
• The cost and time consumption is less in obtaining internal
sources.

Analysis
– External source:
• The data which can’t be found at internal organizations
and can be gained through external third party
resources is external source data.
• The cost and time consumption is more because this
contains a huge amount of data.
• Examples of external sources are Government
publications, news publications, Registrar General of
India, planning commission, international labor bureau,
syndicate services, and other non-governmental
publications.

Analysis
– Other sources:
• Sensors data:
– With the advancement of IoT devices, the sensors of these
devices collect data which can be used for sensor data analytics
to track the performance and usage of products.
• Satellites data:
– Satellites collect a lot of images and data in terabytes on daily
basis through surveillance cameras which can be used to collect
useful information.
• Web traffic:
– Due to fast and cheap internet facilities many formats of data
which is uploaded by users on different platforms can be
predicted and collected with their permission for data analysis.
The search engines also provide their data through keywords and
queries searched mostly.

Classification of data
• Data classification is the process of organizing data into
categories that make it easy to retrieve, sort and store
for future use.
• A well-planned data classification system makes
essential data easy to find and retrieve.
• Written procedures and guidelines for data
classification policies should define what categories
and criteria the organization will use to classify data.
• Once a data classification scheme is created, security
standards should be identified that specify appropriate
handling practices for each category.
• Data can be broadly classified into three
types:
– Structured
– Semi-structured
– Unstructured

• Structured data –
– Structured data is data whose elements are addressable
for effective analysis.
– It has been organized into a formatted repository that is
typically a database.
– It concerns all data which can be stored in database SQL in
a table with rows and columns.
– They have relational keys and can easily be mapped into
pre-designed fields.
– Today, those data are most processed in the development
and simplest way to manage information.
– Example: Relational data.
• Examples of Structured Data
– An 'Employee' table in a database is an example of
Structured Data

• Semi-Structured data –
– Semi-structured data is information that does not
reside in a relational database but that have some
organizational properties that make it easier to
analyze.
– With some process, you can store them in the
relation database (it could be very hard for some
kind of semi-structured data), but Semi-structured
exist to ease space.
– Example: XML data.

• Examples of Semi-structured Data
– Personal data stored in an XML file

• Unstructured data –
– Unstructured data is a data which is not organized in a
predefined manner or does not have a predefined
data model, thus it is not a good fit for a mainstream
relational database.
– So for unstructured data, there are alternative
platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in
a variety of business intelligence and analytics
applications.
– Example: Word, PDF, Text, Media logs.

• Examples of Unstructured Data

Differences between Structured, Semi-
structured and Unstructured data

Differences between Structured, Semi-
structured and Unstructured data

Characteristics of Data
• The seven characteristics that define data
quality are:
– Accuracy and Precision
– Legitimacy and Validity
– Reliability and Consistency
– Timeliness and Relevance
– Completeness and Comprehensiveness
– Availability and Accessibility
– Granularity and Uniqueness

• Accuracy and Precision:
– This characteristic refers to the exactness of the data. It
cannot have any erroneous elements and must convey the
correct message without being misleading.
– This accuracy and precision have a component that relates
to its intended use.
– Without understanding how the data will be consumed,
ensuring accuracy and precision could be off-target or
more costly than necessary.
– For example, accuracy in healthcare might be more
important than in another industry (which is to say,
inaccurate data in healthcare could have more serious
consequences) and, therefore, justifiably worth higher
levels of investment.
• Legitimacy and Validity:
– Requirements governing data set the boundaries of this
characteristic.
– For example, on surveys, items such as gender, ethnicity,
and nationality are typically limited to a set of options and
open answers are not permitted. Any answers other than
these would not be considered valid or legitimate based
on the survey’s requirement.
– This is the case for most data and must be carefully
considered when determining its quality.
– The people in each department in an organization
understand what data is valid or not to them, so the
requirements must be leveraged when evaluating data
quality.
• Reliability and Consistency:
– Many systems in today’s environments use and/or
collect the same source data.
– Regardless of what source collected the data or
where it resides, it cannot contradict a value
residing in a different source or collected by a
different system.
– There must be a stable and steady mechanism
that collects and stores the data without
contradiction or unwarranted variance.

• Timeliness and Relevance:
– There must be a valid reason to collect the data to
justify the effort required, which also means it has
to be collected at the right moment in time.
– Data collected too soon or too late could
misrepresent a situation and drive inaccurate
decisions.

• Completeness and Comprehensiveness:
– Incomplete data is as dangerous as inaccurate data.
– Gaps in data collection lead to a partial view of the
overall picture to be displayed.
– Without a complete picture of how operations are
running, uninformed actions will occur.
– It’s important to understand the complete set of
requirements that constitute a comprehensive set of
data to determine whether or not the requirements
are being fulfilled.

• Availability and Accessibility:
– This characteristic can be tricky at times due to
legal and regulatory constraints.
– Regardless of the challenge, though, individuals
need the right level of access to the data in order
to perform their jobs.
– This presumes that the data exists and is available
for access to be granted.

• Granularity and Uniqueness:
– The level of detail at which data is collected is
important, because confusion and inaccurate
decisions can otherwise occur.
– Aggregated, summarized and manipulated collections
of data could offer a different meaning than the data
implied at a lower level.
– An appropriate level of granularity must be defined to
provide sufficient uniqueness and distinctive
properties to become visible.
– This is a requirement for operations to function
effectively.

Introduction to Big Data platform
• Big Data is a collection of data that is huge in
volume, yet growing exponentially with time.
• It is a data with so large size and complexity

that none of traditional data management
tools can store it or process it efficiently.
• Big data is also a data but with huge size.

Introduction to Big Data platform
• Examples Of Big Data:
• Stock Exchange:
– The New York Stock Exchange generates about one terabyte of
new trade data per day.
• Social Media:
– The statistic shows that 500+terabytes of new data get ingested
into the databases of social media site Facebook, every day. This
data is mainly generated in terms of photo and video uploads,
message exchanges, putting comments etc.
• Jet Engine:
– A single Jet engine can generate 10+terabytes of data in 30
minutes of flight time. With many thousand flights per day,
generation of data reaches up to many Petabytes.
Characteristics Of Big Data
• Big data can be described by the following
characteristics:
– Volume
– Variety
– Velocity
– Variability

• Volume –
– The name Big Data itself is related to a size which
is enormous.
– Size of data plays a very crucial role in determining
value out of data.
– Also, whether a particular data can actually be
considered as a Big Data or not, is dependent
upon the volume of data.
– Hence, 'Volume' is one characteristic which needs
to be considered while dealing with Big Data.

• Variety –
– The next aspect of Big Data is its variety.
– Variety refers to heterogeneous sources and the
nature of data, both structured and unstructured.
– During earlier days, spreadsheets and databases were
the only sources of data considered by most of the
applications.
– Nowadays, data in the form of emails, photos, videos,
monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications.
– This variety of unstructured data poses certain issues
for storage, mining and analyzing data.

• Velocity –
– The term 'velocity' refers to the speed of generation
of data.
– How fast the data is generated and processed to meet
the demands, determines real potential in the data.
– Big Data Velocity deals with the speed at which data
flows in from sources like business processes,
application logs, networks, and social media sites,
sensors, Mobile devices, etc.
– The flow of data is massive and continuous.

• Variability –
– This refers to the inconsistency which can be
shown by the data at times, thus hampering the
process of being able to handle and manage the
data effectively.

What Is Data Analytics?
• Data analytics is the science of analyzing raw data in
order to make conclusions about that information.
• Many of the techniques and processes of data analytics
have been automated into mechanical processes and
algorithms that work over raw data for human
consumption.
• Data analytics techniques can reveal trends and
metrics that would otherwise be lost in the mass of
information.
• This information can then be used to optimize
processes to increase the overall efficiency of a
business or system.

• The process involved in data analysis involves
several different steps:
– The first step is to determine the data requirements or
how the data is grouped. Data may be separated by
age, demographic, income, or gender. Data values
may be numerical or be divided by category.
– The second step in data analytics is the process of

collecting it. This can be done through a variety of
sources such as computers, online sources, cameras,
environmental sources, or through personnel.
– Once the data is collected, it must be organized so
it can be analyzed. Organization may take place on
a spreadsheet or other form of software that can
take statistical data.
– The data is then cleaned up before analysis. This

means it is scrubbed and checked to ensure there
is no duplication or error, and that it is not
incomplete. This step helps correct any errors
before it goes on to a data analyst to be analyzed.

Why Data Analytics Matters
• Data analytics is important because it helps businesses
optimize their performances. Implementing it into the
business model means companies can help reduce
costs by identifying more efficient ways of doing
business and by storing large amounts of data.
• A company can also use data analytics to make better
business decisions and help analyze customer trends
and satisfaction, which can lead to new and better
products and services.
• Data analytics help a business optimize its

performance.

The Evolution of Analytic Scalability
• It goes without saying that the world of big data requires
new levels of scalability.
• As the amount of data organizations process continues to
increase, the same old methods for handling data just
won’t work anymore.
• Organizations that don’t update their technologies to
provide a higher level of scalability will quite simply choke
on big data.
• Luckily, there are multiple technologies available that
address different aspects of the process of taming big data
and making use of it in analytic processes.
• Some of these advances are quite new, and organizations
need to keep up with the times.

• Measurement of Data Size:

• Traditional Analytic Architecture:

• Modern Database Architecture:

• Some mechanism for storing and analyzing large amounts of data:
• MASSIVELY PARALLEL PROCESSING SYSTEMS:

– Massively parallel processing (MPP) database systems have been
around for decades. While individual vendor architectures may vary,
MPP is the most mature, proven, and widely deployed mechanism for
storing and analyzing large amounts of data.
– An MPP database spreads data out into independent pieces managed
by independent storage and central processing unit (CPU) resources.
Conceptually, it is like having pieces of data loaded onto multiple
network connected personal computers around a house.
– It removes the constraints of having one central server with only a
single set CPU and disk to manage it.
– The data in an MPP system gets split across a variety of disks managed
by a variety of CPUs spread across a number of servers.


• CLOUD COMPUTING:
– Enterprises incur no infrastructure or capital costs, only
operational costs. Those operational costs will be incurred
on a pay - per - use basis with no contractual obligations.
– Capacity can be scaled up or down dynamically, and
immediately. This differentiates clouds from traditional
hosting service providers where there may have been
limits placed on scaling.
– The underlying hardware can be anywhere geographically.
The architectural specifics are abstracted from the user. In
addition, the hardware will run in multi - tenancy mode
where multiple users from multiple organizations can be
accessing the exact same infrastructure simultaneously.

• GRID COMPUTING:
– A grid configuration can help both cost and performance. It falls
into the classification of “high - performance computing.”
– Instead of having a single high - end server (or maybe a few of
them), a large number of lower - cost machines are put in place.
As opposed to having one server managing its CPU and
resources across jobs, jobs are parceled out individually to the
different machines to be processed in parallel.
– Each machine may only be able to handle a fraction of the work
of the original server and can potentially handle only one job at
a time. In aggregate, however, the grid can handle quite a bit.
– Grids can therefore be a cost - effective mechanism to improve
overall throughput and capacity.
– Grid computing also helps organizations balance workloads,
prioritize jobs, and offer high availability for analytic processing

• MAPREDUCE:
– MapReduce is a parallel programming framework. It’s neither a
data-base nor a direct competitor to databases.
– This has not stopped some people from claiming it’s going to
replace databases and everything else under the sun.
– The reality is MapReduce is complementary to existing
technologies. There are a lot of tasks that can be done in a
MapReduce environment that can also be done in a relational
database.
– What it comes down to is identifying which environment is
better for the problem at hand. Being able to do something with
a tool or technology isn’t the same as being the best way to do
something.
– By focusing on what MapReduce is best for instead of what
theoretically can be done with it, it is possible to maximize the
benefits received.
Analytic process and tools
• Data Analysis Process consists of the following
phases that are iterative in nature −
– Data Requirements Specification
– Data Collection
– Data Processing
– Data Cleaning
– Data Analysis
– Communication


• Best Analytic Processes and Big Data Tools
– Big data is the storage and analysis of large data
sets. These are complex data sets which can be
both structured or unstructured. They are so large
that it is not possible to work on them with
traditional analytical tools.
– The top big data tools used these days are open
source data tools, data visualization tools,
sentiment tools, data extraction tools and
databases.

• Some of the best used big data tools are
mentioned below –
• R-Programming:
– R is a free open source software programming
language and a software environment for statistical
computing and graphics.
– It is used by data miners for developing statistical
software and data analysis.
– It has become a highly popular tool for big data in
recent years.

• Datawrapper:
– It is an online data visualization tool for making
interactive charts. You need to paste your data file
in a csv, pdf or excel format or paste it directly in
the field.
– Datawrapper then generates any visualization in
the form of bar, line, map etc.
– It can be embedded into any other website as
well. It is easy to use and produces visually
effective charts.

• Tableau Public:
– Tableau is another popular big data tool. It is
simple and very intuitive to use. It communicates
the insights of the data through data visualization.
– Through Tableau, an analyst can check a
hypothesis and explore the data before starting to
work on it extensively.

• Content Grabber:
– Content Grabber is a data extraction tool. It is
suitable for people with advanced programming
skills.
– It is a web crawling software. Businesses can use it
to extract content and save it in a structured
format.
– It offers editing and debugging facility among
many others for analysis later.

Analysis vs reporting
• “Analytics” means raw data analysis. Typical analytics
requests usually imply a once-off data investigation.
• “Reporting” means data to inform decisions. Typical
reporting requests usually imply repeatable access to the
information, which could be monthly, weekly, daily, or even
real-time.
• Some of the steps involved within a data analytics
exploration:
– create data hypothesis
– gather and manipulate data
– present results to the business
– re-iterate

• Some of the steps involved in building a report:
– Understand business requirement
– Connect and gather the data
– Translate the technical data
– Understand the data backgrounds by different dimensions
– Find a way to display data for 100 categories and its 5 sub-
categories (500+ combinations!)
– Re-work the data
– Business stakeholder gets confused
– Scope gets changed
– Repeat the steps
– More re-work
– Initial visualization on excel

– Addressing stakeholders understanding
– Start the reporting dashboard build
– Configure the features and parameters
– More re-work
– Test the user experience
– Conform with the company style guide
– Test the reporting automation and deployment
– Liaise with technology or production team
– Set up a process for regular refresh and failure
– Document reporting process

Modern Data Analytics Tools
• R Programming:
– R is the leading analytics tool in the industry and
widely used for statistics and data modeling. It can
easily manipulate your data and present in different
ways.
– R compiles and runs on a wide variety of platforms viz
-UNIX, Windows and MacOS.
– It has 11,556 packages and allows you to browse the
packages by categories.
– R also provides tools to automatically install all
packages as per user requirement, which can also be
well assembled with Big data.

• Tableau Public:
– Tableau Public is a free software that connects any
data source be it corporate Data Warehouse,
Microsoft Excel or web-based data, and creates
data visualizations, maps, dashboards etc. with
real-time updates presenting on web.
– They can also be shared through social media or
with the client.
– It allows the access to download the file in
different formats.

• Python:
– Python is an object-oriented scripting language which is
easy to read, write, maintain and is a free open source
tool.
– It was developed by Guido van Rossum in late 1980’s
which supports both functional and structured
programming methods.
– Python is easy to learn as it is very similar to JavaScript,
Ruby, and PHP. Also, Python has very good machine
learning libraries viz. Scikitlearn, Theano, Tensorflow and
Keras.
– Another important feature of Python is that it can be
assembled on any platform like SQL server, a MongoDB
database or JSON. Python can also handle text data very
well.
• SAS:
– SAS is a programming environment and language for data
manipulation and a leader in analytics, developed by the
SAS Institute in 1966 and further developed in 1980’s and
1990’s.
– SAS is easily accessible, manageable and can analyze data
from any sources.
– SAS introduced a large set of products in 2011 for
customer intelligence and numerous SAS modules for web,
social media and marketing analytics that is widely used
for profiling customers and prospects.
– It can also predict their behaviors, manage, and optimize
communications.

• Apache Spark:
– Apache Spark is a fast large-scale data processing engine
and executes applications in Hadoop clusters 100 times
faster in memory and 10 times faster on disk.
– Spark is built on data science and its concept makes data
science effortless.
– Spark is also popular for data pipelines and machine
learning models development.
– Spark also includes a library – MLlib, that provides a
progressive set of machine algorithms for repetitive data
science techniques like Classification, Regression,
Collaborative Filtering, Clustering, etc.

• Excel:
– Excel is a basic, popular and widely used analytical tool
almost in all industries. Whether you are an expert in Sas,
R or Tableau, you will still need to use Excel.
– Excel becomes important when there is a requirement of
analytics on the client’s internal data.
– It analyzes the complex task that summarizes the data with
a preview of pivot tables that helps in filtering the data as
per client requirement.
– Excel has the advance business analytics option which
helps in modeling capabilities which have prebuilt options
like automatic relationship detection, a creation of DAX
measures and time grouping.

• RapidMiner:
– RapidMiner is a powerful integrated data science
platform developed by the same company that
performs predictive analysis and other advanced
analytics like data mining, text analytics, machine
learning and visual analytics without any
programming.
– RapidMiner can incorporate with any data source
types, including Access, Excel, Microsoft SQL, Tera
data, Oracle, Sybase, IBM DB2, Ingres, MySQL, IBM
SPSS, Dbase etc.
– The tool is very powerful that can generate analytics
based on real-life data transformation settings.
Applications of data analytics
• Security:
– Data analytics applications or, more specifically, predictive analysis has
also helped in dropping crime rates in certain areas. In a few major
cities like Los Angeles and Chicago, historical and geographical data
has been used to isolate specific areas where crime rates could surge.
On that basis, while arrests could not be made on a whim, police
patrols could be increased. Thus, using applications of data analytics,
crime rates dropped in these areas.
• Transportation:
– Data analytics can be used to revolutionize transportation. It can be
used especially in areas where you need to transport a large number
of people to a specific area and require seamless transportation.
– This data analytical technique was applied in the London Olympics a
few years ago. For this event, around 18 million journeys had to be
made. So, the train operators and TFL were able to use data from
similar events, predict the number of people who would travel, and
then ensure that the transportation was kept smooth.
• Risk detection:
– One of the first data analytics applications may have been in the
discovery of fraud. Many organizations were struggling under debt,
and they wanted a solution to this problem. They already had enough
customer data in their hands, and so, they applied data analytics. They
used ‘divide and conquer’ policy with the data, analyzing recent
expenditure, profiles, and any other important information to
understand any probability of a customer defaulting. Eventually, it led
to lower risks and fraud.
• Risk Management:
– Risk management is an essential aspect in the world of insurance.
While a person is being insured, there is a lot of data analytics that
goes on during the process. The risk involved while insuring the person
is based on several data like actuarial data and claims data, and the
analysis of them helps insurance companies to realize the risk.

• Delivery:
– Several top logistic companies like DHL and FedEx are using data
analysis to examine collected data and improve their overall
efficiency. Using data analytics applications, the companies were
able to find the best shipping routes, delivery time, as well as
the most cost-efficient transport means. Using GPS and
accumulating data from the GPS gives them a huge advantage in
data analytics.
• Fast internet allocation:

– While it might seem that allocating fast internet in every area
makes a city ‘Smart’, in reality, it is more important to engage in
smart allocation. This smart allocation would mean
understanding how bandwidth is being used in specific areas
and for the right cause.

• Reasonable Expenditure:
– When one is building Smart cities, it becomes difficult to
plan it out in the right way. Remodeling of the landmark or
making any change would incur large amounts of
expenditure, which might eventually turn out to be a
waste.
• Interaction with customers:
– In insurance, there should be a healthy relationship
between the claims handlers and customers. Hence, to
improve their services, many insurance companies often
use customer surveys to collect data. Since insurance
companies target a diverse group of people, each
demographic has their own preference when it comes to
communication.
• Planning of cities
– One of the untapped disciplines where data analysis can really
grow is city planning. While many city planners might be
hesitant towards using data analysis in their favour, it only
results in faulty cities riddled congestion. Using data analysis
would help in bettering accessibility and minimizing overloading
in the city.
• Healthcare
– While medicine has come a long way since ancient times and is
ever-improving, it remains a costly affair. Many hospitals are
struggling with the cost pressures that modern healthcare has
come with, which includes the use of sophisticated machinery,
medicines, etc. But now, with the help of data analytics
applications, healthcare facilities can track the treatment of
patients and patient flow as well as how equipment are being
used in hospitals.
Need of Data Analytics Life Cycle
• Data analytics is important because it helps
businesses optimize their performances.
• A company can also use data analytics to
make better business decisions and help
analyze customer trends and satisfaction,
which can lead to new and better products
and services.

Key Roles for a Data analytics project
• Business User:
– The business user is the one who understands the main area of the
project and is also basically benefited from the results.
– This user gives advice and consult the team working on the project
about the value of the results obtained and how the operations on the
outputs are done.
– The business manager, line manager, or deep subject matter expert in
the project mains fulfills this role.
• Project Sponsor:
– The Project Sponsor is the one who is responsible to initiate the
project. Project Sponsor provides the actual requirements for the
project and presents the basic business issue.
– He generally provides the funds and measures the degree of value
from the final output of the team working on the project.
– This person introduce the prime concern and brooms the desired
output.
• Project Manager :
– This person ensures that key milestone and purpose of the
project is met on time and of the expected quality.
• Business Intelligence Analyst :
– Business Intelligence Analyst provides business domain
perfection based on a detailed and deep understanding of the
data, key performance indicators (KPIs), key matrix, and
business intelligence from a reporting point of view.
– This person generally creates fascia and reports and knows
about the data feeds and sources.
• Database Administrator (DBA) :

– DBA facilitates and arrange the database environment to
support the analytics need of the team working on a project.

• Data Engineer :
– Data engineer grasps deep technical skills to assist with tuning
SQL queries for data management and data extraction and
provides support for data intake into the analytic sandbox.
– The data engineer works jointly with the data scientist to help
build data in correct ways for analysis.
• Data Scientist :
– Data scientist facilitates with the subject matter expertise for
analytical techniques, data modeling, and applying correct
analytical techniques for a given business issues.
– He ensures overall analytical objectives are met.
– Data scientists outline and apply analytical methods and
proceed towards the data available for the concerned project.

Data Analytics Lifecycle
• The Data analytic lifecycle is designed for Big Data
problems and data science projects.
• Phase 1: Discovery –
– The data science team learn and investigate the
problem.
– Develop context and understanding.
– Come to know about data sources needed and
available for the project.
– The team formulates initial hypothesis that can be
later tested with data.

• Phase 2: Data Preparation –
– Steps to explore, preprocess, and condition data
prior to modeling and analysis.
– It requires the presence of an analytic sandbox,
the team execute, load, and transform, to get data
into the sandbox.
– Data preparation tasks are likely to be performed
multiple times and not in predefined order.
– Several tools commonly used for this phase are –
Hadoop, Alpine Miner, Open Refine, etc.

• Phase 3: Model Planning –
– Team explores data to learn about relationships
between variables and subsequently, selects key
variables and the most suitable models.
– In this phase, data science team develop data sets
for training, testing, and production purposes.
– Team builds and executes models based on the
work done in the model planning phase.
– Several tools commonly used for this phase are –
Matlab, STASTICA.

• Phase 4: Model Building –
– Team develops datasets for testing, training, and
production purposes.
– Team also considers whether its existing tools will
suffice for running the models or if they need
more robust environment for executing models.
– Free or open-source tools – R and PL/R, Octave,
WEKA.
– Commercial tools – Matlab , STASTICA.

• Phase 5: Communication Results –
– After executing model team need to compare
outcomes of modeling to criteria established for
success and failure.
– Team considers how best to articulate findings and
outcomes to various team members and stakeholders,
taking into account warning, assumptions.
– Team should identify key findings, quantify business
value, and develop narrative to summarize and convey
findings to stakeholders.

• Phase 6: Operationalize –
– The team communicates benefits of project more
broadly and sets up pilot project to deploy work in
controlled way before broadening the work to full
enterprise of users.
– This approach enables team to learn about
performance and related constraints of the model in
production environment on small scale and make
adjustments before full deployment.
– The team delivers final reports, briefings, codes.
– Free or open source tools – Octave, WEKA, SQL,
MADlib.


Few Questions…. (Unit-I)
• Differentiate between structured, semi-structured and
unstructured data.
• Explain about evolution of analytics scalability with analytic
process.
• What is Big Data? Explain its characteristics.
• Explain various phases of data analytics life cycle.
• What are the common tools for the model planning phase?
• Discuss in detail common open-source tools for the model
building phase.
• Explain the key roles for a successful analytic projects.
• Explain the application of data analytics.
• Discuss the steps in data analysis.
• Explain the modern data analytic tools.
THANK YOU

Unit - I Introduction To Data Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit - I Introduction To Data Analytics

Uploaded by

Copyright:

Available Formats

Data Analytics (KIT-601)

3rd year (Semester –VI)

Data Analytics, KIT-601 2

Data Analytics, KIT-601 3

Data Analytics, KIT-601 5

Data Analytics, KIT-601 7

Data Analytics, KIT-601 8

Data Analytics, KIT-601 9

Data Analytics, KIT-601 10

Data Analytics, KIT-601 11

– RBD - Randomized Block Design

Data Analytics, KIT-601 13

Data Analytics, KIT-601 14

Data Analytics, KIT-601 15

Data Analytics, KIT-601 16

Data Analytics, KIT-601 18

Data Analytics, KIT-601 20

Data Analytics, KIT-601 21

Data Analytics, KIT-601 22

Data Analytics, KIT-601 23

Data Analytics, KIT-601 24

Data Analytics, KIT-601 25

Data Analytics, KIT-601 26

Data Analytics, KIT-601 27

Data Analytics, KIT-601 30

Data Analytics, KIT-601 31

Data Analytics, KIT-601 32

Data Analytics, KIT-601 33

Data Analytics, KIT-601 34

• It is a data with so large size and complexity

• Big data is also a data but with huge size.

Data Analytics, KIT-601 35

Data Analytics, KIT-601 37

Data Analytics, KIT-601 38

Data Analytics, KIT-601 39

Data Analytics, KIT-601 40

Data Analytics, KIT-601 41

Data Analytics, KIT-601 42

– The second step in data analytics is the process of

– The data is then cleaned up before analysis. This

Data Analytics, KIT-601 44

• Data analytics help a business optimize its

Data Analytics, KIT-601 45

Data Analytics, KIT-601 46

Data Analytics, KIT-601 47

Data Analytics, KIT-601 48

Data Analytics, KIT-601 49

• MASSIVELY PARALLEL PROCESSING SYSTEMS:

Data Analytics, KIT-601 50

Data Analytics, KIT-601 51

Data Analytics, KIT-601 52

Data Analytics, KIT-601 53

Data Analytics, KIT-601 55

Data Analytics, KIT-601 56

Data Analytics, KIT-601 57

Data Analytics, KIT-601 58

Data Analytics, KIT-601 59

Data Analytics, KIT-601 60

Data Analytics, KIT-601 61

Data Analytics, KIT-601 62

Data Analytics, KIT-601 63

Data Analytics, KIT-601 64

Data Analytics, KIT-601 65