Da Notes (Big Data) PDF

WWW.VIDYARTHIPLUS.
COM
IT6006 DATA ANALYTICS

UNIT I INTRODUCTION TO BIG DATA
Introduction to Big Data Platform – Challenges of conventional systems - Web data – Evolution of
Analytic scalability, analytic processes and tools, Analysis vs reporting - Modern data analytic tools,
Statistical concepts: Sampling distributions, resampling, statistical inference, prediction error.
UNIT II DATA ANALYSIS

Regression modeling, Multivariate analysis, Bayesian modeling, inference and Bayesian networks,
Support vector and kernel methods, Analysis of time series: linear systems analysis, nonlinear dynamics -
Rule induction - Neural networks: learning and generalization, competitive learning, principal component
analysis and neural networks; Fuzzy logic: extracting fuzzy models from data, fuzzy decision trees,
Stochastic search methods.
UNIT III MINING DATA STREAMS

Introduction to Streams Concepts – Stream data model and architecture - Stream Computing, Sampling
data in a stream – Filtering streams – Counting distinct elements in a stream – Estimating moments –
Counting oneness in a window – Decaying window – Real time Analytics Platform(RTAP) applications -
case studies - real time sentiment analysis, stock market predictions.
UNIT IV FREQUENT ITEMSETS AND CLUSTERING

Mining Frequent item sets - Market based model – Apriori Algorithm – Handling large data sets in Main
memory – Limited Pass algorithm – Counting frequent item sets in a stream – Clustering Techniques –
Hierarchical – K- Means – Clustering high dimensional data – CLIQUE and PROCLUS – Frequent pattern
based clustering methods – Clustering in non-euclidean space – Clustering for streams and Parallelism.
UNIT V FRAMEWORKS AND VISUALIZATION

MapReduce – Hadoop, Hive, MapR – Sharding – NoSQL Databases - S3 - Hadoop
Distributed file systems – Visualizations - Visual data analysis techniques, interaction techniques; Systems
and applications:
TOTAL: 45 PERIODS
TEXT BOOKS:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets,Cambridge University Press,
2012.
REFERENCES:
1. Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with
advanced analystics, John Wiley & sons, 2012.
2. Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007 Pete Warden, Big Data Glossary, O‟
Reilly, 2011.
3. Jiawei Han, Micheline Kamber “Data Mining Concepts and Techniques”, Second Edition, Elsevier,
Reprinted 2008.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
UNIT-1
Big data is data sets that are so voluminous and complex that traditional data-processing application
software are inadequate to deal with them.
Big data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating, information privacy and data source.
Big data can be described by the following characteristics

Volume
The quantity of generated and stored data. The size of the data determines the value and potential
insight and whether it can be considered big data or not.
Variety
The type and nature of the data. This helps people who analyze it to effectively use the resulting
insight. Big data draws from text, images, audio, video; plus it completes missing pieces through
data fusion.
Velocity
In this context, the speed at which the data is generated and processed to meet the demands and
challenges that lie in the path of growth and development. Big data is often available in real-time.
Variability
Inconsistency of the data set can hamper processes to handle and manage it.
Veracity
The data quality of captured data can vary greatly, affecting the accurate analysis.
Value: The ultimate challenge of big data is delivering value. Sometimes, the systems
and processes in place are complex enough that using the data and extracting actual value
can become difficult
Techniques
• Techniques for analyzing data, such as A/B testing, machine learning and natural language processing
• Big data technologies, like business intelligence, cloud computing and databases
• Visualization, such as charts, graphs and other displays of the data
Government
The use and adoption of big data within governmental processes allows efficiencies in terms of cost,
productivity, and innovation
International development
Big data provides an infrastructure for transparency in manufacturing industry, which is the ability to
unravel uncertainties such as inconsistent component performance and availability.
Healthcare
Big data analytics has helped healthcare improve by providing personalized medicine and prescriptive
analytics, clinical risk intervention and predictive analytics, waste and care variability reduction,
automated external and internal reporting of patient data, standardized medical terms and patient registries
and fragmented point solutions.
Education
Business schools should prepare marketing managers to have wide knowledge on all the different
techniques used in these sub domains to get a big picture and work effectively with analysts.
media
To understand how the media utilizes big data, it is necessary to provide the mechanism used for media
process. It is used for specific media environments such as newspapers, magazines, or television. The aim
is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's
mindset.
For example, publishing environments are increasingly tailoring messages (advertisements) and content
(articles) to appeal to consumers that have been exclusively through various data-mining activities.
• Targeting of consumers (for advertising by marketers)
• Data-capture
• Data journalism: publishers and journalists use big data tools to provide unique and innovative
insights and info graphics.
Internet of Things (IoT)
Data extracted from IoT devices provides a mapping of device interconnectivity. Such mappings
have been used by the media industry, companies and governments to more accurately target their
audience and increase media efficiency. IoT is also increasingly adopted as a means of gathering sensory
data, and this sensory data has been used in medical and manufacturing contexts.
Information Technology
IT departments can predict potential issues and move to provide solutions before the problems even
happen. In this time, ITOA businesses were also beginning to play a major role in systems management by
offering platforms that brought individual data together and generated insights from the whole of the
system rather than from isolated pockets of data.
Categories Of 'Big Data'

Big data' could be found in three forms:
• Structured
• Unstructured
• Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
An 'Employee' table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. Typical example of unstructured data is, a
heterogeneous data source containing a combination of simple text files, images, videos
etc.
Examples of Un-structured Data
Output returned by 'Google Search'
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
Personal data stored in a XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
challenges of conventional system in big data
Three Challenges That big data face.
⛤Data
⛤Process
⛤Management
Volume
1.The volume of data, especially machine-generated data, is exploding,

2.how fast that data is growing every year, withnew sources of data that are emerging.
3.For example, in the year 2000, 800,000petabytes (PB) of data were stored in the
world,and it is expected to reach 35 zettabytes (ZB) by2020 (according to IBM).
Processing
More than 80% of today’s information isunstructured and it is typically too big to
manageeffectively.
Today, companies are looking to leverage a lotmore
data from a wider variety of sources both insideand outside the organization.
Things like documents, contracts, machine data,sensor data, social media, health
records,emails, etc. The list is endless really.
Management
A lot of this data is unstructured, or has a complex structure that’s hard to represent in
rows and columns.
Web data
Web analytics is the measurement, collection, analysis and reporting of
web data for purposes of understanding and optimizing web usage.[1] However, Web analytics is
not just a process for measuring web traffic but can be used as a tool for business and market
research, and to assess and improve the effectiveness of a website. Web analytics applications
can also help companies measure the results of traditional print or broadcast advertising
campaigns
Launch of the advertising campaign for visitors to the 2014 FIFA World Cup
An advertising campaign is a series of advertisement messages that share a single

idea and theme which make up an integrated marketing communication (IMC). An IMC is a
platform in which a group of people can group their ideas, beliefs, and concepts into one large
media base. Advertising campaigns utilize diverse media channels over a particular time frame
and target identified audiences
Basic steps of the web analytics process
Basic Steps of Web Analytics Process
Most web analytics processes come down to four essential stages or steps ,which are:
• Collection of data: This stage is the collection of the basic, elementary data. Usually, these
data are counts of things. The objective of this stage is to gather the data.
• Processing of data into information: This stage usually take counts and make them ratios,
although there still may be some counts. The objective of this stage is to take the data and
conform it into information, specifically metrics.
• Developing KPI: This stage focuses on using the ratios (and counts) and infusing them with
business strategies, referred to as Key Performance Indicators (KPI). Many times, KPIs deal
with conversion aspects, but not always. It depends on the organization.
• Formulating online strategy: This stage is concerned with the online goals, objectives, and
standards for the organization or business. These strategies are usually related to making
money, saving money, or increasing market share.
Another essential function developed by the analysts for the optimization of the websites are the
experiments
• Experiments and testings: A/B testing is a controlled experiment with two variants, in online
settings, such as web development.
The goal of A/B testing is to identify changes to web pages that increase or maximize a
statistically tested result of interest.
Each stage impacts or can impact (i.e., drives) the stage preceding or following it. So, sometimes
the data that is available for collection impacts the online strategy. Other times, the online
strategy affects the data collected.
Web analytics technologies

There are at least two categories of web analytics; off-site and on-site web analytics.
• Off-site web analytics refers to web measurement and analysis regardless of whether you
own or maintain a website. It includes the measurement of a website's potential audience
(opportunity), share of voice (visibility), and buzz (comments) that is happening on the
Internet as a whole.
• On-site web analytics, the most common, measure a visitor's behavior once on your website.
This includes its drivers and conversions;
for example, the degree to which different landing pages are associated with online
purchases. On-site web analytics measures the performance of your website in a
commercial context. This data is typically compared against key performance indicators for
performance, and used to improve a website or marketing campaign's audience response.
A performance indicator or key performance indicator (KPI) is a type of performance

measurement.[1] KPIs evaluate the success of an organization or of a particular activity (such as
projects, programs, products and other initiatives) in which it engages.
Web analytics data sources

The fundamental goal of web analytics is to collect and analyze data related to web traffic and
usage patterns. The data mainly comes from four sources:[3]
• Direct HTTP request data: directly comes from HTTP request messages (HTTP request
headers).
• Network level and server generated data associated with HTTP requests: not part of an
HTTP request, but it is required for successful request transmissions. For example, IP
address of a requester.
• Application level data sent with HTTP requests: generated and processed by application
level programs (such as JavaScript, PHP, and ASP.Net), including session and referrals.
These are usually captured by internal logs rather than public web analytics services.
• External data: can be combined with on-site data to help augment the website behavior
data described above and interpret web usage. For example, IP addresses are usually
associated with Geographic regions and internet service providers, e-mail open and
click-through rates, direct mail campaign data, sales and lead history, or other data types
as needed.
Page tagging refers to the implementation of tags in the existing HTML code of a
given web presence. These markings help to analyze the behavior of users when
they are moving between two page views.
process of data analysis
Analysis refers to breaking a whole into its separate components for

individual examination. Data analysis is a process for obtaining raw data and converting it into
information useful for decision-making by users. Data is collected and analyzed to answer
questions, test hypotheses or disprove theories.
Data requirements
The data is necessary as inputs to the analysis, which is specified based
upon the requirements of those directing the analysis or customers (who will use the finished
product of the analysis). The general type of entity upon which the data will be collected is
referred to as an experimental unit. .
Data collection
Data is collected from a variety of sources. The requirements may be
communicated by analysts to custodians of the data, such as information technology personnel
within an organization. The data may also be collected from sensors in the environment, such as
traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,
downloads from online sources, or reading documentation.
Data processing
Data initially obtained must be processed or organised for analysis. For

instance, these may involve placing data into rows and columns in a table format (i.e., structured
data) for further analysis, such as within a spreadsheet or statistical software. [4]
Data cleaning
Once processed and organized, the data may be incomplete, contain
duplicates, or contain errors. The need for data cleaning will arise from problems in the way that
data is entered and stored. Data cleaning is the process of preventing and correcting these
errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of
existing data,[5] deduplication, and column segmentation.
Exploratory data analysis

Once the data is cleaned, it can be analyzed. Analysts may apply a variety of
techniques referred to as exploratory data analysis to begin understanding the messages
contained in the data. The process of exploration may result in additional data cleaning or
additional requests for data, so these activities may be iterative in nature. Descriptive statistics,
such as the average or median, may be generated to help understand the data. Data
visualization may also be used to examine the data in graphical format, to obtain additional
insight regarding the messages within the data.
Modeling and algorithms

Mathematical formulas or models called algorithms may be applied to the data to
identify relationships among the variables, such as correlation or causation. To evaluate a
particular variable in the data based on other variable(s) in the data, with some residual error
depending on model accuracy (i.e., Data = Model + Error).
Data product
A data product is a computer application that takes data inputs and generates outputs,
feeding them back into the environment. It may be based on a model or algorithm. An example is
an application that analyzes data about customer purchasing history and recommends other
purchases the customer might enjoy
Communication
Once the data is analyzed, it may be reported in many formats to the users of the analysis to
support their requirements. The users may have feedback, which results in additional analysis.
When determining how to communicate the results, the analyst may consider data
visualization techniques to help clearly and efficiently communicate the message to the audience
Leading Data Analytics Tools

1) Microsoft Excel
Probably not the first thing that comes to mind, but Excel is one of the most widely used analytics
tools in the world given its massive installed base. You won’t use it for advanced analytics to be sure,
but Excel is a great way to start learning the basics of analytics not to mention a useful tool for basic
grunt work. It supports all the important features like summarizing data, visualizing data, and basic
data manipulation. It has a huge user community with plenty of support, tutorials and free resources.
2) IBM HYPERLINK "https://www.ibm.com/products/cognos-analytics"Cognos HYPERLINK
"https://www.ibm.com/products/cognos-analytics" Analytics
IBM’s Cognos Analytics is an upgrade to Cognos Business Intelligence (Cognos BI). Cognos
Analytics has a Web-based interface and offers data visualization features not found in the BI product.
It provides self-service analytics with enterprise security, data governance and management features.
Data can be sourced from multiple sources to create visualizations and reports.
3) The R language
R has been around more than 20 years as a free and open source project, making it quite popular,
and R was designed to do one thing: analytics. There are numerous add-on packages and Microsoft
supports it as part of its Big Data efforts. Extra packages include Big Data support, connecting to
external databases, visualizing data, mapping data geographically and performing advanced
statistical functions. On the down side, R has been criticized for being single threaded in an era where
parallel processing is imperative.
3) Sage Live
Sage Live is a cloud-based accounting platform for small and mid-sized businesses, with features like
the ability to create and send invoices, accept payments, pay bills, record receipts and record sales,
all from within a mobile-capable platform. It supports multiple companies, currencies and banks and
integrates with Salesforce CRM for no additional charge.
4) Sisense
Sisense’s self-titled product is a BI solution that provides advanced analytical tools for analysis,
visualization and reporting. Sisense allows businesses to merge data from many sources and merge
it into a single database where it does the analysis. It can be deployed on-premises or hosted in the
cloud as a SaaS application.
5) Chart.io
Chart.io is a drag and drop chart creation tool that works on a tablet or laptop to build connections to
databases, ranging from MySQL to Oracle, and then creates scripts for data analysis. Data can be
blended from multiple sources with a single click before executing analysis. It makes a variety of
charts, such as bar graphs, pie charts, scatter plots, and more.
6) SAP HYPERLINK "https://www.sap.com/products/bi-platform.html"BusinessObjects
SAP’s BusinessObjects provides a set of centralized tools to perform a wide variety of BI and
analytics, from ETL to data cleansing to predictive dashboards and reports. It’s modular so customers
can start small with just the functions they need and grow the app with their business. It supports
everything from SMBs to large enterprises and can be configured for a number of vertical industries. It
also supports Microsoft Office and Salesforce SaaS.
7) Netlink Business Analytics
Netlink’s Business Analytics platform is a comprehensive on-demand solution, meaning no Capex
investment. It can be accessed via a Web browser from any device and scale from a department to a
full enterprise. Dashboards can be shared among teams via the collaboration features. The features
are geared toward sales, with advanced analytic capabilities around sales & inventory forecasting,
voice and text analytics, fraud detection, buying propensity, sentiment, and customer churn analysis.
8) Domo
Domo is another cloud-based business management suite is browser-accessible and scales from a
small business to a giant enterprise. It provides analysis on all business-level activity, like top selling
products, forecasting, marketing return on investment and cash balances. It offers interactive
visualization tools and instant access to company-wide data via customized dashboards.
9) InetSoft Style Intelligence
Style Intelligence is a business intelligence software platform that allows users to create dashboards,
visual analyses and reports via a data engine that integrates data from multiple sources such as
OLAP servers, ERP apps, relational databases and more. InetSoft’s proprietary Data Block
technology enables the data mashups to take place in real time. Data and reports can be accessed
via dashboards, enterprise reports, scorecards and exception alerts.
10) Dataiku
Dataiku develops Dataiku Data Science Studio (DSS), a data analysis and collaboration platform that
helps data analysts work together with data scientists to build more meaningful data applications. It
helps prototype and build data-driven models and extract data from a variety of sources, from
databases to Big Data repositories.
11) Python
Python is already a popular language because it’s powerful and easy to learn. Over the years,
analytics features have been added, making it increasingly popular with developers looking to do
analytics apps but wanting more power than the R language. R is built for one thing, statistical
analysis, but Python can do analytics plus many other functions and types of apps, including machine
learning and analytics.
12) Apache Spark
Spark is Big Data analytics designed to run in-memory. Early Big Data systems like Hadoop were
batch processes that ran during low utilization (at night) and were disk-based. Spark is meant to run in
real time and entirely in memory, thus allowing for much faster real-time analytics. Spark has easy
integration with the Hadoop ecosystem and its own machine learning library. And it’s open source,
which means it’s free.
13) SAS Institute
SAS is a long-time BI vendor, so its move into analytics was only natural. to be widely used in the
industry. Two of its major apps are SAS Enterprise Miner and SAS Visual Analytics. Enterprise Miner
is good for core statistical analysis, data analytics and machine learning. It’s mature and has been
around a while, with a lot of macros and code for specific uses. Visual Analytics is newer and
designed to run in distributed memory on top of Hadoop.
14) Tableau
Tableau is a data visualization software package and one of the most popular on the market. It’s a
fast visualization software which lets you explore data and make all kinds of analysis and
observations by drag and drop interfaces. Its intelligent algorithms figure out the type of data and the
best method available to process it. You can easily build dashboards with the GUI and connect to a
host of analytical apps, including R.
15) Splunk
Splunk Enterprise started out as a log-analysis tool, but has grown to become a broad based platform
for searching, monitoring, and analyzing machine-generated Big Data. The software can import data
from a variety of sources, from logs to data collected by Big Data applications such as Hadoop or
sensors. It then generates reports a non-IT business person can easily read and understand.
Analysis verses reporting

Living in the era of digital technology and big data has made organizations
dependent on the wealth of information data can bring. You might have seen how
reporting and analysis are used interchangeably, especially the manner which
outsourcing companies market their services. While both areas are part of web
analytics (note that analytics isn’t similar to analysis), there’s a vast difference
between them, and it’s more than just spelling.
It’s important that we differentiate the two because some organizations might be
selling themselves short in one area and not reap the benefits, which web
analytics can bring to the table. The first core component of web analytics,
reporting, is merely organizing data into summaries. On the other hand, analysis
is the process of inspecting, cleaning, transforming, and modeling these
summaries (reports) with the goal of highlighting useful information.
Simply put, reporting translates data into information while analysis turns
information into insights. Also, reporting should enable users to ask “What?”
questions about the information, whereas analysis should answer to “Why”” and
“What can we do about it?”
Here are five differences between reporting and analysis:
1. Purpose
Reporting helps companies monitor their data even before digital technology
boomed. Various organizations have been dependent on the information it brings
to their business, as reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-
channels of data, provide comparison, and make understand information easier
(think of a dashboard, charts, and graphs, which are reporting tools and not
analysis reports), analysis interprets this information and provides
recommendations on actions.
2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy
to confuse tasks that have analysis labeled on top of them when all it does is
reporting. Hence, ensure that your analytics team has a healthy balance doing
both.
Here’s a great differentiator to keep in mind if what you’re doing is reporting or
analysis:
Reporting includes building, configuring, consolidating, organizing, formatting,

and summarizing. It’s very similar to the abovementioned like turning data into
charts, graphs, and linking data across multiple channels.
Analysis consists of questioning, examining, interpreting, comparing, and
confirming. With big data, predicting is possible as well.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their
outputs. Reporting has a push approach, as it pushes information to users and
outputs come in the forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further
probe and to answer business questions. Outputs from such can be in the form of
ad hoc responses and analysis presentations. Analysis presentations are
comprised of insights, recommended actions, and a forecast of its impact on the
company—all in a language that’s easy to understand at the level of the user
who’ll be reading and deciding on it.
This is important for organizations to realize truly the value of data, such that a
standard report is not similar to a meaningful analytics.
4. Delivery
Considering that reporting involves repetitive tasks—often with truckloads of data,
automation has been a lifesaver, especially now with big data. It’s not surprising
that the first thing outsourced are data entry services since outsourcing
companies are perceived as data reporting experts.
Analysis requires a more custom approach, with human minds doing superior
reasoning and analytical thinking to extract insights, and technical skills to
provide efficient steps towards accomplishing a specific goal. This is why data
analysts and scientists are demanded these days, as organizations depend on
them to come up with recommendations for leaders or business executives make
decisions about their businesses.
5. Value
This isn’t about identifying which one brings more value, rather understanding
that both are indispensable when looking at the big picture. It should help
businesses grow, expand, move forward, and make more profit or increase their
value.
This Path to Value diagram illustrates how data converts into value by reporting
and analysis such that it’s not achievable without the other.
Data — Reporting — Analysis — Decision-making — Action — VALUE
Data alone is useless, and action without data is baseless. Both reporting and
analysis are vital to bringing value to your data and operations.
Reporting and Analysis are Valuable
Not to undermine the role of reporting in web analytics, but organizations need to
understand that reporting itself is just numbers. Without drawing insights and
getting reports aligned with your organization’s big picture, you can’t make
decisions based on reports alone.
Data analysis is the most powerful tool to bring into your business. Employing the
powers of analysis can be comparable to finding gold in your reports, which
allows your business to increase profits and further develop.
Having accurate research is crucial in devising various marketing and advertising
materials for your target market, while taking into account their needs as well as
the advantage of your competitors. We can help you come up with
comprehensive strategies through our extensive research services, which are
carefully tailored for your immediate business concerns.
Why Data Analysis?
Companies that are not leveraging modern data analytic tools and techniques are falling apart.
Since Data Analytics tools capture products that automatically glean and analyze data, deliver
information and predictions, you can improve prediction accuracy and refine the models.
Bonus: Want to Transform your Career in Data Analytics? Attend Live Data Analytics Orientation
Session.
Goals of Performing Data Analysis
• You can analyze data.

• Extract actionable and commercially relevant information to boost performance.
• Several extraordinary analytical tools are available, that are free and open source so that you
can leverage it to enhance your business and develop skills.
Top Data Analytics Tools
Top Data Analytics Tools Infographic
Here is the list of top Analytics tools for data analysis that are available for free (for personal
use), easy to use (no coding required), well-documented (you can Google your way through if
you get stuck), and have powerful capabilities (more than excel). These Data Analysis tools will
help you manage and interpret data in a better and more effective way. Let’s explore the best
Analytics tools:
#1 Tableau Public
Tableau Public
What is Tableau Public
It is a simple and intuitive and tool which offers intriguing insights through Data Visualization.
Tableau Public’s million row limit, which is easy to use fares better than most of the other players
in the Data Analytics market.
With Tableau’s visuals, you can investigate a hypothesis, explore the data, and cross-check your
insights.
Uses of Tableau Public
• You can publish interactive data visualizations to the web for free.
• No programming skills required.
• Visualizations published to Tableau Public can be embedded into blogs and web pages and
be shared through email or social media. The shared content can be made available s for
downloads.
Limitations of Tableau Public
• All data is public and offers very little scope for restricted access.
• Data size limitation.
• Cannot be connected to R.
• The only way to read is via OData sources, is Excel or txt.
#2 OpenRefine
OpenRefine
What is OpenRefine
Formerly known as GoogleRefine, the data cleaning software that helps you clean up data for
analysis. It operates on a row of data which have cells under columns, quite similar to relational
database tables.
Uses of OpenRefine
• Cleaning messy data.

• Transformation of data.
• Parsing data from websites.
• Adding data to data set by fetching it from web services. For instance, OpenRefine could be
used for geocoding addresses to geographic coordinates.
Limitations of OpenRefine
• Open Refine is unsuitable for large datasets.

• Refine does not work very well with Big Data.
#3 KNIME
KNIME
What is KNIME?
KNIME helps you to manipulate, analyze, and model data through visual programming. It is used
to integrate various components for data mining and Machine Learning via its modular data
pipelining concept.
Uses of KNIME
• Rather than writing blocks of code, you just have to drop and drag connection points
between activities.
• This data analysis tool supports programming languages.
• In fact, analysis tools like these can be extended to run chemistry data, text mining, Python,
and R.
Limitation of KNIME
• Poor data visualization.
#4 RapidMiner
Rapidminer
What is RapidMiner?
RapidMiner provides Machine Learning procedures and Data Mining including Data
Visualization, processing, statistical modeling, deployment, evaluation, and predictive analytics.
RapidMiner written in the Java is fast gaining acceptance as a Big Data Analytics tool.
Uses of RapidMiner
It provides an integrated environment for business analytics, predictive analysis, text mining,
Data Mining, and Machine Learning.
Along with commercial and business applications, RapidMiner is also used for application
development, rapid prototyping, training, education, and research.
Limitations of RapidMiner
• RapidMiner has size constraints with respect to the number of rows.

• For RapidMiner, you need more hardware resources than ODM and SAS.
#5 Google Fusion Tables
Google Fusion Tables
What is Google Fusion Tables?
When talking about Data Analytics tools for free, here comes a much cooler, larger, and nerdier
version of Google Spreadsheets. An incredible tool for data analysis, mapping, and large dataset
visualization, Google Fusion Tables can be added to business analytics tools list.
Uses of Google Fusion Tables
• Visualize bigger table data online.

• Filter and summarize across hundreds of thousands of rows.
• Combine tables with other data on web.
You can merge two or three tables to generate a single visualization that includes sets of data.
With Google Fusion Tables, you can combine public data with your own for a better visualization.
You can create a map in minutes!
Limitations of Google Fusion Tables
• Only the first 100,000 rows of data in a table are included in query results or mapped.
• The total size of the data sent in one API call cannot be more than 1MB.
#6 NodeXL
NodeXL
What is NodeXL?
It is a visualization and analysis software of relationships and networks. NodeXL provides exact
calculations. It is a free (not the pro one) and open-source network analysis and visualization
software. NodeXL is one of the best statistical tools for data analytics which includes advanced
network metrics, access to social media network data importers, and automation.
Uses of NodeXL
• This is one of the data analysis tools in excel that helps in following areas:
• Data Import
• Graph Visualization
• Graph Analysis
• Data Representation
• This software integrates into Microsoft Excel 2007, 2010, 2013, and 2016. It opens as a
workbook with a variety of worksheets containing the elements of a graph structure like
nodes and edges.
• This software can import various graph formats like adjacency matrices, Pajek .net, UCINet
.dl, GraphML, and edge lists.
Limitations of NodeXL
• You need to use multiple seeding terms for a particular problem.

• Running the data extractions at slightly different times.
#7 Wolfram Alpha
wolfram-alpha
What is Wolfram Alpha?
It is a computational knowledge engine or answering engine founded by Stephen Wolfram.

With Wolfram Alpha, you get answers to factual queries directly by computing the answer from
externally sourced ‘curated data’ instead of providing a list of documents or web pages.
Uses of Wolfram Alpha
• Is an add-on for Apple’s Siri.

• Provides detailed responses to technical searches and solves calculus problems.
• Helps business users with information charts and graphs, and helps in creating topic
overviews, commodity information, and high-level pricing history.
Limitations of Wolfram Alpha
• Wolfram Alpha can only deal with publicly known number and facts, not with viewpoints.
• It limits the computation time for each query.
#8 Google Search Operators
Google Search Operators
What is Google Search Operators?
It is a powerful resource which helps you filter Google results instantly to get most relevant and
useful information.
Uses of Google Search Operators
• Faster filtering of Google search results.

• Google’s powerful data analysis tool can help discover new information or market research.
#9 Solver
Solver Excel
What is Excel Solver?
The Solver Add-in is a Microsoft Office Excel add-in program that is available when you install
Microsoft Excel or Office. It is a linear programming and optimization tool in excel.
This allows you to set constraints. It is an advanced optimization tool that helps in quick problem-
solving.
Uses of Solver
The final values found by Solver are a solution to interrelation and decision.
It uses a variety of methods, from nonlinear optimization and linear programming to evolutionary
and genetic algorithms, to find solutions.
Limitations of Solver
• Poor scaling is one of the areas where Excel Solver lacks.

• It can affect solution time and quality.
• Solver affects the intrinsic solvability of your model.
#10 Dataiku DSS
Dataiku DSS
What is Dataiku DSS?
This is a collaborative data science software platform that helps team build, prototype, explore,
and deliver their own data products more efficiently.
Uses of Dataiku DSS
It provides an interactive visual interface where they can build, click, and point or use languages
like SQL.
This data analytics tool lets you draft data preparation and modulization in seconds.
Helps you coordinate development and operations by handling workflow automation, creating
predictive web services, model health daily, and monitoring data.
Limitation of Dataiku DSS
• Limited visualization capabilities

• UI hurdles: Reloading of code/datasets
• Inability to easily compile entire code into a single document/notebook
• Still need to integrate with SPARK
5 Data Analytics Tools and Techniques You Must Know
Here are some of the useful data analytics tools and techniques that can be used for better
performance:
• VISUAL ANALYTICS
There are different ways to analyze the data. One of the simplest ways to do is to create a graph
or visual and look at it to spot patterns. This is an integrated method that combines data analysis
with human interaction and data visualization.
• BUSINESS EXPERIMENTS
Experimental design, AB testing, and business experiments are all techniques for testing the
validity of something. It is trying out something in one part of the organization and comparing it
with another.
• REGRESSION ANALYSIS
It is a statistical tool for investigating the relationship between variables. For instance, the cause
and effect relationship between product demand and price.
• CORRELATION ANALYSIS
A statistical technique that allows you to determine whether there is a relationship between two
separate variables and how strong that relationship may be. It is best to use when you know or
suspect that there is a relationship between two variables and wish to test the assumption.
• TIME SERIES ANALYSIS
It is the data that is collected at uniformly spaced time intervals. You can use it when you want to
assess changes over time or predict future events based on what happened in the past.
statistical concepts
Sampling distribution
n statistics, a sampling distribution or finite-sample distribution is the probability distribution of a
given statistic based on a random sample. Sampling distributions are important in statistics
because they provide a major simplification en route to statistical inference
For example, consider a normal population with mean and variance {\displaystyle \sigma ^{2}}
\sigma ^{2}. Assume we repeatedly take samples of a given size from this population and
calculate the arithmetic mean {\displaystyle \scriptstyle {\bar {x}}} \scriptstyle {\bar x} for each
sample – this statistic is called the sample mean. The distribution of these means, or averages, is
called the "sampling distribution of the sample mean". This distribution is normal {\displaystyle
\scriptstyle {\mathcal {N}}(\mu ,\,\sigma ^{2}/n)} \scriptstyle {\mathcal {N}}(\mu ,\,\sigma ^{2}/n) (n
is the sample size) since the underlying population is normal, although sampling distributions
may also often be close to normal even when the population distribution is not (see central limit
theorem). An alternative to the sample mean is the sample median. When calculated from the
same population, it has a different sampling distribution to that of the mean and is generally not
normal (but it may be close for large sample sizes).
standard deviation
The standard deviation of the sampling distribution of a statistic is referred to as the standard
error of that quantity. For the case where the statistic is the sample mean, and samples are
uncorrelated, the standard error is:
{\displaystyle \sigma _{\bar {x}}={\frac {\sigma }{\sqrt {n}}}} \sigma _{{{\bar x}}}={\frac {\sigma
}{{\sqrt {n}}}}
where {\displaystyle \sigma } \sigma is the standard deviation of the population distribution of
that quantity and {\displaystyle n} n is the sample size
Resampling
In statistics, resampling is any of a variety of methods for doing one of the following:
Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets
of available data (jackknifing) or drawing randomly with replacement from a set of data points
(bootstrapping)
Exchanging labels on data points when performing significance tests (permutation tests, also
called exact tests, randomization tests, or re-randomization tests)
Validating models by using random subsets (bootstrapping, cross validation)
Bootstrapping is a statistical method for estimating the sampling distribution of an estimator by

sampling with replacement from the original sample, most often with the purpose of deriving
robust estimates of standard errors and confidence intervals of a population parameter like a
mean, median, proportion, odds ratio, correlation coefficient or regression coefficient. It may also
be used for constructing hypothesis tests. It is often used as a robust alternative to inference
based on parametric assumptions when those assumptions are in doubt, or where parametric
inference is impossible or requires very complicated formulas for the calculation of standard
errors. Bootstrapping techniques are also used in the updating-selection transitions of particle
filters, genetic type algorithms and related tesample/teconfiguration Monte Carlo methods used
in computational physics and molecular chemistry.[1][2] In this context, the bootstrap is used to
replace sequentially empirical weighted probability measures by empirical measures. The
bootstrap allows to replace the samples with low weights by copies of the samples with high
weights.
Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias
and standard error (variance) of a statistic, when a random sample of observations is used to
calculate it. Historically this method preceded the invention of the bootstrap with Quenouille
inventing this method in 1949 and Tukey extending it in 1958.[3][4] This method was
foreshadowed by Mahalanobis who in 1946 suggested repeated estimates of the statistic of
interest with half the sample chosen at random.[5] He coined the name 'interpenetrating samples'
for this method.
Quenouille invented this method with the intention of reducing the bias of the sample estimate.
Tukey extended this method by assuming that if the replicates could be considered identically
and independently distributed, then an estimate of the variance of the sample parameter could
be made and that it would be approximately distributed as a t variate with n−1 degrees of
freedom (n being the sample size).
The basic idea behind the jackknife variance estimator lies in systematically recomputing the
statistic estimate, leaving out one or more observations at a time from the sample set. From this
new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of
the statistic can be calculated.
Instead of using the jackknife to estimate the variance, it may instead be applied to the log of the
variance. This transformation may result in better estimates particularly when the distribution of
the variance itself may be non normal.
For many statistical parameters the jackknife estimate of variance tends asymptotically to the
true value almost surely. In technical terms one says that the jackknife estimate is consistent.
The jackknife is consistent for the sample means, sample variances, central and non-central t-
statistics (with possibly non-normal populations), sample coefficient of variation, maximum
likelihood estimators, least squares estimators, correlation coefficients and regression
coefficients.
It is not consistent for the sample median. In the case of a unimodal variate the ratio of the
jackknife variance to the sample variance tends to be distributed as one half the square of a chi
square distribution with two degrees of freedom.
The jackknife, like the original bootstrap, is dependent on the independence of the data.
Extensions of the jackknife to allow for dependence in the data have been proposed.
Another extension is the delete-a-group method used in association with Poisson sampling.
Comparison of bootstrap and jackknife

Both methods, the bootstrap and the jackknife, estimate the variability of a statistic from the
variability of that statistic between subsamples, rather than from parametric assumptions. For the
more general jackknife, the delete-m observations jackknife, the bootstrap can be seen as a
random approximation of it. Both yield similar numerical results, which is why each can be seen
as approximation to the other. Although there are huge theoretical differences in their
mathematical insights, the main practical difference for statistics users is that the bootstrap gives
different results when repeated on the same data, whereas the jackknife gives exactly the same
result each time. Because of this, the jackknife is popular when the estimates need to be verified
several times before publishing (e.g., official statistics agencies). On the other hand, when this
verification feature is not crucial and it is of interest not to have a number but just an idea of its
distribution, the bootstrap is preferred (e.g., studies in physics, economics, biological sciences).
Whether to use the bootstrap or the jackknife may depend more on operational aspects than on
statistical concerns of a survey. The jackknife, originally used for bias reduction, is more of a
specialized method and only estimates the variance of the point estimator. This can be enough
for basic statistical inference (e.g., hypothesis testing, confidence intervals). The bootstrap, on
the other hand, first estimates the whole distribution (of the point estimator) and then computes
the variance from that. While powerful and easy, this can become highly computer intensive.
"The bootstrap can be applied to both variance and distribution estimation problems. However,
the bootstrap variance estimator is not as good as the jackknife or the balanced repeated
replication (BRR) variance estimator in terms of the empirical results. Furthermore, the bootstrap
variance estimator usually requires more computations than the jackknife or the BRR. Thus, the
bootstrap is mainly recommended for distribution estimation." [6]
There is a special consideration with the jackknife, particularly with the delete-1 observation
jackknife. It should only be used with smooth, differentiable statistics (e.g., totals, means,
proportions, ratios, odd ratios, regression coefficients, etc.; not with medians or quantiles). This
could become a practical disadvantage. This disadvantage is usually the argument favoring
bootstrapping over jackknifing. More general jackknifes than the delete-1, such as the delete-m
jackknife, overcome this problem for the medians and quantiles by relaxing the smoothness
requirements for consistent variance estimation.
Usually the jackknife is easier to apply to complex sampling schemes than the bootstrap.
Complex sampling schemes may involve stratification, multiple stages (clustering), varying
sampling weights (non-response adjustments, calibration, post-stratification) and under unequal-
probability sampling designs. Theoretical aspects of both the bootstrap and the jackknife can be
found in Shao and Tu (1995),[7] whereas a basic introduction is accounted in Wolter (2007).[8]
The bootstrap estimate of model prediction bias is more precise than jackknife estimates with
linear models such as linear discriminant function or multiple regression.[9]
Subsampling is an alternative method for approximating the sampling distribution of an estimator. The
two key differences to the bootstrap are: (i) the resample size is smaller than the sample size and (ii)
resampling is done without replacement. The advantage of subsampling is that it is valid under much
weaker conditions compared to the bootstrap. In particular, a set of sufficient conditions is that the
rate of convergence of the estimator is known and that the limiting distribution is continuous; in
addition, the resample (or subsample) size must tend to infinity together with the sample size but at a
smaller rate, so that their ratio converges to zero. While subsampling was originally proposed for the
case of independent and identically distributed (iid) data only, the methodology has been extended to
cover time series data as well; in this case, one resamples blocks of subsequent data rather than
individual data points. There are many cases of applied interest where subsampling leads to valid
inference whereas bootstrapping does not; for example, such cases include examples where the rate of
convergence of the estimator is not the square root of the sample size or when the limiting distribution
is non-normal.
Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held
out for use as validating sets; a model is fit to the remaining data (a training set) and used to predict
for the validation set. Averaging the quality of the predictions across the validation sets yields an
overall measure of prediction accuracy. Cross-Validation is employed repeatedly in building decision
trees.
One form of cross-validation leaves out a single observation at a time; this is similar to the jackknife.
Another, K-fold cross-validation, splits the data into K subsets; each is held out in turn as the
validation set.
This avoids "self-influence". For comparison, in regression analysis methods such as linear regression,
each y value draws the regression line toward itself, making the prediction of that value appear more
accurate than it really is. Cross-validation applied to linear regression predicts the y value for each
observation without using that observation.
This is often used for deciding how many predictor variables to use in regression. Without cross-
validation, adding predictors always reduces the residual sum of squares (or possibly leaves it
unchanged). In contrast, the cross-validated mean-square error will tend to decrease if valuable
predictors are added, but increase if worthless predictors are added.[
A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of
statistical significance test in which the distribution of the test statistic under the null hypothesis is
obtained by calculating all possible values of the test statistic under rearrangements of the labels on
the observed data points. In other words, the method by which treatments are allocated to subjects in
an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under
the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability.
Confidence intervals can then be derived from the tests. The theory has evolved from the works of
Ronald Fisher and E. J. G. Pitman in the 1930s.
To illustrate the basic idea of a permutation test, suppose we have two groups {\displaystyle A} A and
{\displaystyle B} B whose sample means are {\displaystyle {\bar {x}}_{A}} \bar{x}_{A} and
{\displaystyle {\bar {x}}_{B}} \bar{x}_{B}, and that we want to test, at 5% significance level,
whether they come from the same distribution. Let {\displaystyle n_{A}} n_{A} and {\displaystyle
n_{B}} n_{B} be the sample size corresponding to each group. The permutation test is designed to
determine whether the observed difference between the sample means is large enough to reject the
null hypothesis H {\displaystyle _{0}} _{0} that the two groups have identical probability
distributions.
The test proceeds as follows. First, the difference in means between the two samples is calculated: this
is the observed value of the test statistic, T(obs). Then the observations of groups {\displaystyle A} A
and {\displaystyle B} B are pooled.
Next, the difference in sample means is calculated and recorded for every possible way of dividing
these pooled values into two groups of size {\displaystyle n_{A}} n_{A} and {\displaystyle n_{B}}
n_{B} (i.e., for every permutation of the group labels A and B). The set of these calculated
differences is the exact distribution of possible differences under the null hypothesis that group label
does not matter.
The one-sided p-value of the test is calculated as the proportion of sampled permutations where the
difference in means was greater than or equal to T(obs). The two-sided p-value of the test is
calculated as the proportion of sampled permutations where the absolute difference was greater than
or equal to ABS(T(obs)).
If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the
recorded differences, and then observe if T(obs) is contained within the middle 95% of them. If it is
not, we reject the hypothesis of identical probability curves at the 5% significance level.
Statistical inference is the process of using data analysis to deduce properties of an underlying
probability distribution. Inferential statistical analysis infers properties of a population, for example
by testing hypotheses and deriving estimates
Statistical inference makes propositions about a population, using data drawn from the population with
some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences,
statistical inference consists of (first) selecting a statistical model of the process that generates the data
and (second) deducing propositions from the model.
a point estimate, i.e. a particular value that best approximates some parameter of interest;
an interval estimate, e.g. a confidence interval (or set estimate), i.e. an interval constructed using a
dataset drawn from a population so that, under repeated sampling of such datasets, such intervals
would contain the true parameter value with the probability at the stated confidence level;
a credible interval, i.e. a set of values containing, for example, 95% of posterior belief;
rejection of a hypothesis;[a]
clustering or classification of data points into groups.
In statistics the mean squared prediction error of a smoothing or curve fitting procedure is the
expected value of the squared difference between the fitted values implied by the predictive function
and the values of the (unobservable) function g.

Da Notes (Big Data) PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Da Notes (Big Data) PDF

Uploaded by

Copyright:

Available Formats

WWW.VIDYARTHIPLUS.

IT6006 DATA ANALYTICS

UNIT II DATA ANALYSIS

UNIT III MINING DATA STREAMS

UNIT IV FREQUENT ITEMSETS AND CLUSTERING

UNIT V FRAMEWORKS AND VISUALIZATION

Big data can be described by the following characteristics

Categories Of 'Big Data'

2365 Rajesh Kulkarni Male Finance 650000

3398 Pratibha Joshi Female Admin 650000

7465 Shushil Roy Male Admin 500000

7500 Shubhojit Das Male Finance 500000

7699 Priya Sane

Three Challenges That big data face.

1.The volume of data, especially machine-generated data, is exploding,

Today, companies are looking to leverage a lotmore

An advertising campaign is a series of advertisement messages that share a single

Basic steps of the web analytics process

Basic Steps of Web Analytics Process

Web analytics technologies

A performance indicator or key performance indicator (KPI) is a type of performance

Web analytics data sources

process of data analysis

Analysis refers to breaking a whole into its separate components for

Data initially obtained must be processed or organised for analysis. For

Exploratory data analysis

Modeling and algorithms

Leading Data Analytics Tools

Analysis verses reporting

Reporting includes building, configuring, consolidating, organizing, formatting,

Goals of Performing Data Analysis

• You can analyze data.

Top Data Analytics Tools

Top Data Analytics Tools Infographic

What is Tableau Public

Limitations of Tableau Public

• Cleaning messy data.

• Open Refine is unsuitable for large datasets.

• Poor data visualization.

• RapidMiner has size constraints with respect to the number of rows.

#5 Google Fusion Tables

Google Fusion Tables

What is Google Fusion Tables?

• Visualize bigger table data online.

• You need to use multiple seeding terms for a particular problem.

What is Wolfram Alpha?

It is a computational knowledge engine or answering engine founded by Stephen Wolfram.

Uses of Wolfram Alpha

• Is an add-on for Apple’s Siri.

Limitations of Wolfram Alpha

#8 Google Search Operators

Google Search Operators

What is Google Search Operators?

Uses of Google Search Operators

• Faster filtering of Google search results.

What is Excel Solver?

• Poor scaling is one of the areas where Excel Solver lacks.

#10 Dataiku DSS

What is Dataiku DSS?

Uses of Dataiku DSS

• Limited visualization capabilities

5 Data Analytics Tools and Techniques You Must Know

• TIME SERIES ANALYSIS