Professional Documents
Culture Documents
COM
TOTAL: 45 PERIODS
TEXT BOOKS:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.
2. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets,Cambridge University Press,
2012.
REFERENCES:
1. Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with
advanced analystics, John Wiley & sons, 2012.
2. Glenn J. Myatt, Making Sense of Data, John Wiley & Sons, 2007 Pete Warden, Big Data Glossary, O‟
Reilly, 2011.
3. Jiawei Han, Micheline Kamber “Data Mining Concepts and Techniques”, Second Edition, Elsevier,
Reprinted 2008.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
UNIT-1
Big data is data sets that are so voluminous and complex that traditional data-processing application
software are inadequate to deal with them.
Big data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating, information privacy and data source.
International development
Big data provides an infrastructure for transparency in manufacturing industry, which is the ability to
unravel uncertainties such as inconsistent component performance and availability.
Healthcare
Big data analytics has helped healthcare improve by providing personalized medicine and prescriptive
analytics, clinical risk intervention and predictive analytics, waste and care variability reduction,
automated external and internal reporting of patient data, standardized medical terms and patient registries
and fragmented point solutions.
Education
Business schools should prepare marketing managers to have wide knowledge on all the different
techniques used in these sub domains to get a big picture and work effectively with analysts.
media
To understand how the media utilizes big data, it is necessary to provide the mechanism used for media
process. It is used for specific media environments such as newspapers, magazines, or television. The aim
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's
mindset.
For example, publishing environments are increasingly tailoring messages (advertisements) and content
(articles) to appeal to consumers that have been exclusively through various data-mining activities.
• Targeting of consumers (for advertising by marketers)
• Data-capture
• Data journalism: publishers and journalists use big data tools to provide unique and innovative
insights and info graphics.
Internet of Things (IoT)
Data extracted from IoT devices provides a mapping of device interconnectivity. Such mappings
have been used by the media industry, companies and governments to more accurately target their
audience and increase media efficiency. IoT is also increasingly adopted as a means of gathering sensory
data, and this sensory data has been used in medical and manufacturing contexts.
Information Technology
IT departments can predict potential issues and move to provide solutions before the problems even
happen. In this time, ITOA businesses were also beginning to play a major role in systems management by
offering platforms that brought individual data together and generated insights from the whole of the
system rather than from isolated pockets of data.
Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
An 'Employee' table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. Typical example of unstructured data is, a
heterogeneous data source containing a combination of simple text files, images, videos
etc.
Examples of Un-structured Data
Output returned by 'Google Search'
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured
data as a structured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
Personal data stored in a XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
challenges of conventional system in big data
⛤Data
⛤Process
⛤Management
Volume
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Processing
More than 80% of today’s information isunstructured and it is typically too big to
manageeffectively.
data from a wider variety of sources both insideand outside the organization.
Things like documents, contracts, machine data,sensor data, social media, health
records,emails, etc. The list is endless really.
Management
A lot of this data is unstructured, or has a complex structure that’s hard to represent in
rows and columns.
Web data
Web analytics is the measurement, collection, analysis and reporting of
web data for purposes of understanding and optimizing web usage.[1] However, Web analytics is
not just a process for measuring web traffic but can be used as a tool for business and market
research, and to assess and improve the effectiveness of a website. Web analytics applications
can also help companies measure the results of traditional print or broadcast advertising
campaigns
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Launch of the advertising campaign for visitors to the 2014 FIFA World Cup
Most web analytics processes come down to four essential stages or steps ,which are:
• Collection of data: This stage is the collection of the basic, elementary data. Usually, these
data are counts of things. The objective of this stage is to gather the data.
• Processing of data into information: This stage usually take counts and make them ratios,
although there still may be some counts. The objective of this stage is to take the data and
conform it into information, specifically metrics.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
• Developing KPI: This stage focuses on using the ratios (and counts) and infusing them with
business strategies, referred to as Key Performance Indicators (KPI). Many times, KPIs deal
with conversion aspects, but not always. It depends on the organization.
• Formulating online strategy: This stage is concerned with the online goals, objectives, and
standards for the organization or business. These strategies are usually related to making
money, saving money, or increasing market share.
Another essential function developed by the analysts for the optimization of the websites are the
experiments
• Experiments and testings: A/B testing is a controlled experiment with two variants, in online
settings, such as web development.
The goal of A/B testing is to identify changes to web pages that increase or maximize a
statistically tested result of interest.
Each stage impacts or can impact (i.e., drives) the stage preceding or following it. So, sometimes
the data that is available for collection impacts the online strategy. Other times, the online
strategy affects the data collected.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
associated with Geographic regions and internet service providers, e-mail open and
click-through rates, direct mail campaign data, sales and lead history, or other data types
as needed.
Page tagging refers to the implementation of tags in the existing HTML code of a
given web presence. These markings help to analyze the behavior of users when
they are moving between two page views.
Data requirements
The data is necessary as inputs to the analysis, which is specified based
upon the requirements of those directing the analysis or customers (who will use the finished
product of the analysis). The general type of entity upon which the data will be collected is
referred to as an experimental unit. .
Data collection
Data is collected from a variety of sources. The requirements may be
communicated by analysts to custodians of the data, such as information technology personnel
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
within an organization. The data may also be collected from sensors in the environment, such as
traffic cameras, satellites, recording devices, etc. It may also be obtained through interviews,
downloads from online sources, or reading documentation.
Data processing
Data cleaning
Once processed and organized, the data may be incomplete, contain
duplicates, or contain errors. The need for data cleaning will arise from problems in the way that
data is entered and stored. Data cleaning is the process of preventing and correcting these
errors. Common tasks include record matching, identifying inaccuracy of data, overall quality of
existing data,[5] deduplication, and column segmentation.
Data product
A data product is a computer application that takes data inputs and generates outputs,
feeding them back into the environment. It may be based on a model or algorithm. An example is
an application that analyzes data about customer purchasing history and recommends other
purchases the customer might enjoy
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Communication
Once the data is analyzed, it may be reported in many formats to the users of the analysis to
support their requirements. The users may have feedback, which results in additional analysis.
When determining how to communicate the results, the analyst may consider data
visualization techniques to help clearly and efficiently communicate the message to the audience
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
blended from multiple sources with a single click before executing analysis. It makes a variety of
charts, such as bar graphs, pie charts, scatter plots, and more.
6) SAP HYPERLINK "https://www.sap.com/products/bi-platform.html"BusinessObjects
SAP’s BusinessObjects provides a set of centralized tools to perform a wide variety of BI and
analytics, from ETL to data cleansing to predictive dashboards and reports. It’s modular so customers
can start small with just the functions they need and grow the app with their business. It supports
everything from SMBs to large enterprises and can be configured for a number of vertical industries. It
also supports Microsoft Office and Salesforce SaaS.
7) Netlink Business Analytics
Netlink’s Business Analytics platform is a comprehensive on-demand solution, meaning no Capex
investment. It can be accessed via a Web browser from any device and scale from a department to a
full enterprise. Dashboards can be shared among teams via the collaboration features. The features
are geared toward sales, with advanced analytic capabilities around sales & inventory forecasting,
voice and text analytics, fraud detection, buying propensity, sentiment, and customer churn analysis.
8) Domo
Domo is another cloud-based business management suite is browser-accessible and scales from a
small business to a giant enterprise. It provides analysis on all business-level activity, like top selling
products, forecasting, marketing return on investment and cash balances. It offers interactive
visualization tools and instant access to company-wide data via customized dashboards.
9) InetSoft Style Intelligence
Style Intelligence is a business intelligence software platform that allows users to create dashboards,
visual analyses and reports via a data engine that integrates data from multiple sources such as
OLAP servers, ERP apps, relational databases and more. InetSoft’s proprietary Data Block
technology enables the data mashups to take place in real time. Data and reports can be accessed
via dashboards, enterprise reports, scorecards and exception alerts.
10) Dataiku
Dataiku develops Dataiku Data Science Studio (DSS), a data analysis and collaboration platform that
helps data analysts work together with data scientists to build more meaningful data applications. It
helps prototype and build data-driven models and extract data from a variety of sources, from
databases to Big Data repositories.
11) Python
Python is already a popular language because it’s powerful and easy to learn. Over the years,
analytics features have been added, making it increasingly popular with developers looking to do
analytics apps but wanting more power than the R language. R is built for one thing, statistical
analysis, but Python can do analytics plus many other functions and types of apps, including machine
learning and analytics.
12) Apache Spark
Spark is Big Data analytics designed to run in-memory. Early Big Data systems like Hadoop were
batch processes that ran during low utilization (at night) and were disk-based. Spark is meant to run in
real time and entirely in memory, thus allowing for much faster real-time analytics. Spark has easy
integration with the Hadoop ecosystem and its own machine learning library. And it’s open source,
which means it’s free.
13) SAS Institute
SAS is a long-time BI vendor, so its move into analytics was only natural. to be widely used in the
industry. Two of its major apps are SAS Enterprise Miner and SAS Visual Analytics. Enterprise Miner
is good for core statistical analysis, data analytics and machine learning. It’s mature and has been
around a while, with a lot of macros and code for specific uses. Visual Analytics is newer and
designed to run in distributed memory on top of Hadoop.
14) Tableau
Tableau is a data visualization software package and one of the most popular on the market. It’s a
fast visualization software which lets you explore data and make all kinds of analysis and
observations by drag and drop interfaces. Its intelligent algorithms figure out the type of data and the
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
best method available to process it. You can easily build dashboards with the GUI and connect to a
host of analytical apps, including R.
15) Splunk
Splunk Enterprise started out as a log-analysis tool, but has grown to become a broad based platform
for searching, monitoring, and analyzing machine-generated Big Data. The software can import data
from a variety of sources, from logs to data collected by Big Data applications such as Hadoop or
sensors. It then generates reports a non-IT business person can easily read and understand.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
getting reports aligned with your organization’s big picture, you can’t make
decisions based on reports alone.
Data analysis is the most powerful tool to bring into your business. Employing the
powers of analysis can be comparable to finding gold in your reports, which
allows your business to increase profits and further develop.
Having accurate research is crucial in devising various marketing and advertising
materials for your target market, while taking into account their needs as well as
the advantage of your competitors. We can help you come up with
comprehensive strategies through our extensive research services, which are
carefully tailored for your immediate business concerns.
Why Data Analysis?
Companies that are not leveraging modern data analytic tools and techniques are falling apart.
Since Data Analytics tools capture products that automatically glean and analyze data, deliver
information and predictions, you can improve prediction accuracy and refine the models.
Bonus: Want to Transform your Career in Data Analytics? Attend Live Data Analytics Orientation
Session.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Here is the list of top Analytics tools for data analysis that are available for free (for personal
use), easy to use (no coding required), well-documented (you can Google your way through if
you get stuck), and have powerful capabilities (more than excel). These Data Analysis tools will
help you manage and interpret data in a better and more effective way. Let’s explore the best
Analytics tools:
#1 Tableau Public
Tableau Public
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
It is a simple and intuitive and tool which offers intriguing insights through Data Visualization.
Tableau Public’s million row limit, which is easy to use fares better than most of the other players
in the Data Analytics market.
With Tableau’s visuals, you can investigate a hypothesis, explore the data, and cross-check your
insights.
Uses of Tableau Public
• You can publish interactive data visualizations to the web for free.
• No programming skills required.
• Visualizations published to Tableau Public can be embedded into blogs and web pages and
be shared through email or social media. The shared content can be made available s for
downloads.
• All data is public and offers very little scope for restricted access.
• Data size limitation.
• Cannot be connected to R.
• The only way to read is via OData sources, is Excel or txt.
#2 OpenRefine
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
OpenRefine
What is OpenRefine
Formerly known as GoogleRefine, the data cleaning software that helps you clean up data for
analysis. It operates on a row of data which have cells under columns, quite similar to relational
database tables.
Uses of OpenRefine
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Limitations of OpenRefine
#3 KNIME
KNIME
What is KNIME?
KNIME helps you to manipulate, analyze, and model data through visual programming. It is used
to integrate various components for data mining and Machine Learning via its modular data
pipelining concept.
Uses of KNIME
• Rather than writing blocks of code, you just have to drop and drag connection points
between activities.
• This data analysis tool supports programming languages.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
• In fact, analysis tools like these can be extended to run chemistry data, text mining, Python,
and R.
Limitation of KNIME
#4 RapidMiner
Rapidminer
What is RapidMiner?
RapidMiner provides Machine Learning procedures and Data Mining including Data
Visualization, processing, statistical modeling, deployment, evaluation, and predictive analytics.
RapidMiner written in the Java is fast gaining acceptance as a Big Data Analytics tool.
Uses of RapidMiner
It provides an integrated environment for business analytics, predictive analysis, text mining,
Data Mining, and Machine Learning.
Along with commercial and business applications, RapidMiner is also used for application
development, rapid prototyping, training, education, and research.
Limitations of RapidMiner
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
When talking about Data Analytics tools for free, here comes a much cooler, larger, and nerdier
version of Google Spreadsheets. An incredible tool for data analysis, mapping, and large dataset
visualization, Google Fusion Tables can be added to business analytics tools list.
Uses of Google Fusion Tables
• Only the first 100,000 rows of data in a table are included in query results or mapped.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
• The total size of the data sent in one API call cannot be more than 1MB.
#6 NodeXL
NodeXL
What is NodeXL?
It is a visualization and analysis software of relationships and networks. NodeXL provides exact
calculations. It is a free (not the pro one) and open-source network analysis and visualization
software. NodeXL is one of the best statistical tools for data analytics which includes advanced
network metrics, access to social media network data importers, and automation.
Uses of NodeXL
• This is one of the data analysis tools in excel that helps in following areas:
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
• Data Import
• Graph Visualization
• Graph Analysis
• Data Representation
• This software integrates into Microsoft Excel 2007, 2010, 2013, and 2016. It opens as a
workbook with a variety of worksheets containing the elements of a graph structure like
nodes and edges.
• This software can import various graph formats like adjacency matrices, Pajek .net, UCINet
.dl, GraphML, and edge lists.
Limitations of NodeXL
#7 Wolfram Alpha
wolfram-alpha
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
• Wolfram Alpha can only deal with publicly known number and facts, not with viewpoints.
• It limits the computation time for each query.
It is a powerful resource which helps you filter Google results instantly to get most relevant and
useful information.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
#9 Solver
Solver Excel
The Solver Add-in is a Microsoft Office Excel add-in program that is available when you install
Microsoft Excel or Office. It is a linear programming and optimization tool in excel.
This allows you to set constraints. It is an advanced optimization tool that helps in quick problem-
solving.
Uses of Solver
The final values found by Solver are a solution to interrelation and decision.
It uses a variety of methods, from nonlinear optimization and linear programming to evolutionary
and genetic algorithms, to find solutions.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Limitations of Solver
Dataiku DSS
This is a collaborative data science software platform that helps team build, prototype, explore,
and deliver their own data products more efficiently.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
It provides an interactive visual interface where they can build, click, and point or use languages
like SQL.
This data analytics tool lets you draft data preparation and modulization in seconds.
Helps you coordinate development and operations by handling workflow automation, creating
predictive web services, model health daily, and monitoring data.
Limitation of Dataiku DSS
Here are some of the useful data analytics tools and techniques that can be used for better
performance:
• VISUAL ANALYTICS
There are different ways to analyze the data. One of the simplest ways to do is to create a graph
or visual and look at it to spot patterns. This is an integrated method that combines data analysis
with human interaction and data visualization.
• BUSINESS EXPERIMENTS
Experimental design, AB testing, and business experiments are all techniques for testing the
validity of something. It is trying out something in one part of the organization and comparing it
with another.
• REGRESSION ANALYSIS
It is a statistical tool for investigating the relationship between variables. For instance, the cause
and effect relationship between product demand and price.
• CORRELATION ANALYSIS
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
A statistical technique that allows you to determine whether there is a relationship between two
separate variables and how strong that relationship may be. It is best to use when you know or
suspect that there is a relationship between two variables and wish to test the assumption.
It is the data that is collected at uniformly spaced time intervals. You can use it when you want to
assess changes over time or predict future events based on what happened in the past.
statistical concepts
Sampling distribution
n statistics, a sampling distribution or finite-sample distribution is the probability distribution of a
given statistic based on a random sample. Sampling distributions are important in statistics
because they provide a major simplification en route to statistical inference
For example, consider a normal population with mean and variance {\displaystyle \sigma ^{2}}
\sigma ^{2}. Assume we repeatedly take samples of a given size from this population and
calculate the arithmetic mean {\displaystyle \scriptstyle {\bar {x}}} \scriptstyle {\bar x} for each
sample – this statistic is called the sample mean. The distribution of these means, or averages, is
called the "sampling distribution of the sample mean". This distribution is normal {\displaystyle
\scriptstyle {\mathcal {N}}(\mu ,\,\sigma ^{2}/n)} \scriptstyle {\mathcal {N}}(\mu ,\,\sigma ^{2}/n) (n
is the sample size) since the underlying population is normal, although sampling distributions
may also often be close to normal even when the population distribution is not (see central limit
theorem). An alternative to the sample mean is the sample median. When calculated from the
same population, it has a different sampling distribution to that of the mean and is generally not
normal (but it may be close for large sample sizes).
standard deviation
The standard deviation of the sampling distribution of a statistic is referred to as the standard
error of that quantity. For the case where the statistic is the sample mean, and samples are
uncorrelated, the standard error is:
{\displaystyle \sigma _{\bar {x}}={\frac {\sigma }{\sqrt {n}}}} \sigma _{{{\bar x}}}={\frac {\sigma
}{{\sqrt {n}}}}
where {\displaystyle \sigma } \sigma is the standard deviation of the population distribution of
that quantity and {\displaystyle n} n is the sample size
Resampling
In statistics, resampling is any of a variety of methods for doing one of the following:
Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets
of available data (jackknifing) or drawing randomly with replacement from a set of data points
(bootstrapping)
Exchanging labels on data points when performing significance tests (permutation tests, also
called exact tests, randomization tests, or re-randomization tests)
Validating models by using random subsets (bootstrapping, cross validation)
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Jackknifing, which is similar to bootstrapping, is used in statistical inference to estimate the bias
and standard error (variance) of a statistic, when a random sample of observations is used to
calculate it. Historically this method preceded the invention of the bootstrap with Quenouille
inventing this method in 1949 and Tukey extending it in 1958.[3][4] This method was
foreshadowed by Mahalanobis who in 1946 suggested repeated estimates of the statistic of
interest with half the sample chosen at random.[5] He coined the name 'interpenetrating samples'
for this method.
Quenouille invented this method with the intention of reducing the bias of the sample estimate.
Tukey extended this method by assuming that if the replicates could be considered identically
and independently distributed, then an estimate of the variance of the sample parameter could
be made and that it would be approximately distributed as a t variate with n−1 degrees of
freedom (n being the sample size).
The basic idea behind the jackknife variance estimator lies in systematically recomputing the
statistic estimate, leaving out one or more observations at a time from the sample set. From this
new set of replicates of the statistic, an estimate for the bias and an estimate for the variance of
the statistic can be calculated.
Instead of using the jackknife to estimate the variance, it may instead be applied to the log of the
variance. This transformation may result in better estimates particularly when the distribution of
the variance itself may be non normal.
For many statistical parameters the jackknife estimate of variance tends asymptotically to the
true value almost surely. In technical terms one says that the jackknife estimate is consistent.
The jackknife is consistent for the sample means, sample variances, central and non-central t-
statistics (with possibly non-normal populations), sample coefficient of variation, maximum
likelihood estimators, least squares estimators, correlation coefficients and regression
coefficients.
It is not consistent for the sample median. In the case of a unimodal variate the ratio of the
jackknife variance to the sample variance tends to be distributed as one half the square of a chi
square distribution with two degrees of freedom.
The jackknife, like the original bootstrap, is dependent on the independence of the data.
Extensions of the jackknife to allow for dependence in the data have been proposed.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Another extension is the delete-a-group method used in association with Poisson sampling.
Whether to use the bootstrap or the jackknife may depend more on operational aspects than on
statistical concerns of a survey. The jackknife, originally used for bias reduction, is more of a
specialized method and only estimates the variance of the point estimator. This can be enough
for basic statistical inference (e.g., hypothesis testing, confidence intervals). The bootstrap, on
the other hand, first estimates the whole distribution (of the point estimator) and then computes
the variance from that. While powerful and easy, this can become highly computer intensive.
"The bootstrap can be applied to both variance and distribution estimation problems. However,
the bootstrap variance estimator is not as good as the jackknife or the balanced repeated
replication (BRR) variance estimator in terms of the empirical results. Furthermore, the bootstrap
variance estimator usually requires more computations than the jackknife or the BRR. Thus, the
bootstrap is mainly recommended for distribution estimation." [6]
There is a special consideration with the jackknife, particularly with the delete-1 observation
jackknife. It should only be used with smooth, differentiable statistics (e.g., totals, means,
proportions, ratios, odd ratios, regression coefficients, etc.; not with medians or quantiles). This
could become a practical disadvantage. This disadvantage is usually the argument favoring
bootstrapping over jackknifing. More general jackknifes than the delete-1, such as the delete-m
jackknife, overcome this problem for the medians and quantiles by relaxing the smoothness
requirements for consistent variance estimation.
Usually the jackknife is easier to apply to complex sampling schemes than the bootstrap.
Complex sampling schemes may involve stratification, multiple stages (clustering), varying
sampling weights (non-response adjustments, calibration, post-stratification) and under unequal-
probability sampling designs. Theoretical aspects of both the bootstrap and the jackknife can be
found in Shao and Tu (1995),[7] whereas a basic introduction is accounted in Wolter (2007).[8]
The bootstrap estimate of model prediction bias is more precise than jackknife estimates with
linear models such as linear discriminant function or multiple regression.[9]
Subsampling is an alternative method for approximating the sampling distribution of an estimator. The
two key differences to the bootstrap are: (i) the resample size is smaller than the sample size and (ii)
resampling is done without replacement. The advantage of subsampling is that it is valid under much
weaker conditions compared to the bootstrap. In particular, a set of sufficient conditions is that the
rate of convergence of the estimator is known and that the limiting distribution is continuous; in
addition, the resample (or subsample) size must tend to infinity together with the sample size but at a
smaller rate, so that their ratio converges to zero. While subsampling was originally proposed for the
case of independent and identically distributed (iid) data only, the methodology has been extended to
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
cover time series data as well; in this case, one resamples blocks of subsequent data rather than
individual data points. There are many cases of applied interest where subsampling leads to valid
inference whereas bootstrapping does not; for example, such cases include examples where the rate of
convergence of the estimator is not the square root of the sample size or when the limiting distribution
is non-normal.
Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held
out for use as validating sets; a model is fit to the remaining data (a training set) and used to predict
for the validation set. Averaging the quality of the predictions across the validation sets yields an
overall measure of prediction accuracy. Cross-Validation is employed repeatedly in building decision
trees.
One form of cross-validation leaves out a single observation at a time; this is similar to the jackknife.
Another, K-fold cross-validation, splits the data into K subsets; each is held out in turn as the
validation set.
This avoids "self-influence". For comparison, in regression analysis methods such as linear regression,
each y value draws the regression line toward itself, making the prediction of that value appear more
accurate than it really is. Cross-validation applied to linear regression predicts the y value for each
observation without using that observation.
This is often used for deciding how many predictor variables to use in regression. Without cross-
validation, adding predictors always reduces the residual sum of squares (or possibly leaves it
unchanged). In contrast, the cross-validated mean-square error will tend to decrease if valuable
predictors are added, but increase if worthless predictors are added.[
A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of
statistical significance test in which the distribution of the test statistic under the null hypothesis is
obtained by calculating all possible values of the test statistic under rearrangements of the labels on
the observed data points. In other words, the method by which treatments are allocated to subjects in
an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under
the null hypothesis, then the resulting tests yield exact significance levels; see also exchangeability.
Confidence intervals can then be derived from the tests. The theory has evolved from the works of
Ronald Fisher and E. J. G. Pitman in the 1930s.
To illustrate the basic idea of a permutation test, suppose we have two groups {\displaystyle A} A and
{\displaystyle B} B whose sample means are {\displaystyle {\bar {x}}_{A}} \bar{x}_{A} and
{\displaystyle {\bar {x}}_{B}} \bar{x}_{B}, and that we want to test, at 5% significance level,
whether they come from the same distribution. Let {\displaystyle n_{A}} n_{A} and {\displaystyle
n_{B}} n_{B} be the sample size corresponding to each group. The permutation test is designed to
determine whether the observed difference between the sample means is large enough to reject the
null hypothesis H {\displaystyle _{0}} _{0} that the two groups have identical probability
distributions.
The test proceeds as follows. First, the difference in means between the two samples is calculated: this
is the observed value of the test statistic, T(obs). Then the observations of groups {\displaystyle A} A
and {\displaystyle B} B are pooled.
WWW.VIDYARTHIPLUS.COM V+ TEAM
WWW.VIDYARTHIPLUS.COM
Next, the difference in sample means is calculated and recorded for every possible way of dividing
these pooled values into two groups of size {\displaystyle n_{A}} n_{A} and {\displaystyle n_{B}}
n_{B} (i.e., for every permutation of the group labels A and B). The set of these calculated
differences is the exact distribution of possible differences under the null hypothesis that group label
does not matter.
The one-sided p-value of the test is calculated as the proportion of sampled permutations where the
difference in means was greater than or equal to T(obs). The two-sided p-value of the test is
calculated as the proportion of sampled permutations where the absolute difference was greater than
or equal to ABS(T(obs)).
If the only purpose of the test is reject or not reject the null hypothesis, we can as an alternative sort the
recorded differences, and then observe if T(obs) is contained within the middle 95% of them. If it is
not, we reject the hypothesis of identical probability curves at the 5% significance level.
Statistical inference is the process of using data analysis to deduce properties of an underlying
probability distribution. Inferential statistical analysis infers properties of a population, for example
by testing hypotheses and deriving estimates
Statistical inference makes propositions about a population, using data drawn from the population with
some form of sampling. Given a hypothesis about a population, for which we wish to draw inferences,
statistical inference consists of (first) selecting a statistical model of the process that generates the data
and (second) deducing propositions from the model.
a point estimate, i.e. a particular value that best approximates some parameter of interest;
an interval estimate, e.g. a confidence interval (or set estimate), i.e. an interval constructed using a
dataset drawn from a population so that, under repeated sampling of such datasets, such intervals
would contain the true parameter value with the probability at the stated confidence level;
a credible interval, i.e. a set of values containing, for example, 95% of posterior belief;
rejection of a hypothesis;[a]
In statistics the mean squared prediction error of a smoothing or curve fitting procedure is the
expected value of the squared difference between the fitted values implied by the predictive function
and the values of the (unobservable) function g.
WWW.VIDYARTHIPLUS.COM V+ TEAM