You are on page 1of 19

1. How is data classified ?

explain discrete and continuous data with


example and also what is the difference between nominal ordinal
interval and radio data.

Data classification is broadly defined as the process of organizing data by relevant categories
so that it may be used and protected more efficiently. On a basic level, the classification
process makes data easier to locate and retrieve. Data classification is of particular importance
when it comes to risk management, compliance, and data security.

There are three main types of data classification that are considered industry standards:

 Content-based classification inspects and interprets files looking for sensitive information

 Context-based classification looks at application, location, or creator among other variables as indirect
indicators of sensitive information

 User-based classification depends on a manual, end-user selection of each document. User-based


classification relies on user knowledge and discretion at creation, edit, review, or dissemination to flag
sensitive documents.

Definition of Discrete Data

The term discrete implies distinct or separate. So, discrete data refers to the type of quantitative data
that relies on counts. It contains only finite values, whose subdivision is not possible. It includes only
those values that can only be counted in whole numbers or integers and are separate which means the
data cannot be broken down into fraction or decimal.

For example, Number of students in the school, the number of cars in the parking lot, the number of
computers in a computer lab, the number of animals in a zoo, etc.

Definition of Continuous Data

Continuous data is described as an unbroken set of observations; that can be measured on a scale. It can
take any numeric value, within a finite or infinite range of possible value. Statistically, range refers to
the difference between highest and lowest observation. The continuous data can be broken down into
fractions and decimal, i.e. it can be meaningfully subdivided into smaller parts according to the
measurement precision.

For Example, Age, height or weight of a person, time taken to complete a task, temperature, time,
money, etc.

Comparison Chart
BASIS FOR
DISCRETE DATA CONTINUOUS DATA
COMPARISON

Meaning Discrete data is one that has clear Continuous data is one that falls on a
spaces between values. continuous sequence.

Nature Countable Measurable

Values It can take only distinct or separate It can take any value in some interval.
values.

Graphical Bar Graph Histogram


Representation

Tabulation is known as Ungrouped frequency distribution. Grouped frequency distribution.

Classification Mutually Inclusive Mutually Exclusive

Function graph Shows isolated points Shows connected points

Example Days of the week Market price of a product

Nominal

A nominal scale describes a variable with categories that do not have a natural order or ranking. You
can code nominal variables with numbers if you want, but the order is arbitrary and any calculations,
such as computing a mean, median, or standard deviation, would be meaningless.

Examples of nominal variables include:

 genotype, blood type, zip code, gender, race, eye color, political party

Ordinal

An ordinal scale is one where the order matters but not the difference between values.

Examples of ordinal variables include:


 socio economic status (“low income”,”middle income”,”high income”), education level (“high
school”,”BS”,”MS”,”PhD”), income level (“less than 50K”, “50K-100K”, “over 100K”),
satisfaction rating (“extremely dislike”, “dislike”, “neutral”, “like”, “extremely like”).

Note the differences between adjacent categories do not necessarily have the same meaning. For
example, the difference between the two income levels “less than 50K” and “50K-100K” does not have
the same meaning as the difference between the two income levels “50K-100K” and “over 100K”.

Interval

An interval scale is one where there is order and the difference between two values is meaningful.

Examples of interval variables include:

 temperature (Farenheit), temperature (Celcius), pH, SAT score (200-800), credit score (300-
850).

Ratio

A ratio variable, has all the properties of an interval variable, and also has a clear definition of 0.0.
When the variable equals 0.0, there is none of that variable.

Examples of ratio variables include:

 enzyme activity, dose amount, reaction rate, flow rate, concentration, pulse, weight, length,
temperature in Kelvin (0.0 Kelvin really does mean “no heat”), survival time.

When working with ratio variables, but not interval variables, the ratio of two measurements has a
meaningful interpretation. For example, because weight is a ratio variable, a weight of 4 grams is twice
as heavy as a weight of 2 grams. However, a temperature of 10 degrees C should not be considered
twice as hot as 5 degrees C. If it were, a conflict would be created because 10 degrees C is 50 degrees
F and 5 degrees C is 41 degrees F. Clearly, 50 degrees is not twice 41 degrees. Another example, a pH
of 3 is not twice as acidic as a pH of 6, because pH is not a ratio variable.
2. A company wants to understand the level of satisfaction of its product
. design a small questionnaire for the same

1. Which of the following words would you use to describe our product?

 Buggy
 Fine , but there are some issue
 Fine
 great

2. How well does our product meet your needs?

 Badly
 Fine
 Well
 Very well

3. Which 3 features are the most valuable to you?

4. If you could change just one thing about our product, what would it be?

5. What problem would you like to solve with our product?

6. How would you rate the value for money of the product?

 Bad
 Regular
 good

7. Compared to our competitors, is our product quality better, worse, or about the same?

 Better
 same
 worse

8. On a scale from 0 to 10, how likely are you to recommend our company to a friend or colleague?

9. How likely are you to buy again from us?

 Not likely
 likely
 very likely

10. How easy is it to navigate our website?

 Yes
 no

11 How responsive have we been to your questions or concerns about our products?
 Not responsive
 Usually responsive
 Very responsive

12. To what extent do you agree with the following statement: The company made it easy for me to
handle my issue.

 Strongly disagree
 Disagree
 Agree
 Strongly agree

3. From the questionnaire designed above give 3 example of univariate


analysis

4. From the questionnaire designed above give 2 example of bi variate


analysis

5. Explain correlation and regression with hypothetical eg


Correlation is a technique for investigating the relationship between two quantitative, continuous
variables, for example, age and blood pressure. Pearson's correlation coefficient (r) is a measure of the
strength of the association between the two variables.

The first step in studying the relationship between two continuous variables is to draw a scatter plot of
the variables to check for linearity. The correlation coefficient should not be calculated if the
relationship is not linear. For correlation only purposes, it does not really matter on which axis the
variables are plotted. However, conventionally, the independent (or explanatory) variable is plotted on
the x-axis (horizontally) and the dependent (or response) variable is plotted on the y-axis (vertically).

The nearer the scatter of points is to a straight line, the higher the strength of association between the
variables. Also, it does not matter what measurement units are used.

Nine students held their breath, once after breathing normally and relaxing for one minute, and once
after hyperventilating for one minute. The table indicates how long (in sec) they were able to hold their
breath. Is there an association between the two variables?

Subject A B C D E F G H I

Normal 56 56 65 65 50 25 87 44 35

Hypervent 87 91 85 91 75 28 122 66 58
The chart shows the scatter plot (drawn in MS Excel) of the data, indicating the reasonableness of
assuming a linear association between the variables.

Hyperventilating times are considered to be the dependent variable, so are plotted on the vertical axis.

Regression Analysis
Regression analysis refers to assessing the relationship between the outcome variable and one or more
variables. The outcome variable is known as the dependent or response variable and the risk elements,
and cofounders are known as predictors or independent variables. The dependent variable is shown by
“y” and independent variables are shown by “x” in regression analysis.
The sample of a correlation coefficient is estimated in the correlation analysis. It ranges between -1 and
+1, denoted by r and quantifies the strength and direction of the linear association among two variables.
The correlation among two variables can either be positive, i.e, a higher level of one variable is related
to a higher level of another) or negative, i.e, a higher level of one variable is related to a lower level of
the other.
The sign of the coefficient of correlation shows the direction of the association. The magnitude of the
coefficient shows the strength of the association.
For example, a correlation of r = 0.8 indicates a positive and strong association among two variables,
while a correlation of r = -0.3 shows a negative and weak association. A correlation near to zero shows
the non-existence of linear association among two continuous variables.

Linear Regression
Linear regression is a linear approach to modelling the relationship between the scalar components
and one or more independent variables. If the regression has one independent variable, then it is known
as a simple linear regression. If it has more than one independent variables, then it is known as multiple
linear regression. Linear regression only focuses on the conditional probability distribution of the given
values rather than the joint probability distribution. In general, all the real world regressions models
involve multiple predictors. So, the term linear regression often describes multivariate linear regression.
Example of simple linear regression

The table below shows some data from the early days of the Italian clothing company Benetton. Each
row in the table shows Benetton’s sales for a year and the amount spent on advertising that year. In this
case, our outcome of interest is sales—it is what we want to predict. If we use advertising as the
predictor variable, linear regression estimates that Sales = 168 + 23 Advertising. That is, if advertising
expenditure is increased by one Euro, then sales will be expected to increase by 23 million Euro, and if
there was no advertising we would expect sales of 168 million Euro.

6. Explain a typical business situation and how analytic can be used. Use
hypothetical figures as per your imagination

7. Explain normal distribution in detail


A normal distribution is an arrangement of a data set in which most values cluster in the middle
of the range and the rest taper off symmetrically toward either extreme.
Height is one simple example of something that follows a normal distribution pattern: Most
people are of average height, the numbers of people that are taller and shorter than average are
fairly equal and a very small (and still roughly equivalent) number of people are either
extremely tall or extremely short.

Here’s an example of a normal distribution curve:

A graphical representation of a normal distribution is sometimes called a bell curve because of its
flared shape. The precise shape can vary according to the distribution of the population but the peak
is always in the middle and the curve is always symmetrical. In a normal distribution, the mean,
mode and median are all the same.
Normal distribution curves are sometimes designed with a histogram inside the curve. The graphs
are commonly used in mathematics, statistics and corporate data analytics.

Example of Normally Distributed Data: Heights

Height data are normally distributed. The distribution in this example fits real data that I collected from
14-year-old girls during a study.

As you can see, the distribution of heights follows the typical pattern for all normal distributions. Most
girls are close to the average (1.512 meters). Small differences between an individual’s height and the
mean occur more frequently than substantial deviations from the mean. The standard deviation is
0.0741m, which indicates the typical distance that individual girls tend to fall from mean height.

The distribution is symmetric. The number of girls shorter than average equals the number of girls taller
than average. In both tails of the distribution, extremely short girls occur as infrequently as extremely
tall girls.

8. Explain some indicator of a typical business firm, which reflects the


financial performance of that firm for eg. gross margin to sale , net
margin to sales , ebit to sale ,PAT to sale ,sale to total assets

9. Under what circumstances is a chi square test used


What is the Chi-Square Test?

The Chi-Square test is a statistical procedure used by researchers to examine the differences between

categorical variables in the same population. For example, imagine that a research group is interested

in whether or not education level and marital status are related for all people in the U.S. After collecting

a simple random sample of 500 U.S. citizens, and administering a survey to this sample, the researchers

could first manually observe the frequency distribution of marital status and education category within
their sample. The researchers could then perform a Chi-Square test to validate or provide additional

context for these observed frequencies.

Chi-Square calculation formula is as follows:

When is the Chi-Square Test Used in Market Research?

Market researchers use the Chi-Square test when they find themselves in one of the following

situations:
 They need to estimate how closely an observed distribution matches an expected distribution. This is
referred to as a “goodness-of-fit” test. 


 They need to estimate whether two random variables are independent.


When to Use the Chi-Square Test on Survey Results

The Chi-Square test is most useful when analysing cross tabulations of survey response data.

Because cross tabulations reveal the frequency and percentage of responses to questions by various

segments or categories of respondents (gender, profession, education level, etc.), the Chi-Square test

informs researchers about whether or not there is a statistically significant difference between how the

various segments or categories answered a given question. Important things to note when

considering using the Chi-Square test First, Chi-Square only tests whether two individual variables

are independent in a binary, “yes” or “no” format.

Chi-Square testing does not provide any insight into the degree of difference between the respondent

categories, meaning that researchers are not able to tell which statistic (result of the Chi-Square test) is

greater or less than the other. Second, Chi-Square requires researchers to use numerical values, also
known as frequency counts, instead of using percentages or ratios. This can limit the flexibility that

researchers have in terms of the processes that they use.

10. Explain in detail cluster analysis technique with special reference to


a. Its difference from factor analysis
b. Euclidean distance
c. Grover’s coefficient of similarity

Cluster analysis is a class of techniques that are used to classify objects or cases into
relative groups called clusters. Cluster analysis is also called classification analysis or
numerical taxonomy. In cluster analysis, there is no prior information about the group
or cluster membership for any of the objects.
Euclidean distance
In general, if you have p variables X1, X2, . . . , Xp measured on a sample of n subjects,
the observed data for subject i can be denoted by xi1, xi2, . . . , xip and the observed data
for subject j by xj1, xj2, . . . , xjp.
The Euclidean distance between these two subjects is given by
dij = q (xi1 − xj1) 2 + (xi2 − xj2) 2 + . . . + (xip − xjp) 2
When using a measure such as the Euclidean distance, the scale of measurement of the
variables under consideration is an issue, as changing the scale will obviously effect the
distance between subjects (e.g. a difference of 10cm could being a difference of 100mm).
In addition, if one variable has a much wider range than others then this variable will tend
to dominate. For example, if body measurements had been taken for a number of different
people, the range (in mm) of heights would be much wider than the range in wrist
circumference, say. To get around this problem each variable can be standardised
(converted to z-scores). However, this in itself presents a problem as it tends to reduce the
variability (distance) between clusters. This happens because if a particular variable
separates observations well then, by definition, it will have a large variance (as the between
cluster variability will be high). If this variable is standardised then the separation between
clusters will become less. Despite this problem, many textbooks do recommend
standardisation. If in doubt, one strategy would be to carry out the cluster analysis twice —
once without standardising and once with — to see how much difference, if any, this makes
to the resulting clusters.

A similarity measure can be defined as the distance between various data points. While,
similarity is an amount that reflects the strength of relationship between two data items,
dissimilarity deals with the measurement of divergence between two data items[9]. In
fact, the performance of many algorithms depends upon selecting a good distance
function over the input data set[9]. Here, a brief overview of similarity measure functions
commonly used for clustering in literature is shown in the following subsections:
2.1.1 Euclidean distance Euclidean distance is considered as the standard metric for
geometrical problems. It is simply the ordinary distance between two points. Euclidean
distance is extensively used in clustering problems, including clustering text. The default
distance measure used with the K-means algorithm is also the Euclidean distance. The
Euclidean distance determines the root of square differences between the coordinates of
a pair of objects as shown in equation (
1) given below [7]. 𝐷𝑖𝑠𝑡𝑋𝑌 = 𝑚𝑎𝑥𝑘 𝑋𝑖𝑘 – 𝑋𝑗𝑘
2.1.2 Cosine distance Cosine distance measure for clustering determines the cosine of
the angle between two vectors given by the following formula[36]. Here θ gives the angle
between two vectors and A, B are n-dimensional vectors.

𝜃 = 𝑎𝑟𝑐𝑐𝑜𝑠 𝐴. 𝐵 𝐴 𝐵
2.1.3 Jaccard distance The Jaccard distance measures the similarity of the two data items
as the intersection divided by the union of the data items as shown in equation (3) given
below [36]. The Jaccard similarity measure was also used for clustering ecological
species[1]. 𝐽 𝐴, 𝐵 = 𝐴 ∩ 𝐵 𝐴 ∪ 𝐵
2.1.4 Manhattan distance Manhattan distance is a distance metric that calculates the
absolute differences between coordinates of pair of data objects as shown in equation (4)
given below[7]:
𝑫𝒊𝒔𝒕𝑿𝒀 = 𝑿𝒊𝒌 – 𝑿𝒋𝒌
2.1.5 Chebyshev distance Chebyshev distance is also called the maximum value distance.
This distance metric calculates the absolute magnitude of the differences between
coordinate of a pair of data objects as given in equation (5) given below[7]: 𝐷𝑖𝑠𝑡𝑋𝑌 =
𝑚𝑎𝑥𝑘 𝑋𝑖𝑘 – 𝑋𝑗𝑘
2.1.6 Minkowski distance Minkowski Distance is also known as the generalized distance
metric. In equation (6) given below[7], note that when p=2, the distance becomes the
Euclidean distance. Chebyshev distance metric is a variant of Minkowski distance metric
where p=∞ (taking a limit). This distance can be used for variables that are both ordinal
and quantitative in nature.
𝐷𝑖𝑠𝑡𝑋𝑌 = 𝑋𝑖𝑘 − 𝑋𝑗𝑘 1 𝑝 𝑑 𝑘=1

11.Recall on the importance and usefulness of big data in todays scenario


by giving relevant example

Big Data – The Opportunity is Now

In the past, technology platforms were built to address either structured OR unstructured data.
The value and means of unifying and/or integrating these data types had yet to be realized, and
the computing environments to efficiently process high volumes of disparate data were not yet
commercially available.
Large content repositories house unstructured data such as documents, and companies often
store a great deal of structured information corporate systems like Oracle, SAP and NetSuite
and others. Today’s organizations, however, are utilizing, sharing and storing more information
in varying formats, including:

 e-mail and Instant Messaging


 Collaborative Intranets and Extranets
 Public websites, wikis, and blogs
 Social media channels
 Video and audio files
 Data from industrial sensors, wearables and other monitoring devices

This unstructured data adds up to as much as 85% of the information that businesses
store. Regardless of the size of your busines or the industry you are in, you have Big Data. The
ability to extract high value from this data to enable innovation and competitive gain is the
purpose of Big Data analytics. Conducting analytics on large sets of data, business users and
executives are able to see patterns and trends in performance, new relationships between data
sets and potentially new sources of revenue.
EXAMPLE:
Let’s look at a few examples of scenarios where Big Data solutions have helped these
companies gain a competitive advantage.
Coca Cola’s Big Data Wins
Coca Cola has been in a leader in the consumer packaged goods industry for over a century,
and their brands are iconic. They distribute their products to a global network of retailers, have
many SKU’s, and must be able to predict buyer behavior to ensure they have the right
inventory, promotional ads in the marketplace and sponsoring the right events worldwide.
Coca Cola has been able to get wins with Big Data analytics by:

 Selecting the ideal ingredient mix to produce juice products


 Create efficiencies in their warehousing, restaurant and retail supply chain operations
 Mining loyalty program, competitive, POS and social media data to understand buyer behavior
 Creating digital service centres for procurement and HR processes
 Leverage a new breed of storage media to retain, process and analyze vast amounts of
information

Coca Cola’s customers are in 206 countries, a vastly diverse marketplace with tens of millions
of ultimate consumers. Effectively managing the information relating to their clients,
employees, suppliers and media assets requires effective storage, powerful indexing and search
functionality, and innovative solutions to make sure information can be located and used when
required. Big Data solutions have provided Coca Cola with this ability.

Netflix Uses Big Data to Improve Customer Experience


To make sure its clients keep watching its programming, Netflix is constantly analyzing
trends in

 Program viewership
 Trends in the content its customers are consuming
 The colors of the promotional visuals of its programming
 Devices its clients are watching its programming on
 Whether a viewer watches a portion of a movie, a season of a series, or a complete series back
to back in a weekend binge watching session

For many entertainment, technology and media organizations, Big Data analytics is the key to
retaining subscribers, securing advertising revenues, and understanding the sort of content to
serve as it relates to geographical locations, time of day, demographics, and on opinions
expressed on social media. Big Data gives Nexflix the ability to deliver the content the
customer wants to see, when the customer wants it.

12.Explain one- way ANNOVA in detail with special elaboration on


a. The situation in which it is used
b. Assumptions and how they are fulfilled
c. Steps in hypothesis testing
One Way ANOVA
A one way ANOVA is used to compare two means from two independent (unrelated) groups using
the F-distribution. The null hypothesis for the test is that the two means are equal. Therefore,
a significant result means that the two means are unequal.

When to use a one way ANOVA


Situation 1: You have a group of individuals randomly split into smaller groups and completing
different tasks. For example, you might be studying the effects of tea on weight loss and form three
groups: green tea, black tea, and no tea.
Situation 2: Similar to situation 1, but in this case the individuals are split into groups based on an
attribute they possess. For example, you might be studying leg strength of people according to weight.
You could split participants into weight categories (obese, overweight and normal) and measure their
leg strength on a weight machine.

The results of a one-way ANOVA can be considered reliable as long as the following assumptions are
met:

 Response variable residuals are normally distributed (or approximately normally distributed).
 Variances of populations are equal.
 Responses for a given group are independent and identically distributed normal random variables (not
a simple random sample (SRS)).

If data are ordinal, a non-parametric alternative to this test should be used such as Kruskal–Wallis one-
way analysis of variance. If the variances are not known to be equal, a generalization of 2-
sample Welch's t-test can be used.[2]

Hypothesis Testing
In order to conduct the one-way ANOVA hypothesis test we follow the step-wise implementation
procedure for hypothesis testing.
Step 1: State the null hypothesis H0H0 and alternative hypothesis HAHA
The null hypothesis states that the mean annual salary is equal among all groups of graduates.
H0:μ1=μ2=μ3=μ4=μ5=μ6H0:μ1=μ2=μ3=μ4=μ5=μ6

Alternative hypothesis
HA:Not all the means are equal

Step 2: Decide on the significance level, αα


α=0.01

Step 3 and 4: Compute the value of the test statistic and the p-value.
In order to calculate the FF test statistic, we need to determine several quantities beforehand. For
illustration purposes we manually compute the FF test statistic in R. Step by step we populate the
ANOVA table until we finally get the FF test statistic and consequently the p-value.

Step 5: If p≤αp≤α, reject H0H0; otherwise, do not reject H0H0.

p.value <= alpha


## [1] TRUE

The p-value is less than the specified significance level of 0.01; we reject H0H0. The test results are
statistically significant at the 1% level and provide very strong evidence against the null hypothesis.

Step 6: Interpret the result of the hypothesis test.


p=5.7469735×10−37p=5.7469735×10−37. At the 1% significance level, the data provides very strong
evidence to conclude that at least one pair of group means is different from each other.

13.State and explain the HACE theorem in the context of big data
analytics

Big Data is a comprehensive term for any collection of data sets so large and multifarious that
it becomes difficult to process them using conventional data processing applications. The
challenges include analysis, capture, search, sharing, storage, transfer, revelation, and privacy
violations. The tendency to larger data sets is due to the additional information derivable from
analysis of a single large set of related data, as compared to separate smaller sets with the same
total amount of data, allowing correlations to be found to "spot business trends, prevent
diseases, combat crime and so on.

There are two types of Big Data: structured and unstructured.


Structured data are numbers and words that can be easily categorized and analyzed. These
data are generated by things like network sensors embedded in electronic devices, smart
phones, and global positioning system (GPS) devices. Structured data also include things like
sales figures, account balances, and transaction data.

Unstructured data include more multifarious information, such as customer reviews from
feasible websites, photos and other multimedia, and comments on social networking sites.
These data can not be separated into categorized or analysed

Big Data Characteristic(HACE Theorem) HACE theorem is theorem to model the BIG DATA
characteristics. Big Data starts with large-volume, Heterogeneous, Autonomous sources with
distributed and decentralized control, and seeks to explore Complex and Evolving
relationships among data

The term Big Data literally concerns about data volumes, HACE theorem suggests that the
key characteristics of the Big Data are
A. Huge with various and miscellaneous data sources:-One of the fundamental
characteristics of the Big Data is the huge volume of data represented by various and
miscellaneous dimensionalities. This huge volume of data comes from various sites like
Twitter, MySpace, Orkut and LinkedIn etc. This is because different information collectors
prefer their own representation or procedure for data recording, and the nature of different
applications also results in various data representations
B. Autonomous Sources with circulated & disperse Control: - Autonomous Sources with
circulated & disperse Control are a main characteristic of Big Data applications. Being
autonomous, each data source is able to produce and collect information without connecting
any centralized control. This is similar to the World Wide Web (WWW) setting where each
web server provides a certain amount of information and each server is able to fully function
without necessarily depending on other servers. On the other hand, the massive volumes of
the data also make an application susceptible to attacks or failure, if the whole system has to
depend on any centralized control unit. For example, Asian markets of Wal-Mart are
inherently different from its North American markets in terms of seasonal promotions, top
sell items, and customer behaviours. More specifically, the local government regulations also
impact on the wholesale management process and result in restructured data representations
and data warehouses for local markets.
C. Complex and Evolving associations:- In an early stage of data centralized information
systems, the focus is on finding best feature values to represent each observation. This type
of sample feature representation inherently treats each individual as an independent entity
without considering their social connections, which is one of the most important factors of
the human society. The correlations between individuals inherently complicate the whole
data representation and any reasoning process on the data. In a dynamic world, the features
used to represent the individuals and the social ties used to represent our connections may
also evolve with respect to temporal, spatial, and other factors. Examples of complex data
types are bills of materials, word processing documents, maps, time-series, images and video.
Such combined characteristics suggest that Big Data require a “big mind” to consolidate data
for maximum values.

14.Explain MANOVA with special elaboration on


a. Its difference from one way annova
b. Assumptions for its use
c. Multivariate normality
d. Pillai’s trace v/s wilk’s lambda

15. Explain bi variate schema with relevant example in each category

Bivariate data deals with two variables that can change and are compared to find relationships. If one
variable is influencing another variable, then you will have bivariate data that has an independent and
a dependent variable. This is because one variable depends on the other for change. An independent
variable is a condition or piece of data in an experiment that can be controlled or changed.
A dependent variable is a condition or piece of data in an experiment that is controlled or influenced
by an outside factor, most often the independent variable.
This is very different from univariate data, which is one variable in a data set that is analyzed to
describe a scenario or experiment.
For example, if Mindy was studying for a college test and tracks her study time and her test scores,
she might see that the more time she spends studying, the better her test scores become. Therefore, in
this scenario, Mindy's test scores are the dependent variable because they depend on the number of
hours she studies. Likewise, the number of study hours would be considered the independent variable.
For that reason, we can see the relationship in this bivariate data set:

16. Interpret the chi square output below and frame the relevant Null
and alternative hypotheses

Comparison Chart

BASIS FOR
NULL HYPOTHESIS ALTERNATIVE HYPOTHESIS
COMPARISON

Meaning A null hypothesis is a statement, An alternative hypothesis is statement in


in which there is no relationship which there is some statistical significance
between two variables. between two measured phenomenon.

Represents No observed effect Some observed effect

What is it? It is what the researcher tries to It is what the researcher tries to prove.
disprove.

Acceptance No changes in opinions or actions Changes in opinions or actions

Testing Indirect and implicit Direct and explicit

Observations Result of chance Result of real effect

Denoted by H-zero H-one

Mathematical Equal sign Unequal sign


formulation

Definition of Null Hypothesis

A null hypothesis is a statistical hypothesis in which there is no significant difference exist between
the set of variables. It is the original or default statement, with no effect, often represented by H0 (H-
zero). It is always the hypothesis that is tested. It denotes the certain value of population parameter
such as µ, s, p. A null hypothesis can be rejected, but it cannot be accepted just on the basis of a single
test.

Definition of Alternative Hypothesis

A statistical hypothesis used in hypothesis testing, which states that there is a significant difference
between the set of variables. It is often referred to as the hypothesis other than the null hypothesis,
often denoted by H1 (H-one). It is what the researcher seeks to prove in an indirect way, by using the
test. It refers to a certain value of sample statistic, e.g., x¯, s, p

The acceptance of alternative hypothesis depends on the rejection of the null hypothesis i.e. until and
unless null hypothesis is rejected, an alternative hypothesis cannot be accepted.

You might also like