You are on page 1of 4

PROGRAMS ON CATEGORICAL VARIABLES USING

BI-VARIATE ANALYSIS:
1) Replacing value
Explanation: Missing data arise in almost all serious statistical analyses. It occurs
due to the factors like missingness at random, missingness that depends on
unobserved predictors etc. Using this technique, the values can be replaced.

2) Encoding labels
Explanation: Categorical variables can be divided into two categories: Nominal (No
particular order) and Ordinal (some ordered).There are many ways we can encode
these categorical variables as numbers and use them in an algorithm.We will use
Pandas and Scikit-learn and category_encoders (Scikit-learn contribution library) to
show different encoding methods in Python.

3) BMI Index
Explanation:Current genome-wide association studies (GWAS) are normally
implemented in a univariate framework and analyze different phenotypes in isolation.
This univariate approach ignores the potential genetic correlation between important
disease traits. Hence this approach is difficult to detect pleiotropic genes, which may
exist for obesity and osteoporosis, two common diseases of major public health
importance that are closely correlated genetically.To identify such pleiotropic genes
and the key mechanistic links between the two diseases, we here performed the first
bivariate GWAS of obesity and osteoporosis.

4) Binary encoding
Explanation:Binary encoding converts a category into binary digits. Each binary digit
creates one feature column. If there are n unique categories, then binary encoding
results in the only log(base 2)ⁿ features

5) Results of Students based on marks and attendance

Explanation:Bivariate statistical analyses are data analysis procedures using two


variables (e.g. self-efficacy and academic performance). Bivariate analyses can be
descriptive (e.g. a scatterplot), but the goal is typically to compare or examine the
relationship between two variables. For instance, researchers may examine whether student
self-efficacy in mathematics is a significant predictor of mathematics standardized test
scores. Another example is comparing academic performance across groups of students
receiving different modes of instruction (e.g. face to face versus online), in which case the
two variables are test scores and group membership (e.g. face-to-face cohort or online
cohort).

6) Coin identifier based on area and width


Explanation: The aim is to evaluate the correlation of dynamic movement of
exchange rate between Bitcoin and Ethereum. This study implemented a statistical
approach of correlation test to detect the covariance movement of exchange rate
using bivariate analysis. Result indicates the correlation factor between Bitcoin and
Ethereum is 0.653. Therefore, result concluded there is significant, positive and
strong association between Bitcoin and Ethereum exchange rate. The finding of this
study can benefits investors to predict the price movement among cryptocurrencies.
In the same time, the knowledge of price movement can help investor to gain better
profit and reducing loss for their investment portfolio.

7)Temperature Reaction time of a chemical


Explanation:Air temperature data retrieved from global atmospheric models may
show a systematic bias with respect to measurements from weather stations. This is a
common concern in local climate studies

EXAMPLES:

1)We have two categorical variables-


First one is the Country of Origin which has two levels- Britisher and American.
The second categorical variable is the Type of Sports which has three levels- Football
(Soccer), Baseball and Cricket. We have a coaching centre where training for each of these
sports is given. We have a sample dataset where the number of Britishers and Americans
enrolling in different sports is known. We have to find if there is any relationship between the
Country a person belongs and the Sports he chooses.

The way chi-square works is by comparing the data that we have collected (The Observed
Frequencies Table) with the frequencies
that it would have expected (Expected Frequencies table). This Expected Frequency table
will have values that we would ‘expect’
in each cell by chance alone. These ‘expected values’ are simply calculated by taking the
total count of that group
e.g. we want to calculate the expected value for Americans who would enrol for the sport-
football.
We take the total count of American (which here is 200) and divide it by the total number of
samples (which in our example is 400) and
multiplying this value by the total of the count of the other group in question which in our
case will be 195 (the count of people enrolled in the sport- Football). Thus if we have to
calculate the expected value for the number of Americans enrolled for Football, we will take
the total number of Americans, the total number of people enrolled for Football and the total
number of people in our sample and use these values to calculate the expected value.

2)Residuals
The regression line is a mathematical model for the overall pattern of a linear relationship
between x and y. The deviations from this overall pattern are called residuals.
Definition: A residual is the difference between an observed
value and its corresponding predicted value:

Example: Compute the residuals from the dose and concentration data given:

3)Suppose that a medical researcher


wishes to investigate whether different dosages of a new drug affect the duration of relief
from particular allergic symptoms. To study this, an experiment is conducted using a random
sample of ten patients and the following observations are recorded:

Dosage (x) 33 4 5 6 6 7 8 8 9

Duration of relief (y) 9 5 12 9 14 16 22 18 24 22

Note: There are two variables in the problem and they are labelled by x and y for our
convenience. The common sense suggests that the variable y depends on x and x can be
independently selected and controlled by the researcher. However, the
variable y cannot be controlled and is dependent on x.

4)Analysis of categorical data generally involves the use of data tables. A two-way table
presents categorical data by counting the number of observations that fall into each group
for two variables, one divided into rows and the other divided into columns. For example,
suppose a survey was conducted of a group of 20 individuals, who were asked to identify
their hair and eye color. A two-way table presenting the results might appear as follows:

Eye Color

Hair Color Blue Green Brown Black Total


-----------------------------------------------------
Blonde 2 1 2 1 6
Red 1 1 2 0 4
Brown 1 0 4 2 7
Black 1 0 2 0 3
-----------------------------------------------------
Total 5 2 10 3 20

The totals for each category, also known as marginal distributions, provide the number of
individuals in each row or column without accounting for the effect of the other variable.
Since simple counts are often difficult to analyze, two-way tables are often converted into
percentages. In the above example, there are 4 individuals with red hair. Since there were a
total of 20 observations, this means that 20% of the individuals surveyed are redheads. One
also might want to investigate the percentages within a given category -- of the 4 redheads,
2 (50%) have brown eyes, 1 (25%) has blue eyes, and 1 (25%) has green eyes.

5)Alcohol consumption Cholesterol level


6)Age of cancer patients Length of survival

-2017115583

You might also like