Professional Documents
Culture Documents
BI-VARIATE ANALYSIS:
1) Replacing value
Explanation: Missing data arise in almost all serious statistical analyses. It occurs
due to the factors like missingness at random, missingness that depends on
unobserved predictors etc. Using this technique, the values can be replaced.
2) Encoding labels
Explanation: Categorical variables can be divided into two categories: Nominal (No
particular order) and Ordinal (some ordered).There are many ways we can encode
these categorical variables as numbers and use them in an algorithm.We will use
Pandas and Scikit-learn and category_encoders (Scikit-learn contribution library) to
show different encoding methods in Python.
3) BMI Index
Explanation:Current genome-wide association studies (GWAS) are normally
implemented in a univariate framework and analyze different phenotypes in isolation.
This univariate approach ignores the potential genetic correlation between important
disease traits. Hence this approach is difficult to detect pleiotropic genes, which may
exist for obesity and osteoporosis, two common diseases of major public health
importance that are closely correlated genetically.To identify such pleiotropic genes
and the key mechanistic links between the two diseases, we here performed the first
bivariate GWAS of obesity and osteoporosis.
4) Binary encoding
Explanation:Binary encoding converts a category into binary digits. Each binary digit
creates one feature column. If there are n unique categories, then binary encoding
results in the only log(base 2)ⁿ features
EXAMPLES:
The way chi-square works is by comparing the data that we have collected (The Observed
Frequencies Table) with the frequencies
that it would have expected (Expected Frequencies table). This Expected Frequency table
will have values that we would ‘expect’
in each cell by chance alone. These ‘expected values’ are simply calculated by taking the
total count of that group
e.g. we want to calculate the expected value for Americans who would enrol for the sport-
football.
We take the total count of American (which here is 200) and divide it by the total number of
samples (which in our example is 400) and
multiplying this value by the total of the count of the other group in question which in our
case will be 195 (the count of people enrolled in the sport- Football). Thus if we have to
calculate the expected value for the number of Americans enrolled for Football, we will take
the total number of Americans, the total number of people enrolled for Football and the total
number of people in our sample and use these values to calculate the expected value.
2)Residuals
The regression line is a mathematical model for the overall pattern of a linear relationship
between x and y. The deviations from this overall pattern are called residuals.
Definition: A residual is the difference between an observed
value and its corresponding predicted value:
Example: Compute the residuals from the dose and concentration data given:
Dosage (x) 33 4 5 6 6 7 8 8 9
Note: There are two variables in the problem and they are labelled by x and y for our
convenience. The common sense suggests that the variable y depends on x and x can be
independently selected and controlled by the researcher. However, the
variable y cannot be controlled and is dependent on x.
4)Analysis of categorical data generally involves the use of data tables. A two-way table
presents categorical data by counting the number of observations that fall into each group
for two variables, one divided into rows and the other divided into columns. For example,
suppose a survey was conducted of a group of 20 individuals, who were asked to identify
their hair and eye color. A two-way table presenting the results might appear as follows:
Eye Color
The totals for each category, also known as marginal distributions, provide the number of
individuals in each row or column without accounting for the effect of the other variable.
Since simple counts are often difficult to analyze, two-way tables are often converted into
percentages. In the above example, there are 4 individuals with red hair. Since there were a
total of 20 observations, this means that 20% of the individuals surveyed are redheads. One
also might want to investigate the percentages within a given category -- of the 4 redheads,
2 (50%) have brown eyes, 1 (25%) has blue eyes, and 1 (25%) has green eyes.
-2017115583