You are on page 1of 55

De La Salle University

A Statistical Study of Applying Confidence Interval Estimation,


Hypothesis Testing for a Single & Double Parameter, and Simple
Linear Regression to the Indian Liver Patient Dataset from the
UCI Machine Learning Repository

Group 5 of N03B
Jon Eiro D. Andal
Eliana Joelle L. Foronda
Gwyneth Elizabeth Ann F. Galicia
Darrel Danier A. Leander

Submitted in partial fulfillment of the requirements for the Foundation Course in


Statistics Class in the College of Science Department of De La Salle University Manila
AY 2021 - 2022

Professor: Mrs. Maria Angeli Reyes


A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Table of Contents

Abstract……………………………………………………………………………………3
Introduction………………………………………………………………………………..4
Problem Definition………………………………………………………………………10
Methodology……………………………………………………………………………..13
Data Representation and Analysis……………………………………………………….23
Statistical Analysis……………………………………………………………………….28
Conclusions and Recommendations……………………………………………………..38
Appendices……………………………………………………………………………….40
Bibliography……………………………………………………………………………..53

2
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Abstract

Liver disease has been ranked the tenth most common cause of death in India, as
per the World Health Organization. The alarming amount of individuals that endure and
suffer from liver-related diseases has been consistently increasing due to the
overconsumption of alcohol, misuse of drugs, harmful inhalation of toxic substances,
hereditary occurrences, and a deskbound lifestyle. In this context, this statistical research
would primarily utilize the Indian Liver Patient dataset that comprises 416 liver patient
records and 167 non-liver patient records collected from North East of Andhra Pradesh,
India. In this statistical study, there is an endeavor to place a strong emphasis on
comparing the findings of individuals who have and those who do not have liver disease
and to intuit if the patients’ amount of different chemical compounds in their blood have
an impact on the results of their liver function blood test. This paper would construct,
implement, and test four distinct statistical methods, namely, confidence interval
estimation, single and double parameter hypothesis testing, and simple linear regression.
The variables that are significant and would be investigated for this data collection are
total and conjugated bilirubin, amount of ALP, albumin, and the ratio of albumin and
globulin.

3
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

I. Introduction

The human liver is a vital organ that also serves as a gland in the body. Inside the
human body, this highly crucial organ performs around 500 different functions. In
essence, it is considered as the largest solid organ in the body that incorporates the ability
to regenerate. The liver is required for the digestion of food as well as the elimination of
toxins from the body (Cleveland Clinic, n.d.). In the industrialized world, one of the most
common causes of liver disease is alcohol consumption. Moreover, the liver can be
severely damaged by excessive alcohol consumption, inhaling hazardous gases, eating
contaminated food, certain viruses, and the usage of drugs (Mayo Clinic, n.d.). The
inheritance of liver disease is also possible, especially if there are individuals in the
family that had liver problems in the past. The diagnosis of liver diseases is executed
through the liver function test (Gulia et al, 2014).

This statistical study is predominantly a quantitative research that has a


descriptive research design as its approach. The Indian Liver Patient Dataset from the
UCI Machine Learning Repository was selected for this particular statistical study. This
dataset comprises 416 liver patient records and 167 non-liver patient records collated
from North East of Andhra Pradesh of India. In essence, this statistical research
incorporates an endeavor to attain the overall purpose of the findings of individuals who
have liver disease and those who do not have liver disease. Moreover, the study is
inclined towards determining whether or not the specific chemical compounds have any
bearing on the outcomes on their liver functionality. This statistical research utilizes six
statistical tests which are Confidence Interval Estimation, Single and Double Parameter
Hypothesis Testing, and Simple Linear Regression.

An estimate of a population parameter that is unknown is known as a confidence


interval. It is always coupled with a given confidence level, which is generally reported
as a percentage, and is always based on a sampling from the population (Salkind, 2010).
Hypothesis testing is a method of statistical reasoning that draws conclusions about a
population parameter or a population probability distribution predicated on data from a

4
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

sample (Britanicca, n.d.). Lastly, the Simple Linear Regression involves creating
assumptions about the data. The link between two continuous variables is modeled using
simple linear regression. Usually, the goal is to use the value of an input (or predictor)
variable to forecast the value of an output variable (or response) (JMP, n.d.).

The statistical study with the utilization of the four aforementioned statistical tests
is applied with the use of the Indian Liver Patient Dataset (ILPD) and makes use of the
variables: total and conjugated bilirubin, amount of ALP, albumin, and albumin and
globulin ratio. The results of the various statistical tests are derived from the
computerized solutions and statistical softwares/extensions which are PHstat and
Statistica.

5
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

II. Definition of Terms

Alanine Aminotransferase
Normally found in the cells of the liver and heart. When the liver or heart is
injured, it is released into the bloodstream.

Albumin
Protein that is soluble in water and may be coagulated with heat, such as in egg
whites, milk, and blood serum.

Alkaline Phosphatase
In the liver, bone, and other tissues, an enzyme that liberates phosphate in alkaline
circumstances.

Aspartate Aminotransferase
When your liver or muscles are harmed, this enzyme is released.

Bilirubin
An orange-yellow pigment formed in the liver by the breakdown of hemoglobin
and excreted in bile.

Epidemiology
In this discipline of medicine, illnesses and other health-related variables are
discussed in terms of their occurrence, distribution, and possible control.

Enzyme
In a living organism, proteins that speed up the rate of a chemical reaction. An
enzyme acts as catalyst for specific chemical reactions, converting a specific set of
reactants into specific products.

Hepatotoxicity

6
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Liver damage caused by xenobiotics, such as medications, food additives, alcohol,


chlorinated solvents, peroxidized fatty acids, fungal toxins, radioactive isotopes,
environmental toxicants, and even some therapeutic plants.

Liver Cirrhosis
A late stage of scarring of the liver caused by many forms of liver diseases and
conditions, such as hepatitis and chronic alcoholism.

Liver Function Tests


In an article by Mayoclinic (2021), the Liver Function Test is described as having
the primary function of diagnosing and monitoring the liver. This is done by measuring
the levels of specific proteins and enzymes present in the blood. These are namely:
Alanine transaminase (ALT), Aspartate Transaminase (AST), Alkaline phosphatase
(ALP), Albumin and Total protein, Bilirubin, Gamma-glutamyltransferase (GGT),
L-Lactate dehydrogenase (LD), and Prothrombin time (PT.)

Total Protein
The total quantity of protein in the serum is measured using a biochemical test.
Albumin and globulin are two types of protein found in the blood.

7
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

III. Review of Related Literature

A Study on the Temporal Trends in the Etiology of Cirrhosis of Liver in Coastal


Eastern Odisha
The origin of liver cirrhosis is known to evolve throughout time as a result of a
variety of variables such as public awareness, preventative measures, and changes in
society's lifestyle. However, there is a scarcity of evidence on the cause of liver cirrhosis
in India across time. As a result, the purpose of this research was to explore into the
etiology of liver cirrhosis throughout time. Furthermore, a Poisson regression model was
utilized to estimate hospitalization rate ratio (RR). When using the Poisson regression
model, the numerator and denominator variables were the ratios of various etiologies
[numbers of patients admitted to the gastroenterology department for alcoholic liver
disease (first group), viral hepatitis (second group), as well as other cirrhosis causes (third
group)]. With the collation of data, the researchers had come to a conclusion that the
main cause of cirrhosis in coastal eastern India was alcohol consumption. In essence, the
study involved 4,331 patients with cirrhosis of the liver who were admitted to the
hospital, of which 2,742 had alcohol-related cirrhosis, 858 (19.8%) had viral
hepatitis-related cirrhosis, and 731 had non-alcohol and nonviral causes of cirrhosis of
the liver (Mishra et al, 2020).

Aging and Liver Disease


Aging is a natural condition in which a human gradually loses their ability to
maintain homeostasis. Because of this, there will be observable negative effects on the
structure or function of the human anatomy. In this study conducted by Kim, Kisseleva,
and Brenner (2015), it was concluded that the age of the person is highly correlated to the
increased risk in contracting liver disease and its mortality rate. Main causes for this
would be the decrease in mass of functional liver cells, volume, and blood flow. Other
factors also discussed would be the growing lack of regenerative ability of the liver, the
decrease in serum albumin and bilirubin. As well as the elevated levels of blood
cholesterol, high-density lipoprotein cholesterol, and neutral fat.

8
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Liver Disease in Women: The influence of Gender on Epidemiology, Natural


History, and Patient Outcomes
Gender has an effect on the presentation and risk of liver disease. This was the
conclusion reached by Guy & Peters (2013) after reviewing other studies related to
specific types of liver complications. The study contains information about the difference
in likelihood, mortality rate, and general presentation of liver disease across genders.
Examples of this would be the inference that women are more commonly present with
acute liver failure, autoimmune hepatitis, benign liver lesions, primary biliary cirrhosis,
and toxin-mediated hepatotoxicity. Though, are less likely to have malignant liver
tumors, primary sclerosing cholangitis, and viral hepatitis. There is also a decreased rate
of decompensated cirrhosis in women with hepatitis C virus infection. Women are also
more likely to survive from hepatocellular carcinoma, but are equal to men when it
comes to alcohol induced liver problems.

Sex-and-age-related differences in bilirubin concentrations


In this study by Rosenthal & Pincus (1984), it was deduced that amongst the
17,995 participants with ages 13-96 years old (6,740 being men and 11,215 being
women), it was said that the mean serum bilirubin concentration found in men far
exceeds that of women. Furthermore, bilirubin concentrations were discovered to be at its
highest at ages 19-24 years old. However, despite these differences, it was suggested that
bilirubin concentrations do not have a correlation with other liver functions that are
indirectly related to bile pigment processing.

9
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

IV. Problem Definition

Background of the Study


In normal medical practices, the primary method in diagnosing liver disease in a
patient is through blood tests and at other times, urinalysis. The main components that are
usually observed are taken from a specific type of blood test called “The Liver Function
Panel.” Such components to be observed are: Total Protein, Albumin, Bilirubin (Total and
Direct), Alkaline Phosphatase (ALP), Aspartate Aminotransferase (AST), and Alanine
Aminotransferase (ALT). (University of Michigan Health, 2020). These are then
examined by professionals, comparing it to the standard– otherwise normal blood test
result. Indications of a normal blood test must have ALT with levels of 7 to 55 units per
liter (U/L), AST of 8 to 48 U/L, ALP of 40 to 147 U/L, Albumin with 3.5 to 5.0 grams
per deciliter (g/dL), Total protein of 6.3 to 7.9 g/dL and Bilirubin that is 0.1 to 1.2
milligrams per deciliter (mg/dL). These are typical results from an adult male. (Mayo
Clinic, 2021).

However, issues may arise if other aspects of a person are not considered in the
interpretation of a blood test. Differences in characteristics such as gender and age might
alter the results enough to cause a misdiagnosis. In a study conducted by Guy & Peters
(2013) that is focused on the differences in the impact of gender to risk of liver disease:
women are more likely to have acute liver failure, autoimmune hepatitis, benign liver
lesions, primary biliary cirrhosis, and toxin-mediated hepatotoxicity. Though having less
chances in contracting malignant liver tumors, primary sclerosing cholangitis, and viral
hepatitis. Men however, are more at risk in dying from chronic liver disease and cirrhosis
by two-fold. There is no difference if the main cause of liver complications are due to
alcohol consumption.

When it comes to age, it is a fact that the older population is at higher risk for
liver disease as concluded by Kim, Kisselva, and Brenner (2015) in a research focused on
the effects of aging on the liver. The reason for this is that the volume of the liver

10
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

decreases as one ages. For people that are 65 years old and older, a 35% decrease in liver
blood volume was observed when compared to those who are less than 40 years old. In
the same study, it was also found that there is also a decrease in the mass of functional
liver cells. Although there were mixed results about the effects of age on the liver, it was
observed that aging does have an effect on blood components related to the organ.
Humans will find trouble in maintaining normal levels of serum albumin as it decreases
with age. Other than that, cholesterol and fat volume in the liver increases, the
metabolism of the low-density lipoprotein cholesterol decreases by 35%, and serum
γ-glutamyltransferase and alkaline phosphatase levels are elevated.

Furthermore, In another study that has a similar goal— albeit more specific,
conducted by Rosenthal & Pincus (1984), it was found that the mean serum bilirubin
concentrations in men far exceeds those found in women. The serum bilirubin levels were
also at its highest at ages 19-24, slowly declining as one ages. Hence, adding substance to
the claim that there is distinction between the different groups and their liver function
results.

Purpose of the Study


The general goal of this study is to determine the contrast in the results of those
who have liver disease, and those who do not. As well see if there are variables that can
help in the detection of early signs of liver complications.

Preliminary Hypothesis
Making inferences with the aforementioned statements, the researchers have
hypothesized that there is a significant difference in blood test results among liver and
non-liver disease patients.

11
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Significance of the Study


The findings of this research will be useful to these groups of people.

a. Doctors - the study will help them to know which enzymes present in the blood
would help in determining which patient does have liver disease. It would become
easier for them to diagnose people.

b. Researchers - in doing the study it would help us improve our problem-solving


abilities. It would also help us to have a better grasp of this field and provide more
knowledge on research methodologies. We could learn a lot so that we can apply
it into future studies like this.

c. Future Researchers - this study will serve as reference data for those who are
planning to research this kind of topic and will help them gather data more easily.
Aside from that, this study will also help them conduct more promising research
that may help us someday.

d. Students - the research paper would be beneficial for academic purposes, such as
studying organology, research, statistics ,and more. It could also serve as
reference for those who want to learn more about this topic.

e. Other medical professionals - after finishing the study, the paper would contain
important data that would compare the result of the liver function blood test of
patients with and without liver disease. These data would help medical
professionals to increase their efficiency in diagnosing patients with liver disease.

12
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

V. Methodology

Applying Confidence Interval Estimation of the Mean of Total Bilirubin in mg/dL in


Patients

A. Variables

The variable that would be handled for the Confidence Interval Estimation of the
Mean would be the total amount of Bilirubin in mg/dL in patients that would be derived
from the Indian Liver Patient Dataset taken from the UCI Machine Learning Repository.
The chemical compound, Bilirubin, is examined in liver function tests as it can determine
whether or not you may show signs of liver disease. Moreover, having a low level of
bilirubin in your blood is normal, but having a high level might indicate liver illness
(URMC, n.d.). The researchers would like to investigate the range of values that
encapsulates the true value of the unknown parameter, which is the total bilirubin in
mg/dL present in the patients. With the use of the confidence interval estimation, this
particular statistical test would provide a lower and upper estimate instead of a single
value for the mean. The Indian Liver Patient Dataset consists of 416 liver patient records
and 167 non liver patient records (a total sample size of 583 patients). The total bilirubin
consists of the combination of direct and indirect bilirubin. In addition, the total bilirubin
would signify if an individual could have an underlying disease, especially if it is in high
amounts. When too much bilirubin, a chemical contained in the bile of the liver, seeps
into the bloodstream, it can lead to a variety of health issues (Cleveland Clinic, n.d.). A
normal range of total bilirubin goes from 0.1 to 1.2 mg/dL (Mount Sinai Hospital, n.d.).

B. Application of Confidence Interval Estimation of the Mean of Total Bilirubin


in mg/dL in Patients

The researchers would perform this particular statistical test in order to determine
if the confidence interval estimation of the mean would fall under the range of normal

13
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

amount of total bilirubin in mg/dL. With respect to the aforementioned dataset (ILPD), it
is apparent that the sample size is large enough, having 583 patients (n ≥ 30) and the
sigma is known, therefore, the application of the Central Limit Theorem (CLT) is valid,
hence the Z-distribution should be used. In view of the fact that the CLT is followed, X̄
will be approximately normal with a sample mean equal to the population mean. In this
specific confidence interval estimation, the researchers would make use of the default
confidence level, which is 95%. Given that the confidence level is 95%, the value of
alpha (α) would be 0.05. Additionally, α/2 would be 0.025. Then, the sample mean X̄
must be solved using the formula:

x̄ = ( Σ xi ) / n

After selecting the sample statistic and confidence level, the determination of the
margin of error is needed. The percentage point difference between your results and the
true population value is known as the margin of error. The formula for the margin of error
is:

Zα/2
The value that would be derived from the margin of error formula above would be
the numerical value that must be both subtracted from and added to the sample mean, X̄.
(X̄± margin of error). Subsequently, the two numerical values would respectively be the
lower and upper limit. The lower and upper limits would then dictate the range of
numbers that would enclose the true value of the mean. The confidence interval with 95%
confidence should exhibit this figure:

14
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

X̄-Zα/2 X̄+Zα/2

Calculation of the average Alkaline phosphatase level of patients Using Single


parameter hypothesis testing

A. Variables

The variable that the researchers would utilize would be the Alkaline phosphate (ALP)
levels in international units per liter (IU/L) that came from the health records of patients.
These ALP levels would then be used for the hypothesized mean to test the researcher’s
Alternative hypothesis which aims to find out whether the mean ALP levels of the
patient’s would be above the normal, which would deem their ALP levels abnormal that
contributes to a higher risk of getting liver disease. The data that would be used will have
a sample size of 583 along with their ALP levels which are taken from the Indian Liver
Patient Records Data which are collected from Northeast of Andhra Pradesh, India.
According to the Cleveland clinic (2021) a normal ALP Level is typically 40 IU/L to 147
IU/L. Furthermore, Balingit (2013) said that an ALP level higher than 150 IU/L is
deemed abnormal, which supports the research made by the Cleveland clinic. So, the
researcher would utilize the upper mean level from the range which is 147 IU/L for its
hypothesized mean.
The variables are measured using the data gathered from the health records of the
patients with the help of the UCI machine learning repository, where all the data

15
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

regarding the level of alkaline phosphatase in the patient’s blood is shown individually.
The alkaline phosphatase level of the patient’s is considered as ratio data, wherein there is
a measurable difference between the ALP levels of each patient and can be ordered from
highest to least, also there is a “true zero starting point” wherein no ALP level should be
negative.

B. Application of Single Hypothesis Testing

The researchers would use the single parameter hypothesis testing to test whether
their null hypothesis will be accepted or would fail to be rejected. The single hypothesis
testing will be utilized to assess the population of patients present in the data set, which is
the patients alkaline phosphatase level. The single hypothesis testing would follow an
8-step procedure, and since the data has a sample size of 583 and the sigma is known,
meaning it follows the central limit theorem and would also follow a normal distribution
curve. The 8-step procedure that would be used in the single hypothesis testing is as
follows: the first step would be identifying the Null hypothesis of the study, which would
be the mean alkaline phosphatase level of the patient would be less than 147 IU/L. Ho: µ
≤ 147 IU/L. The second step would be identifying the alternative hypothesis which would
be that the mean alkaline phosphatase level of the patient would be more than the healthy
amount which is greater than 147 IU/L that would deem the patient to be unhealthy and
would be subjected to liver disease. This means the test would utilize an upper tailed test,
which means the alpha would not be cut in half. The third step of the hypothesis testing
would be finding the level of significance. The researchers would use the normal alpha
(α) of 0.05 level of significance. The level of significance would determine whether the
null hypothesis would be rejected, or it would fail to be rejected. Since the researcher’s
would follow an upper tailed test, the alpha level would still stay the same α = 0.05. After
that the value of alpha has been determined using the z-table which would be 1.645,
which would serve as the rejection region. The fourth step is identifying the test statistic,
and since our data is 568 and the sigma is known, therefore the central limit theorem
(CLT) can be used. The fifth step of the hypothesis testing would be the decision rule,

16
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

which would identify if the null hypothesis will be rejected or not. The parameter or the
region of rejection that would be used is 1.645, which would lead us to a decision that the
researchers would reject Ho if the computed Z-score would be higher than 1.645. And if
the computed Z is less than 1.645 then the researchers would conclude that the null
hypothesis would not be rejected. The sixth step would be computing for the z-score
using the formula:

Ẋ−𝝁
Z= σ
𝑛

The seventh step of this hypothesis testing would be making the decision and confirming
whether or not the researchers would reject or fail to reject the null hypothesis based on
the computed Z-score. The final step of this hypothesis testing would be the making of
conclusion for the test.

Calculation of the Average Conjugated Bilirubin Level of Liver and Non-Liver


Patients Using Double Parameter Hypothesis Testing

A. Variables

For the hypothesis testing in the double parameter, the variable that will be used is
the conjugated or direct bilirubin level, in milligrams per decilitre (mg/dL), of the liver
and non-liver patients, in which the data comes from the Indian Liver Patient Dataset.
The hypothesis testing will have a sample size of 57 for liver patients and 27 for non-liver
patients. The average direct bilirubin level of liver and non-liver patients will be used to
get the standard deviation of each population.With the use of the UCI machine learning
repository, which displays all of the data on the level of direct bilirubin in the patient's
blood, the variables are assessed using data acquired from the patient's health records.
The variables mentioned will be utilized to prove that the alternative hypothesis is true,
that there is a significant difference between the two.

In adults, the normal blood test findings vary from 0 to 0.2 mg/dL or less than 0.3
mg/dL. Bilirubin may also show up in your urine if your blood test results are greater.

17
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Normal, healthy persons do not have bilirubin in their urine (University of Rochester
Medical Center Rochester, 2022).

B. Application of Double Parameter Hypothesis Testing

Double parameter hypothesis testing will be used to determine what will be done
to the null hypothesis, whether to reject it or not. Before proceeding in using the 8-step
procedure in comparing the mean of the two populations, the diagnostic checking should
be done first. There are 5 steps in diagnostic checking. The first step is the test for
normality, in testing for normality the null hypothesis for the two populations is that each
population is normally distributed. The test will be conducted using the shapiro-wilk test
in STATISTICA. The decision rule is that if the p-value is less than the computed value,
then the null hypothesis is rejected. If both null hypotheses from each population have
failed to reject, we can continue to the next steps. Step two is determining if the two
populations are independent and should not influence each other. The next step is to make
sure that at least one or both of the population has a sample size less than 30. The fourth
step is knowing if the population variances of both populations are unknown, because in
hypothesis testing in the double parameter, t-test will be used. The last step includes
proving whether the two population variances are equal or not, provided that the null
hypothesis is that the two population variances are equal, while the alternative hypothesis
states that the two population variances are not equal. The F-test will be used for this
step. The researchers will use the normal level of significance (Alpha) of 0.05, the
distribution will have a two-tailed test, due to this the Alpha will be divided by two,
yielding a result of 0.025. Each critical value is solved differently and will also utilize the
F table. The null hypothesis is rejected if the computed value is greater than the upper
critical value or less than the lower critical value. The formula for F-test, degrees of
freedom (d.f.) and critical values are shown below:

18
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Upper critical value (𝝰/2 = 0.025): Lower critical value (𝝰/2 = 0.025):

𝑑. 𝑓.1(numerator) = 𝑛1 − 1 𝑑. 𝑓.2(numerator) = 𝑛2 − 1

𝑑. 𝑓.2(denominator) = 𝑛2 − 1 𝑑. 𝑓.1(denominator) = 𝑛1 − 1

After finding the value from the


F table, the reciprocal of the value is the
lower critical value.

2
𝑆1
F= 2
2
, where 𝑆1 > 𝑆2
2

𝑆2

The next procedure is doing the 8 steps for finding the conclusion of the initial
hypotheses. This 8-step procedure is similar to the single parameter hypothesis testing.
First step is to identify the null hypothesis, which is finding if there is a significant
difference between the conjugated bilirubin level of liver patients and non-liver patients.
While the next step is the alternative hypothesis, there is a significant difference between
the conjugated bilirubin level of liver patients and non-liver patients. According to the
alternative hypothesis the test will be using a two-tailed test. Step 3 is setting the level of
significance, in this hypothesis testing the researchers will be using 0.05 level of
significance (Alpha). The fourth step is determining which test statistic will be used.
There are two types of t-test that can be used in this hypothesis testing, t-test for separate
variance and t-test for pooled variance. The F-test will determine on which t-test will be
used, if null hypothesis is rejected t-test for separate variance will be used, while if it has
failed to reject the t-test for pooled variance will be used. The fifth step is obtaining the
critical values for the critical region of the distribution and finding the decision rule.
Since the test will be a two-tailed test, the Alpha will be divided by two, having 0.025 as
a result. Based on the critical values the decision rule is if the absolute value of the
computed t-score is greater than the critical value, then the null hypothesis is to be

19
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

rejected. Step 6 is for the actual computation of the t-score. The formula for the two t-test
and its corresponding degrees of freedom (d.f.) is shown below:

( )
2 2
𝑆1 𝑆2

t (separate variance) =
(𝑋 − 𝑋 ) − (µ − µ )
1 2 1 2
, with d.f. =
𝑛1
+ 𝑛2

2 2

( ) ( )
2 2 2 2
𝑆1 𝑆2 1 𝑆1 1 𝑆2
+ 𝑛1 − 1 𝑛1
+ 𝑛2 − 1 𝑛2
𝑛1 𝑛2

t (pooled variance) =
(𝑋 − 𝑋 ) − (µ − µ )
1 2 1 2
, with d.f. = 𝑛 + 𝑛2 − 2
1
2 1
𝑆𝑝 𝑛
1
( +
1
𝑛2 )
2 (𝑛1 − 1)𝑆21 + (𝑛2 − 1)𝑆22
𝑆𝑝 = 𝑛1+ 𝑛2− 2

The 7th step, this is where the conclusion will be drawn from the decision rule. The
conclusion is either there is sufficient or insufficient evidence to reject the null
hypothesis. For the final step, based on the conclusion from step 7 the researchers will
make a summary on the findings of the test from the hypotheses.

Investigating the Relationship between Albumin and Albumin-Globulin Ratio

A. Variables

In this section of the study, the researchers aim to determine and understand the
relationship between Albumin levels and Albumin-Globulin Ratio. Data will be taken
from 583 patient records of which, 416 have a form of liver disease and 167 having none
(Lichman, 2013). Albumin will be the independent variable while the Albumin-Globulin
ratio will be the dependent variable.

According to an article written by Yazdi, P. (2021), In the bloodstream, the most


commonly present protein is albumin. It has the main function of maintaining osmotic

20
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

pressure and therefore, prevents the leakage of fluids from blood vessels or the tissues
that surround it. For globulin, it is also a type of protein, but has more types depending on
its function such as acting as transporters or antibodies. The Albumin-Globulin ratio is
derived from the total protein test, which can be known from the Liver Function test.
Normal ratios of albumin-globulin are between 1.1 and 2.5. This also reads as 1:1 ratio to
2:5 ratio. Anything that deviates from this would provide a sign for potential or risk for
disease. Liver disease is usually more associated with higher levels of globulin and
albumin.

B. Application of Simple Linear Regression

As Albumin and Globulin have the normal ratios between 1.1 and 2.5, imbalances
in this– like a higher albumin count, would indicate a possibility of health complications
such as liver disease (Yazdi, 2021). For this study, the researchers had the goal of
determining the relationship between these two variables: albumin and albumin-globulin
ratio to see if albumin is a variable that can predict the value of the A-G ratio. This will
be done by applying the Simple Linear regression test to the independent and dependent
variables taken from a sample size of 579 patients from the dataset (Lichman, 2013). The
test specifically that is going to be used would consist of two main parts: The ANOVA
approach to test if there is actually a significance between the relationships and The
Hypothesis Testing of Significance of Linear Relationships. This test in particular checks
if a relationship actually has significance or not.

The first step was to identify the null and alternative hypothesis for each test. For
the ANOVA Approach, The null hypothesis indicates that there is no association between
the variables. The alternative hypothesis says otherwise. For the Hypothesis Testing of
Significance of Linear Relationships, the null hypothesis says that there is no significant
relationship.

In the second step, a scatter plot was created to show the linear relationship of
albumin and the A-G ratio. From this, the regression equation used for predicting the
values of the independent variable can be taken. This is represented by the following
formula:

21
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

𝑦 = β0 + β1𝑥

The next step was using a statistical software called STATISTICA, to calculate the
t-values, correlation coefficient, and other results that are important for the interpretation
of the test. In order to properly read the results, conditions were set: If the estimated
coefficient has the value of zero (0), then it follows the null hypothesis. Any other value
would have it follow the alternative hypothesis. The level significance used is α= 0.05.
However, since it is a two-tailed test, the test will have This is important for the decision
rule. This rule is important in the decision making of the group on whether to accept or
reject the null hypothesis. In this test, it would be to reject Ho if, F> Fa = a, df = 1/n-2

The second test is about knowing whether the linear relationship has any or is
significant. This will be done by computing the absolute value of “t” with this formula:

𝑟 𝑛−2
𝑡 = 2
1−𝑟

22
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

VI. Data Presentation and Analysis

I. Demographic Profile

Demographic Variable Sample Size Mean Min. Max


n = 583

Age classification

Young ( < 30 years old) 105 21 4 29

Middle-aged (30-59 y.o.) 345 44 30 58

Senior (60 and above) 133 66 60 90

Gender

Male 443

Female 140

Presence of Heart Disease

Yes (1) 416

No (2) 167

Total Bilirubin

Normal (0.1 to 1.2 mg/dL) 332 0.8042 0.4 1.2

Abnormal ( >1.2 mg/dL) 251 6.5984 1.3 75

Alkaline Phosphatase Level

Normal (40 to 147 IU/L) 55 125.4909 63 147

Abnormal ( >147 IU/L) 528 307.7727 148 2110

Alanine Aminotransferase
Level

Normal (7- 55 IU/L) 417 29.4604 10 55

Abnormal ( above 55 IU/L) 166 209.4639 56 2000

Aspartate aminotransferase
Level

23
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Normal (8-48 IU/L) 330 28.1939 10 48

Abnormal ( above 48 IU/L) 253 216.4980 49 4929

Albumin

Normal (3.5 to 5 g/dL) 209 3.9742 3.5 5

Abnormal (< 3.5 or > 5) 374 2.6767 0.9 5.5

Total Proteins

Normal (6.3 to 7.9 g/dL) 286 7.0150 6.3 7.9

Abnormal ( < 6.3 or >7.9) 297 5.9710 2.7 9.6

Direct Bilirubin

Normal ( ≤ 0.3 mg/dL) 308 0.1961 0.1 0.3

Abnormal ( >0.3 mg/dL) 275 2.9309 0.4 19.7

Total Bilirubin

Normal (0.1 to 1.2 mg/dL) 332 0.8042 0.4 1.2

Abnormal ( >1 .2 mg/dL) 251 6.5984 1.3 75

Albumin and Globulin Ratio

Normal (1 to 2.5) 286 1.1864 1 2.5

Abnormal ( < 1 or >2.5) 297 0.7198 0.3 2.8

24
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 1.1 (Descriptive statistics of the data using Statistica)

A. Applying Confidence Interval Estimation of the Mean of Total Bilirubin


(mg/dL) in patients

Figure 2.1 - Use of Descriptive Statistics in Statistica for Mean, Variance and Standard Deviation of Total
Bilirubin

25
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 2.2 - Confidence Interval of Mean of Total Bilirubin in patients

Sample mean Z-value Standard Error Margin of Standard


Error Deviation

3.298799 1.96 0.257172 ● 0.50405712 6.209522

Figure 2.3 - Summary Table of Important Values for Confidence Interval Estimation on Mean Total
Bilirubin in Patients

Figure 2.4 - Statistica Results for Confidence Interval Estimation

B. Single parameter Hypothesis testing for the mean ALP level of Patients

Figure 3.1 - Use of descriptive statistics in STATISTICA for the single parameter hypothesis testing

26
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

C. Double parameter Hypothesis testing for the Average Conjugated Bilirubin


Level of Liver and Non-Liver Patients.

Figure 4.1 - Use of descriptive statistics in STATISTICA for the double parameter hypothesis testing

27
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

VI. Statistical Analysis

A. Discussion of Results in Applying A Confidence Interval Estimation of the


Mean of Total Bilirubin in mg/dL in Patients

In view of the results gathered in the construction of a confidence interval estimation for
the mean total of bilirubin (mg/dL) derived from the Indian Liver Patient dataset, it can
be stated that the researchers are only 95% confident that the population mean μ is inside
the formulated confidence interval (2.793701, 3.803898). Owing to the fact that the
population mean μ is an unknown parameter, the researchers have decided to form a
confidence interval by basing it on the sampling distribution of the designated point
estimator, X̄, which is the sample mean. In essence, the 95% is merely a confidence level
and not the probability value that μ would be inside this confidence interval. This specific
interval only gives us a plausible range of values for the population mean. Furthermore,
95% is simply the percentage of all samples of size 583 patients that induces confidence
intervals that subsume μ, hence, 5% of samples of size 583 patients do not contain the
population mean. As per the Mount Sinai Hospital (n.d.), the normal range of total
bilirubin is from 0.1 to 1.2 mg/dL. As a result, the estimated confidence interval of the
mean of total bilirubin in both liver and non-liver patients from North East of Andhra
Pradesh, India is 2.793701, 3.803898 mg/dL with a 95% confidence level, therefore,
making the range of total bilirubin abnormal. A high amount of total bilirubin in the
blood can be one of the symptoms of liver-related illnesses.

B. Statistical analysis for Single parameter hypothesis test of the mean alkaline
phosphatase level of patients.

The 8-step procedure will be used in testing whether the population parameter is different
to the hypothesized value. The researchers first formulated two hypotheses using a
hypothesized value of 147 IU/L which came from the upper level of the normal range of
ALP level. The null hypothesis being Ho: µ ≤ 147 UI/L (The Alkaline phosphatase level
of the patient is in the normal range) and with an alternative hypothesis Ha: µ > 147 UI/L
(The Alkaline Phosphatase Level of the patient is considered abnormal). With that

28
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

hypothesis the researchers determined to use an upper tailed test using the normal
significance level of 0.05 which is equivalent to a 1.645 critical value. The Mean of the
ALP level which is 290.5736 with a standard deviation of 242.9380 will be used in
computing for the Z-score to find out whether we would reject the null hypothesis or not.

Figure 5.1 - Use of PHSTAT in Z-test of hypothesis for the mean

Discussion of results of Single Parameter Hypothesis Testing of the mean Alkaline


Phosphatase Level of Patients

Based on the results of the 8-step procedure, The researcher computed a Z value of 14.27
which is greater than the critical value of 1.645, alongside this is a calculated p-value of
0.000 which is less than the alpha which is 0.05. Therefore, the researchers would reject
the null hypothesis which states that the average alkaline phosphatase level of the patients
is less than 147 UI/L which is considered in the normal range. Therefore, the researcher
would conclude that there is prevailing evidence that shows a very concerning problem
with the patients ALP levels. Lowe D. and Sanvictores T (2021) stated that ALP levels
29
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

above the normal range Indicates that a patient has a high chance of having liver
disease.With reference to this, based on the computed data majority of patients shows an
abnormal amount of ALP which the researchers can say they have a high chance of
having liver problems. This conclusion was also supported with the use of statistical tools
such as PHSTAT, wherein the researchers obtained a reported p-value of 0.000 meaning
that the results obtained were highly significant and is very unlikely to occur by chance
alone.

C. Statistical analysis for Double Parameter Hypothesis Test of the Average


Conjugated Bilirubin Level of Liver and Non-Liver Patients

In double parameter hypothesis testing, diagnostic checking is a must in order for


the researchers to know if the two independent populations are valid for the hypothesis
testing, it includes a 5-step process. In the first step, the Shapiro-Wilk test is utilized to
determine if the two populations are normally distributed.

Figure 6.1 - Use of Shapiro-Wilk test in STATISTICA for the test of normality for liver patients’ data

30
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 6.2 - Use of Shapiro-Wilk test in STATISTICA for the test of normality for non-liver patients’ data

The result for the liver patients’ data is normally distributed, because the p-value is
greater than the Alpha. While the result for the non-liver patients’ data is not normally
distributed. Since the p-value is slightly close to the Alpha and that the acquired data does
not give much of a choice for the sample size, the diagnostic checking can still proceed to
the next steps. Next is making sure that the two populations are independent with each
other. For step 3 and 4, it indicates that the hypothesis testing will be using a t-test with
sigma unknown and that at least 1 of the sample size is less than 30. The final step is
crucial in determining which t-test to use. The null hypothesis states that the population
variance of the two populations are equal.

31
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 6.3 - Use of PHSTAT in F-test for Differences in Two Variances

Since the value of F is greater than the upper critical value, then the null hypothesis
would be rejected. The p-value is also obtained, it is less than the Alpha, which supports
the action of rejecting the null hypothesis. It means that the test statistic that will be used
for the 8-step procedure is the t-test for separate variance.

After conducting the diagnostic checking, the test can now proceed to the 8-step
procedure. First and second step is stating the null and alternative hypothesis, µ1 - µ2= 0
(There is no significant difference between the conjugated bilirubin level of liver patients
and non-liver patients.) is the null hypothesis while Ha: µ1 - µ2 ≠ 0 (There is a significant
difference between the conjugated bilirubin level of liver patients and non-liver patients.)
is the alternative hypothesis. Next is having the normal level of significance (Alpha) 0.05.

32
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

After that is identifying which t-test to use, which has already been identified with the
help of the F-test in the previous 5-step procedure. Step 5 is for the decision rule,
according to the alternative hypothesis it is a two-tailed test, therefore α/2 = 0.025 is used
for the t-table. The formula for the degree of freedom (d.f.) in t-test for separate variance
is computed with a value of 71.3163 and since d.f. should always be rounded down, the
final d.f. is 71. Using the PHSTAT the critical values are ± 1.9939. The computed value
from the t-test will determine the action on the null hypothesis.

Figure 6.4 - Use of PHSTAT in Separate-Variances t Test for the Difference Between Two Means

33
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Discussion of results of Double Parameter Hypothesis Test of the Average


Conjugated Bilirubin Level of Liver and Non-Liver Patients

The computed t-value is 4.2689, the absolute value of the said value is greater
than the critical value, therefore the null hypothesis is rejected. In addition to this, the
p-value is 0.0001, in which it is less than the level of significance. The p-value also
proves that the null hypothesis is to be rejected. That means that there is a significant
difference between the average conjugated bilirubin level of liver and non-liver patients.
According to Mayo Foundation for Medical Education and Research (2020), normal
results for direct bilirubin are generally 0.3 mg/dL or below. The direct bilirubin levels of
liver patients are slightly higher compared to non-liver patients, which may have greatly
affected the results of the hypothesis testing.

Statistical Analysis of the Relationship between Albumin and Albumin-Globulin


Ratio

The Scatterplot for the effect of Albumin towards the A-G ratio, showcases a linear
relationship between them. From the graph constructed by the program STATISTICA, it
was given that the equation for the regression is: y = 1.515 + 1.7143x. This is an equation
that is used to predict the values of the predictor variable. In this case, that would be the
A-G ratio.

34
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 7.1 - Scatter plot graph done in STATISTICA

From the results of the test, the researcher was able to gather the value for the following:
First, there is a linear relationship between the Albumin and the Albumin-Globulin levels.
This was proven by the first part of the Simple Linear Regression test (ANOVA
approach). By finding the F distribution and having it be the basis for the Decision rule,
1
which states that if F> Fa = .05, df = 577
= 3.858, which was the case, then it was

decided that the null hypothesis would be rejected. Having the null hypothesis be
followed meant that there would be no correlation between the two variables. Otherwise,
having the Alternative hypothesis be followed indicates the opposite. The results were in
favor of the alternative hypothesis since F = 523.3.

Regression Statistics

Statistic Value

35
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Multiple R 0.68963234189663

Multiple R² 0.475592766989831

Adjusted R² 0.474683915632794

F(1,577) 523.289934385422

df 1, 577

p 0

Std.Err. of Estimate 0.575795889906194

Figure 6.6 - Regression statistics done in STATISTICA

The second part of the statistical analysis for the relationship between the
Albumin and Albumin-Globulin ratio, tests how significant their relationship is. This was
done by conducting Hypothesis testing of Significance of Linear Relationships. The null
hypothesis for this assumes that the relationship is not significant while the alternative
hypothesis assumes that it is. significant. This favored the alternative hypothesis because
the p-value does not have the value of 0.05. The null hypothesis was also rejected since
the Decision Rule chosen has a condition wherein it states that if the absolute value of “t”
is greater than 1.984, then it would be rejected.

Regression Summary for Dependent Variable: Albumin (indian_liver_patientfinal)


N = 579 R= .68963234 R²= .47559277 Adjusted R²= .47468392
F(1,577)=523.29 p<0.0000 Std.Error of estimate: .57580

b* Standard error b Standard error t(577) p-value


of b* of b

Intercept 1.514989 0.074898 20.22747 0.00

Albumin and 0.689632 0.030147 1.714272 0.074939 22.87553 0.00


Globulin Ratio

Figure 7.2 - Regression summary for the Dependent Variable computed in STATISTICA

To repeat, it was concluded that these variables have a significant, linear relationship.
What this implies for the research is that Albumin is a component in the bloodstream that

36
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

should be given significant importance when reading the liver function test of a patient.
This specific type of protein is created by the liver. It is responsible for numerous
functions such as the movement of molecules across the blood and the prevention of
fluids from the blood leaking to the surrounding tissues (Mount Sinai, n.d.). The linear
aspect in its relationship with the Albumin-Globulin ratio is another indicator of its
importance since an increase in Albumin would also affect the ratio.

37
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Conclusion and Recommendations

The four statistical methodologies have been executed for this statistical research,
namely, confidence interval estimation for the mean of total bilirubin in patients, single
parameter hypothesis testing regarding the mean alkaline phosphatase level of patients,
double parameter hypothesis testing to compare the amount of conjugated bilirubin of
liver and non-liver patients, and the simple linear regression with regards to determining
the relationship between albumin and globulin ratio and amount of albumin. In view of
the collated results and in-depth analyses of the various test results given, it has been
apparent that the derived conclusions show that there is an abnormal amount of specific
chemical compounds in the blood of liver patients, specifically, amounts of total
bilirubin, alkaline phosphatase levels, and conjugated bilirubin. The World Health
Organization (WHO) stated that India is considered the world’s capital of liver diseases.
With reference to this, it is vastly an alarming matter owing to the fact that liver-related
illnesses are rampant in the said country. These results imply that these particular
chemical compounds present within an individual’s bloodstream can signify an early
onset of symptoms regarding liver disease, hence, can serve as a preventative measure for
the citizens of India. Ultimately, in reference to the gathered data results and
interpretations, the findings of the variables do bear an impact on an individual’s liver
functionality.

Based on the conclusions that are projected from the findings of this study, the
researchers would like to state the following recommendations to further enhance this
study. Firsty, let this study be proof of the lack of healthcare actions across the world
especially to the countries wherein there is a poor healthcare system. The government
should focus more on their healthcare system to lessen the burden of these different types
of health concerns, especially liver diseases which are sometimes overlooked. Relating to
poor healthcare systems, programs focusing on liver health should be implemented where
there is an abundance of people who suffer from liver diseases, as it is a very concerning,
which is supported by the study conducted by Sumeet, A. et al. (2019) stating that liver
type diseases accounts for approximately 2 million deaths per year worldwide. With this,
free blood tests should also be implemented so that early signs of liver diseases could be

38
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

prevented, that would help in lessening the burden of these types of diseases in the future.
Natalia, O. et al. (2017) stated that excessive alcohol consumption impacts the liver in the
greatest degree, which could lead to liver problems in the long run. A study by Natalia,
O. et al. added that 35% of problem drinkers develop liver problems. Lessening these
factors that affect liver health should be lessened like alcohol so that liver diseases would
not be prevalent. Future researchers who plan to make a study relating to this topic should
do further investigations in this area of research, and see if there are changes within the
scope of this research which are those people who have liver diseases. Many years from
now, the number of liver disease patients would drastically change whether for the better
or for worse, that is why further extensive and more intricate research is recommended. It
is highly critical to identify a liver infection early on in order to reduce the intensity and
frequency of the illness. Future work must include the gravity and importance of
diagnosis at the preliminary stage to induce and promote prevention measures which can
therefore lessen the consistently increasing number of liver patient cases. Other
researchers who also like to take up similar studies, should focus more on the variables
that significantly affect the liver health of a person.

39
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

VII. Appendices

A. Constructing a Confidence Interval Estimation of the Mean of Total


Bilirubin in mg/dL in Patients

1. Upon looking at the Indian Liver Patient dataset, the sample size (n) is equal to
583, value for sigma is known, hence, the researchers conclude that the sample
mean would follow a normal distribution.

2. The value of α would be 0.05 and α/2 would be 0.025.

3. With the X̄ calculation using Statistica, a statistical software, X̄ would have the
numerical value of 3.298799, which signifies the sample statistic of the mean. It
has been stated that the Central Limit Theorem is valid and could be applied,
therefore, the value of sigma σ can also be derived from Statistica, σ = 6.209522.

4. With this, the standard error could be easily calculated with the formula:

5. Inputting the respective values we have:


6.209522/√583 = 0.257172

6. Following this and looking at the information that is given, the lower and upper
limits can be solved by both subtracting the margin of error from X̄ = 3.298799
and adding the value of the margin of error to the sample mean.

7. The formula for the margin of error is:

Zα/2
α/2 = 0.025 and this value is important since it must be found on the body of the Z-table
to find its corresponding Z-score.

40
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 8.1 - Z-Table for the Z-score of α/2


The Z-score of 0.025 is -1.96. This is a valid Z-score given that the confidence level is
95%, hence, the Z-scores can also be +1.96.

8. Now that the Z-score, standard deviation, and sample size are known, plug in the
corresponding numerical values for the margin of error formula:

Zα/2

Multiply the Z-score of 1.96 to the numerical value of the standard error which is
SE = 0.257172

1.96 ⋅ 6.209522/√583 or 0.257172 = 0.50405712

9. Now that the margin of error has been calculated, proceed to altering the sample
mean, X̄ = 3.298799 mg/dL, by subtracting and then adding the value of the
margin of error = 0.5040573642.

X̄ = 3.298799 - 0.50405712 = 2.793701 mg/dL (lower limit)


X̄ = 3.298799 + 0.50405712 = 3.803898 mg/dL (upper limit)

41
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

10. With this data gathered, it can be inferred that the confidence interval (2.793701,
3.803898) contains the true mean of the total bilirubin (mg/dL) in the patients
with a 95% confidence level.

2.793701 (lower limit) X̄ = 3.298799 3.803898 (upper limit)


Confidence Interval for Mean Total Bilirubin in Patients with a 95% confidence level

B. 8-step procedure for Single hypothesis test of the mean alkaline phosphatase
level of patients
1. Null Hypothesis

Ho: µ ≤ 147 UI/L (The Alkaline phosphatase level of the patient is in the normal
range)

2. Alternative hypothesis

Ha: µ > 147 UI/L (The Alkaline Phosphatase Level of the patient is considered
abnormal)

3. Alpha level

α = 0.05

4. Test statistics

42
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Test statistics: (CLT IS OK)


Ẋ−𝝁
Z= σ
𝑛

5. Decision Rule: Reject Ho if Z is greater than 1.645

Figure 9.1 - Normal distribution curve for the single parameter testing of patients ALP levels.

6. Computation for the Value of Z


computed value for x̄ = 290.5763
computed value of the standard deviation (σ) = 242.9380

Using the z score formula:

Ẋ−𝝁
Z= σ
𝑛

290.5763−147
Z= 242.9380
583

Z = 14.27 with p-value of 0.000

7. Conclusion
We have sufficient evidence to reject the null hypothesis Ho

43
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

8. Implication of the conclusion


The alkaline phosphatase level of the patients is abnormal and should need
immediate medical action.

C. 5-step Procedure for Diagnostic Checking in Double Parameter Hypothesis


Test of the Average Conjugated Bilirubin Level of Liver and Non-Liver
Patients.
1. Test for Normality

Ho: The liver patients’ data is Ho: The non-liver patients’ data is
normally distributed. normally distributed.
Ha: The liver patients’ data is not Ha: The non-liver patients' data is not
normally distributed. normally distributed.

α = 0.05 α = 0.05

Test Statistic: Shapiro-Wilk test


Decision Rule: Reject Ho if p-value < α
p-value = 0.05658 p-value = 0.0405
We have insufficient evidence to We have sufficient evidence to reject
reject Ho.The liver patients’ data Ho.The liver patients’ data is not
is normally distributed. normally distributed.

2. Independence:
A liver patient can not be a non-liver patient at the same time, each population do
not influence each other

3. At least 1 is n < 30
n1 = 57
n2 = 27

4. Raw data: σ12 and σ22 are unknown (CLT NOT OK)

44
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

5. Equal Population Variance


Ho: σ12 = σ22
Ha: σ12 ≠ σ22

α = 0.05

2
𝑆1
Test Statistic: F =
2 2
2 , where 𝑆1 > 𝑆2
𝑆2

Decision Rule: Reject Ho if F > 2.0355 or F < 1.8744

Upper critical value (𝝰/2 = 0.025): Lower critical value (𝝰/2 = 0.025):

𝑑. 𝑓.1(numerator) = 𝑛1 − 1 𝑑. 𝑓.2(numerator) = 𝑛2 − 1

𝑑. 𝑓.2(denominator) = 𝑛2 − 1 𝑑. 𝑓.1(denominator) = 𝑛1 − 1

After finding the value from the


F table, the reciprocal of the value is the

lower critical value.

𝑑. 𝑓.1(numerator) = 57 − 1 𝑑. 𝑓.2(numerator) = 27 − 1

𝑑. 𝑓.2(denominator) = 27 − 1 𝑑. 𝑓.1(denominator) = 57 − 1

d.f.1 = 56 d.f.2 = 26

d.f.2 = 26 d.f.1 = 56

45
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 10.1 - Normal distribution curve for the F Test for Differences in Two Variances.

Computation for the F-value

2
𝑆1
2 2
F= 2 , where 𝑆1 > 𝑆2
𝑆2

0.032726
F=
0.015328

F = 2.1351 with p-value of 0.0369

We have sufficient evidence to reject Ho.

8-step Procedure for Double Parameter Hypothesis Test of the Average Conjugated
Bilirubin Level of Liver and Non-Liver Patients.

1. Null Hypothesis

Ho: µ1 - µ2= 0 (There is no significant difference between the conjugated bilirubin


level of liver patients and non-liver patients.)

2. Alternative Hypothesis

46
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Ha: µ1 - µ2 ≠ 0 (There is a significant difference between the conjugated bilirubin


level of liver patients and non-liver patients.)

3. Level of Significance

α = 0.05

4. Test Statistic

Based on the results of F-test, t-test for separate variance will be used.

( )
2 2
𝑆1 𝑆2

t (separate variance) =
(𝑋 − 𝑋 ) − (µ − µ )
1 2 1 2
, with d.f. =
𝑛1
+ 𝑛2

2 2

( ) ( )
2 2 2 2
𝑆1 𝑆2 1 𝑆1 1 𝑆2
+ 𝑛1 − 1 𝑛1
+ 𝑛2 − 1 𝑛2
𝑛1 𝑛2

5. Decision Rule: Reject Ho if |T| > ±1.9939

( )
2 2
𝑆1 𝑆2
𝑛1
+ 𝑛2
d.f. = 2 2

( ) ( )
2 2
1 𝑆1 1 𝑆2
𝑛1 − 1 𝑛1
+ 𝑛2 − 1 𝑛2

0.015328 2

d.f. =
( 0.032726
57
+ 27 )
0.032726 2 0.015328 2
1
57 − 1 ( 57
+
1
)
27 − 1 27 ( )
d.f. = 71.3163 ≈ 71

47
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Figure 10.2 - Normal distribution curve for the Separate-Variances t Test for the Difference Between Two Means

6. Computation for the t-value

t=
(𝑋 − 𝑋 ) − (µ − µ )
1 2 1 2
2 2
𝑆1 𝑆2
𝑛1
+ 𝑛2

(0.436842 − 0.292593) − (0)


t=
0.032726 0.015328
57
+ 27

t = 4.2689 with p-value of 0.0001

7. Conclusion

We have sufficient evidence to reject Ho.

48
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

8. Implication of the Conclusion

There is a significant difference between the conjugated bilirubin level of liver


patients and non-liver patients

Simple Linear Regression

For this study, the researchers utilized “Simple Linear Regression” as our test of statistics
for analyzing the relationship between Albumin and Albumin-Globulin ratio. This is with
the use of the program STATISTICA, as our main tool. The Dependent or Response
Variable would be the Albumin levels of the population whereas, the Independent or
Predictor variable would be their Albumin-Globulin ratio. This is represented by:

𝑦 = β0 + β1𝑥

Regression Equation: 𝑦 = 1. 515 + 1. 7413𝑥

𝑥 = 𝐼𝑉 = Albumin-Globulin ratio

𝑦 = DV = Albumin

ANOVA Approach -

● Null and Alternative Hypothesis


○ Ho: β1= 0; The level of albumin does not have a correlation with
Albumin-Globulin ratio.
○ Ha: β1 ≠ 0; the level of albumin does have a correlation with the
Albumin-Globulin ratio.
● Level of significance: α = 0.05
● Choice of Test Statistic:
𝑀𝑆𝑅𝑒𝑔
○ 𝐹= 𝑀𝑠𝑅𝑒𝑠

49
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

● Decision Rule:
1
○ Reject Ho if, F> Fa = .05, df = 577
= 3.858

● STATISTICA results:

Regression Statistics

Statistic Value

Multiple R 0.68963234189663

Multiple R² 0.475592766989831

Adjusted R² 0.474683915632794

F(1,577) 523.289934385422

df 1, 577

p 0

Std.Err. of Estimate 0.575795889906194

Regression Summary for Dependent Variable: Albumin (indian_liver_patientfinal)


N = 579 R= .68963234 R²= .47559277 Adjusted R²= .47468392
F(1,577)=523.29 p<0.0000 Std.Error of estimate: .57580

b* Standard b Standard t(577) p-value


error of b* error of b

Intercept 1.514989 0.074898 20.22747 0.00

50
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Albumin and 0.689632 0.030147 1.714272 0.074939 22.87553 0.00


Globulin
Ratio

Computation for values:

𝑛
𝑛 ( ∑ 𝑦𝑖) 2
2 (1817.2)
Total Variation: 𝑆𝑦𝑦 = ∑ 𝑦𝑖 − 𝑖=1
𝑛
= 6068.1 - 579
= 364.6911226
𝑖=1

2
SSReg =𝑏1 𝑆𝑥𝑦 =1.74132 (101.204475) = 306.8646886

SSRes =𝑆𝑦𝑦 − 𝑆𝑆𝑟𝑒𝑔= 364.6911226-306.8646886 = 57.826434

ANOVA Table

Source of Variation Sum of Squares Df Mean squares

SSReg (Due to SSreg = 1 MSReg =


Regression) 306.8646886 306.8646886

Due to residuals SSres = 57.826434 577 (n-2) MsRes =


0.100219123

Total Syy = 364.6911226 578 (n-1)

𝑀𝑆𝑅𝑒𝑔 306.8646886
𝐹= 𝑀𝑠𝑅𝑒𝑠
= 0.100219123
= 3061.937477

F > 3.858

● Conclusion: We have sufficient evidence to reject Ho


● Implication: Albumin and Albumin-Globulin ratio have a linear relationship with
one another and it is possible to predict the variable of Albumin-Globulin ratio
using the regression equation and the value of Albumin.

2. Hypothesis testing of significance of Linear Relationship

51
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

● Null and Alternative Hypothesis


○ Ho: p = 0.05; The level of albumin does not have a significant correlation
with Albumin-Globulin ratio.
○ Ha: p ≠ 0.05; the level of albumin does have a significant correlation with
the Albumin-Globulin ratio.
● Level of significance: α = 0.05
● Choice of Test Statistic:

𝑟 𝑛−2
𝑡 = 2
1−𝑟

● Decision Rule:
𝑑
○ Reject Ho if, |t| > t 2 = 0.025, df = 577 = 1.984

● Computation:

𝑟 𝑛−2 0.6896 579−2


𝑡 = = = 22.87 or 22
1−𝑟
2 1−0.4756

● Conclusion: We have sufficient evidence to reject Ho.


● Implication: There is a significant relationship between the Albumin and the
Albumin-Globulin ratio.

52
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Bibliography

Indian liver patient records. (n.d.). Kaggle: Your Machine Learning and Data Science

Community. https://www.kaggle.com/uciml/indian-liver-patient-records

Liver problems - Symptoms and causes. (2020, February 21). Mayo Clinic.

https://www.mayoclinic.org/diseases-conditions/liver-problems/symptoms-causes/syc-20

374502

Liver: What it does, disorders & symptoms, staying healthy. (n.d.). Cleveland Clinic.

https://my.clevelandclinic.org/health/articles/21481-liver

Confidence intervals - SAGE research methods. (2007). SAGE Research Methods: Find

resources to answer your research methods and statistics questions.

https://methods.sagepub.com/reference/encyclopedia-of-measurement-and-statistics/n10

2.xml

Simple linear regression. (n.d.).

https://www.jmp.com/en_ph/statistics-knowledge-portal/what-is-regression.html

Statistics - Hypothesis testing. (n.d.). Encyclopedia Britannica.

https://www.britannica.com/science/statistics/Hypothesis-testing

A study on the temporal trends in the etiology of cirrhosis of liver in coastal eastern

Odisha. (2020, January). PubMed Central (PMC).

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7376596/

53
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Bilirubin test: Test details & results. (n.d.). Cleveland Clinic.

https://my.clevelandclinic.org/health/diagnostics/17845-bilirubin

Total bilirubin (Blood) - Health encyclopedia - University of Rochester Medical Center.

(n.d.). Welcome to URMC - Rochester, NY - University of Rochester Medical Center.

https://www.urmc.rochester.edu/encyclopedia/content.aspx?contenttypeid=167&contenti

d=total_bilirubin_blood

Bilirubin blood test. (n.d.). Mount Sinai Health System.

https://www.mountsinai.org/health-library/tests/bilirubin-blood-test#:~:text=Normal%20

Results,1.71%20to%2020.5%20µmol%2FL)

University of Michigan Health. (2020, September 23). Liver Function Panel.


https://www.uofmhealth.org/health-library/tr6148#tr6149

Rosenthal, Philip & Pincus, M & Fink, D. (1984). Sex- and age-related differences in
bilirubin concentrations in serum. Clinical chemistry. 30. 1380-2.
10.1093/clinchem/30.8.1380.

Guy, J., & Peters, M. G. (2013). Liver disease in women: the influence of gender on
epidemiology, natural history, and patient outcomes. Gastroenterology & hepatology,
9(10), 633–639.

Kim, I. H., Kisseleva, T., & Brenner, D. A. (2015). Aging and liver disease. Current
opinion in gastroenterology, 31(3), 184–191.
https://doi.org/10.1097/MOG.0000000000000176

Chernoff R. Protein and older adults. J Am Coll Nutr. 2004 Dec;23(6 Suppl):627S-630S.
doi: 10.1080/07315724.2004.10719434. PMID: 15640517.

Wiwanitkit, V. (2001). High serum alkaline phosphatase levels, a study in 181 Thai adult
hospitalized patients. BMC family practice, 2(1), 1-4.
54
A Statistical Study of Applying Confidence Interval Estimation, Hypothesis Testing for a Single & Double Parameter,
and Simple Linear Regression to the Indian Liver Patient Dataset from the UCI Machine Learning Repository

Lowe D, Sanvictores T, John S. Alkaline Phosphatase. In: StatPearls. StatPearls


Publishing, Treasure Island (FL); 2021. PMID: 29083622.

55

You might also like