You are on page 1of 12

Introduction

Body mass index (BMI) is a method widely used to measure body fat. The measurement is
defined by using a person's weight (kg) divided by the square of height ( m2). Usually, an
adult would be classified as overweight if the BMI is over 25, those who get BMI more than
30 is defined as obesity. Since 1975, the number of obese people in the world has nearly
tripled. Through the data provided by WHO (2015), in 2016, 36% of adults were
overweighted and 13% of them were defined as obesity. Being overweight and obese
would increase a person’s risk of contracting non-communicable diseases, such as heart
disease, several forms of cancer and other chronic diseases [Fonatain et al.(2003)].
Therefore, it is crucial to find the factors that lead to overweight and obese, in other
words, to find the contributors of the increasing of people's BMI. In the past few decades,
a large number of scholars have studied on this topic. For instance, Yu (2012) found that
the level of education would affect BMI of different genders, ages and races differently.
Especially compared with college graduates, those less educated white people and young
black women are more likely to be obese. Besides the level of education, there are many
factors affecting BMI, including physical activity [Galani and Schneider (2007)], drinking
[Colditz et al.(1991)], smoking habits [Cawley and Scholder (2013)]; socio-economic factors
[Cohen et al.(2013), and other factors, such as marital status [Sobal, Rauschenbach and
Frongillo (1992)], length of residence in the United States [Oza-Frank and C Unningham
(2010)] and depression [Faith et al. (2011)].

What causes obesity

One of the simplest reasons that caused obesity for a person is the calories intake is much
greater than it’s consumed. In this case, to explore the factors lead to obesity, the calories
intake and calories consumed can be analyzed separately. Calories intake can be related to
people’s eating habits, for example, people have to eat to a certain level of calories intake
to maintain the energy required for daily life. However, varies kinds of food would bring
different level of fullness and calories. Some foods with low calories can bring people a
strong sense of fullness, such as oats, corn, potatoes, etc. while if in the same level of
fullness, some kinds of foods contain horrible calories, such as chocolate, fried chicken, etc.
Therefore, different eating habits will have a huge impact on people's BMI. Another part of
the analysis for obesity is the calories consumed, in order to stronger the calories
consumption more physical work out is needed besides the basic energy needed. Thus,
exercises of different intensity and frequency can also have a huge impact on people's BMI.
As an example, on the choice of commuting method, more energy would be consumed if
people choose to work or ride a bicycle rather than take cars.
Empirical
(Figure 1)
(Here is the model data based on the factors in the reference paper, with a little
explanation)

Insights

From ‘Variable selection for a categorical varying-coefficient model with identifications for
determinants of body mass index’ (Gao et al., 2017), it lists a lot of factors that affecting
BMI. These factors are categorized into three sorts: life factors, socio-economic factors and
other factors. Among these factors, some of them are directly influencing the calories
intake and consume and some others are indirectly related, they are summarized in the
table below (Table 1). Those factors would be used on the first stage of the BMI modelling

Based on strong medical research, factors about the frequency of smoking and drinking are
expected to influence the calories intake directly. According to (Mineur et al., 2011;
Chiolero et al., 2008) nicotine inhibits people's appetite, as food is the main source of
energy for a day, suppressing appetite is equivalent to reducing calorie intake. Besides, the
medical journal Diabetes pointed out that smoking is one of the simplest and most reliable
ways to lose weight for low-level smokers (<= 10 pills/day). As for the alcohol intake, it also
promotes the weight loss in some extent. According to Colditz's (1991), with the alcohol
intake increases people are intended to reduce the sucrose intake, in this case,
carbohydrate intake would decrease as well. Activity with varies frequency in different
intensities are factors that directly affecting the calories consumption. Through Hill and
Peter (1998) people who exercise regularly or engage in manual labor are less likely to be
obese compared to those who often sit or hardly exercised.

Besides those direct factors, in the first stage of modelling, the levels of education and
income would be considered as indirect contribution to people’s calories intake and
consumption. Based on the common sense, people with higher degree would get more
knowledge about heath and those who have higher income level are more willing to have
a healthy lifestyle. In this case, people with higher education level and income level are
expected to live a healthier life and they would be more disciplined to keep fit, thus, they
might have lower risk of obesity.

Variable Definition
Smk_ed 1 if current every day smoker, 0 otherwise
Smk_sd 1 if current some day smoker, 0 otherwise
Smk_f 1 if former smoker, 0 otherwise
Cigsday Number of cigarettes per day
Alclyr 1 if ever day 12+ drinks in any one year, 0 otherwise
alc_life 1 if had 12+ drinks in entire life, 0 otherwise
Alc_c1 1 if current infrequent drinker, 0 otherwise
Alc_c2 1 if current light drinker, 0 otherwise
Alc_c3 1 if current moderate drinker, 0 otherwise
Alc_c4 1 if current heavier drinker, 0 otherwise
Vig_l1 1 if do vigorous activities less than once per week, 0 otherwise
Vig_l2 1 if do vigorous activities more than one time and less than three times per
week, 0 otherwise
Vig_l3 1 if do vigorous activities more than three times per week, 0 otherwise
Mod_l1 1 if do light/moderate activities less than once per week, 0 otherwise
Mod_l2 1 if do light/moderate activities more than one time and less than three
times per week, 0 otherwise
Mod_l3 1 if do light/moderate activities more than three times per week, 0
otherwise
Str_l1 1 if do strength activities less than once per week, 0 otherwise
Str_l2 1 if do strength activities more than one time and less than three times per
week, 0 otherwise
Str_l3 1 if do strength activities more than three times per week, 0 otherwise
Educ1 Number of years of school completed
lnincome Nature logrithm of earnings last year
(Table 1)

Therefore, combined all the factors related to calories intake and consumption, the first
stage of the model (Model 1) would be:

BMI =β 0+ β smk × Smk ed + β sm k × Smk sd + β sm k × Smk f + β cigsday ×Cigsday + β alclyr × Alclyr+ β alc × Alc life + β alc
ed sd f life

After dealing the dataset, it summarized the results as:


(Figure II)

Plugging the coefficients into the model, it derives:

BMI=26.88655−0.11831× Smk ed +0.33812 × Smk sd +0.66830 × Smk f −0.02719 ×Cigsday+ 0.94337× Alc

From the results, it is clear to see that for the factors related to smoking, the more
cigarettes per day the less BMI. From the perspective of a person’s smoking frequency, only
current every day smoker would have lower BMI. As for the alcohol related factors,
surprisingly, those who has 12+ drinks per day and those infrequent drinkers has positive
relationship with BMI. However, light, moderate and heavier drinkers have negative effects
on BMI. On the perspective of calories consume, people who do vigorous and strength
activities would have negative effect on BMI while those who do moderate activities would
get higher BMI. Through the model, educational level has negative relationship with BMI
which is opposite to the income level.

F-Test
A statistical test known as an F-test is one in which the test statistic has an F-distribution
whenever the null hypothesis is shown to be accurate. The majority of the time, it is used to
assess different statistical models that have been "fitted" to different data sets in order to
determine which statistical model most accurately depicts the population from which the
data were obtained. Following the use of the method of least squares to the fitting of models
to data, accurate "F-tests" are often required. In honour of Ronald Fisher, George W.
Snedecor came up with the name of the company. In the early 20th century, Fisher was the
one who first created the variance ratio as a statistical measure.
A sum of squares decomposition of the data's variability is used in the construction of the
vast majority of F-tests. The statistic that is being analysed in an F-test is the ratio of two
scaled sums of squares that are standing in for various kinds of variability. With these sums
of squares, it is to be anticipated that the statistic will be greater when the null hypothesis is
rejected; this is because the null hypothesis is the default assumption. It is necessary for the
sums of squares to be statistically independent and to individually follow a scaled 2-
distribution for the F-distribution to behave in the manner that is expected by the null
hypothesis. If the data values do not correlate with one another, have a normal distribution,
and have the same variance, then the second criteria have been met.
Multiple-comparison ANOVA problems
The F-test is used in one-way analysis of variance (ANOVA), which determines whether or
not there is a significant difference between the expected values of a quantitative variable
across a number of different groups that have already been specified. Consider, as an
example, a piece of medical study that investigated four different possible therapies. It is
possible to use the F-test of the analysis of variance (ANOVA) to compare the means of the
four treatments in order to evaluate an alternate hypothesis that states they are all equally
effective. An omnibus test is a kind of statistical analysis that employs the use of a single test
in order to identify a broad range of outliers. Alternately, we may evaluate each treatment
based on how it compares to itself (for instance, in the medical trial example with four
treatments we could carry out six tests among pairs of treatments). It is beneficial to use the
ANOVA F-test because doing so avoids the need to predetermine which treatments will be
compared or to make adjustments for multiple comparisons. This is one of the reasons why
using the ANOVA F-test is useful. If we choose to reject the null hypothesis, the difficulty with
the ANOVA F-test is that it does not tell us whether the treatments are significantly different
from one another or not. In addition, if the F-test is carried out with all the levels equal, we
cannot draw the conclusion that the treatment pair that has the largest mean difference is
significantly different.
Both Model 1 and Model 2 may be said to be "nested" inside one another. Model 2 is
"nested" within Model 1. Model 1 has constraints, however Model 2 does not. Model 1 has
some parameters called p1, and model 2 has some parameters called p2. These parameters
are set up in such a way that for any value of p1 in model 1, there may be a value of p2 in
model 2 that produces the same regression curve.
Therefore, it is important to evaluate how well a model fits the data in comparison to a naive
model. In a naive model, the only explanatory term is the intercept term, and all projected
values for the dependent variable are set to the sample mean of the variable. It is important
to evaluate how well a model fits the data in comparison to a naive model. The naive model
is considered to be a constrained model since all of the variables that are used for
explanation are required to have coefficients of zero.
Determine if there is a break in the data structure, which is another case that is prevalent. In
contrast, the non-restricted model performs not one but two regressions on each and every
one of the data points, while the restricted model only does one regression. This particular
use of the F-test is referred to as the Chow test. When it comes to fitting the data, a model
that has more parameters is always going to be preferred than one that has fewer
parameters. As a consequence of this, model 2 often gives a better fit (lower error) to the
data in comparison to model 1. However, one of the most common concerns is whether or
not model 2 gives a fit to the data that is considerably better than the first model. In order to
find a solution to this problem, you may try using an F-test.

Methodology

We may determine whether or not your linear regression model fits the data better than a
model that does not include any independent variables by using the F-test for overall
significance. In this article, the F-test, the square root of the correlation coefficient, and other
regression statistics are compared and contrasted with one another (R2). R2 is a metric that
may be used to evaluate how well your model matches the given data, and it is connected to
the F-test in some manner. The F-test is a statistic that has a fair amount of latitude for
interpretation. They have a wide variety of potential applications. Because F-tests may
examine numerous model terms at the same time, you can use them to evaluate the degree to
which various linear models satisfy the requirements of the data. On the other hand, T-tests
are limited to analyzing just a single word at a time.
Examine the p-value of the F-test in comparison to the criteria you use to determine
significance. If the p-value for your regression model is lower than the significance threshold,
then the data from your sample demonstrate that your model provides a better fit to the data
than the model that does not include any independent variables. If none of your tests for
variables that are independent of the dependent variable are significant, then the F-test as a
whole will not be significant. The findings of separate tests may not always agree with one
another. The F-test is used to examine all of the coefficients at the same time, while the t-test
examines each coefficient on its own merits. Even if the results of the individual t-tests are not
significant, the F-test can conclude that the coefficients as a whole are significant.
Take into consideration this despite the fact that these test findings seem to contradict one
another. The results of the F-test indicate that it is very improbable that the sum of all of the
coefficients would equal zero if the predictive power of all of the independent variables was
added together. However, it is quite unlikely that any one of the criteria, by itself, is the
decisive one. To put it another way, the results of your sample provide sufficient evidence to
conclude that your model is significant, but they do not provide sufficient evidence to
conclude that any one variable is important.

Discussion

We first used (2.5) to calculate acceptable bandwidth levels. The information gathered is
summarized. Although all three are significant, the influence on body mass index (BMI) varies
significantly between regressors. Gender and race have a more substantial impact than an age
since their smoothing factors are substantially less. We next utilize our technique to determine,
based on these smoothing parameters, which regressors of BMI are crucial and which are not. The
weight parameter is set to 3.2 based on the updated BIC criteria (2.13). However, only 24 of the
48 regressors have been shown as significant predictors of BMI (BMI). Our assessment suggests a
correlation between physical activity and body mass index (BMI), although the magnitude and
regularity of exercise do matter (Obozinski et al., 2011). For instance, there is little to no
difference in body mass index (BMI) between performing this sort of exercise less than once a
week and never doing it, but there is a little but noticeable difference in BMI if you begin doing it
more than once a week. People must engage in mild to moderate activity more than three times
weekly to see a change in their body mass index (BMI). Encouraging individuals to exercise
vigorously or moderately and more frequently, rather than simply urging them to exercise more
often, is one policy recommendation from our study that might help decrease overweight and
obesity.
Whether or not and how often someone consumes alcohol or smokes also has an impact on their
body mass index. Using a computer has no outward signs of influence. An individual's BMI may
be partly predicted by their socioeconomic status, with characteristics such as level of education,
income, OSC, and whether or not they have recently seen a doctor being significant regressors.
Contrarily, these factors are not correlated with longevity: job conditions, working hours,
homeownership, health insurance, or medical care expenditure. The main determinants of body
mass index (BMI) are citizenship, region of residence, and length of time in the United States.
However, there is no correlation between BMI and citizenship status, area of the home, or length
of time in the U.S (Jacob et al., 2009).
Calculations were made using the group LASSO method, the LASSO technique applied to models
(3.1) and (3.2), and the stepwise approach used to models (3.1) and (3.2) to draw comparisons
between them. The X and Z models have the requirements described in Section 4.1, Particulars 3.1
and 3.2. It's important to note that our approach and the group LASSO method use the same
variables. For each model, we calculate the root-mean-squared leave-one-out prediction errors
(RME), where (Yi) is the prediction for the (i) the subject when subject 0 is removed from the
analysis. RME (.N Yi 2/N) Our technique outperforms the other five models, with the lowest
RME. The second place goes to the group LASSO approach, while the third goes to model-based
LASSO methods (3.2). Model (3.1) performs poorly when compared to its LASSO-based
counterpart. The worst result may be achieved using the step-by-step approach. These results
demonstrate the superiority of our technique and the wide range of factors that might influence
body mass index.
An unregularized post-selection estimate of the varying-coefficient model is performed to
determine the impact of significant regressors on BMI. Point estimates and confidence interval
estimates for the effects of essential factors on body mass index (BMI) across demographic groups
are provided in the supplementary file by Gao et al. (2017). Considering the limitations of the
available floor area, when taken as a whole, the estimated coefficients demonstrate that the
variables above have significant impacts on BMI. Since none of the regressors always have zero
effects across all 32 groups, the 95% CIs for their impact on the 32 variables do not always
include zero.
We have the impact of the regressor of U.S. birth on body mass index for a total of 32
demographic factors. This graph shows the various demographic categories along the horizontal
axis. The estimated impacts of delivery on body mass index are positive across all demographics,
as shown by data collected after natural selection. This lends credence to the hypothesis that our
genetic makeup at birth plays a significant role in determining our body mass index (Tibshirani &
Taylor, 2011). Second, there seems to be variation in how birth influences BMI among the 32
subgroups. When comparing groups of the same age and race, such as groups 1 and 17, groups 2
and 18, and so on, the impacts are more pronounced on males. As a result, Asian civilizations are
more affected by the distinctions than other cultures. There is no overlap between the 95%
confidence intervals (CI) for groups 3, 19, 7, 23, 11, 27, and 31. If two groups of persons of the
same age and gender composition are compared, Asians are more likely to have a significant
difference due to their chance of birth. Asians' body mass index (BMI) increases by 12.78
percentage points just from being born in the United States; this is more than the 6.11 percentage
point, 11.24 percentage point, and 8.69 percentage point increases seen among whites and people
of color, respectively, among the four youngest male age groups.
Conclusion

We provide a variable selection approach for the categorical varying-coefficient model to solve
several complex modeling and statistical challenges in the existing body of research on body mass
index (BMI). These challenges can be found in the study that has already been conducted.
Research on BMI is now being carried out, and the results may answer these questions. The body
above of research might include material that is relevant to the challenges above. We make use of
the data from the National Health Interview Survey (NHIS), which was conducted in the United
States in 2013, to investigate the relationship between a wide variety of potential variables found
in the extensive body of research literature and body mass index (BMI), as well as obesity. The
NHIS was carried out in the United States (Zheng & Peng, 2017). The National Health Interview
Survey was conducted in the United States. We employ a group LASSO approach to find the
relatively relevant BMI determinants. We use a setup with varied coefficients to account for and
quantify the different effects of various BMI determinants. This allows us to uncover the relatively
essential BMI determinants. Because of this, we can estimate the BMI factors that are rather
important. We are in a position to scientifically explain the relative relevance of demographic
variables in differentiating the influence putative determinants have on BMI by analyzing the
optimum bandwidths of the various demographic group factors. This enables us to distinguish the
impact that putative determinants have on BMI. Because of this, we can ascertain which putative
factors have the most significant bearing on BMI (Oelker et al., 2014).
We also uncover a few asymptotic characteristics for the data-driven technique examined in this
study. These findings are related to the previous ones and follow the same logic. These results are
connected to the ones that came before them. According to the conclusions gained from our
theoretical inquiry, the appropriate model may be found with a probability that gradually
approaches one under certain circumstances that are considered relatively light. This can be the
case when specific requirements are met. The suggested estimator is still capable of achieving
asymptotic normalcy even when it is used in the actual (oracle) model, provided that there are no
variables that have the potential to cause the analysis to be incorrect (Tang et al., 2013).
It has not been determined whether or not an asymptotic behavior is present in the scenario where
P and R are seen to be going in opposite directions toward infinity. This circumstance has not been
explored. Suppose the indicator function is used rather than all of the kernel functions, and the
optimal bandwidth selection is ignored (while ignoring the number of demographic groups
increases exponentially with r). In that case, the theoretical investigation can be reduced to the one
that was investigated by Lounici et al. (2011). in the previous sentence. However, the optimum
bandwidths for model (2.2) in the high-dimensional setting are not yet known to the best of our
knowledge. One of our objectives for the future is to study this specific subject matter further.
The quantile regression model offers an alternate approach when the attention is focused on a
small range of BMI value categories, such as low or high, for example. This point was made very
obvious by Koenker (2017). Zhao et al. (2014) use the quantile categorical varying-coefficient
model to investigate an issue with variable selection that is somewhat analogous to the one
discussed here. In contrast to our method, which emphasizes the differences in effect across
categories by making use of a group LASSO penalty, theirs emphasizes the fusion of types of
determinants for each regressor by making use of a penalized approach with both LASSO and
fused LASSO penalties (Tibshirani et al., 2005). In other words, our method emphasizes the
differences in effect across categories, while theirs emphasizes the fusion of sorts. This is done
using a penalized technique that incorporates both LASSO and fused LASSO penalties (Tibshirani
et al., 2005).
In contrast to the suggested quantile regression process in Zhao et al. (2014), which can justify the
relative relevance of demographic factors, our method uses a kernel function to choose the
appropriate bandwidth in the quantile regression model. This is in contrast to the fact that the
suggested quantile regression process by Zhao et al. (2014) can justify the relative relevance of
demographic factors. In contrast, the recommended quantile regression procedure in Zhao et al.
(2014) can explain the relative significance of demographic determinants. This is something that
stands in stark contrast to the above statement. This is a crucial difference between it and the
quantile regression method introduced by Zhao et al. (2014). It would be more beneficial to enable
the matching quantile categorical varying-coefficient model to extract information on
demographic characteristics for research projects that are only interested in particular BMI ranges.
This would be the case since it would make the model more accurate. Combining a suitable
bandwidth selection procedure with a group LASSO-type penalty is one approach that might be
used to accomplish this goal. Studies focusing only on specific BMI ranges are an example of the
study that falls under this category. For the time being, we will have to jot it down on the list of
things to do as something that we will have to do later (Gertheiss & Tutz, 2012).

You might also like