You are on page 1of 38



Descriptive statistic

Group 4:
Vũ Tiến Hải - 11219357
Nguyễn Đình Hiệp - 11212209
Đặng Thuỳ Anh - 11219353
Vũ Kim Ngân - 11219365
Dương Nhật Anh - 11219352

Hanoi, 2023
Table of Contents

A. INTRODUCTION..................................................................................................1
I. DEFINITION OF DESCRIPTIVE STATISTIC:................................................1
1. The necessity of Descriptive Statistics.............................................................1
2. Understanding Descriptive Statistics...............................................................2
III. APPLICATION...............................................................................................5
B. ARTICLE AND SOURCES...................................................................................8
I. ARTICLE SUMMARY......................................................................................8
1. The case...........................................................................................................8
2. The purpose.....................................................................................................8
3. The method......................................................................................................9
Conclusion............................................................................................................ 18
II. OTHER RESEARCH AND TECHNIQUES....................................................20
1. Article 1: Financial Development and Economic Growth relationship..........20
2. Article 2: Application of outlier mining in insider identification based on
Boxplot method....................................................................................................23
III. DATA ANALYSIS.......................................................................................27
1. Methodology..................................................................................................27
2. Research question..........................................................................................28
3. Data Analysis.................................................................................................28
Conclusion............................................................................................................ 34
IV. Conclusion:....................................................................................................35
List of tables

Table 1. Details of Demographic Factors used in the Present Study...........................13

Table 2: Work-life Score Descriptive Statistics..........................................................14
Table 3: Grouped frequency tables of BMI.................................................................28

List of figures

Figure 1: Boxplot of the Daily Trading Volume.........................................................23

Figure 2: The Trend of Daily Closing Price and Tumover Rate..................................24
Figure 3: Histogram and descriptive statistics of age..................................................27
Figure 4: Histogram BMI of policyholders.................................................................28
Figure 5: Histogram of charges paid by policyholders................................................29
Figure 6: Box plot of charges in categories of the smoker, sex, region, and children..30
Figure 7: Correlations between BMI and charges and box plots of charges in
categories of BMI........................................................................................................31

Descriptive statistics is a branch of statistics that deals with methods of organizing,

summarizing, and presenting data in a convenient and informative way. It aims to
provide an overview of the data, allowing for easy interpretation and communication
of the key characteristics of the dataset.

The goal of descriptive statistics is to provide a concise and meaningful summary

of the data. This involves calculating measures such as mean, median, mode, range,
standard deviation, and variance, as well as visualizing the data through histograms,
box plots, and scatter plots.

Descriptive statistics is a fundamental aspect of statistics and is widely used in

many fields, including business, science, medicine, social sciences, and more. It can
be used to answer questions such as: What is the average salary of employees in a
company? What is the range of test scores for a group of students? What is the most
common type of car on the road? By answering these questions, descriptive statistics
helps to provide insights and understanding into a dataset.

Descriptive statistics can be presented in various forms, such as tables, graphs, or

numerical summaries. Examples of numerical summaries include measures of central
tendency (such as mean, median, and mode), measures of variability (such as range,
standard deviation, and variance), and measures of shape (such as skewness and
kurtosis). Examples of graphical summaries include histograms, box plots, scatter
plots, and frequency polygons.


a. The necessity of Descriptive Statistics

Unlike raw data, which is collected data that has not yet been processed or cleaned,
descriptive statistics deals with methods of organizing, summarizing, and presenting

data in a convenient and informative way. This practice allow for the ease of data
visualization, which helps data to be presented in a meaningful and understandable
way. Thus, it allows for a simplified interpretation of the data set in question. Other
benefits of descriptive statistics may include:

● Descriptive statistics allows users to present the data in graphical formal, such
as bar charts, pie charts, dot plots, and histograms, which is much easier to
understand and interpret.
● Various statistical measures (e.g.: mean, median, and mode) allow users to
summarize the central characteristics of the data.This allows us to obtain a
rough understanding of where the data values lie.
● The measures of dispersion (e.g.: standard deviation) help users understand
how far the data values are spread away from each other.
● The computation of skewness illustrates the shape of the distribution.
● Correlation analysis compares two different characteristics and check whether
there is any relation between them.

b. Understanding Descriptive Statistics

Determining the statistics you want to output is typically the first step in applying
advanced statistics, and delivering them in the proper manner is the final step. The
steps involved in descriptive statistics are as follows:

Step 1: Define the research question or problem: The first step in descriptive
statistics is to define the research question or problem that you want to investigate.
The research question or problem should be focused and specific, and should provide
a clear direction for the collection and analysis of data. It should be framed in such a
way that it can be answered using quantitative data. It should also be specific enough
to allow for the selection of appropriate variables, data collection methods, and
statistical techniques for analysis. Once the research question or problem has been
identified, researchers can start collecting data.

Step 2: Collecting and Organizing Data: The second step in descriptive statistics is
to collect data. There are many ways to collect data, including surveys, experiments,
interviews, and observational studies.

After the data is collected, a researcher should clearn the data through removing
any errors or inconsistencies in the data that could affect the results. Once the data is
clean, it should be entered into a spreadsheet or statistical software program.

Once the data is collected and sorted, one should organize the data in a way that is
easy to analyze. This could involve sorting the data by variables, creating groups or
categories, or summarizing the data with measures of central tendency and variability.

Step 3: Calculate measures of central tendency: Measures of central tendency

describe the center of the data. There are three measures to describe the data center:

● Mean: The mean is the sum of all the values in a dataset divided by the total
number of values. It is also known as the arithmetic average.
● Mode: The mode is the value that occurs most frequently in a dataset. The
mode can identify the most common value or values in the dataset, but it may
not provide a good representation of the overall distribution of the data.
● Median: The median is the middle value in a dataset when the values are
arranged in order of magnitude. The median is not affected by extreme values,
but it may not provide a good representation of the data if the distribution is
skewed or has outliers.

Step 4: Measures of Dispersion: This is a statistical tools used to describe the spread
or distribution of a dataset. Here's a brief overview of each measures of dispersion:

● Range: The range is the difference between the largest and smallest values in a
dataset. It provides a rough estimate of the spread of the data, but it can be
affected by extreme values and may not be a robust measure of variability.
● Variance: The variance is a measure of how far the values in a dataset are
spread out from the mean. It is calculated by taking the average of the squared
differences between each value and the mean.
● Standard deviation: This is the square root of the variance. It provides a more
intuitive measure of dispersion that is expressed in the same units as the
original data. The standard deviation provides a good estimate of the spread of
the data, but it may not be a robust measure of variability in the presence of
● Coefficient of variation: The coefficient of variation is the ratio of the standard
deviation to the mean, expressed as a percentage. It provides a measure of
relative variability that can be used to compare the variability of different

Step 5: Data Visualization: The data can be visualized using various graphical
representations such as histograms, scatter plots, bar graphs, and pie charts. This helps
to get a quick understanding of the distribution and pattern of the data. There are

several types of data visualizations that are commonly used in descriptive statistics,

● Bar graphs: a chart that represents the data using rectangular bars with lengths
proportional to the values they represent.
● Scatter plots: a graph that represents the relationship between two variables.
Each point on the plot represents a single data point, and the position of the
point on the plot corresponds to the value of the two variables.
● Box plots: a chart that represents the distribution of the data using a box with
whiskers. The box represents the interquartile range, while the whiskers
represent the range of the data.
● Histograms: a chart that represents the distribution of the data using bars. The
bars represent the frequency or proportion of the data that falls within each
There are some other data visualizations such as heat maps, pie charts, and linegraphs.

Step 6: Interpretation: Interpretation of descriptive statistics involves making sense

of the results obtained from analyzing a dataset using various statistical techniques.
The interpretation of descriptive statistics requires understanding the context in which
the data was collected, the specific research question being investigated, and the
statistical methods used, including the measures of central tendency, measures of
dispersion, and data visualization.

 Besides description of a data set, we can also read and interpret data through
describing the relationship between two variables using covariance and
correlation coefficient.

There are two types of covariance: sample covariance and population covariance.
Sample covariance is calculated using a sample of data, while population covariance
uses the entire population data. Sample covariance is an estimator of population
covariance, but it tends to be biased, so adjustments need to be made to get an
unbiased estimate. The formula to calculate covariance is here as followed:

Covariance can be positive, negative, or zero. A positive covariance indicates that

the two variables tend to move together in the same direction, while a negative
covariance indicates they tend to move in opposite directions. A covariance of zero
means that the two variables are independent and do not have any linear relationship.

However, since covariance measures the strength and direction of the linear
relationship between two variables, the covariance of two variables with large values
or different units of measurement may be much larger than the covariance of two
variables with smaller values or similar units of measurement, even if the strength of
the relationship is the same. To overcome this disadvantage, correlation coefficient is
more commonly used as it standardizes the covariance by dividing it by the product of
the standard deviations of the two variables, which allows for more meaningful

Correlation coefficient is a measure that indicates the strength and direction of

the relationship between two variables. It ranges between -1 and 1, where a value of -1
indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a
perfect positive correlation.

- Weather Forecasting
Statistics is used heavily in the field of weather forecasting. Analysis of trends can
be useful in depicting and predicting the changing patterns and erraticism of some
climatic parameters. This analysis gives a proper knowledge about the changing
conditions of the climate and its effects, by the evaluation of meteorological
parameters. In particular, probability is used by weather forecasters to assess how
likely it is that there will be rain, snow, clouds, etc. on a given day in a certain area.
Forecasters will regularly assume that “there is a 90% chance of rain today between
after 5PM” to indicate that there’s a high likelihood of rain during certain hours.

- Health care
Health statistics are used to understand risk factors for communities, track and
monitor diseases, see the impact of policy changes, assess the quality and safety of

health care and determine how likely it is that certain individuals will spend a certain
amount on healthcare each year. Health statistics are a form of evidence, or facts that
can support a conclusion.
About Health Statistics Modules:
o Correlates: See how to measure the risk factors and protective factors that
impact our health.
o Conditions: Learn to assess how often and how badly diseases impact a
o Care: Dig into how healthcare is delivered to the communities that need it, to
treat disease and illness.
o Costs: Get more information on what health care costs, and why.
For example, an actuary at a health insurance company might use factors like age,
existing medical conditions, current health status, etc. to determine that there’s a 80%
probability that a certain individual will spend $10,000 or more on healthcare in a
given year.

- Traffic
Traffic engineers regularly use statistics to monitor total traffic in different areas of a
city, which allows them to decide whether or not they should add or remove roads to
optimize traffic flow and analysis to monitor how traffic changes. Also, statistical
methods enable planners to construct and validate microsimulations of traffic
congestion and explore alternative designs. These models are also useful in planning
for emergencies, such as managing the evacuation of a major city threatened by a

- Investing
An investor wants to assess the likelihood of a certain investment paying off. Using
statistics and probability, they can calculate the expected return and the risk associated
with the investment. For instance, an investor might use descriptive statistics to
determine the average return of a particular stock over the last five years, the volatility
of its returns, and the correlation between its returns and those of other stocks. Based

on this information, the investor can decide how much of their portfolio to allocate to
the stock.

- Medical Studies
Medical professionals use descriptive statistics to understand how different factors are
related in a population. For example, they may use correlation to analyze the
relationship between smoking habits and the risk of developing lung cancer. By
looking at a large sample of people, they can determine how strong this correlation is
and estimate the likelihood of an individual developing lung cancer based on their
smoking habits.

- Manufacturing
Manufacturing engineers use descriptive statistics to monitor the efficiency of
different production processes. For instance, they might take a random sample of
products from a production line and calculate the proportion of defective items. If the
proportion of defective items is higher than a certain acceptable level, they can use
descriptive statistics to identify the causes of the defects and implement corrective
actions to improve the process.



Arun Raj. R.M.Pharm., MBA Lecturer Dept. of Pharmaceutical Science Mahatma
Gandhi University, RIMSR Kottayam, Kerala, India.
Dr. Hareesh N RamanathanMBA, M.Phil, Ph.D.Professor & HeadDept. of
Management Studies, Toc H Institute of Science & Technology, Cochin, Kerala,

1. The case
Employee engagement is a critical factor for organizational success. Many of the
factors that impact on employee engagement have been identified and one of them is
the work-life balance. Work-life balance is the ability to balance work obligations and
personal life responsibilities, which is crucial for individuals to achieve physical,
emotional, and mental well-being. Without a healthy work-life balance, premedical
employees may experience burnout, stress, and other negative health outcomes as well
as struggle to balance these responsibilities with their work. In turn, employers are
recognizing the need to create a work environment that supports employees' well-
being and promotes productivity.

This article introduces the work-life balance scores and their influences on the
equilibrium between the private life and work life of paramedical employees.The data
collected through questionnaires will be analyzed to identify common themes and
patterns related to work-life balance and to provide insights into the challenges and
opportunities for improving employees' work-life balance.

2. The purpose
The aim of this paper is to find out the work-life status of paramedical employees
in a private hospital and the variations in the work-life among different categories of

medical workers. The study can provide insights into the specific challenges and
opportunities related to this field and to identify effective strategies for supporting
employees' well-being and promoting a healthy work-life balance.

3. The method
In order to conclude the research precisely, 116 paramedical employees were
chosen from a Private Hospital. Questionnaires were circulated and data was collected
and analyzed by using appropriate statistical tools.
Employee’s answers are transferred into the characteristics of a data set, which are
summarized through descriptive statistics.

Descriptive statistics are brief informational coefficients that summarize a given

data set, which can be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of central tendency
and measures of dispersion. The measure of center included in the article is mean
while the measures of dispersion included are the range, standard deviation, and

c. Measures of central tendency

A measure of central tendency is a summary measure that attempts to describe a
whole set of data with a single value that represents the middle or center of its
d. Mean
The mean is the sum of the value of each observation in a dataset divided by the
number of observations. In statistics, the mean can also be defined as the ratio of sum
of all observations to the total number of observations. The mean can be used to get an
overall idea or picture of the data set. It's obtained by simply dividing the sum of all
values in a data set by the number of values.

e. Measures of dispersion
In statistics, the measures of dispersion assume how spread out all other values of
the distribution are from the central tendency. In simple terms, it shows how squeezed
or scattered the variable is.

a. Range
Range is the difference between the largest and the smallest observation in the
data. The range generally gives you a good indicator of variability when you have a
distribution without extreme values. When paired with measures of central tendency,
the range can tell you about the span of the distribution. To calculate the range, you
need to find the largest observed value of a variable (the maximum) and subtract the
smallest observed value (the minimum).

Range = Maximum Value–Minimum Value.

b. Variance
Variance measures how far each number in the set is from the mean, and thus from
every other number in the set. It is calculated by taking the differences between each
number in the data set and the mean, then squaring the differences to make them
positive, and finally dividing the sum of the squares by the number of values in the
data set.

c. Standard Deviation
Standard deviation is a statistic that measures the dispersion of a dataset relative to
its mean and is calculated as the square root of the variance. In normal distributions, a
high standard deviation means that values are generally far from the mean, while a
low standard deviation indicates that values are clustered close to the mean. The
standard deviation is calculated as the square root of variance by determining each
data point's deviation relative to the mean.

f. Graphical techniques
Descriptive statistics also provide a way to visualize data using graphs and charts.
Specifically, in this paper, histogram, box plot and frequency table are used to
highlight important features of the data and communicate findings in a more
accessible way.

a. Histogram
In statistics, a histogram is a graphical representation of the distribution of data.
The histogram is represented by a set of rectangles, adjacent to each other, where each
bar represents a kind of data. It is used to summarize discrete or continuous data that
are measured on an interval scale. It is often used to illustrate the major features of the
distribution of the data in a convenient form.

b. Box plot
In descriptive statistics, a box plot is a method for graphically demonstrating the
locality, spread and skewness groups of numerical data through their quartiles. In
addition to the box on a box plot, there can be lines extending from the box indicating
variability outside the upper and lower quartiles. Box plots are used to show
distributions of numeric data values, especially when you want to compare them
between multiple groups.

c. Frequency table
Frequency refers to the number of times an event or a value occurs. A frequency
table is a table that lists items and shows the number of times the items occur. A
frequency table shows the distribution of observations based on the options in a
variable. Frequency tables are helpful to understand which options occur more or less
often in the dataset. This is helpful for getting a better understanding of each variable
and deciding if variables need to be recoded or not.

The application of descriptive statistics in the article:

Table 1. Details of Demographic Factors used in the Present Study

Research data were analyzed by using a frequency table. Table 1 summarizes data
by grouping it into intervals and showing how many times each interval occurs.
Through this method, it is easier to identify patterns and trends in the data. As we can
see, 71.6% of respondent employees are youngsters whereas only 28.4% are middle
aged. This is understandable since becoming a medical professional typically requires
a significant amount of education and training, which can take many years to
complete. Younger people may have an advantage in this regard, as they have more
time to invest in their education and training before settling into a long-term career.
Moreover, it may be easier for them to balance work-life status since they are just
starting their careers and may have fewer responsibilities outside of work. Also, it
explains that 69.8% of employees say their parents are staying with them and 30.2%
say no. There is a higher percentage of people living with parents because it can be a
cost-effective way to save money or they may simply have a strong emotional bond or
desire for companionship and support with parents. People seek for a work-life
balance when they live with family members because they value their personal life

and want to maintain a healthy and fulfilling relationship with their family. Achieving
work-life balance can help people maintain their motivation, focus, and job
satisfaction, which can in turn help them be more present and engaged with their
family members.

In order to measure the work-life of paramedical employees a Likert frame

was developed by the researcher. It had 15 statements touching different dimensions
of work-life balance. Each question will have a scale of 1 to 5, 1 is not satisfied and 5
is very satisfied. This can generate a score that spreads between 15(1x15) and
75(15x5). So the benchmark score was found to be 45 [(15+75)/2 which means
maximum + minimum divided by two]. That means if the mean work-life score is
significantly greater than 45, we can claim that the work-life of paramedical
employees is good (above average).

Table 2: Work-life Score Descriptive Statistics

The descriptive statistics are explained in table no. 2. It was reported that the
work-life balance mean was found to be 51.1 with a standard deviation 4.34. The
mean is used to represent the typical value and therefore serves as a yardstick for all
observations. As the mean score is higher than the benchmark score (51.1 compared to

45), it can be concluded that the average medical employee from the private hospital
is satisfied with their work-life status.
The minimum score is 38 which is quite near the benchmark of 45, which means
that there is nobody that is totally dissatisfied with their work-life balance. There are
only some aspects of work-life balance that they are not pleased with which is normal
for any employee. The maximum score is 61 which is quite near the mean but far from
the highest score that can be achieved. This means that although the work-life of
paramedical employees is good, it is not perfect. And employers still have to improve
the quality of the work-life balance if they want to maintain work efficiency and
employee engagement.
From the table, it can be concluded that the variance equals 18.92. However, it
gives weights to outliers whose values are far different from the mean. When squaring
these values, it is possible to skew the data set. Since variance represents a squared
result, the standard deviation is often easier to visualize and apply. The standard in the
table is 4. 34974, which means that the data deviates from the scale of 4. 34974.
Standard deviation has no concept of good or bad. This value just shows you how
little or more spread out the data is - the spreadness between the satisfaction and
dissatisfaction of work-life balance.

Fig. 1 Histogram

The histogram above has bell-shaped, with only one peak, and is symmetric
around the mean, which means that the variables are nearly normally distributed. In
real-life settings, most distributions are not perfectly normal so the histogram in figure
1 is only considered nearly normally distributed. It indicates that values near the mean
occur more frequently than values that are farther away from the mean. Moreover, it
tells you about the central tendency of the dataset, which is the mean. The peak of the
bell-shaped curve represents the mean, and 50% of the data falls on either side of it. In
a normal distribution, the mean, median, and mode are equal or nearly the same (51
compared to 51. 1897).

Fig. 2 Box plot diagram

Box plot diagram shows that one outlier was found which is having a score below
40. Since there is only one outlier and this is not affecting the normality of the data,
this outlier is neglected. The figures shown through the box plot are the same as the

According to the analysis, the article came to the conclusion that the work-life
balance of paramedical employees is good. To sum up, this study used descriptive
statistics to present the work-life balance score in order to determine paramedical
employees’ equilibrium between personal and work life. Descriptive statistics used in
the article have shown the basic features of a dataset and presented in a summary that
describes the data sample and its measurements. Through this method, not only
professional analysts but also any readers can understand the data better and make
their own conclusion about paramedical employees’ work-life balance.

The work-life balance analysis of paramedical employees helps to know about the
employees’ working conditions, environment, and their present situation of balancing
their personal life with work. Therefore, organizations should facilitate the balance of
employees’ personal and work lives, which will improve work efficiency, result in
higher job satisfaction and pave the way for better performance. These factors will
contribute to enhancing the organization’s performance and profitability. The findings
in this paper should not be universally applied to all disciplines as all data collected
comes from the healthcare sector only. The results may also differ in case of
employees in other functional areas.

The benefits of descriptive statistics in finding out about the work-life status of
medical employees
Descriptive statistics is highly considered to analyze and understand the work-life
status of medical employees as it can be used to identify trends in work-life statuses,

such as changes in job satisfaction, burnout rates, or work hours. By identifying these
trends, medical organizations can take action to address problems and improve
working conditions. Moreover, it can compare work-life status across different groups
of medical employees, such as physicians, nurses, and administrative staff, which
helps identify differences in work-life balance and job satisfaction. Descriptive
statistics can also be used to measure the outcomes of interventions aimed at
improving work-life status, such as reducing work hours or providing mental health
support. By tracking key measures such as job satisfaction and burnout rates, medical
organizations can evaluate the effectiveness of these interventions. In addition, it helps
medical organizations plan interventions to improve work-life status. Since descriptive
statistics identify areas of concern and understand the factors that contribute to poor
work-life balance, organizations can develop targeted interventions that address the
specific needs of their employees.

Reason why descriptive satistic should be used in research and medical field

Descriptive statistics is an essential tool used in research to summarize and
describe the key features of a dataset. It is used to provide an overview of the data,
including the central tendency, variability, distribution, and shape of the data. It
provides a summary of the key features of the dataset, including the mean, median,
mode, range, and standard deviation and makes it easier to understand and interpret
the data. Moreover, descriptive statistics can identify patterns in the data, including
the shape of the distribution, outliers, and skewness, which can be used to guide
further analysis and investigation. Insights into the characteristics of the data,
including the range of values, the spread of the data, and the level of variability can
also be provided by descriptive statistics. Overall, descriptive statistics is a critical tool
in research, as it helps researchers to understand and interpret the data, and to draw
valid conclusions based on the findings. It is often the first step in analyzing data and
is essential for designing and conducting effective research studies.

Medical field
Descriptive statistics plays a critical role in the medical field. It is used to
summarize and describe the characteristics of patient populations, clinical trials, and
other medical data. The use of descriptive statistics in the medical field has several
important applications such as describing the prevalence and incidence of diseases in
populations, which can be used to identify risk factors and develop preventive
strategies. Moreover, it is used in clinical research to summarize and describe the
characteristics of study participants, including age, sex, medical history, and other
relevant factors and improve the quality of medical care since it can track the
performance of hospitals, clinics, and individual practitioners. Another benefit of
descriptive statistics is summarizing and describing the results of clinical trials. It is
used to calculate measures such as efficacy, safety, and tolerability, which are used to
evaluate the effectiveness of new drugs. 


1. Article 1: Financial Development and Economic Growth relationship
The Case
The information challenges the widely held view that financial development
promotes economic growth by providing empirical evidence that indicates a negligible
or weakly negative association between the two variables.

The Purpose
This information on the reliability of the weak connection between the same
variables is extremely valuable against recent studies that have suggested a positive
relationship between them. The author aims to provide evidence that supports the idea
of a weak or negative correlation between these two variables.

The Method
Firstly, data from 95 individual countries show a weak or negative correlation
between financial development and growth. Secondly, this pattern contrasts with the

large, positive correlation found in intercountry data. Thirdly, multiple regression
estimates of simple growth equations from individual-country data indicate the same
pattern. Fourthly, when using multiple regression estimates from cross-country
averaged data, a significant structural heterogeneity is observed, with a weak or
negligible relationship between financial development and growth in most cases. It is
noteworthy that the techniques related to the analysis topic used in Steps 1 and 2,
hence our focus will be on the analysis of these two tables.

The technique used in the article

Coefficient of correlation:
The article analyzes the individual-country evidence of the relationship between
financial development and economic growth. The study uses the ratio of liquid
liabilities to GDP (DEPTH) as the prime indicator of financial development and
calculates the correlation coefficient between DEPTH and growth of real GDP per
capita for each country in the sample. The equation for calculating correlation can be
expressed as follows:

In hypothesis testing, the null hypothesis is that there is no correlation between two
variables, and the alternative hypothesis is that there is a correlation. The p-value is
used to assess the strength of the evidence against the null hypothesis. A low p-value
(typically below 0.05) indicates that the observed correlation is unlikely to have
occurred by chance alone, and therefore provides evidence against the null hypothesis.

The period of the analysis is from 1960 to 1989, and the sample includes 95
countries with at least ten observations. The correlations are positive in 39 cases, of
which 9 show statistical significance at the conventional five percent level.
Meanwhile, The correlations are negative in the remaining 56 cases, and 16 of these
are significant at the five percent level. The mean of the 95 correlation coefficients is -
0.06, suggesting a negligible or weakly negative association between economic
growth and financial development.

In addition, The study found a positive correlation between financial development

and GDP growth across 95 countries, but individual-country analysis showed a weak
negative association. The study suggests that the cross-country correlation is not
reflective of the true effect of financial development on growth. The growth model
used in the study included real GDP growth, population, exports, gross domestic
investment to GDP ratio, and DEPTH.

The correlation is used to identify the relationship between the depth of financial
development and the growth of real GDP per capita for each country. Additionally,
the study compares the individual-country correlation patterns with the cross-country
correlation between the same variables. The impact of cyclical factors on the
correlations is also assessed. To gain a better understanding of the growth effects of
financial development on each country, the study examines the DEPTH parameter
from individual-country regressions of simple growth models. Overall, the study
focuses on investigating the correlations between financial development and economic
growth in individual countries to uncover any patterns or relationships that may exist.

The conclusion
The reason why the technique of analyzing correlations between financial
development and economic growth is quintessential for organization managers is that
it provides valuable insights into the relationship between these two important factors.
Understanding this relationship can help managers make informed decisions about
investment opportunities, financial strategies, and economic policies. Additionally,

this information can help them identify potential risks and opportunities, optimize
financial strategies, and develop more effective policies to support economic growth.
In conclusion, the technique of analyzing correlations between financial development
and economic growth can provide valuable insights that can help managers make
better-informed decisions and drive organizational success.

2. Article 2: Application of outlier mining in insider identification based

on Boxplot method
The case
With the development in Chinese trading market, illegal activities such as insider
trading is predicted to happen. It is noted that insiders who engage in fraudulent
activities are likely to display abnormal behaviour that can be detected through outlier
identification techniques. Thus, the researchers suggested using the Boxplot method as
a technique to detect outliers in the data and identify any potential fraud.

The purpose
As a relatively young market with unfinalized market discipline mechanism,
frauds such as insider trading are prone to happen. By using outliers detection
techniques, this paper could improve fraud detection and help organizations prevent
financial losses. Furthermore, it could help enhance the security measures and protect
the business’s stakeholders (such as investors, customers, and employees) from the

The method
Boxplot is a graphical representation of a dataset's distribution, including their
locality, spread and skewness through the summary of five numbers:
o Median: A value that separates the data into two halves, which means 50% of
the data lies below it and 50% of the data lies above it when arranged in
ascending or descending order. A median is also referred to as the second
quartile (Q2).

o First quartile (Q1) is the value below which 25% of the data fall, while the third
quartile (Q3) is the value below which 75% of the data fall.
o Interquartile Range: The range between Q3 and Q1, which represents 50% of
the data.
o An outlier is any value that lies above the Q3 + 1.5IQR point or below Q1 -
1.5IQR point.

The technique used in the article

There are some natures of the trading market that can be used to detect fraud.
Firstly, in terms of time, insider trading may occur in sensitive period of the company.
This is due to the fact that during this time, there may occur many private or
confidential information that can affect the company’s stock price.

Secondly, insider transactions are likely to product a phenomenon. Normally,

traders would want to make a large gain through fraud. However, they will avoid
buying and selling a huge amount to avoid the suspicion from the government and
other traders. On the other hand, small amount of purchase must handle large fees.
Thus, traders tend to choose to trade in a moderate volume in batches, and through
multiple accounts. This action will help distract the attention from the public, and they
can earn their money without being suspected.

The research used Huang’s insider trading case in 2007 as a case study for this
measure. The data shows the transactions from January 1st to September 31st in

Zhongguancun Stock in 2007. To study the fraud, the researchers chose the trading
volume, daily closing price, and the stock turnover rate as the criterias.

Figure 1: Boxplot of the Daily Trading Volume

The presented figure depicts the trading volume over a period of time, with the
vertical line representing the volume of trading and the horizontal line representing the
months. The chart exhibits four outliers that fall outside the whiskers, indicating a
significantly high trading volume in January, April, June, and September. Further
analysis reveals that these months correspond to sensitive periods for the company. In
January, a reform of the split share structure was implemented, and this data can be
considered inconsequential. During April, a confidential asset replacement plan
between Zhongguancun and Pengtai Company was in progress, and it was noted that
Huang was involved in the decision-making process. In addition, Zhongguancun
underwent a rearrangement process with Pengtai Company in July and August, which
was not disclosed to the public. The findings suggest that from April to June, Huang
utilized his insider position to direct others to purchase approximately 1,000,000,000
shares, and from August to September, he purchased nearly 150,000,000 shares for
himself. Notably, the former chairman of Zhongguancun, Xu, also acquired around
31,660,000 shares during this time. The temporal alignment of the identified outliers
with Huang's purchase patterns further supports the likelihood of his involvement in
insider trading.

Figure 2: The Trend of Daily Closing Price and Tumover Rate

Another method that the researchers used to discover any fraud was to use daily
closing price and stock turnover rate. In this figurem the blue line represents the daily
closing price, while the green one represents daily turnover rate.

The stock turnover rate, also known as inventory turnover rate, is a financial ratio
that measures how many times a company's inventory is sold and replaced over a
specific period of time. A high turnover rate may indicate increased market activity
and liquidity, while a low turnover rate may suggest lower market interest and
potential price stagnation. Thus, these two measures are likely to be constant.

Normally, the stock turnover rate was between 3% - 7% daily, and it was constant
with the stock price. However, there were some dates when the turnover rate was
noticeably high, such as in April, June, and September. There were sharp fluctuations
in these dates, which implied that there were large-scale batch of tradings. Moreover,
these periods matches with the period of Huang case. It is worth noticing that the
stock price of Zhongguancun company fell from May to July (from 12.6 yuan to 6.84

yuan), contradict to the activeness in the trading of the company’s shares. This
information indicated that the possibility of insider trading was enormous.

The conclusion
Detecting frauds in trading market has been a top concern for traders as it is a way
to protect the company’s money and the customers’ benefits. This research has proved
that by using the boxplot outlier detection method, one can effectively identify insider
trading activities. The study found that the method can successfully detect abnormal
trading volumes and price fluctuations during sensitive periods of the company. The
study also highlights the need for continuous development of efficient and reliable
methods for detecting insider trading activities to maintain the fairness and
transparency of financial markets.


1. Methodology
The research design for this study will be a quantitative research design, which
will involve the analysis of numerical data collected from the insurance firm.
The data for this study will be obtained from the insurance firm's internal database.
The data will be collected using a purposive sampling technique, where only the data
related to the research objective will be collected. The data set describes the
information of individuals paying charges to an insurance company. The insurance
company may care about descriptive statistics because of their procedures of risk
assessment. An insurance firm can analyze the risk involved in insuring a single
person or a group of people with the use of descriptive statistics. By analyzing the data
on factors such as age, gender, occupation, health history, and other relevant
variables, an insurance company can calculate the likelihood of an individual making
a claim and determine the appropriate premium to charge. In this case, there are 7
variables: age, sex, BMI, children, smoker, region, and charges. Detailed information
will be explained in the table below

Variables Explanation

Age Age of primary beneficiary

Sex Insurance contractor gender: female or male

BMI Body mass index, providing an understanding of body, weights that
are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height
to weight, ideally 18.5 to 24.9

Children Number of children covered by health insurance / Number of


Smoker Whether policyholders smoke or not

Region The beneficiary's residential area in the US, northeast, southeast,

southwest, northwest

Charges Individual medical costs billed by health insurance

2. Research question
a. Do the policyholders who possess the health risk caused by over BMI or
smoking, have to pay higher than those who don’t?
b. Do the beneficiary who has more kids have to pay more charges?

3. Data Analysis
Descriptive statistics will be used to summarize the collected data. Numerical
techniques such as measure of central location, variability, and linear relationship will
be used to describe a varible or the relationship between two variables and to make
predictions about future trends in the insurance industry. Graphical techniques such as
histograms, box plots, and bar charts will be used to visualize the numerical
techniques. The statistical software package, such as SPSS, will be used to analyze the

Figure 3: Histogram and descriptive statistics of age

As we can see from the figure, the mean was approximately equal to the median
(39.21 compared to 39) and the coefficient of skewness is equal to 0.056 which nearly
reaches 0. Besides, the histogram curve is a belled shape, so we conclude that the
distribution of age is the normal distribution. These results can help firms to more
accurately estimate the probability of claims made by different age groups. By
modeling the age distribution using a normal distribution, the firm can identify which
age groups are more likely to make claims and adjust their premiums accordingly to
account for the increased risk. In addition, the mode of age is 18 thus the highest
group age insured show us that youngster increasingly charges for their insurance. The
firm may target young customers to stimulate their revenue of insurance products.

Figure 4: Histogram BMI of policyholders

To illustrate data for an ordinal variable, BMI values were collapsed into ordinal
categories based on US standards: <18.5 underweight, [18.5, 25.0) normal weight,
[25.0, 30.0) overweight, [30.0, 35.0) class I obesity, [35.0, 40.0] class II obesity, and
[≥40.0) class extreme obesity


Table 3: Grouped frequency tables of BMI

frequency distribution of BMI categories is shown in Table. Note that few participants
were underweight: only 21 of 1338 (0.4%). Another 226 (16.9%) were normal weight,
386 (28.8%) were overweight, and 389 (29.1%) had class I obesity, with 225 (16.8%)
and 191 (6.8%) classified as having class II or extreme obesity, respectively.

Referring to cumulative statistics, one sees that 20.1% were underweight or normal
weight but that 23.6% (ie, 100%−76.4%) were obese.

It is true that the firm may increase premiums. If the data suggest that
policyholders who are obese are more likely to file claims, we may consider
increasing premiums for these individuals. While this may be contentious, it may
assist to balance the expense of reimbursements for obesity-related health conditions.
However, instead of simply increasing premiums for policyholders who are obese,
you could consider offering discounts for those who adopt healthy lifestyle choices.
For example, you could offer discounts for those who participate in regular exercise,
attend weight loss programs, or track their food intake. This approach incentivizes
healthy behaviors rather than penalizing those who are struggling with obesity.

Figure 5: Histogram of charges paid by policyholders

As we can see from the statistical table, the mean was much larger than median
and the coefficient of skewness is equal to 1.516 (greater than 0), so this histogram
tends to be positively skewed. From the results of the chart, it can be seen that the
number of policyholders paying a small and medium amount (ranging from 2000 to
15000) accounts for the majority. That can help the company identify their potential
customers who have low average contract value. From there, the company can offer

policies to expand the group of customers who want to buy expensive insurance.
Besides, the firm needs to ensure and take better care of our main customers

Figure 6: Box plot of charges in categories of the smoker,

sex, region, and children

From these boxes plot, it is quite obvious that smokers pay higher insurance
premiums. Therefore, we can say that smoking is a characteristic that definitely affects
patients' charges. Besides, we see a little increase in insurance premiums when the
number of children increases. But, decreases when there are 4 and more children
which is surprising to see. A possible explanation for this result is the insurance
policy, which states when a family has more than three children under the age of 21,
you only pay for the three oldest. This effectively makes the policyholder with more
than 4 children only have to pay a premium for three children. Thus, we assume that

the policyholders only have children that are under the age of 21 for this data. The sex
and region variables seem to show no effect on the insurance premium.

Insurance companies need to assess the risk of insuring a person, which includes
evaluating the likelihood of the person making a claim. Smokers are considered to be
at higher risk of developing health issues, such as heart disease, lung cancer, and
respiratory illnesses, which can lead to higher insurance claims. Based on this result,
Insurance companies are businesses that need to make a profit. By charging higher
premiums for smokers, they can offset the higher risk of insurance claims and
maintain profitability.

Figure 7:Correlations between BMI and charges and box plots

of charges in categories of BMI

In order to investigate the relationship between BMI and charges paid we apply the
coefficient of correlation The first is the value of Pearson’ r – i.e., the correlation
coefficient. That’s the Pearson Correlation figure (inside the square red box, above),
which in this case is 0.198.

Pearson’s r varies between +1 and -1, where +1 is a perfect positive correlation,

and -1 is a perfect negative correlation. 0 means there is no linear correlation at all.
Our figure of 0.198 indicates a very weak positive correlation. The higher their BMI
is, the more charges they have to pay, but the effect is very small.

The insurance firm may consider one of those decisions-making. One possibility is to
raise rates for consumers with a higher BMI or who smoke. This would help to
balance the higher risk while keeping the insurance firm profitable. However, it may
make insurance costs for some people. Another option is to create targeted policies
that are designed specifically for individuals with a higher BMI or who smoke. This
insurance may offer higher premiums, but it may also provide extra benefits such as
coverage for weight reduction programs or smoking cessation assistance. A third
option is to refuse coverage for individuals who have a higher BMI or who smoke.
This could be a controversial decision, but it would effectively eliminate the risk
associated with these individuals.

The data set provides us with basic certain medical costs personal by interpreting
the characteristics and the relationships of 7 variables which are age, gender, BMI,
children, smokers, region, and charges. Based on the result that their higher risk of
health problems is caused by either over BMI or smoking, the more charges they have
to pay, the firm may increase premiums, create targeted policies, and refuse coverage.
Besides, we find out that if a family has more than three children under the age of 21,
you only pay for the three oldest according to insurance policies. This information can
be used to tailor insurance products and services to better meet the needs of different
customer segments. By analyzing descriptive statistics such as mean, median, mode,
and standard deviation, insurance firms can better understand the characteristics of
their insured population. Descriptive statistics can help insurance firms identify trends
and patterns in their data, such as changes in claim frequency or severity over time.
This information can be used to adjust insurance pricing and underwriting practices to
better manage risk


In conclusion, descriptive statistics is a fundamental branch of statistics that involves

the use of statistical methods to summarize and describe data. The primary goal of

descriptive statistics is to provide an overview of the characteristics of a dataset, such
as its central tendency, variability, and distribution. The various measures used in
descriptive statistics, including measures of central tendency, measures of variability,
and measures of shape, are critical tools that help analysts make sense of data and
draw meaningful conclusions from it.

Descriptive statistics is widely used in many fields, including finance, medicine, social
sciences, and engineering, among others. It is an essential tool in data analysis and
reporting, and it is often the first step in any statistical analysis. Moreover, it provides
a framework for the more advanced statistical analyses, including inferential statistics
and predictive modeling.

Overall, descriptive statistics is a critical tool for any researcher or analyst who wants
to understand and interpret data accurately. By using the right descriptive statistics,
one can gain valuable insights into a dataset and make informed decisions based on
those insights. As such, descriptive statistics will remain an essential tool in data
analysis for many years to come.


1. Ram, R. (1999). Financial development and economic growth: Additional

evidence. Journal of Development Studies, 35(4), 164–174.
2. Aihua Li, Mengyan Feng, Yanruyu Li, Zhidong Liu,

Application of Outlier Mining in Insider Identification Based on Boxplot Method,

Procedia Computer Science, Volume 91, 2016, Pages 245-251, ISSN 1877-0509,

3. Arun Raj. R., & Dr. Hareesh N Ramanathan. (2022). A Study on Work-Life
Balance of Paramedical Employees with Special Reference to A Private
Hospital. Indian Journal of Commerce and Management Studies, 3(3), 74–79.
Retrieved from
4. Insurance Premium Data 

Insurance Premium Data | Kaggle

5. How does family size impact my insurance cost? |


You might also like