Professional Documents
Culture Documents
-------***-------
STATISTICS REPORT
Descriptive statistic
Group 4:
Vũ Tiến Hải - 11219357
Nguyễn Đình Hiệp - 11212209
Đặng Thuỳ Anh - 11219353
Vũ Kim Ngân - 11219365
Dương Nhật Anh - 11219352
Hanoi, 2023
Table of Contents
A. INTRODUCTION..................................................................................................1
I. DEFINITION OF DESCRIPTIVE STATISTIC:................................................1
II. NECESSITY AND UNDERSTANDING OF DESCRIPTIVE STATISTICS....1
1. The necessity of Descriptive Statistics.............................................................1
2. Understanding Descriptive Statistics...............................................................2
III. APPLICATION...............................................................................................5
B. ARTICLE AND SOURCES...................................................................................8
I. ARTICLE SUMMARY......................................................................................8
1. The case...........................................................................................................8
2. The purpose.....................................................................................................8
3. The method......................................................................................................9
Conclusion............................................................................................................ 18
II. OTHER RESEARCH AND TECHNIQUES....................................................20
1. Article 1: Financial Development and Economic Growth relationship..........20
2. Article 2: Application of outlier mining in insider identification based on
Boxplot method....................................................................................................23
III. DATA ANALYSIS.......................................................................................27
1. Methodology..................................................................................................27
2. Research question..........................................................................................28
3. Data Analysis.................................................................................................28
Conclusion............................................................................................................ 34
IV. Conclusion:....................................................................................................35
REFERENCES........................................................................................................35
List of tables
List of figures
Unlike raw data, which is collected data that has not yet been processed or cleaned,
descriptive statistics deals with methods of organizing, summarizing, and presenting
1
data in a convenient and informative way. This practice allow for the ease of data
visualization, which helps data to be presented in a meaningful and understandable
way. Thus, it allows for a simplified interpretation of the data set in question. Other
benefits of descriptive statistics may include:
● Descriptive statistics allows users to present the data in graphical formal, such
as bar charts, pie charts, dot plots, and histograms, which is much easier to
understand and interpret.
● Various statistical measures (e.g.: mean, median, and mode) allow users to
summarize the central characteristics of the data.This allows us to obtain a
rough understanding of where the data values lie.
● The measures of dispersion (e.g.: standard deviation) help users understand
how far the data values are spread away from each other.
● The computation of skewness illustrates the shape of the distribution.
● Correlation analysis compares two different characteristics and check whether
there is any relation between them.
Determining the statistics you want to output is typically the first step in applying
advanced statistics, and delivering them in the proper manner is the final step. The
steps involved in descriptive statistics are as follows:
Step 1: Define the research question or problem: The first step in descriptive
statistics is to define the research question or problem that you want to investigate.
The research question or problem should be focused and specific, and should provide
a clear direction for the collection and analysis of data. It should be framed in such a
way that it can be answered using quantitative data. It should also be specific enough
to allow for the selection of appropriate variables, data collection methods, and
statistical techniques for analysis. Once the research question or problem has been
identified, researchers can start collecting data.
Step 2: Collecting and Organizing Data: The second step in descriptive statistics is
to collect data. There are many ways to collect data, including surveys, experiments,
interviews, and observational studies.
After the data is collected, a researcher should clearn the data through removing
any errors or inconsistencies in the data that could affect the results. Once the data is
clean, it should be entered into a spreadsheet or statistical software program.
2
Once the data is collected and sorted, one should organize the data in a way that is
easy to analyze. This could involve sorting the data by variables, creating groups or
categories, or summarizing the data with measures of central tendency and variability.
● Mean: The mean is the sum of all the values in a dataset divided by the total
number of values. It is also known as the arithmetic average.
● Mode: The mode is the value that occurs most frequently in a dataset. The
mode can identify the most common value or values in the dataset, but it may
not provide a good representation of the overall distribution of the data.
● Median: The median is the middle value in a dataset when the values are
arranged in order of magnitude. The median is not affected by extreme values,
but it may not provide a good representation of the data if the distribution is
skewed or has outliers.
Step 4: Measures of Dispersion: This is a statistical tools used to describe the spread
or distribution of a dataset. Here's a brief overview of each measures of dispersion:
● Range: The range is the difference between the largest and smallest values in a
dataset. It provides a rough estimate of the spread of the data, but it can be
affected by extreme values and may not be a robust measure of variability.
● Variance: The variance is a measure of how far the values in a dataset are
spread out from the mean. It is calculated by taking the average of the squared
differences between each value and the mean.
● Standard deviation: This is the square root of the variance. It provides a more
intuitive measure of dispersion that is expressed in the same units as the
original data. The standard deviation provides a good estimate of the spread of
the data, but it may not be a robust measure of variability in the presence of
outliers.
● Coefficient of variation: The coefficient of variation is the ratio of the standard
deviation to the mean, expressed as a percentage. It provides a measure of
relative variability that can be used to compare the variability of different
datasets.
Step 5: Data Visualization: The data can be visualized using various graphical
representations such as histograms, scatter plots, bar graphs, and pie charts. This helps
to get a quick understanding of the distribution and pattern of the data. There are
3
several types of data visualizations that are commonly used in descriptive statistics,
including:
● Bar graphs: a chart that represents the data using rectangular bars with lengths
proportional to the values they represent.
● Scatter plots: a graph that represents the relationship between two variables.
Each point on the plot represents a single data point, and the position of the
point on the plot corresponds to the value of the two variables.
● Box plots: a chart that represents the distribution of the data using a box with
whiskers. The box represents the interquartile range, while the whiskers
represent the range of the data.
● Histograms: a chart that represents the distribution of the data using bars. The
bars represent the frequency or proportion of the data that falls within each
interval.
There are some other data visualizations such as heat maps, pie charts, and linegraphs.
Besides description of a data set, we can also read and interpret data through
describing the relationship between two variables using covariance and
correlation coefficient.
There are two types of covariance: sample covariance and population covariance.
Sample covariance is calculated using a sample of data, while population covariance
uses the entire population data. Sample covariance is an estimator of population
covariance, but it tends to be biased, so adjustments need to be made to get an
unbiased estimate. The formula to calculate covariance is here as followed:
However, since covariance measures the strength and direction of the linear
relationship between two variables, the covariance of two variables with large values
or different units of measurement may be much larger than the covariance of two
variables with smaller values or similar units of measurement, even if the strength of
the relationship is the same. To overcome this disadvantage, correlation coefficient is
more commonly used as it standardizes the covariance by dividing it by the product of
the standard deviations of the two variables, which allows for more meaningful
comparisons.
III. APPLICATION
- Weather Forecasting
Statistics is used heavily in the field of weather forecasting. Analysis of trends can
be useful in depicting and predicting the changing patterns and erraticism of some
climatic parameters. This analysis gives a proper knowledge about the changing
conditions of the climate and its effects, by the evaluation of meteorological
parameters. In particular, probability is used by weather forecasters to assess how
likely it is that there will be rain, snow, clouds, etc. on a given day in a certain area.
Forecasters will regularly assume that “there is a 90% chance of rain today between
after 5PM” to indicate that there’s a high likelihood of rain during certain hours.
- Health care
Health statistics are used to understand risk factors for communities, track and
monitor diseases, see the impact of policy changes, assess the quality and safety of
5
health care and determine how likely it is that certain individuals will spend a certain
amount on healthcare each year. Health statistics are a form of evidence, or facts that
can support a conclusion.
About Health Statistics Modules:
o Correlates: See how to measure the risk factors and protective factors that
impact our health.
o Conditions: Learn to assess how often and how badly diseases impact a
community.
o Care: Dig into how healthcare is delivered to the communities that need it, to
treat disease and illness.
o Costs: Get more information on what health care costs, and why.
For example, an actuary at a health insurance company might use factors like age,
existing medical conditions, current health status, etc. to determine that there’s a 80%
probability that a certain individual will spend $10,000 or more on healthcare in a
given year.
- Traffic
Traffic engineers regularly use statistics to monitor total traffic in different areas of a
city, which allows them to decide whether or not they should add or remove roads to
optimize traffic flow and analysis to monitor how traffic changes. Also, statistical
methods enable planners to construct and validate microsimulations of traffic
congestion and explore alternative designs. These models are also useful in planning
for emergencies, such as managing the evacuation of a major city threatened by a
hurricane.
- Investing
An investor wants to assess the likelihood of a certain investment paying off. Using
statistics and probability, they can calculate the expected return and the risk associated
with the investment. For instance, an investor might use descriptive statistics to
determine the average return of a particular stock over the last five years, the volatility
of its returns, and the correlation between its returns and those of other stocks. Based
6
on this information, the investor can decide how much of their portfolio to allocate to
the stock.
- Medical Studies
Medical professionals use descriptive statistics to understand how different factors are
related in a population. For example, they may use correlation to analyze the
relationship between smoking habits and the risk of developing lung cancer. By
looking at a large sample of people, they can determine how strong this correlation is
and estimate the likelihood of an individual developing lung cancer based on their
smoking habits.
- Manufacturing
Manufacturing engineers use descriptive statistics to monitor the efficiency of
different production processes. For instance, they might take a random sample of
products from a production line and calculate the proportion of defective items. If the
proportion of defective items is higher than a certain acceptable level, they can use
descriptive statistics to identify the causes of the defects and implement corrective
actions to improve the process.
7
B. ARTICLE AND SOURCES
I. ARTICLE SUMMARY
1. The case
Employee engagement is a critical factor for organizational success. Many of the
factors that impact on employee engagement have been identified and one of them is
the work-life balance. Work-life balance is the ability to balance work obligations and
personal life responsibilities, which is crucial for individuals to achieve physical,
emotional, and mental well-being. Without a healthy work-life balance, premedical
employees may experience burnout, stress, and other negative health outcomes as well
as struggle to balance these responsibilities with their work. In turn, employers are
recognizing the need to create a work environment that supports employees' well-
being and promotes productivity.
This article introduces the work-life balance scores and their influences on the
equilibrium between the private life and work life of paramedical employees.The data
collected through questionnaires will be analyzed to identify common themes and
patterns related to work-life balance and to provide insights into the challenges and
opportunities for improving employees' work-life balance.
2. The purpose
The aim of this paper is to find out the work-life status of paramedical employees
in a private hospital and the variations in the work-life among different categories of
8
medical workers. The study can provide insights into the specific challenges and
opportunities related to this field and to identify effective strategies for supporting
employees' well-being and promoting a healthy work-life balance.
3. The method
In order to conclude the research precisely, 116 paramedical employees were
chosen from a Private Hospital. Questionnaires were circulated and data was collected
and analyzed by using appropriate statistical tools.
Employee’s answers are transferred into the characteristics of a data set, which are
summarized through descriptive statistics.
9
e. Measures of dispersion
In statistics, the measures of dispersion assume how spread out all other values of
the distribution are from the central tendency. In simple terms, it shows how squeezed
or scattered the variable is.
a. Range
Range is the difference between the largest and the smallest observation in the
data. The range generally gives you a good indicator of variability when you have a
distribution without extreme values. When paired with measures of central tendency,
the range can tell you about the span of the distribution. To calculate the range, you
need to find the largest observed value of a variable (the maximum) and subtract the
smallest observed value (the minimum).
b. Variance
Variance measures how far each number in the set is from the mean, and thus from
every other number in the set. It is calculated by taking the differences between each
number in the data set and the mean, then squaring the differences to make them
positive, and finally dividing the sum of the squares by the number of values in the
data set.
10
c. Standard Deviation
Standard deviation is a statistic that measures the dispersion of a dataset relative to
its mean and is calculated as the square root of the variance. In normal distributions, a
high standard deviation means that values are generally far from the mean, while a
low standard deviation indicates that values are clustered close to the mean. The
standard deviation is calculated as the square root of variance by determining each
data point's deviation relative to the mean.
f. Graphical techniques
Descriptive statistics also provide a way to visualize data using graphs and charts.
Specifically, in this paper, histogram, box plot and frequency table are used to
highlight important features of the data and communicate findings in a more
accessible way.
11
a. Histogram
In statistics, a histogram is a graphical representation of the distribution of data.
The histogram is represented by a set of rectangles, adjacent to each other, where each
bar represents a kind of data. It is used to summarize discrete or continuous data that
are measured on an interval scale. It is often used to illustrate the major features of the
distribution of the data in a convenient form.
b. Box plot
In descriptive statistics, a box plot is a method for graphically demonstrating the
locality, spread and skewness groups of numerical data through their quartiles. In
addition to the box on a box plot, there can be lines extending from the box indicating
variability outside the upper and lower quartiles. Box plots are used to show
distributions of numeric data values, especially when you want to compare them
between multiple groups.
12
c. Frequency table
Frequency refers to the number of times an event or a value occurs. A frequency
table is a table that lists items and shows the number of times the items occur. A
frequency table shows the distribution of observations based on the options in a
variable. Frequency tables are helpful to understand which options occur more or less
often in the dataset. This is helpful for getting a better understanding of each variable
and deciding if variables need to be recoded or not.
13
Table 1. Details of Demographic Factors used in the Present Study
Research data were analyzed by using a frequency table. Table 1 summarizes data
by grouping it into intervals and showing how many times each interval occurs.
Through this method, it is easier to identify patterns and trends in the data. As we can
see, 71.6% of respondent employees are youngsters whereas only 28.4% are middle
aged. This is understandable since becoming a medical professional typically requires
a significant amount of education and training, which can take many years to
complete. Younger people may have an advantage in this regard, as they have more
time to invest in their education and training before settling into a long-term career.
Moreover, it may be easier for them to balance work-life status since they are just
starting their careers and may have fewer responsibilities outside of work. Also, it
explains that 69.8% of employees say their parents are staying with them and 30.2%
say no. There is a higher percentage of people living with parents because it can be a
cost-effective way to save money or they may simply have a strong emotional bond or
desire for companionship and support with parents. People seek for a work-life
balance when they live with family members because they value their personal life
14
and want to maintain a healthy and fulfilling relationship with their family. Achieving
work-life balance can help people maintain their motivation, focus, and job
satisfaction, which can in turn help them be more present and engaged with their
family members.
The descriptive statistics are explained in table no. 2. It was reported that the
work-life balance mean was found to be 51.1 with a standard deviation 4.34. The
mean is used to represent the typical value and therefore serves as a yardstick for all
observations. As the mean score is higher than the benchmark score (51.1 compared to
15
45), it can be concluded that the average medical employee from the private hospital
is satisfied with their work-life status.
The minimum score is 38 which is quite near the benchmark of 45, which means
that there is nobody that is totally dissatisfied with their work-life balance. There are
only some aspects of work-life balance that they are not pleased with which is normal
for any employee. The maximum score is 61 which is quite near the mean but far from
the highest score that can be achieved. This means that although the work-life of
paramedical employees is good, it is not perfect. And employers still have to improve
the quality of the work-life balance if they want to maintain work efficiency and
employee engagement.
From the table, it can be concluded that the variance equals 18.92. However, it
gives weights to outliers whose values are far different from the mean. When squaring
these values, it is possible to skew the data set. Since variance represents a squared
result, the standard deviation is often easier to visualize and apply. The standard in the
table is 4. 34974, which means that the data deviates from the scale of 4. 34974.
Standard deviation has no concept of good or bad. This value just shows you how
little or more spread out the data is - the spreadness between the satisfaction and
dissatisfaction of work-life balance.
16
Fig. 1 Histogram
The histogram above has bell-shaped, with only one peak, and is symmetric
around the mean, which means that the variables are nearly normally distributed. In
real-life settings, most distributions are not perfectly normal so the histogram in figure
1 is only considered nearly normally distributed. It indicates that values near the mean
occur more frequently than values that are farther away from the mean. Moreover, it
tells you about the central tendency of the dataset, which is the mean. The peak of the
bell-shaped curve represents the mean, and 50% of the data falls on either side of it. In
a normal distribution, the mean, median, and mode are equal or nearly the same (51
compared to 51. 1897).
17
Fig. 2 Box plot diagram
Box plot diagram shows that one outlier was found which is having a score below
40. Since there is only one outlier and this is not affecting the normality of the data,
this outlier is neglected. The figures shown through the box plot are the same as the
histogram.
Conclusion
According to the analysis, the article came to the conclusion that the work-life
balance of paramedical employees is good. To sum up, this study used descriptive
statistics to present the work-life balance score in order to determine paramedical
employees’ equilibrium between personal and work life. Descriptive statistics used in
the article have shown the basic features of a dataset and presented in a summary that
describes the data sample and its measurements. Through this method, not only
professional analysts but also any readers can understand the data better and make
their own conclusion about paramedical employees’ work-life balance.
The work-life balance analysis of paramedical employees helps to know about the
employees’ working conditions, environment, and their present situation of balancing
their personal life with work. Therefore, organizations should facilitate the balance of
employees’ personal and work lives, which will improve work efficiency, result in
higher job satisfaction and pave the way for better performance. These factors will
contribute to enhancing the organization’s performance and profitability. The findings
in this paper should not be universally applied to all disciplines as all data collected
comes from the healthcare sector only. The results may also differ in case of
employees in other functional areas.
The benefits of descriptive statistics in finding out about the work-life status of
medical employees
Descriptive statistics is highly considered to analyze and understand the work-life
status of medical employees as it can be used to identify trends in work-life statuses,
18
such as changes in job satisfaction, burnout rates, or work hours. By identifying these
trends, medical organizations can take action to address problems and improve
working conditions. Moreover, it can compare work-life status across different groups
of medical employees, such as physicians, nurses, and administrative staff, which
helps identify differences in work-life balance and job satisfaction. Descriptive
statistics can also be used to measure the outcomes of interventions aimed at
improving work-life status, such as reducing work hours or providing mental health
support. By tracking key measures such as job satisfaction and burnout rates, medical
organizations can evaluate the effectiveness of these interventions. In addition, it helps
medical organizations plan interventions to improve work-life status. Since descriptive
statistics identify areas of concern and understand the factors that contribute to poor
work-life balance, organizations can develop targeted interventions that address the
specific needs of their employees.
Reason why descriptive satistic should be used in research and medical field
Research
Descriptive statistics is an essential tool used in research to summarize and
describe the key features of a dataset. It is used to provide an overview of the data,
including the central tendency, variability, distribution, and shape of the data. It
provides a summary of the key features of the dataset, including the mean, median,
mode, range, and standard deviation and makes it easier to understand and interpret
the data. Moreover, descriptive statistics can identify patterns in the data, including
the shape of the distribution, outliers, and skewness, which can be used to guide
further analysis and investigation. Insights into the characteristics of the data,
including the range of values, the spread of the data, and the level of variability can
also be provided by descriptive statistics. Overall, descriptive statistics is a critical tool
in research, as it helps researchers to understand and interpret the data, and to draw
valid conclusions based on the findings. It is often the first step in analyzing data and
is essential for designing and conducting effective research studies.
19
Medical field
Descriptive statistics plays a critical role in the medical field. It is used to
summarize and describe the characteristics of patient populations, clinical trials, and
other medical data. The use of descriptive statistics in the medical field has several
important applications such as describing the prevalence and incidence of diseases in
populations, which can be used to identify risk factors and develop preventive
strategies. Moreover, it is used in clinical research to summarize and describe the
characteristics of study participants, including age, sex, medical history, and other
relevant factors and improve the quality of medical care since it can track the
performance of hospitals, clinics, and individual practitioners. Another benefit of
descriptive statistics is summarizing and describing the results of clinical trials. It is
used to calculate measures such as efficacy, safety, and tolerability, which are used to
evaluate the effectiveness of new drugs.
The Purpose
This information on the reliability of the weak connection between the same
variables is extremely valuable against recent studies that have suggested a positive
relationship between them. The author aims to provide evidence that supports the idea
of a weak or negative correlation between these two variables.
The Method
Firstly, data from 95 individual countries show a weak or negative correlation
between financial development and growth. Secondly, this pattern contrasts with the
20
large, positive correlation found in intercountry data. Thirdly, multiple regression
estimates of simple growth equations from individual-country data indicate the same
pattern. Fourthly, when using multiple regression estimates from cross-country
averaged data, a significant structural heterogeneity is observed, with a weak or
negligible relationship between financial development and growth in most cases. It is
noteworthy that the techniques related to the analysis topic used in Steps 1 and 2,
hence our focus will be on the analysis of these two tables.
In hypothesis testing, the null hypothesis is that there is no correlation between two
variables, and the alternative hypothesis is that there is a correlation. The p-value is
used to assess the strength of the evidence against the null hypothesis. A low p-value
(typically below 0.05) indicates that the observed correlation is unlikely to have
occurred by chance alone, and therefore provides evidence against the null hypothesis.
21
The period of the analysis is from 1960 to 1989, and the sample includes 95
countries with at least ten observations. The correlations are positive in 39 cases, of
which 9 show statistical significance at the conventional five percent level.
Meanwhile, The correlations are negative in the remaining 56 cases, and 16 of these
are significant at the five percent level. The mean of the 95 correlation coefficients is -
0.06, suggesting a negligible or weakly negative association between economic
growth and financial development.
The correlation is used to identify the relationship between the depth of financial
development and the growth of real GDP per capita for each country. Additionally,
the study compares the individual-country correlation patterns with the cross-country
correlation between the same variables. The impact of cyclical factors on the
correlations is also assessed. To gain a better understanding of the growth effects of
financial development on each country, the study examines the DEPTH parameter
from individual-country regressions of simple growth models. Overall, the study
focuses on investigating the correlations between financial development and economic
growth in individual countries to uncover any patterns or relationships that may exist.
The conclusion
The reason why the technique of analyzing correlations between financial
development and economic growth is quintessential for organization managers is that
it provides valuable insights into the relationship between these two important factors.
Understanding this relationship can help managers make informed decisions about
investment opportunities, financial strategies, and economic policies. Additionally,
22
this information can help them identify potential risks and opportunities, optimize
financial strategies, and develop more effective policies to support economic growth.
In conclusion, the technique of analyzing correlations between financial development
and economic growth can provide valuable insights that can help managers make
better-informed decisions and drive organizational success.
The purpose
As a relatively young market with unfinalized market discipline mechanism,
frauds such as insider trading are prone to happen. By using outliers detection
techniques, this paper could improve fraud detection and help organizations prevent
financial losses. Furthermore, it could help enhance the security measures and protect
the business’s stakeholders (such as investors, customers, and employees) from the
frauds.
The method
Boxplot is a graphical representation of a dataset's distribution, including their
locality, spread and skewness through the summary of five numbers:
o Median: A value that separates the data into two halves, which means 50% of
the data lies below it and 50% of the data lies above it when arranged in
ascending or descending order. A median is also referred to as the second
quartile (Q2).
23
o First quartile (Q1) is the value below which 25% of the data fall, while the third
quartile (Q3) is the value below which 75% of the data fall.
o Interquartile Range: The range between Q3 and Q1, which represents 50% of
the data.
o An outlier is any value that lies above the Q3 + 1.5IQR point or below Q1 -
1.5IQR point.
The research used Huang’s insider trading case in 2007 as a case study for this
measure. The data shows the transactions from January 1st to September 31st in
24
Zhongguancun Stock in 2007. To study the fraud, the researchers chose the trading
volume, daily closing price, and the stock turnover rate as the criterias.
The presented figure depicts the trading volume over a period of time, with the
vertical line representing the volume of trading and the horizontal line representing the
months. The chart exhibits four outliers that fall outside the whiskers, indicating a
significantly high trading volume in January, April, June, and September. Further
analysis reveals that these months correspond to sensitive periods for the company. In
January, a reform of the split share structure was implemented, and this data can be
considered inconsequential. During April, a confidential asset replacement plan
between Zhongguancun and Pengtai Company was in progress, and it was noted that
Huang was involved in the decision-making process. In addition, Zhongguancun
underwent a rearrangement process with Pengtai Company in July and August, which
was not disclosed to the public. The findings suggest that from April to June, Huang
utilized his insider position to direct others to purchase approximately 1,000,000,000
shares, and from August to September, he purchased nearly 150,000,000 shares for
himself. Notably, the former chairman of Zhongguancun, Xu, also acquired around
31,660,000 shares during this time. The temporal alignment of the identified outliers
with Huang's purchase patterns further supports the likelihood of his involvement in
insider trading.
25
Figure 2: The Trend of Daily Closing Price and Tumover Rate
Another method that the researchers used to discover any fraud was to use daily
closing price and stock turnover rate. In this figurem the blue line represents the daily
closing price, while the green one represents daily turnover rate.
The stock turnover rate, also known as inventory turnover rate, is a financial ratio
that measures how many times a company's inventory is sold and replaced over a
specific period of time. A high turnover rate may indicate increased market activity
and liquidity, while a low turnover rate may suggest lower market interest and
potential price stagnation. Thus, these two measures are likely to be constant.
Normally, the stock turnover rate was between 3% - 7% daily, and it was constant
with the stock price. However, there were some dates when the turnover rate was
noticeably high, such as in April, June, and September. There were sharp fluctuations
in these dates, which implied that there were large-scale batch of tradings. Moreover,
these periods matches with the period of Huang case. It is worth noticing that the
stock price of Zhongguancun company fell from May to July (from 12.6 yuan to 6.84
26
yuan), contradict to the activeness in the trading of the company’s shares. This
information indicated that the possibility of insider trading was enormous.
The conclusion
Detecting frauds in trading market has been a top concern for traders as it is a way
to protect the company’s money and the customers’ benefits. This research has proved
that by using the boxplot outlier detection method, one can effectively identify insider
trading activities. The study found that the method can successfully detect abnormal
trading volumes and price fluctuations during sensitive periods of the company. The
study also highlights the need for continuous development of efficient and reliable
methods for detecting insider trading activities to maintain the fairness and
transparency of financial markets.
Variables Explanation
27
BMI Body mass index, providing an understanding of body, weights that
are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height
to weight, ideally 18.5 to 24.9
2. Research question
a. Do the policyholders who possess the health risk caused by over BMI or
smoking, have to pay higher than those who don’t?
b. Do the beneficiary who has more kids have to pay more charges?
3. Data Analysis
Descriptive statistics will be used to summarize the collected data. Numerical
techniques such as measure of central location, variability, and linear relationship will
be used to describe a varible or the relationship between two variables and to make
predictions about future trends in the insurance industry. Graphical techniques such as
histograms, box plots, and bar charts will be used to visualize the numerical
techniques. The statistical software package, such as SPSS, will be used to analyze the
data.
28
Figure 3: Histogram and descriptive statistics of age
As we can see from the figure, the mean was approximately equal to the median
(39.21 compared to 39) and the coefficient of skewness is equal to 0.056 which nearly
reaches 0. Besides, the histogram curve is a belled shape, so we conclude that the
distribution of age is the normal distribution. These results can help firms to more
accurately estimate the probability of claims made by different age groups. By
modeling the age distribution using a normal distribution, the firm can identify which
age groups are more likely to make claims and adjust their premiums accordingly to
account for the increased risk. In addition, the mode of age is 18 thus the highest
group age insured show us that youngster increasingly charges for their insurance. The
firm may target young customers to stimulate their revenue of insurance products.
29
Figure 4: Histogram BMI of policyholders
To illustrate data for an ordinal variable, BMI values were collapsed into ordinal
categories based on US standards: <18.5 underweight, [18.5, 25.0) normal weight,
[25.0, 30.0) overweight, [30.0, 35.0) class I obesity, [35.0, 40.0] class II obesity, and
[≥40.0) class extreme obesity
The
30
Referring to cumulative statistics, one sees that 20.1% were underweight or normal
weight but that 23.6% (ie, 100%−76.4%) were obese.
It is true that the firm may increase premiums. If the data suggest that
policyholders who are obese are more likely to file claims, we may consider
increasing premiums for these individuals. While this may be contentious, it may
assist to balance the expense of reimbursements for obesity-related health conditions.
However, instead of simply increasing premiums for policyholders who are obese,
you could consider offering discounts for those who adopt healthy lifestyle choices.
For example, you could offer discounts for those who participate in regular exercise,
attend weight loss programs, or track their food intake. This approach incentivizes
healthy behaviors rather than penalizing those who are struggling with obesity.
As we can see from the statistical table, the mean was much larger than median
and the coefficient of skewness is equal to 1.516 (greater than 0), so this histogram
tends to be positively skewed. From the results of the chart, it can be seen that the
number of policyholders paying a small and medium amount (ranging from 2000 to
15000) accounts for the majority. That can help the company identify their potential
customers who have low average contract value. From there, the company can offer
31
policies to expand the group of customers who want to buy expensive insurance.
Besides, the firm needs to ensure and take better care of our main customers
From these boxes plot, it is quite obvious that smokers pay higher insurance
premiums. Therefore, we can say that smoking is a characteristic that definitely affects
patients' charges. Besides, we see a little increase in insurance premiums when the
number of children increases. But, decreases when there are 4 and more children
which is surprising to see. A possible explanation for this result is the insurance
policy, which states when a family has more than three children under the age of 21,
you only pay for the three oldest. This effectively makes the policyholder with more
than 4 children only have to pay a premium for three children. Thus, we assume that
32
the policyholders only have children that are under the age of 21 for this data. The sex
and region variables seem to show no effect on the insurance premium.
Insurance companies need to assess the risk of insuring a person, which includes
evaluating the likelihood of the person making a claim. Smokers are considered to be
at higher risk of developing health issues, such as heart disease, lung cancer, and
respiratory illnesses, which can lead to higher insurance claims. Based on this result,
Insurance companies are businesses that need to make a profit. By charging higher
premiums for smokers, they can offset the higher risk of insurance claims and
maintain profitability.
In order to investigate the relationship between BMI and charges paid we apply the
coefficient of correlation The first is the value of Pearson’ r – i.e., the correlation
coefficient. That’s the Pearson Correlation figure (inside the square red box, above),
which in this case is 0.198.
33
The insurance firm may consider one of those decisions-making. One possibility is to
raise rates for consumers with a higher BMI or who smoke. This would help to
balance the higher risk while keeping the insurance firm profitable. However, it may
make insurance costs for some people. Another option is to create targeted policies
that are designed specifically for individuals with a higher BMI or who smoke. This
insurance may offer higher premiums, but it may also provide extra benefits such as
coverage for weight reduction programs or smoking cessation assistance. A third
option is to refuse coverage for individuals who have a higher BMI or who smoke.
This could be a controversial decision, but it would effectively eliminate the risk
associated with these individuals.
Conclusion
The data set provides us with basic certain medical costs personal by interpreting
the characteristics and the relationships of 7 variables which are age, gender, BMI,
children, smokers, region, and charges. Based on the result that their higher risk of
health problems is caused by either over BMI or smoking, the more charges they have
to pay, the firm may increase premiums, create targeted policies, and refuse coverage.
Besides, we find out that if a family has more than three children under the age of 21,
you only pay for the three oldest according to insurance policies. This information can
be used to tailor insurance products and services to better meet the needs of different
customer segments. By analyzing descriptive statistics such as mean, median, mode,
and standard deviation, insurance firms can better understand the characteristics of
their insured population. Descriptive statistics can help insurance firms identify trends
and patterns in their data, such as changes in claim frequency or severity over time.
This information can be used to adjust insurance pricing and underwriting practices to
better manage risk
IV.Conclusion:
34
descriptive statistics is to provide an overview of the characteristics of a dataset, such
as its central tendency, variability, and distribution. The various measures used in
descriptive statistics, including measures of central tendency, measures of variability,
and measures of shape, are critical tools that help analysts make sense of data and
draw meaningful conclusions from it.
Descriptive statistics is widely used in many fields, including finance, medicine, social
sciences, and engineering, among others. It is an essential tool in data analysis and
reporting, and it is often the first step in any statistical analysis. Moreover, it provides
a framework for the more advanced statistical analyses, including inferential statistics
and predictive modeling.
Overall, descriptive statistics is a critical tool for any researcher or analyst who wants
to understand and interpret data accurately. By using the right descriptive statistics,
one can gain valuable insights into a dataset and make informed decisions based on
those insights. As such, descriptive statistics will remain an essential tool in data
analysis for many years to come.
REFERENCES
Procedia Computer Science, Volume 91, 2016, Pages 245-251, ISSN 1877-0509,
https://doi.org/10.1016/j.procs.2016.07.069.
3. Arun Raj. R., & Dr. Hareesh N Ramanathan. (2022). A Study on Work-Life
Balance of Paramedical Employees with Special Reference to A Private
Hospital. Indian Journal of Commerce and Management Studies, 3(3), 74–79.
Retrieved from https://ijcms.in/index.php/ijcms/article/view/536
4. Insurance Premium Data
35