You are on page 1of 238

Management of Health

Infrastructures
Degree of Biomedical Engineering
Universitat Rovira i Virgili
Hatem A. Rashwan

1
In Loving Memory
and Gratitude
• Before we delve into today's
lecture, I would like to take
a moment to remember and
express our heartfelt
gratitude to the esteemed
professor who previously
taught this course.
• Unfortunately, Dr. David
Riaño is no longer with us,
but their influence and
dedication continue to
inspire us.

2
Healthcare data management is critical for improving
patient care, patient safety, operational efficiency, research,
population health management, and evidence-based
decision-making for healthcare organizations.

3
Presentation
• Computer Science (CS) and Artificial
Intelligence (AI) technologies are being
incorporated in health care centers for a
better treatment of patients. Data-base
and knowledge-base technologies are at
the core of this revolution.
• In this course, we will introduce some of
the most relevant standards for clinical
data codification, data structuration,
electronic health records, and semantic
annotation. Moreover, the course will
introduce notions on data analysis,
knowledge representation and
management in medicine, and will
present decision support systems and
artificial intelligence (AI) techniques in
medicine.
4
Contents
1. Introduction 4. Clinical Knowledge
2. Clinical Data – Knowledge Representation: An
Introduction
– Data Sources – K. Representation: FOL
– Types of Data, Variables, and – K. Representation: Production Rules
Transformations – K. Representation: Objects
– Data Standards: Coding Systems and – K. Representation: Ontologies
Terminologies – Knowledge Life Cycle
– Patient Records 5. Clinical Decision Support Systems
– Interoperability – Differential Diagnosis Generators
– Drug Interaction Checkers
3. Clinical Data Analysis – Alarm and Surveillance Systems
– A Data Science Project 6. Ethics and Security Issues
– Data Pre-Processing – Spanish Legislation on health
– Statistical Clinical Data Analysis records
– Code of ethics of Health Inf.
– Artificial Intelligence Clinical Data Systems
Analysis – General Data Protection Regulation
– Quality of the Analysis 7. Conclusions

5
1. Introduction
• Health Care and Health Care Systems
Def (Health Care, HC): Efforts made to maintain or restore physical, mental, or
emotional well-being specially by trained and licensed professionals.
Def (Health Care System): The organization of people, institutions, and resources
that deliver health care services to meet the health needs of target populations.
• Health Care Center
Def (Health Care Center, HCC): a building or establishment housing local medical
services or the practice of a group of doctors.

We will distinguish between three


types of Health Care Centers:
Primary HCC: Place where family doctors,
general practitioners, and other clinical staff
provide first-contact care.
Secondary HCC: Place where one or more
specialists and additional staff provide care.
Tertiary HCC: Place where the higher level of
specialization is provided. General Hospitals.
6
People in Health Care
Def (Patient): A person under health care.
Alternative classifications of patients:
– Acute/Chronic patient: patient whose health
condition develops suddenly and last a short
time –days or weeks (acute), or whose
condition evolves slowly and may worsen over
an extended period of time –months or years,
forever (chronic).
– Inpatient/Outpatient: patient that are admitted
to a hospital with a doctor’s order (inpatient), or
who is not hospitalized (outpatient). Home-care
patients receiving care at home, they are a type
of outpatient.
• Health Care Human Resources and
Professionals: physicians, nurses,
administration, etc.

7
Computer Technology in Health Care
Def (Information System): an integrated set of components for
collecting, storing, and processing data and for providing
information, knowledge, and digital products.
• Data and Information
– Single Clinical Data Units: Textual vs. Codified
– Information Structures: Unstructured (Textual), Semi-structured (Forms), Structured
(EHR).
– Data uses DATA 39
• Primary Use: Clinical Care
• Secondary Use: Research Units meaning
– Clinical Data Analysis INFORMATION Temp = 39
• Statistical
• Machine Learning (Artificial Intelligence) generalization
KNOWLEDGE Temp > 38 => Fever
• Knowledge
– Medical Knowledge Representation
– Clinical Decision Support System (CDSS): health information technology system that is
designed to provide physicians and other health professionals with clinical decision
support (CDS), that is, assistance with clinical decision-making tasks.

8
Computer Technology in Health Care
Def (Information System): an integrated set of components for collecting,
storing, and processing data and for providing information, knowledge, and
digital products.

• Knowledge
– Medical Knowledge Representation
– Clinical Decision Support System (CDSS): health information technology system that is
designed to provide physicians and other health professionals with clinical decision
support (CDS), that is, assistance with clinical decision-making tasks.

9
Simplified Global View

Health Care System

HEALTH CARE
CENTER (HC)
RESEARCH UNIT Upper
(Clinical Data Analysis)
Level
data
Computers (secondary use)

DECISION SUPPORT INFORMATION


SYSTEM (CDSS) data SYSTEM (HIS) data
Intermediate
(Information Structures) Devices
Knowledge
(support)
(Clinical Practice Guideline)
Level
data
recommendation (primary use) monitoring

Resources & data


HEALTHCARE
data Basic
care care Patient
Health Care Professional (HCP) Level

10
2. Clinical Data
• Data Sources in Health Care
• Data Uses in Health Care
• Types of Data
• Variables
• Data Transformations
• Wrong, Missing, and Censoring Data
• Big Data in Health Care
• Standards of Biomedical Data
• Patient Records
• Interoperability
• EHR Systems
11
Data Sources in Health Care
• Primary Source: data producers (where the data is produced)
– Visits and professional encounters
– Monitoring Equipment (e.g., ECG, Pulse Oximeter)
– Laboratories (e.g., blood analysis)
– Clinical Center Devices (e.g., X-Ray) PRIMARY SOURCE

– Biosensors and wearable devices (e.g., sensors)


– Internet of Things (IoT) (e.g., smart watches)
– Etc. SECONDARY SOURCE

• Secondary Source: data storages (where the data is stored)


– Health Care Record
– Repositories
– Etc.

12
Primary Data Sources in Health Care

Insurance
Companies

13
Secondary Data Sources in Health Care
Def. (Health Care Record, HR): Structured and systematic documentation about one
single patient’s medical history and care across the time.
Def (Clinical Data Repository, CDR): Real-time database that consolidates data from a
variety of clinical sources to present a unified view of a single patient. It is optimized
to allow clinicians to retrieve data for a single patient rather than to identify a
population of patients with common characteristics or to facilitate the management of
a specific clinical department.
– Clinical Data Warehouse (CDW) including the CDR and HR.

14
Secondary Data Sources in Health Care
Def. (Health Care Record, HR): Structured and systematic documentation about one
single patient’s medical history and care across the time.
Def (Clinical Data Repository, CDR): Real-time database that consolidates data from a
variety of clinical sources to present a unified view of a single patient. It is optimized
to allow clinicians to retrieve data for a single patient rather than to identify a
population of patients with common characteristics or to facilitate the management of
a specific clinical department.
– Clinical Data Warehouse (CDW)
Topic Clinical Data Repository Clinical Data Warehouse
Specificity of the data Detail-oriented, focused on the individual Aggregated data (summarized to decision-
patient making levels).
User’s data access Read/write Non-volatile, read-only access
Updates Real-time from operational systems Periodically (static) by operational systems
Data normalization Normalized data, no redundant data De-normalized and redundant data, often
Data contained Integrated clinical data (only) Integrated operational, clinical, and financial
data
Data comes from Clinical systems Clinical, financial, and administrative systems

• Big Data: According to Oracle, Big data is data that contains greater variety arriving
in increasing volumes, with ever-higher velocity, and of questionable veracity.
These are known as the four Vs. 15
CDW Basic Architecture

16
Data Uses in Health Care

Primary Use (Patient Care): the main


purpose is to deliver health care to the
patient. Access rights by caregivers is
implicit (no need to ask permission to the
patient).

Data Secondary Use (Data Exploitation): data is


used for other than health care provision
purposes (e.g., Research). Possible uses:
(secondary source) quality assurance, clinical and medical
research, public health, etc. Access rights
must be explicit and subject to the law
(you need to ask permission to the
patient).

17
Types of Data: Clinical Sense
• According to the Health Sciences Library at the University of
Washington, clinical data falls into six major types:
– Health care record data: Data are obtained, for a single patient, at the point of care at a
medical facility, hospital, clinic or practice. It is generally not available to outside
researchers, and the data collected includes administrative and demographic
information, diagnosis, treatment, prescription drugs, laboratory tests, physiologic
monitoring data, hospitalization, patient insurance, etc.
– Administrative data: Often associated with electronic health records, these data are
primarily hospital discharge information (summary) reported to a government agency.
– Claims data: Claims data describe the billable interactions (insurance claims) between
insured patients and the healthcare delivery system. Claims data falls into four general
categories: inpatient, outpatient, pharmacy, and enrollment.
– Patient-Disease registry data: These registries are clinical information systems that track
a narrow range of key data for certain chronic condition such as Alzheimer's Disease,
cancer, diabetes, heart disease, and asthma. Registries often provide critical information
for managing patient conditions.
– Health surveys data: In order to provide an accurate evaluation of the population
health, national surveys of the most common chronic conditions are generally
conducted to provide prevalence estimates. These surveys are specific for research
purposes and policy decisions.
– Clinical trials data: Data on a subset of subjects for the purpose of testing a treatment
before it is introduced in a health care system.

https://guides.lib.uw.edu/hsl/data/findclin 18
Data Types: A Complete View
Sources

Patient

Professional

Center

Devices

Logistics

Agencies

19
Data Types: A Complete View
Sources

Patient

Professional

Center

Devices

Logistics

Agencies

20
Types of Data: Format
Images UNSTRUCTURED

X-ray Computed Tomography Mammography Positron-Emission Tomography Magnetic Resonance Imaging Ultrasound
(CT) (PET) (MRI)

Charts
pathology images


Electrocardiogram Spirometry Electroencephalogram Glycemic index chart
(ECG, EKG) (EEG) (GI)
STRUCTURED

Values Variables
Values can be obtained directly from the patient (e.g., temperature) or from the analysis of an
image/chart (e.g., cancer stage). Values are stored in variables.

21
Cumulative Data
• The clinical structured data sometimes contains
cumulative and non-cumulative data
• The difference between cumulative and non-
cumulative data is that
– cumulative data displays the total amount of
information that's been gathered over a period of
time,
– whereas non-cumulative data shows the amount
of information gathered only at a certain point in
time.

22
Cumulative Data
• Count(*): indicates a quantity. e.g., number of female patients.
• Ratio: a number divided by another number (e.g., body mass
index BMI = weight/height2, in kg/m2)
• Proportion: a ratio of counts where the numerator is a subset
of the denominator (e.g., 30 patients out of 50 are depressed,
30/50 or 0.6) – Range [0.0, 1.0]
• Percentage: proportion expressed as a percentage (e.g., 0.6 is
expressed as 60%) – Range [0.0, 100.0]
• Risk: a proportion where the numerator counts events that
happen prospectively (e.g., 80 patients started the clinical trial
but only 50 remain, the risk of censoring is 30/80 or 37.5%).
• Rate: proportion that involves a time (e.g., an ICU has a
mortality rate of 10% if in one year it receives 1500 patients of
which 150 died).

(*) Be careful because absolute numbers (counts) are not related to the universe of work. For example, knowing that 50 female
23
nurses in front of 3 male nurses smoke does not necessarily mean that female nurses smoke more than male nurses.
Indep. Dep.
Variables
• Clinical Variables according to their relationship:
– Independent: we can control or change (e.g., dosage)
– Dependent: variable that we measure (e.g., temperature)

• Types of variables according to their possible values (domain):

24
Indep. Dep.
Variables
• Types of variables according to their possible values (domain):
– Quantitative or Numeric: there’s a distance function
defined on all the pairs of values of the variable (e.g.,
temperature, heart rate).
• Continuous: “all” values are possible (e.g., temperature =
37.4957oC)
• Discrete: “some” values are possible (Integer) (e.g., HR = 100 Beats
per minute, but not 100.35 bpm)

– Qualitative or Categorical: they don’t have a numerical


interpretation (e.g., gender, pain score).
• Nominal: their values are labels (e.g., gender, drug-name)
• Ordinal: their values represent an order (e.g., pain score 0-10, but
distance (2,4) not necessarily equal to distance(8,10)).

25
Changing Variable Types
Variable

Discretization
Quantitative Qualitative
(Numeric) Continuity (Categorical)

Continuous Discrete Nominal Ordinal

Binary N-ary

26
Discretization

Data Transformation I: Numeric to …


• Numeric to (Binary) Categorical: Based on whether the data satisfies a restriction or not.
E.g., has the patient got a temperature > 37.5oC?
• Numeric to (n-Ary) Categorical: Discretization process
– By value (careful): E.g., family group (gi) depending on the number of values (i).
0.36668, 0. 36667, 1.7823, 1.9995 (continuous) → {0, 0, 1, 2} (discrete)
0,0,0,1,1,1,1,1,2,2,3,4,4,5,5 (discrete) → g0, g0, g0,g1,g1,g1,g1,g1,g2,g2,g3,g4,g4,g5,g5 (nominal)
0,0,0,1,1,1,1,1,2,2,3,4,4,5,5 (discrete) → 0,0,0,1,1,1,1,1,2,2,3,4,4,5,5 (ordinal)
– By rounding or by truncation: E.g., we’re interested in the round temperature, and not in the
decimals (or in the tens and not in the units, etc.)
37.51,38.3,36.47,39.0,39.4,36.8,37.8 (numeric) → (round) 38, 38, 36,39,39,37,38 / (truncation) 37, 38, 36, 39, 39,
36, 37 (discrete, ordinal)
– By user specification: E.g., allowed ages are baby (<2y), infant (2-16y), young (17-22), adult (22-65y),
elder (>65y).
1,2,2,5,12,13,18,18,20,30,35,35,40,70,76,76 (numeric, categorical) → B,I,I,I,I,I,Y,Y,Y,A,E,E,E (categorical)
– By size: E.g., separate temperatures in groups of five (5) values.
1,2,2,5,12,13,18,18,20,30,35,35,40,70,76,76 (numeric) → (5) g1,g1,g1,g1,g1,g2,g2,g2,g2,g2,g3,g3,g3,g3,g3→g4,g4
(nominal)
1,2,2,5,12,13,18,18,20,30,35,35,40,70,76,76 (numeric) → (5) 1,1,1,1,1,2,2,2,2,2,3,3,3,3,3→4,4 (ordinal)
– By binning: E.g., separate temperatures in five equally wide intervals (or bins).
1,2,2,5,12,13,18,18,20,30,35,35,40,70,76,76 (numeric) → (76-1)/5=15 → [1,16),[16,31),[31,46),[46,61),[61,76] →
g1,g1,g1,g1,g1,g1,g2,g2,g2,g2,g3,g3,g3,g5,g5,g5 (categorical)
– By frequency: E.g., separate temperatures in five equally frequent intervals.
1,2,2,5,12,13,18,18,20,30,35,35,40,70,76,76 (numeric) → 16/5 = 3 values per bin →
g1,g1,g1,g2,g2,g2,g3,g3,g3,g4,g4,g4,g5,g5, g5, g6→g5 (nominal)
27
123 … N
List of all values (with repetitions) ordered:

BY USER SPECIFICATION BY SIZE


Input: boundary values = v1, v2, …, vk Input: size of each group = n
g1 g2 … gk+1 g1 g2 … g N/n
Output: v1 v1 < x  v2 … x>vk Output: ~n ~n … n
• Same values in same group, this is why not all groups
or <v1 v1  x < v2 … xvk are necessarily of size n, particularly the last one.

BY BINNING BY FREQUENCY
Input: number of bins = n Input: number of bins = n

g1 g2 … gn g1 g2 … gn
Output: m1x<M1 m2<xM2 … mn<xMn Output: ~N/n ~N/n … ~N/n
• Max and min and the greatest and lowest value in the • Same values in same group, this is why not all groups
list, S=(Max-min)/n is the size of each bin, and [mi,Mi] are necessarily of size N/n.
= [min+(i-1)*S, min+i*S] is the i-th group.
28
Data Transformation II: Categorical to …
Continuity
• Binary to Continuous: One of the binary values (false) is converted to 0,
the other one (true) is converted to 1.
• n-Ary to Numeric: Conversion processes
– Unique Integers (careful): E.g., drugs acting on the cardiovascular system are
codified as numbers.
Antihypertensive (2), Diuretic (3), Peripheral vasodilators (4), Vasoprotectives (5), Beta blocking
agents (7), Calcium channel blockers (8), Agents acting on the renin–angiotensin system (9), Lipid
modifying agents (10), …
– Dummy Coding (without comparison group): E.g., the use of a drug d in a
treatment is marked with 1/0 in the corresponding new numeric column d.
d1, d2, d1, d1, d3, d4, d2, d4, d3 → 1000, 0100, 1000, 1000, 0010, 0100, 0001, 0010
– Dummy Coding (with comparison group g): E.g., the use of a drug dg in a
treatment is marked with 1/0 in the corresponding new numeric column d. Drug g
does not have a column.
d1, d2, d1, d1, d3, d4, d2, d4, d3 (comparison group d2) → 100, 000, 100, 100, 010, 000, 001, 010
– Effect Coding (with comparison group g): E.g., the use of a drug dg in a treatment
is marked with 1/0 in the corresponding new numeric column d. Drug g does not
have a column. Drug g has all dg columns to -1.
d1, d2, d1, d1, d3, d4, d2, d4, d3 (comparison group d2) → 100, -1-1-1, 100, 100, 010, -1-1-1, 001,
010

29
DUMMY CODING
(without comparison group) (with comparison group = v2)
Conversion Table V1 V2 V3 V4 V1 V2 V3 V4

V1 1, 0, 0, 0 1 0 0 0 1 0 0 0

https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/
1 1
UNIQUE INTEGERS V2 0, 1, 0, 0 2 0 1 0 0 2 0 1 0 0
3 0 0 1 0 3 0 0 1 0
V3 0, 0, 1, 0 1 0 0 0 1 0 0 0
4 4
1 V1 Conversion Table 1 V4 0, 0, 0, 1 0 1 0 0 0 1 0 0
5 5
2 V2 V1 1 2 Conversion Table 6 1 0 0 0 6 1 0 0 0
V2 2 7 0 0 1 0 7 0 0 1 0
3 V3 3 V1 1, 0, 0
8 0 1 0 0 8 0 1 0 0
V3 3
V2 0, 0, 0 0 1 0 0 0 1 0 0
4 V1 V4 4
1 9 9

V3 0, 1, 0 10 0 0 1 0 10 0 0 1 0
5 V2 2 11 1 0 0 0 11 1 0 0 0
V4 0, 0, 1
6 V1 1
EFFECT CODING (with comparison group = v2)
7 V3 3 V1 V2 V3 V4

8 Conversion Table 1 0 0 0
V2 2 1
-1 1 -1 -1
V1 1, 0. 0 2
9 V2 2 3 0 0 1 0
V2 -1, -1, -1
4 1 0 0 0
10 V3 3 V3 0, 1, 0 5 -1 1 -1 -1
11 V1 1 V4 0, 0, 1 6 1 0 0 0
7 0 0 1 0
Categorical Numeric 8 -1 1 -1 -1
9 -1 1 -1 -1
10 0 0 1 0
30
11 1 0 0 0
Benefits of Discretization and Continuity
… Discretization … Continuity
The goal of discretization is to reduce the number The goal of continuity is to convert a
of values a continuous variable assumes by nominal variable into a numeric variable
grouping them into a number of n intervals or by projecting it in a continuous space.
bins.

• Application of numeric methods


• Data Space Reduction (memory save): R → n
(regression)
• Data Density Increment (sparseness
reduction)
• Data simplification (improve interpretation)
• Application of discrete methods
• Easier to model (find models)
• Reduce unnecessary precision or overfitting
(e.g., values captured by patient monitors in
ICUs are too precise)
• Can improve the performance of patient
classification
• Etc.
31
Wrong, Missing, and Censoring Data
• Wrong data: The values of the data is incorrect because …
– The person introducing it made a mistake (e.g., age=38 instead of age=83,
both ages are possible)
– Wrong default values (e.g., patient married = true, by default is true).
– Values out of range (e.g., heart rate = 09 = 9 when the range is [70, 180])
• Missing data: The value of the data is unknown because …
– A clinical device is disconnected
– The data was not taken
– The data is protected/private/unavailable (for some individuals)
• Censoring data: The value of the data is partially unknown because …
– The data was approximated during introduction (e.g., age of the patient)
– Device does not reach the values measured (e.g., scale registering weights
until 150kg, and patient is over weighted).

32
Wrong, Missing, and Censoring Data
• Dealing with wrong, missing, and censoring data:
1. Detect: not always possible
• Values out of range
• The average of the sample is different from epidemiological/population
studies
2. Correct: There are several alternatives, some of which are …
• Remove instances (high % of wrong/missing/censoring) (Complete Cases)
• Remove feature (high % of wrong/missing/censoring)
• Encoding: transform it to special values (e.g., -1, 99999, 0)
• Imputation (Replace): introduce the mean/median/mode value or
min/max
• Imputation (Predict): calculate value with a regression or k-nearest
neighbors (KNN) or Linear Interpolation (LI) algorithms
• Leave as NA (NaN or ?) and use algorithms which are capable to manage
missing values (e.g., XGBoost supports missing values by default.)

33
Wrong, Missing, and Censoring Data

34
Big Data in Health Care

Source: Minor L.B, Harnessing the Power of Data in Health, Stanford University School of Medicine, 2017. 35
The four Vs of Big Data

Biomedical Data
satisfies all four v’s

36
Let’s put some order …

37
Standards of Biomedical Data

• Codification, Terminologies, and Ontologies


• Electronic Health Records
• Interoperability

38
Codification, Terminologies, and Ontologies
Clinical Data Standards: Standards like ICD, SNOMED CT, LOINC, and CPT are
used to codify clinical information, diagnoses, procedures, and laboratory
tests, ensuring consistent terminology across healthcare systems.

• International Classification of Diseases (ICD): ICD9, ICD10, ICD11


• Diagnostic Related Groups (DRG)
• International Classification of Primary Care (ICPC-2)
• Classification of Functioning, Disability and Health (ICF)
• Current Procedural Terminology (CPT, HPCPS)
• Anatomical Therapeutic Chemical Classification System (ATC)
• Identifying health measurements, observations, and documents (LOINC)
• SNOMED CT and Unified Modeling Language System (UMLS)

39
What do we want to code?
Health Care
Processes

Biomedical/Clinical Administrative
Processes Processes

Case Description Case Classification Case Management


(patient signs & symptoms) (patient diagnosis) (patient treatment)

ICD X X
DRG X X
ICD ICD
ICPC X X X
LOINC
ICF X
DRG
CPT X
ICPC
ATC X
LOINC X ATC ICF
CPT 40
SNOMED X X X X
Why do we want to code?
1. Standardization of health, clinical vocabulary, and meaning
2. Facilitates communication and multiple languages
3. Improves reports’ precision
4. Normalization of data structures and cross reference
5. Contributes to reduce error, inaccuracy, and
misunderstanding
6. Billing and cost analysis
7. Secondary use of data
8. Facilitates automated applications and data exchange and
sharing

41
International Classification of Diseases (ICD)

Def (ICD): The International Statistical


Classification of Diseases and Related Health
Problems (commonly known as the ICD) provides
alpha-numeric codes to classify diseases and a
wide variety of signs, symptoms, abnormal
findings, complaints, social circumstances and
external causes of injury or disease.

NN AANN.X ICD11

Ex. 11 BA00.1 Isolated diastolic hypertension

11 Diseases of the circulatory system


BA00 Essential hypertension
BA00.1 Isolated diastolic hypertension
42
ICD Versions
• ICD-9 (1978): 17,000 codes
ICD-9-CM (CM stands for Clinical Modification) is based on
the ICD-9 but it provides for additional morbidity detail. It
consists of 3 volumes: V1-V2 for diagnosis codes, V3 for
codes about surgical, diagnostic, and therapeutic procedures.
– http://icd9.chrisendres.com/
• ICD-10 (1990): +155,000 codes
ICD-10-CM for diagnosis codes, replacing volumes 1 and 2.
ICD-10-PCS (PCS stands for Procedure Coding System) for
procedure codes, replacing volume 3.
• ICD-11 (2018)
• https://www.findacode.com/

Used abbreviations: NEC (Not Elsewhere Classifiable), NOS (Not Otherwise Specified).
43
A: alphabetic; A,B,C,…
ICD-9-CM N: numeric; 0,1,2, …
X: alphanumeric (A U N)

• http://icd9.chrisendres.com/
• http://www.icd9data.com/
• https://www.findacode.com/

• 17,000 codes

• Volumes 1-2 - Diseases:


– 17 chapters + 2 supplementary
– Format: NNN.N[N]
– Examples:
• 540.9 - Acute appendicitis without mention of
peritonitis
• 410.00 - Acute myocardial infarction of
anterolateral wall, episode of care unspecified
• Volume 3 - Procedures:
– 18 chapters
– Format: NN.N[N]
– Examples:
493.2 Chronic obstructive asthma • 30.1 – Hemilaryngectomy
[0-2] • 38.34 - Resection of vessel with anastomosis,
Asthma with chronic obstructive pulmonary disease [COPD] aorta
Chronic asthmatic bronchitis
Excludes:
acute bronchitis (466.0)
chronic obstructive bronchitis (491.20-491.22)
44
ICD-10-CM & ICD-10-PCS

• http://www.icd9data.com/
• https://www.findacode.com/

• +155,000 codes

• ICD-10-CM - Diseases:
– 21 chapters
– Format: ANN.N[NNNA]
– Examples:
• K40.1 Bilateral inguinal hernia, with gangrene
• L20.81 - Atopic neurodermatitis
• M24.541 - Contracture, right hand
• M24.542 - Contracture, left hand
• S12.110A - Anterior displaced Type II dens fracture,
initial encounter for closed fracture
• ICD-10-PCS - Procedures:
– 17 chapters
– Format: XXXXXXX
– Examples:
• 0016070 Bypass Cerebral Ventricle to Nasopharynx
with Autologous Tissue Substitute, Open Approach
• 30243G0 - Transfusion of Autologous Bone Marrow
into Central Vein, Percutaneous Approach
• BV18ZZZ Fluoroscopy of Vasa Vasorum

45
ICD-11-CM

• https://icd.who.int/browse11/l-m/en
• https://www.findacode.com/

• Will come into effect in 2022

• Diseases:
– 28 chapters
– Format: XXXX.X[X]
– Examples:
• 1C12.0 Whooping cough due to Bordetella
pertussis
• KB00.0 Perinatal arterial stroke
• HA01.10 Male erectile dysfunction, lifelong,
generalized
• LA85.20 Double outlet right ventricle with
subpulmonary ventricular septal defect,
transposition type

46
Overall comparison of ICD9 and ICD10
• Extension: Issues today with the ICD-9 diagnosis
and procedure code sets are addressed in ICD-10.
• More detailed Clinical information (laterality,
temporality, etc.): One concern today with ICD-9 is
the lack of specificity of the information conveyed
in the codes. For example, if a patient is seen for
treatment of a burn on the right arm, the ICD-9
diagnosis code does not distinguish that the burn is
on the right arm. If the patient is seen a few weeks
later for another burn on the left arm, the same
ICD-9 diagnosis code would be reported. Additional
documentation would likely be required for a claim
for the treatment to explain that the burn treated
at this time is a different burn from the one that
was treated previously. In the ICD-10 diagnosis code
set, characters in the code identify right versus left,
initial encounter versus subsequent encounter, and
other clinical information.
• Resizing: Another issue with ICD-9 is that some
chapters are full and impede the ability to add new
codes. In some cases, new codes have been
assigned to different chapters making it difficult to
locate all available codes. ICD-10 codes have
increased character length, which greatly expands
the number of codes that are available for use.
With more available codes, it is less likely that
chapters will run out of codes in the future.
• Updating: Other issues that are addressed in ICD-10
include the use of full code titles and appropriately
reflecting advances in medical knowledge and
technology.

Source: https://www.unitypoint.org/waterloo/filesimages/for%20providers/icd9-icd10-differences.pdf
47
Overall comparison of ICD10 and ICD11
ICD has been reviewed to accommodate for the needs of multiple use cases
and users in recording, reporting, and analysis of health information. ICD-11
comes with:

• Improved usability: more clinical detail with less training time


• Updated scientific content
• Clinical detail: Enables coding of all clinical detail
• Computerization: Made eHealth ready for use in electronic environments
• Interoperability: Linked to relevant other classifications and terminologies
• Multilingual: Full multilingual support (translations and outputs)

Source: https://www.who.int/classifications/icd/en/#page21

48
Diagnosis-Related Groups (DRG)
Def. (DRG): is a classification of the “services” that
acute health care centers can provide. DRGs were
intended to describe all types of patients in an acute
hospital setting. DRGs have been used in the US since
1982 to determine how much Medicare pays the
hospital for each “service", since patients within each
category are clinically similar and are expected to use
the same level of hospital resources. DRGs are assigned
to patients by a "grouper" program based on ICD
diagnoses (primary diagnosis), procedures, age, sex,
discharge status, and the presence of complications or
comorbidities (secondary diagnoses).
49
Int’l Classification of Primary Care (ICPC-2)

Def. (ICPC-2): it classifies patient data and clinical activity in the domains of General/Family Practice
and primary care, taking into account the frequency distribution of problems seen in these domains.
It allows classification of the patient’s reason for encounter (RFE), the problems/diagnosis managed,
interventions, and the ordering of these data in an episode of care structure.
• It has a biaxial structure consisting of 17 chapters, each one divided into 7 components.
• ~1,300 codes.
• Abbreviated ICPC-2 in Spanish: https://www.iqb.es/patologia/ciap/ciap_toc.htm
Chapters
A General and unspecified Components
B Blood, blood forming organs, lymphatics, spleen 1: symptoms and complaints
D Digestive 2: diagnostic, screening and preventive procedures
F Eye 3: medication, treatment and procedures
H Ear 4: test results
K Circulatory 5: administrative
L Musculoskeletal 6: referrals and other reasons for encounter
N Neurological 7: diseases
P Psychological
R Respiratory
S Skin
T Endocrine, metabolic and nutritional
U Urology
W Pregnancy, childbirth, family planning
X Female genital system and breast
Y Male genital system
Z Social problems
50
51
Source: http://docpatient.net/3CGP/QC/ICPC_desk.pdf
Overall Comparison of ICD and ICPC
• Oriented to Primary Care: ICD covers the needs of hospital care where patients normally present for a
single episode of care and mostly with one, often clearly differentiated, problem. In primary care,
however, healthcare providers deal with multiple episodes of care over time, and deal with many, often
undifferentiated, problems simultaneously. Therefore, ICPC allows to capture information on episodes of
care (EoC) over time. It does so by allowing the simple recording of the first contact between patient and
healthcare provider concerning a certain health problem, and ends with the last contact relating to this
same problem. The EoC allows for grouping of information over time. Healthcare providers can use this to
improve continuity and coordination of care. The ability to collect data using the EoC also creates more
insight into the processes related to certain conditions over time, and so a greater understanding of what
is needed and the associated costs.
• Reflects the content of primary care: ICPC is a classification system which aims to reflect the content of
primary care. The ICPC contains codes that are mainly based on the frequencies with which they are
encountered in primary care and with a level of detail that is appropriate for primary care. It is possible to
tailor ICPC to match local epidemiological needs. Enables easy and consistent coding. The whole of ICPC,
all codes, fit the front and back of one A4 sheet of paper.
• Reduced size: ICPC-2 has around 1,300 codes whereas ICD has between 14,000 – 140,000 codes with a
complex coding system.
• Formal structure: The components that form part of each ICPC chapter permit considerable specificity for
all three elements of the encounter (i.e., findings, diagnosis, and treatment), yet their symmetrical
structure and largely uniform numbering across all chapters also facilitate usage even in manual recording
systems.
• Multilingual: ICPC is available in Catalan, Chinese, Croatian, Danish, Dutch, English, Finnish, French,
German, Greek, Italian, Japanese, Norwegian, Portuguese, Romanian, Russian, Serbian, Slovenian, and
Spanish.

52
Source: https://www.globalfamilydoctor.com/site/DefaultSite/filesystem/documents/Groups/WICC/International%20Classification%20of%20Primary%20Care%20Dec16.pdf
Episode of Care
An "Episode of Care" is a healthcare management and billing concept that refers to a
specific period during which a patient receives a sequence of healthcare services related
to a particular health issue, condition, or treatment. This concept is primarily used in
healthcare reimbursement and administration to organize and account for the care
provided to a patient over a defined period. Here are some key points to understand
about episodes of care:
• Episode of Care = (RFE; Diagnosis; Procedure)
– RFE = Patient’s Reason for Encounter (e.g., headache) Symptoms
– Diagnosis = differential knowledge acquired about the patient’s physical or
psychical state.
– Procedure = treatment action(s) or test(s) started on the patient.
• Simplified examples:
Encounter RFE Diagnosis Process

EoC1 “I’m feeling tired” “A04 tiredness” “-34: blood test”

EoC2 “What’s the (blood) test “B80 iron deficiency “-40 diagnostic endoscopy”
result?” anemia” (colonoscoscopy)
EoC3 “What is the (colonoscopy) “D75 malignant neoplasm “-67 referral to
test result?” colon/rectum” physician/specialist/…”

53
Episode of Care
• Episode of Care = (RFE; Diagnosis; Procedure)
– RFE = Patient’s Reason for Encounter (e.g., headache)
– Diagnosis = differential knowledge acquired about the patient’s physical or
psychical state.
– Procedure = treatment action(s) or test(s) started on the patient.
• Simplified examples:
Encounter RFE Diagnosis Process

EoC1 “I’m feeling tired” “A04 tiredness” “-34: blood test”

EoC2 “What’s the (blood) test “B80 iron deficiency “-40 diagnostic endoscopy”
result?” anemia” (colonoscoscopy)
EoC3 “What is the (colonoscopy) “D75 malignant neoplasm “-67 referral to
test result?” colon/rectum” physician/specialist/…”

• Complete example (complex medical considerations):


https://prezi.com/6llsvfu_dk_o/caso-clinico-1-icpc-vi/

54
Classification of Functioning, Disability and
Health (ICF)
Def (ICF): it is the WHO framework for measuring health and disability at both
individual and population levels. ICF was officially endorsed by all 191 WHO
Member States in 2001 as the international standard to describe and
measure health and disability.
• http://apps.who.int/classifications/icfbrowser/
• ICF has 4 chapters: b, d, e, s
• A-N-NN-N

d4100 Lying down


Getting into and out of a lying down position or changing body position,
from horizontal to any other position, such as standing up or sitting down.
Inclusions: getting into a prostrate position
55
Current Procedural Terminology (CPT)
Def. (Current Procedural Terminology, CPT): it is a medical code
system describing clinical procedures and services (either medical,
surgical, and diagnostic) addressed to physicians, health insurance
companies and accreditation organizations.
• CPT coding is similar to ICD-9 and ICD-10 coding, except that it
identifies the services rendered, rather than the diagnoses.
• Three categories of CPT codes:
– Category I: coding procedures and contemporary medical practices. It is
organized in 6 groups (NNNNN)
1. Codes for evaluation and management
2. Codes in anesthesiology
3. Codes for surgery
4. Codes for radiology
5. Codes for pathology and laboratory
6. Codes for medicine
– Category II: Clinical laboratory services (NNNNA). Complementary to
category I, but have not attached costs.
– Category III: Emerging technologies, services, and procedures (NNNNA).
New things that are expecting to be incorporated in category I.

56
Anatomical Therapeutic Chemical
Classification System (ATC)
Def. (ATC): Codification system in which drugs are classified at five levels.

A-NN-A-A-NN
Level 1: Anatomical Group (A): 14 groups
Level 2: Therapeutic Subgroup (NN)
Level 3: Therapeutic/Pharmacological Subgroup (A)
Level 4: Chemical/therapeutic/pharmacological Subgroup (A)
Level 5: Chemical Substance (NN)

ATC includes daily doses (DDDs): The DDD is the


assumed average maintenance dose per day for a
drug used for its main indication in adults. 57

https://www.whocc.no/atc_ddd_index/
Logical Observation Identifiers Names and
Codes (LOINC)
Def. (LOINC): it is an international standard to assist in the electronic exchange and gathering of
clinical results (such as laboratory tests, clinical observations, outcomes management and
research).
• https://loinc.org/
• > 71,000 observation terms
• LOINC has two parts:
– Laboratory: to describe results of laboratory and microbiology tests,
– Clinical: to refer to a variety of non-lab concepts (e.g., ECG, cardio echo, ultrasound). The
clinical part has:
• Terms for clinical documents: to be incorporated in clinical reports such as discharge summaries.
• Terms for survey instruments: to be used in standard surveys such as Glasgow Comma Score.
• Each code has six dimensions or parts:
– Component (Analyte): the substance or entity being measured or observed.
– Property: the characteristic or attribute of the analyte.
– Time: the interval of time which an observation was made.
– System (Specimen): The specimen or thing upon which the observation was made.
– Scale: how the observation value is qualified or expressed: quantitative, ordinal, nominal.
– Method (optional): how the observation was made. https://loinc.org/806-0/
• Example: code 806-0 (parts Leukocytes: NCnc: Pt: CSF: Qn: Manual count) stands for
“manual count of white blood cells in cerebral spinal fluid specimen”. NCns= number
58
concentration, Pt= Point in time, CSF= cerebral spinal fluid, Qn= quantitative.
Systematized Nomenclature of Medicine (SNOMED CT)
Def. (Systematized Nomenclature of Medicine - Clinical Terms, SNOMED CT): it is the most comprehensive,
multilingual and codified clinical terminology developed in the world. SNOMED CT is also a terminology product
that can be used to encode, retrieve, communicate, and analyze clinical data, enabling healthcare professionals to
represent information in an appropriate, accurate, and unambiguous way. Terminology is constituted, in a basic
way, by concepts, descriptions and relationships. These items are intended to accurately represent clinical
knowledge and information in the healthcare setting.

• https://browser.ihtsdotools.org
• Poly-hierarchical structure of
clinical concepts (IS-A relations):

• Semantic definition of concepts


(properties):
59
SNOMED-CT Diagrams
Symbol Meaning
Examples
Concept

Defined Concept

Attribute

Is-a relationship

Attribute group

Conjunction (AND)

 Equivalence

Subsumption (Inclusion)

Unidirectional/Bidirection
al connectors
60
https://confluence.ihtsdotools.org/download/attachments/29951081/doc_DiagrammingGuideline_Current-en-US_INT_20140131.pdf?api=v2
SNOMED CT: A Big Ontology

61
Unified Medical Language System (UMLS)

Def. (UMLS): The UMLS is a set of files and software that brings together
many health and biomedical vocabularies and standards to enable
interoperability between computer systems.
• The UMLS integrates and distributes key terminology, classification and
coding standards, and associated resources to promote creation of more
effective and interoperable biomedical information systems and services,
including electronic health records.
• It contains 3 knowledge sources:
– Metathesaurus: Terms and codes from many vocabularies, including CPT, ICD-10-
CM, LOINC, MeSH, RxNorm, and SNOMED CT. Hierarchies, definitions, and other
relationships and attributes.
– Semantic Network: Broad categories (semantic types) and their relationships
(semantic relations).
– SPECIALIST Lexicon and Lexical Tools: A large syntactic lexicon of biomedical and
general English and tools for normalizing strings, generating lexical variants, and
creating indexes.

62
Electronic Health Records
• Definition and Related Terms
– Electronic Medical Record
– Electronic Health Record
– Electronic Personal Health Records
• Parts of an EHR
• EHR Software Strategies
• Health Information Systems: Past, Present, and Future
• EHR Standards
– HL7
– OpenEHR
– EHRcom
• EHR Systems
– Proprietary solutions: SAP Health-Care, HPCIS (DXC), Selene (CGM),
Millennium (Cerner)
– Ad hoc solutions: Diraya, Jimena, Abucasis, Ianus, e-Osabide.

63
Electronic Health Record (EHR)
Def. (Electronic Health Record, EHR): An electronic
(digital) collection of medical information about a
person that is stored on a computer.
An electronic health record includes information about
a patient’s health history, such as diagnoses, medicines,
tests, allergies, immunizations, and treatment plans.
Electronic health records can be seen by all healthcare
providers who are taking care of a patient and can be
used by them to help make recommendations about
the patient’s care.
“Also called electronic medical record”.

64
Office of the National Coordinator for
Health Information Technology (ONC)
within the Office of the Secretary for the U.S. Department of Health and Human Services (HHS).

65
Parts of an EHR

• Parts proposed by the Committee on Data Standards for


Patient Safety (2003) of the Institute of Medicine(*) in the US:
1. Health Data and Information
2. Results Management
3. Order Entry and Management
4. Clinical Decision Support
5. Electronic Communication and Connectivity
6. Patient Support
7. Administrative Processes
8. Reporting and Population Health Management

(*) Since 2015, National Academy of Medicine

Sources: Kelley T. Electronic Health Records for Quality Nursing & Health Care. DEStech Publications, Inc. 2016.
Institute of Medicine (US) Committee on Data Standards for Patient Safety. Key Capabilities of an Electronic Health Record System: Letter Report.
Washington (DC): National Academies Press (US); 2003. Available from: https://www.ncbi.nlm.nih.gov/books/NBK221802/ doi: 10.17226/1078166
EHR: 1. Health Care Data and Information
• An EHR must contain certain data about patients. The input-output access
to this data by care providers must be efficient.
• Difference between data and information:
– Data: symbols representing numbers, letters, words, abbreviations, etc.
– Information: assignment of a (clinical) meaning to data that allows decision making
and knowledge generation.
• Health care data and information can be introduced in EHRs in either a
structured or unstructured format. E.g., text vs. SNOMED codes.
• Possible categories of EHR data and information:
– Patient demographics: e.g., name, date of birth, gender, ethnicity, race, medical
record number, etc.
– Patient list of problems and diagnoses: primary and secondary causes of treatment.
– List of medications and prescriptions: pharmacological treatment of the patient.
– List of allergies: allergies to medication, food, substances, etc.
– Clinical documentation: e.g., electronic forms, templates, spreadsheets, notes, etc.
– Patient orders: list of patient actions required by a health care professional for the
management of the patient.
– Medication and administration record (MAR): Nurses, pharmacists, and providers
use the MAR to administer patient medications and subsequently monitor the
patient’s response.

67
EHR: 2. Results Management
• An EHR must contain the results of the clinical tests performed on the
patient.
• This is a broad category that incorporates results from tests performed in
clinical areas, other diagnostic tests, and consultative exams.
– Result: outcome of a test or exam ordered by a health care provider and performed on
the patient. Used to evaluate the patient’s condition and to make clinical decisions
about the patient’s care.
• Examples: Complete blood count (CBC), Potassium level (K+), Chest X-Ray
(CXR), Electrocardiogram (ECG/EKG), Pulmonary function tests (PFTs),
Biopsies, Strep Test, Urinalysis (UA), Mononucleosis Test (Mono), Magnetic
Resonance Imaging (MRI), Nuclear Medicine Scans (NM), Computerized
Tomography (CT), etc.

68
EHR: 3. Order Entry and Management
• An EHR must contain the treatment orders asked by the health care
professionals. In modern HC systems order entry is computerized.
• Computerized Provider Order Entry (CPOE) as part of (or connected to)
the EHR.
• CPOE facilitates:
– Reduction of errors because hand-writing (easy reading).
– Automatic filling of fields in the order (e.g., patient/doctor name, date, etc.)
– Automatic incorporation of orders in the EHR of the patient.
– Electronic prescription.
– Time reduction.
– Automatic detection of medication errors (drugs, dosages, frequencies, etc.)
and interactions (drug-drug, drug-allergy, etc. checkers).
– Paper elimination and cost reduction.

69
EHR: 4. Clinical Decision Support
• An EHR can contain (or use) some tools for decision support.
• Clinical Decision Support (CDS) component in EHR was not possible in
paper-based health records.
• CDS is meant to assist the provider, nurse, and other health care
professionals to make optimal decisions about a patient’s treatment plan.
• CDS uses patient’s data and information in this purpose.
• CDS and CPOE proved a beneficial symbiosis working together.
• Monitoring values in normality ranges is another use of CDS

TO BE CONSIDERED LATER

70
EHR: 5. Electronic Communication and
Connectivity
• An EHR must facilitate secure data communication between different
units in the same health care center or among health care centers.
• Communication: EHR data and information moves
– Data sources: where EHR data is produced
– Data storages: where EHR data is stored (secondary sources)
– Data uses: where EHR data is consumed/required

Source 1 Storage 1 Use 1



Source m Storage n Use n

• Connectivity: provide EHR access via Internet


Source Internet Use 1
Platform
• Related to IT security
Storage 1
71
EHR: 6. Patient Support
• Some EHR module can provide support to the patients (e.g., educative
information)
• These modules can be accessed by patients providing patient-oriented
support.

72
EHR: 7. Administrative Process
• This EHR component allows health care organization around the patient.
• It is normally used during admission, discharge (inpatients) or visit
(outpatients).
• It uses to be centered on demographic information (patient’s name, date
of birth, gender, race, ethnicity, language), the patient receives a medical
record identifier (MRI), patient’s insurance or payer information (social
security number, insurance number, credit card number), and patient
locations (exam rooms, emergency rooms, operating rooms, ICU rooms,
inpatient care rooms, etc.)

73
EHR: 8. Reporting and Population Health
Management
• Some EHR incorporate a module to extract summary information
required by National Health Care systems.
• Health care centers use to report their activities and information of their
patients to federal, state, and local governmental institutions for global
supervision of patient safety and health care quality.
• Units and departments in hospitals have also to report to the hospital as
an input for global management and decision making.
• The EHR becomes an outstanding tool in these processes.
• Automating EHR data extraction and arrangement for these purposes
should be achieved with EHR modules implementing these extractions and
processing algorithms.
• These components reduce time, costs, and accuracy during reporting.

74
EHR Software Strategies
• Ad hoc strategy: the hospital IT staff develops their own system.
• Off-the-shelf strategy: the hospital purchases a software and customizes
it.
– Single EHR: a single software is available with modules for financial, billing,
human resources, material management, etc. It’s a neat compact solution.
– Best-of-breed strategy: many vendors solutions are analyzed and the best
components of each are purchased and integrated. It involves intensive
interface development between components.
– Best-of-suite strategy: there is a core EHR to which other software systems are
integrated.

EHR EHR

Ad hoc Single EHR Best-of-breed Best-of-suit 75


Source: Ford, EW; Menachemi, N; Huerta, TR; Yu, F (2010). Hospital IT Adoption Strategies Associated with Implementation Success: Implications for Achieving Meaningful Use.
Health Information Systems
(Past, Present, and Future Challenges)
1. Shift from paper-based to computer-based processing and
storage, as well as the increase of data in health care settings
2. Shift from institution-centered departmental and, later, hospital
information systems towards regional and global HIS
3. Inclusion of patients and health consumers as HIS users, besides
health care professionals and administrators
4. Use of HIS data not only for patient care and administrative
purposes, but also for health care planning as well as clinical and
epidemiological research
5. Shift from focusing mainly on technical HIS problems to those of
change management as well as of strategic information
management
6. Shift from mainly alpha-numeric data in HIS to images and now
also to data on the molecular level
7. Steady increase of new technologies to be included, now starting
to include ubiquitous computing environments and sensor-based
technologies for health monitoring
Source: Haux R. Health information systems - past, present, future. Int J Med Inform. 2006 Mar-Apr;75(3-4):268-81.
76
EHR Standards
• Outstanding EHR standards (not the only ones):
– HL7: Health Level 7
– OpenEHR
– EHRcom: Electronic Health Record Communication

Same EHR System Diverse EHR Systems


(each HCC can have a different EHR)

HCC 1 HCC 1

HCC 4 HCC 4
… HCC 2 HCC 2

HCC 5
HL7
HCC 3 HCC 5

HCC 3

impossible current approach


77
HL7

Def. (HL7): it is a set of international standards for transfer of clinical and


administrative data between software applications that are used by various
healthcare providers.

• They are produced and maintained by Health Level Seven International.


• HL7 version 3 (in front of v2) offers a Reference Information Model (RIM)
to standardize the format of HL7 messages and documents.
– HL7 messages in v3 are not as much used as HL7 v2 messages.
– HL7 v3 documents are based on the Clinical Document Architecture (CDA) which
specifies document types for sharing, exchange, and reuse. For example, operative
reports, discharge summaries, laboratory reports, or quality measure report on
patient cohort. It uses XML notation.
– Unlike messages, documents do not contain rules for transmission.
– Documents can be combined with other standards.
Messages are generally used to support an ongoing process in a real- Documents are persistent in nature, have “static” content and tend to
time fashion. They convey status information and updates related to be used post occurrence, i.e. once the actual process is done.
one and the same dynamic business object. Messages are about Documents are about persisting "snapshots" as understood at a
"control" - they can represent requests that can be accepted or refused particular time.
by the system and there are clear sets of expectations about what the
receiver must do. 78
HL7 v3 RIM

79
Simplified HL7 v3 RIM

• Entities: elements (either objects or agents) participating. Ex: person, organization, place,
material, etc.
• Roles: provide the kind of participation and liabilities of entities. Ex: patient, employee (for
doctors, nurses, ...), etc.
• Participations: information about the involvement of a role in an act.
• Acts: health-care past/present/future action. Ex: procedure, patient encounter, observations,
invoice, etc.
Example (operative report): Dr. Smith (entity) as physician (role) observes uncontrolled diabetes (observation act) in patient Mr. Jones (entity, role
patient) and prescribes metformin (supply act) to stabilize the situation (act). [Both the doctor and the patient (roles) participate in this clinical situation80
(participation)].
HL7 messages: v2 vs. v3

81
HL7 v2 Messages

• Most common HL7 messages and codes:


ADT – Admission, Discharge, and Transfer messages.
SIU – Scheduling messages for clinical appointments.
ORM – Order entry messages to make the order.
ORU – Order result messages with results of an ORM message.
• HL7 messages are made of segments (lines)
• The initial segment is a MSH (message header) containing type, version, and important information about the message.
• Segments are divided into fields by pipe characters ‘|’
• Fields can also be divided into subfields by ‘^’ (e.g., JONES^WILLIAM^A^III)
• HL7 messages are made for computer communication, not for human interpretation

• Editor: http://7edit.com/home

82
• Otros: https://hl7latam.blogspot.com/2018/07/editor-de-mensajes-hl7.html
7Edit: HL7 v2 Message Editor
Types of Segments

HL7 Message
HL7 RIM
HL7 List of Messages

83
Case Example 0: Use of 7Editor
1. Create a message of the sort ADT_01 (admission/visit notification).
2. Indicate that the admission was on Feb 4, 2020 at 12:30 h, by writing a segment of the sort EVN (event).
3. Incorporate personal information about the patient as a PID segment: Mary Higgins, born on Dec 15, 1997,
female, Asian, who lives in 1122 Alberton Av and home phone number 111222, and married. She receives
the patient id 123456.
4. Introduce a contact person: create a next of kin (NK1) segment with the information of the husband, John
Carter, with phone num. 333444.
5. Introduce the information that Mary has allergy (AL1 segment) to penicillin (ATC code J01C).
6. Mary was diagnosed (DG1 segment) of essential hypertension (ICD-9CM code 401.0) on Aug 10, 2019.
7. One hour after arrival, she received a procedure (PR1 segment) of anamnesis by doctor Charles Cannot
(ROL segment), who observed (OBX segment) a systolic blood pressure (SNOMED-CT code 163020007) of
140 mmHg, and prescribed (PR1 segment) atenolol 100 mg, with ATC code C07AB03.

84
HL7 CDA
Def. (HL7 CDA): it provides a standard for the representation,
persistence and communication of clinical documents for exchange
between systems.
• A CDA document is a tree-structure in which higher levels can
contain lower level CDA structures. They are both human readable
and machine processable.
• A CDA document is an XML file consisting of:
– Header: identifies the patient, provider, document type, etc.
– Body: mandatory human-readable part containing the complete content
and an optional encoded part that can be safely ignored by recipients
which are unable to process it.
• Depending on the body, there are tree classes of CDA documents
(called levels):
– Level 1: with a NonXMLBody (free style) body. Ex., PDF file.
– Level 2: with a StructuredBody body containing textual sections.
– Level 3: with a StructuredBody body containing textual sections with
codified entries (clinical information) for machine processing.
• Level 3 allows semantic interoperability. 85
https://web.archive.org/web/20081026071806/http://hl7book.net/index.php?title=CDA
CDA Document Complete Structure
Header: metadata about the document
• Who created the document
• Who is the document about
Header • When was the document created
• Where was the document created
• Etc.

Body Body (level 1): composed of two parts:


1. Human readable representation of the document
2. Machine processable part of the document structured in sections.
Section
Section (level 2): each one of the body components (ex., observation, medication, etc.)
Entry
Entry (level 3): fully structured machine-processable representation of a clinical
information item on the CDA (ex., vital signs, laboratory and allergy observations,
substance administrations, clinical procedures, etc.)

Entry Example:
This entry is representing …

• an observation (OBS),

• of a finding (body weight): code 363808001 of


363808001 the SNOMED CT codification,
Measured body weight (observable entity)
• took at date-time 07/04/2000 14:30h,

• with value 93 kg

86
CDA Header: Example
XML header
CDA declaration

CDA header

Type of document: laboratory


LOINC consultation
Confidentiality: the document
is not confidential

The document is about this


Name: Henry Levin the 7th patient: has name Henry Levin
the 7th, male, and born on July
24th, 1932.
male

Birthday: 1932/09/24

The information about the patient is provided by the health care center 2.16.840.1.113883.19.5

87
CDA Body Examples
Non Structured Body Structured Body
Composed of 1+ elements <component>,
Expressed with element <text> each one with 0+ elements <section>,
composed of 0+ elements <entry>

plain text
patient.txt
http://www.hcc.com/patients/patient0012453.txt Patient with significant alterations …

The patient is described in a local (or remote) text file The document represents a patient’s textual
patient.txt (or patient0012453.txt) which is attached to anamnesis that states “Patient with significant
the body. A full URL reference to the remote document
alterations …”
could be provided.
88
CDA Documents Can Automate Print out

89
HL7 CDA Viewer: Backbeach
https://backbeachsoftware.com.au/challenge/index.htm
XML view of
CDA document

Examples in XML
Examples in blocks

CDA Header

CDA Body Sections

Section
Visibility
Control

90
OpenEHR
• OpenEHR is an open standard that specifies all the architectural components
needed to create health information systems that are interoperable, highly
maintainable, and very flexible. It describes the management and storage,
retrieval, and exchange of health data in EHRs. It has three basic components:
– The Reference Model (RM) (or Information Model): is a hierarchy of data structures. NON-MODIFIABLE
– The Knowledge or Archetype Model (AM): composed of models to describe archetypes and templates.
– The Service Model (SM) under development, it includes definitions of basic services in the health
information environment, centered around the EHR.

91
https://specifications.openehr.org/releases/1.0.1/html/architecture/overview/Output/overviewTOC.html
openEHR: The Reference Model
• openEHR RM defines all the basic classes in openEHR
• Everything in openEHR (archetypes, templates, etc.) is based on the classes of openEHR RM.
• Classes are organized in the following packages:

Template DM
openEHR Archetype Profile
Archety0e DM

– Package ehr_extract: defines the semantics of Extracts (i.e., things that can be shared) from openEHR data sources,
including EHRs.
– Package ehr: contains the top level structure, the EHR.
– Package demographic: expresses attributes and relationships of demographic entities (e.g. contact address) which exist
regardless of particular clinical involvements or participations in particular events. Ex., concepts as PARTY, ROLE, etc.
– Package integration: for legacy and other data integration situations.
– Package composition: defines the containment and context semantics of the key concepts COMPOSITION, SECTION,
and ENTRY.
– Package common: defines abstract concepts and design patterns used in higher level openEHR models.
– Package data_structures: describes generic path-addressable data structures (e.g., single, list, table, tree) and a generic
notion of linear history (i.e., time-series structure), for recording events in past time.
– Package data_types: contains classes to define openEHR basic data types (ex., text, date_time, URI, etc.)
– Package support: defines the semantics respectively for constants, terminology access, access to externally defined
scientific units and conversion information 92
Source: https://specifications.openehr.org/releases/RM/latest/ehr.html
OpenEHR AM: Archetypes
OpenEHR archetypes are clinical content specifications that formalize the patterns and requirements for
representation of detailed, computable health-related concepts. Each archetype defines a topic-related set of
data groups and elements. For example, there are separate archetypes for recording symptoms, a blood
pressure measurement, an ultrasound report, and a medication order.
• Metadata (Archetype Header): each archetype describing a concept must have a concept name, a concept
description, a purpose, a use, and possible misuses.
• Archetype Classes: there are four basic classes to describe archetypes
– Composition: container class. All information stored within the EHR will be contained within a Composition. For example, an
encounter, a health summary, or a report. Similar to HL7’s body.
– Section: organizing class, usually contained within a composition. They correspond to the headings that you might find on a blank
piece of paper. They are most commonly used to provide a framework in which to place the smaller Entry and Cluster class
archetypes which hold most of the detailed clinical content.
– Entity: standalone 'semantic unit' of information. Entities can be grouped together usefully and re-used in many different
settings. The information within an Entry will mean the same thing no matter where it is used. Entities can be an observation
(without interpretation, ex., blood pressure), evaluation (is an interpretation of an observation, ex. gender or contraindication),
action (ex. procedure), instruction (ex., medication order), or admin (administrative, ex., patient admission).

– Cluster: reusable archetypes to be used in Entries or other Clusters. They can capture recursive concepts (e.g., observation).
They represent common and fundamental domain patterns that are required in many archetypes and clinical scenarios (e.g.,
size, symptom, inspection, and relative location).
• Archetype data contents: data units and archetypes that define a concrete archetype.
Some archetype examples follow … 93
Source: https://openehr.atlassian.net/wiki/spaces/healthmod/overview
Observation Archetype: Blood Pressure
Data Protocol

Information on what data Information related to the


components represent BP: process of taking BP: method,
systolic, diastolic, pulse, … formula, device, …

94
Evaluation Archetype: Gender
Data Protocol

Data units related to the Information related to the


evaluation of gender: procedure of evaluating the
administrative gender, legal patient’s gender: last updated.
gender, gender assigned at
birth, … 95
Action Archetype: Procedure
Description Protocol

What are the features of this Information about the protocol


procedure: urgency, body site, to perform the procedure:
procedural details, … requestor, receiver of the
order, …
96
Instruction Archetype: Medication order
Activity: Order Protocol

Elements describing the Elements related to the


activity: medication item protocol of prescribing a drug:
(drug), route (of order identifier, …
administration), …
97
Admin Archetype: Patient admission
Data

Relevant components to get


captured at admission time: point of
care/unit, room, bed, building,
address, admission type, prior
patient location, etc.

98
OpenEHR AM: Templates
OpenEHR templates are combinations and constraints on archetypes to create context-specific
clinical data sets and documents such as clinical notes, discharge summary documents, or
messages that will be used in EHR systems.
• Making templates is a “document design process”:
– Determine which organizational models are used, in which order and which ‘primary’ archetypes these models will
contain.
– Set appropriate default values in the primary archetypes, if required
– Specialize some archetypes, if required.
• For example: Antenatal Examination Template
STEPS
History
Symptoms
Template will contain 4 sections: History, Physical Examination, …
Concerns
Sections will contain archetypes
Physical Examination
Blood Pressure Some archetypes may require refinement
Fetal Heart Rate
Palpation of Abdomen Some archetypes’ slots may require default values
Assessment
Plan

• OpenEHR Template examples:


https://openehr.atlassian.net/wiki/spaces/healthmod/pages/2949132/Example+openEHR+Templates

99
Clinical Knowledge Manager (CKM)
The openEHR Clinical Knowledge Manager (CKM) is an international, online clinical
knowledge resource to manage openEHR archetypes and templates that has gathered
an active community of interested and motivated individuals from around the world
focused on furthering an open and international approach to clinical informatics for
sharing health information between individuals, clinicians and organizations; between
applications, and across regional and national borders.

All contributions to CKM is on a voluntary basis, and all CKM content is open source
and freely available under a Creative Commons license.

Concretely:
• CKM is a library of clinical knowledge artefacts - currently predominantly openEHR
archetypes and templates;
• CKM supports the full life cycle management of openEHR archetypes and
templates through a review and publication process;
• CKM provides governance of the knowledge artefacts.

• https://ckm.openehr.org/ckm/

100
Clinical Knowledge Manager Tool

101
Generating New Archetypes
• New archetypes can be generated:
– From scratch: we can generate an archetype for a new
concept.
– By Specialization: we can take an existing archetype (ex.,
the archetype general laboratory observation) and refine it
to represent a more specific archetype (ex., laboratory
observation of blood glucose level).
– By Composition: we can take several existing archetypes
and combine them to form a new more complex
archetype. Composition is done by declaring archetype
slots of type allow_archetype and assign an archetype to
that slot. Example, if we have archetypes A1, …, An, we can
create archetype B(A1, …, An).

102
EHRcom
• The Health informatics - Electronic Health Record
Communication (EN 13606) was the European Standard
for an information architecture to communicate
Electronic Health Records (EHR) of a patient. The
standard was later adopted as ISO 13606 and later
replaced with ISO 13606-2 and recently ISO 13606-
5:2019.
• This standard was intended to support the
interoperability of systems and components that need to
communicate (access, transfer, add or modify) EHR data
via electronic messages or as distributed objects:
– preserving the original clinical meaning intended by the author;
– reflecting the confidentiality of that data as intended by the
author and patient.

103
ISO 13606 standard
• ISO 13606 is a standard from the International Standardization
Organization (ISO), originally designed by the European Committee for
Standardization (CEN).
• ISO 13606 defines a standard, rigorous, and stable information
architecture for communicating part or all of the electronic health record
(EHR) of a single subject of care (patient) between EHR systems, or
between EHR systems and a centralized EHR data repository. It may also
be used for EHR communication between an EHR system and clinical
applications or middleware components (such as decision support
components) that need to access EHR data, or as the representation of
EHR data within a distributed (federated) record system.
• ISO 13606 follows a Dual Model architecture separating information from
knowledge. Information is structured through a Reference Model with the
basic entities for representing any information of the EHR. Knowledge is
based on archetypes, which are formal definitions of clinical information
models, such as discharge report, glucose measurement or family history,
in the form of structured and constrained combinations of the entities of a
Reference Model.

Source: http://www.en13606.org/information.html 104


Interoperability
• All EHR standards pursue semantic interoperability.

• Interoperability is the ability of diverse systems and


organizations to work together seamlessly. That means
exchanging information and using the information that has
been exchanged.
– Syntactic Interoperability: when two or more systems are capable of
communicating and exchanging data by using the same data formats
or communication protocols. XML and JSON are examples of data
exchange formats.
– Semantic Interoperability: ability to automatically interpret the
information exchanged meaningfully and accurately in order to
produce useful results for end users. To achieve semantic
interoperability, both sides must obey a common information
exchange reference model. The content of the information exchange
requests are unambiguously defined: what is sent is the same as what
is understood.

Health Care Center A Health Care Center B


interoperability
Health Information Health Information
105
System A System B
EHR Systems

106
Source: https://www.capterra.com/infographics/top-emr-software
107
The Top 5
By total customers (hospitals using it)
By total users (health care professionals using it)
By social media followers (presence in social media: facebook, linked in, twitter)
By vendor size (in number of employees)

108
By total customers and users

109
By social media followers and vendor size

110
EHR Systems in Spain
Source: Miguel Ángel Montero, Health Care & Social Services Director, IECISA. 21-July-2020.

Ad-hoc solutions

Diraya

Jimena

Abucasis

Ianus

e-Osabide

Proprietary solutions
SAP HealthCare

DXC HPCIS
(previously from HP)
CGM Selene
(Previously from Siemens/Cerner)

Cerner Millennium
111
3. Clinical Data Analysis
• A Data Science Project
• Statistical Analysis of Health Care Data
– Descriptive Statistics
– Inferential Statistics
– Regression
• Artificial Intelligence Analysis of Health Care Data
– Unsupervised Machine Learning
– Supervised Machine Learning

112
A Data Science Project
• Data science is an inter-disciplinary field that uses
scientific methods, processes, algorithms and systems to
extract knowledge and insights from many structural and
unstructured data. It unifies statistics, data analysis,
machine learning, domain knowledge and their related
methods in order to understand and analyze actual
phenomena, with data. Wikipedia

• A DS project follows the next steps:


1. Goals and Objectives Setting
2. Data Extraction
3. Data Cleaning
4. Feature Engineering
5. Model Creation
6. Impact Analysis

113
Data Cleaning
• Data cleaning (or data cleansing) is the process of detecting
and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database. Wikipedia
• The main issues in data cleaning are:
– Missing Values: some data can be absent. For example, the blood pressure of some
patients can be unknown.
– Outliers: some data can be highly atypical. For example, some patients may be older
than 100.
– Errors: some data can be corrupt. For example, heart rate can become null because the
connected machine disconnects while moving the patient.
– Duplicated Data: some data can be redundant. For example, we could have birth date,
age, and admission date (birthdate = admission_date – age).
– Pre-Calculation: some required data can be calculated from the available data. For
example, body mass index can be calculated from patient’s height and weight
(BMI=weight(in Kg) / height(in m)2).
– Useless Features: some features can be irrelevant to the current DS project. For
example, some study may not need patient’s gender information.
– Useless Cases: some cases can be irrelevant to the current DS project. For example,
pediatric studies may remove patients older than 18.
114
Feature Engineering and Model Creation
• Def (Feature engineering) the process of using domain knowledge to
extract features from raw data via data mining techniques. Wikipedia
• Def (Scientific modelling) the process of making a particular part or
feature of the world easier to understand, define, quantify, visualize, or
simulate by referencing it to existing and usually commonly accepted
knowledge. It identifies relevant aspects of a situation in the real world
and then uses different types of models for different aims, such as
conceptual models to better understand, operational models to
operationalize, mathematical models to quantify, and graphical models to
visualize the subject.

• In this course, Model Creation refers to the use of artificial intelligence


algorithms to process available data and produce mathematical models
that can be used to infer facts about new data.

TO BE CONSIDERED LATER
115
Statistical Analysis of Health Care Data
• Sample Tools
– MS Excel
– Python

• Statistical Data Analysis: Descriptive Statistics


• Descriptive Statistics with Python
• Case Study 1: Data Description with Python

• Statistical Data Analysis: Inferential Statistics


• Inferential Statistics with Python
• Case Study 2: Data Analysis with Python

116
Sample Tool: Python

• Rationale:
– Accessibility: it is open access.
– Programmable: analyses can be embedded within
computer programs.
– Simple: with minor indications statistics with Python is
easy.
– Powerful: statistic functions are fast.
– Complete: python provides a great variety of statistical
functions implemented and ready to be used.

118
Statistical Data Analysis: Descriptive Statistics
Descriptive statistics is a branch of statistics aiming at quantitatively
describe or summarize features of a collection of data.
• Qualitative variables: proportion or percentage of occurrence of
each variable value (e.g., percentage of patients taking one drug).
• Quantitative variables:
– Measures of central tendency
• Mean: arithmetic average of the values.
• Median: middle value of the set of values. mode σ𝑛𝑖=1 𝑥𝑖
𝑚𝑒𝑎𝑛 =
• Mode: most commonly observed value of the set of values. median 𝑛
– Measures of dispersion or variability
σ𝑛𝑖=1 𝑥𝑖 − 𝑥ҧ 2
• Variance and standard deviation: st. dev = square-root(variance) 𝑠=
𝑛−1
– ~68% of the cases are in the interval [mean  st.dev]
– ~95% of the cases are in the interval [mean  2*st.dev]

• Confidence intervals: e.g., 95%CI = [mean  1.96*st.dev/n]


99%CI = [mean  2.58*st.dev/n]

• Interquartile range: obtain the first and third quartiles Q1 and Q3, then [Q1, Q3] in
the interquartile range containing 50% of the data.

119
Statistical Description of Data

N 451
Variable Type Mean St. Dev. 95% CI Population Description
Age Numeric 46.4080 16.4298 44.8916 47.9243
Sex Categoric male 44.79% female 55.21%
Height Numeric 166.1353 37.1946 162.7025 169.5681
Weight Numeric 68.1441 16.5998 66.6121 69.6762
QRS duration Numeric 88.9224 15.3814 87.5028 90.3420
P-R interval Numeric 155.0953 44.8755 150.9537 159.2370
Q-T interval Numeric 367.2239 33.4208 364.1395 370.3084
T interval Numeric 169.9335 35.6711 166.6413 173.2257
P interval Numeric 89.9756 25.8480 87.5900 92.3612
Heart Rate Numeric 74.4634 13.8707 73.1833 75.7436
Ragged R wave Categoric exists 0.22% not exists 99.78%
Diphasic der R valveCategoric exists 1.11% not exists 98.89%

Comparative Population Description


Male Female All
N 202 44.79% 249 55.21% 451 100%
Variable Type Mean St. Dev. 95% CI Mean St. Dev. 95% CI Mean St. Dev. 95% CI
Age Numeric 47.4109 16.4466 45.1428 49.6790 45.5944 16.4042 43.556816 47.631939 46.4080 16.4298 0.4978 37.1946
Height Numeric 171.2228 72.6881 147.6103 194.8353 162.0080 39.8710 157.05566 166.9604 166.1353 37.1946 162.7025 169.5681
Weight Numeric 72.6881 171.2228 59.6308 85.7454 64.4578 14.7585 62.624677 66.290986 68.1441 16.5998 66.6121 69.6762
QRS duration Numeric 94.6832 94.6832 72.9829 116.3834 84.2490 14.4695 82.451745 86.046246 88.9224 15.3814 87.5028 90.3420
P-R interval Numeric 157.3564 157.3564 107.0805 207.6324 153.2610 45.9316 147.55589 158.9662 155.0953 44.8755 150.9537 159.2370
Q-T interval Numeric 364.5693 364.5693 340.1280 389.0106 369.3775 33.4351 365.22454 373.53048 367.2239 33.4208 364.1395 370.3084
T interval Numeric 177.2327 177.2327 164.5085 189.9568 164.0120 35.4926 159.60352 168.42058 169.9335 35.6711 166.6413 173.2257
P interval Numeric 92.2673 92.2673 82.1306 102.4040 88.1165 25.5852 84.938527 91.294404 89.9756 25.8480 87.5900 92.3612
Heart Rate Numeric 73.5050 73.5050 63.3682 83.6417 75.2410 12.8728 73.64204 76.839888 74.4634 13.8707 73.1833 75.7436
Ragged R wave Categoric exists 0.00% not exists 100.00% exists 0.40% not exists 99.60% exists 0.22% not exists 99.78%
Diphasic der R valveCategoric exists 0.99% not exists 99.01% exists 1.20% not exists 98.80% exists 1.11% not exists 98.89%
120
Descriptive Statistics with Python
• Context
Numpy : Python library adding support for large, multi-dimensional arrays and matrices, along
import numpy as np with a large collection of high-level mathematical functions to operate on these arrays.
import statistics Scipy: Python library on Numpy that provides additional functions for optimization, linear
import scipy algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE
solvers and other tasks common in science and engineering.
import math
Import collections
• Qualitative Variables
data = ['a','a','a','e','e','e','e','e','e','i','o','o','o','o','o','u','u','u','u','u']
Frequency all categories: collections.Counter(data)
Frequency of one category: collections.Counter(data)[‘a’]
Percentage: collections.Counter(data)[‘a’]/len(data)*100
• Quantitative Variables
data = [1,2,3,3,3,4,4,4,4,5,5,5,6,6,7,7,7,7,7,8,10,10,10]
N: len(data)
Mean: statistics.mean(data)
Median: statistics.median(data)
Mode: statistics.mode(data)
Variance: statistics.variance(data)
Std. Deviation: statistics.stdev(data)
95% CI (normal): from scipy import stats n = 23
cv = stats.norm.ppf(0.975) mean = 5.565217391304348
error = cv * stdev / math.sqrt(n) median = 5
CI = (mean – error, mean + error) mode = 7
Quartiles: q1 = np.quantile(data, 0.25) variance = 6.3478260869565215
q3 = np.quantile(data, 0.75)
standard deviation = 2.5194892512087685
CI (95%) = (4.5355506551396925, 6.594884127469003)
Quartiles = 4.0 5.0 7.0
121
Case Study 1: Data Description
Heart Disease Data Set

https://archive.ics.uci.edu/ml/datasets/Heart+Disease

N = 303
Natt = 14
122
Case Study 1: Variable Types
Position Short name Type Description
1. age Quantitative Patient’s age in years
2. sex Qualitative Patient’s gender (1:male,0:female)
3. cp Qualitative Chest pain type (1:typical angina, 2:atypical angina, 3:non-
anginal pain, 4:asymptomatic)
4. trestbps Quantitative Resting blood pressure (in mm Hg on admission to the
hospital)
5. chol Quantitative Serum cholestoral in mg/dl
6. fbs Qualitative Fasting blood sugar > 120 mg/dl (1:true, 0:false)
7. restecq Qualitative resting electrocardiographic results (0:normal, 1:having ST-T
wave abnormality (T wave inversions and/or ST elevation or
depression of > 0.05 mV), 2:showing probable or definite left
ventricular hypertrophy by Estes' criteria
8. thatlach Quantitative Maximum heart rate achieved
9. exang Qualitative Exercise induced angina (1:yes, 0:no)
10. oldpeak Quantitative ST depression induced by exercise relative to rest
11. slope Qualitative The slope of the peak exercise ST segment (1:upsloping, 2:
flat, 3:downsloping)
12. ca Qualitative (*) Number of major vessels (0-3) colored by flourosopy
13. thal Categorical 3: normal, 6: fixed defect, 7: reversable defect
14. num Categorical Diagnosis of heart disease (angiographic disease status) (0: <
50% diameter narrowing, 1: > 50% diameter narrowing)

123
CS1: Data Description with Python
1. Download the .csv file in a local folder
2. Take a look at the particularities of this data set
3. Make a table with all the attributes
4. In python, load the file in a pandas DataFrame structure indicating the names of
the columns
5. Look at the types of the attributes
6. Declare categorical attributes according to the previous table
7. Obtain the description of all the variables
8. Obtain frequency of all categories
9. Observe from the previous result that there are some missing values ?
10. Calculate the mean, median, and mode of age
11. Calculate the mean, median, and mode of age for males and females separately
12. Calculate 95% CI of chol
13. Calculate 95% CI of chol for women and men separately
14. Calculate 95% CI of all numeric attributes
15. Calculate interquartile ranges of chol
16. Calculate interquartile ranges of all numeric attributes
124
Statistical Data Analysis: Inferential Statistics
Inferential statistics: is a branch of statistics which aims to deduce properties of a population
under a probability distribution.
• Examples:
– One single population: e.g., probability of developing a lung cancer being a smoker.
– Two populations: e.g., treated patients vs. not-treated patients.
– N populations: e.g., effect of one drug depending on the patient’s disease.

• Related concepts:
– Confidence Intervals (CI): “P% CI [a, b]” means that we are P% confident that the population
parameter is in the interval [a, b] (it is not that P% of the population is within the interval).
– Hypothesis test: null hypothesis vs. alternative hypothesis.
• Equality: e.g., a treatment has no effect.
• Improvement: e.g., one treatment is better than another.
– P-value (or significance): probability that an observation is due to random chance.
• P-value < 0.05: we accept the alternative hypothesis (e.g., the treatment has an influence in the evolution of
the patient).
• P-value > 0.05: there’s not statistical evidence to reject the null hypothesis (e.g., we cannot conclude that the
drug improves the evolution of the patient).
– Test Statistics:
• Quantitative variables: Student’s t-test, Welch’s t-Test.
• Qualitative variables: Pearson’s Chi-squared test, Fisher’s test.
– ANOVA

125
Concepts Similarity

① Confidence Interval (CI)


(needs data and tables)

② Hypothesis Test
(needs data and tables)

③ P-value
(needs only data)

126
Student’s t-Test and Welch’s t-Test
A t-test is a type of inferential statistic used to determine if there
is a significant difference between the means of two groups.
• The sample (quantitative values) follows a normal distribution.
• One sample: To test whether the mean of a population has a
specific value. E.g., the average temperature of patients with
flu is 39.5 oC.
• Two samples: To compare the mean values of two samples.
– Variance of the Samples
• Same variance: use Student’s t-Test.
• Different or unknown variances: use Welch’s t-Test.
– Samples relationship
• Unpaired Samples: The two samples are independent. E.g., smoker mothers
affect the mean weight of newborns (we have separate independent samples of
smoking women and non-smoking women).
• Paired Samples: The two samples are dependent (they belong to the same
subject) E.g., the mean temperature of the patients after the treatment went
down (we compare the temperatures before and after the treatment of the
same patient).

William Sealy Gosset “Student”


(1876-1937) 127
Examples
test null hypothesis data n mean st.dev statistic test P-value conclusion

𝑥ҧ − 𝜇0 We accept
one sample The average 𝑡= 0.7447 that the
39.0, 40.0, 38.9, 39.9, 𝑠Τ 𝑛
Student’s t- temperature is 8 39.55 0,4175 (> 2P/2) average
39.4, 39.7, 39.9, 39.6
test 39.5oC temperature
t = 0,3387
is 39.5oC.
𝑥1 − 𝑥2
𝑡=
two 1 1 Smoking
Weights 𝑠 +
samples smoking while 𝑛1 𝑛2 while
Smokers: 2.7, 2.9, 2.7,
unpaired pregnancy pregnancy
2.8, 3.0, 2.7,2.5, 3.1 8 2.80 0.1927 0.00198
(equal reduces the affects the
Non-Smokers: 3.4, 2.8, 10 3.26 0.3062 𝑛1 − 1 𝑠12
+ (𝑛2 − 1)𝑠22 (< P)
variances) mean weight of 𝑠= newborn
3.5, 3.2, 3.6, 2.9, 3.5, 𝑛1 + 𝑛2 − 2
Student’s t- newborns weight
3.2, 3.6, 2.9
test t = -3.6918 (reduction).

d = x 1 – x2
two Temp. Before-After: After the
the mean temp.
samples 39.0-37.5, 40.0-39.5, 𝑑ҧ − 𝜇0 treatment,
of the patients 𝑡= 0.02083
paired 38.9-37.8, 39.9-39.4, 8 đ = 0.8 sd = 0.7382 the
after treatment 𝑠𝑑 Τ 𝑛 (< P)
Student’s t- 39.4-37.5, 39.7-39.9, temperature
went down
test 39.9-38.9, 39.6-39.7 went down.
t = 2.9693

𝑥1 − 𝑥2 Smoking
Whelch’s smoking while Smokers: 2.7, 2.9, 2.7,
𝑡= while
test pregnancy 2.8, 3.0, 2.7,2.5, 3.1
8 2.8 0.1927 𝑠12 𝑠22 pregnancy
(unknown reduces the Non-Smokers: 3.4, 2.8, + 0.0014
10 3.26 0.3062 𝑛1 𝑛2 reduces the
variances) mean weight of 3.5, 3.2, 3.6, 2.9, 3.5,
newborn
newborns 3.2, 3.6, 2.9 t = -3.8848
weight.

128
Student’s/Welch’s t-Tests with Excel
(Functions in Excel)

df = n - 1

Table Values 1.645 = INV.T(1-0.05; df) 1.96 = INV.T.2C(0.025*2; df)


( → t) 1.645 = INV.T.2C(0.05*2; df) 1.96 = INV.T(1-0.025; df)
(t → P-value) 0.05 = DISTR.T.CD(1.645; df) 0.025*2 = DISTR.T.2C(1.96; df)
0.05 = 1 – DISTR.T.N(1.645; df; VERDADERO) 0.025 = DISTR.T.CD(1.96; df)
0.025 = 1 – DISTR.T.N(1.96; df; VERDADERO)
(data → P-value) 0.05 = PRUEBA.T.N(D1; D2; 1; type) 0.05 = PRUEBA.T.N(D1; D2; 2; type)

type = 1 (paired), 2 (unpaired same variance), 3 (unknown variances)

129
Student’s & Welch’s t-Tests with Excel
(Example)

one-sample Temperature x0
1 39.0 39.5
2 40.0 39.5
3 38.9 39.5
4 39.9 39.5
5 39.4 39.5
6 39.7 39.5
7 39.9 39.5
8 39.6 39.5
(4)
n= 8 Prueba t para medias de dos muestras emparejadas
mean= 39.55 promedio(data)
std.dev= 0.4175 desvest.m(data) Temperature x0
error= 0.1476 std.dev/sqrt(n) Media 39.55 39.5
df= 7 n-1 Varianza 0.17428571 0
t= 0.3388 abs(mean-x0)/error Observaciones 8 8
= 0.05 Coeficiente de correlación de Pearson #¡DIV/0!
Diferencia hipotética de las medias 0
(1) 1 TAIL 2 TAIL Grados de libertad 7
t-table (t=) 1.8946 INV.T(1-; df ) 2.3646 INV.T.2C( ; df) Estadístico t 0.33875374
(2) P(T<=t) una cola 0.37236486
P-value (from t)= 0.3724 DISTR.T.CD(t; df) 0.7447 DISTR.T.2C(t; df); Valor crítico de t (una cola) 1.89457861
(3) P(T<=t) dos colas 0.74472973
P-value (from data)= 0.3724 PRUEBA.T.(D1; D2; 1; 1) 0.7447 PRUEBA.T.2C(D1; D2; 2;1); Valor crítico de t (dos colas) 2.36462425

Options:
(1) Manual calculation of hypothesis test
(2) Based on P-values if we know or we want to calculate the t value (DISTR.T.* functions)
(3) Based on P-values if we don’t know the t value (PRUEBA.T.* functions)
(4) Using the Data > Data Analysis Complement of Excel
130
Student’s & Welch’s t-Tests with Python
import numpy as np
import scipy.stats as stats

# one sample Student’s t-test # two samples paired Student's t-test


data = np.array([39.0, 40.0, 38.9, 39.9, 39.4, 39.7, 39.9, 39.6]) before = np.array([39.0, 40.0, 38.9, 39.9, 39.4, 39.7, 39.9, 39.6])
after = np.array([37.5, 39.5, 37.8, 39.4, 37.5, 39.9, 38.9, 39.7])
fvalue, pvalue = stats.ttest_1samp(data, 39.5)
print(fvalue, pvalue) fvalue, pvalue = stats.ttest_rel(before, after)
print(fvalue, pvalue)

0.33875374294705973 0.7447297281621488 2.969261484185574 0.020829549985541648


Conclusion: we cannot discard Avg. Temp=39.5 Conclusion: we can discard before = after

# two samples unpaired (equal variances) Student's t-test # Welch's t-test


smokers = np.array([2.7, 2.9, 2.7, 2.8, 3.0, 2.7, 2.5, 3.1]) smokers = np.array([2.7, 2.9, 2.7, 2.8, 3.0, 2.7,2.5, 3.1])
non_smokers = np.array([3.4, 2.8, 3.5, 3.2, 3.6, 2.9, 3.5, 3.2, 3.6, 2.9]) non_smokers = np.array([3.4, 2.8, 3.5, 3.2, 3.6, 2.9, 3.5, 3.2, 3.6, 2.9])

fvalue, pvalue = stats.ttest_ind(smokers, non_smokers, equal_var=True) fvalue, pvalue = stats.ttest_ind(smokers, non_smokers, equal_var=False)
print(fvalue, pvalue) print(fvalue, pvalue)
-3.6918328279635917 0.0019761593646508767 -3.8848476429716334 0.0014179799721141778
Conclusion: we can reject smokers = non_smokers Conclusion: we can reject smokers = non_smokers

131
Pearson’s Chi-Square Test
Chi-square t-test: When the values in the sample are categorical apply Chi-square
test instead of t-Test.
• E.g., smoking while pregnancy affects the mean
weight of newborns (classified as normal-weight
and under-weight).
• Used to determine if there is a significant difference
between the expected frequencies and the observed
frequencies in one or more categories.
𝑜𝑏𝑠[𝑖, 𝑡𝑜𝑡𝑎𝑙] ∗ 𝑜𝑏𝑠[𝑡𝑜𝑡𝑎𝑙, 𝑗]
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑[𝑖, 𝑗] =
𝑜𝑏𝑠[𝑡𝑜𝑡𝑎𝑙, 𝑡𝑜𝑡𝑎𝑙]
• Chi-square value:
𝒏
𝟐
(𝑶𝒊 − 𝑬𝒊 )𝟐
𝝌 =෍
𝑬𝒊
𝒊=𝟏
• P-value <  ➔ There’s a statistical significant
association between maternal smoking and the
probability of having an underweighted newborn.

Karl Pearson
(1857-1936) 132
Chi-Square Test with Excel
(Functions in Excel)

df = 3

Table Values 6.251 = INV.CHICUAD(1-0.10; df)


( → t) 6.251 = INV.CHICUAD.CD(0.10; df)
(2 → P-value) 0.10 = DISTR.CHICUAD.CD(6.251; df)
0.10 = 1 – DISTR.CHICUAD(6.251; df; VERDADERO)
(data → P-value) 0.10 = PRUEBA.CHICUAD(D-OBSSERVED; D-EXPECTED)

133
Chi-Square Test with Excel
(Example)

Observed smoker no total


normal 86 44 130
under 29 30 59
total 115 74 189

Expected smoker no
normal 79.1005 50.8995
under 35.8995 23.1005

2 4.9237 S r S s (obs[i,j]-exp[i,j]) 2 /exp[i,j]


 0.05
df 1 (r-1)(c-1)
(1)
 2-table ( 2=) 3.8415 INV.CHICUAD.CD( ; df)
(2)
P-value (from  2 value) 0.0265 DISTR.CHICUAD.CD( 2; df)
(3)
P-value (from data) 0.0265 PRUEBA.CHICUAD(obs;exp)

134
Chi-Square Test with Python
GROUP
Smoker Non Smoker
COMPARED
VARIABLE

Normal-weight 86 44
Under-weight 29 30

import numpy as np
import scipy.stats as stats

# Chi-square test
data = np.array([[86, 44],[29, 30]])

fvalue, pvalue, df, expected = stats.chi2_contingency(data , correction=False)


print(fvalue, pvalue, df, expected)
4.923705434361292 0.026490642530502487 1 [[79.1005291 50.8994709] [35.8994709 23.1005291]]

Conclusion: there’s an association between the two variables (smoking during pregnancy affects normality of the
weight of the newborns)

135
ANOVA
ANOVA or Analysis of Variance (between group means): used for
testing two or more groups to see whether there’s a difference
between their mean values. For example, testing if several
treatments cause the same result.
• Types of ANOVA tests:
– One-way ANOVA: used when you want to test several groups considering one
single factor (or category) to see if there’s a difference between them. For
example, testing whether a drug has the same effect in children, adults, and
elders (1 factor: age; 3 groups: children, adults, elders).
– Two-way ANOVA: used when you want to test several groups considering two
factors to see if there’s a difference between groups for each one of the
factors. For example, testing whether a drug has the same effect in children,
adults, and elders, but also if there’s any different effect on male and female
patients (2 factors: age, gender; 6 groups: children-male, children-female, …).
• With/without replication: If groups have m>1 cases or m=1 cases.

Ronald Aylmer Fisher


(1890-1962) 136
Understanding ANOVA
Two-way ANOVA groups Two-way ANOVA groups
One-way ANOVA groups
(without replication) (with replication)

H0: there’s no difference between groups H0: there’s no difference between groups for factor 1
There’s no
··· ··· difference in the
··· drug effect in
males and
H1: there’re some groups that are different ··· females same as without replication
H1: there’re some groups that are different for factor 1 +
··· factor 1
G1 ··· There’s a
difference in the H0: there’s no interaction between factors
G2 ···
drug effect in
Gr ···
males and ··· For example,
females following a
··· concrete
H0: there’s no difference between groups for factor 2 treatment (f1)
There’s no does not
···
difference in the interact with
···
drug effect in ··· gender (f2).
terms of age
··· groups
H1: there’re some interaction between factors
H1: there’re some groups that are different for factor2 For example,
··· There’s a ··· taking
difference in the contraceptives
···
drug effect in ··· (f1) has an
terms of age interaction with
··· groups gender (f2) (e.g.,
G1 G2 G3 G4 Gc women are
factor 2 ··· more sensitive)
137
One-Way ANOVA
One-way analysis of variance (ANOVA): used to determine whether there are any
statistically significant differences between the means of two or more independent
(unrelated) groups.
– E.g., compare the efficacy of different dosages of a drug against high blood pressure.

Data Group1 Group 2 … Group k


1 𝑋11 𝑋12 𝑋1𝑘
𝑋𝑖𝑗 = ··· 𝑋ത
···

···
···

𝑛 𝑋𝑛1 𝑋𝑛1 𝑋𝑛𝑘


𝑋1 𝑋2 ··· 𝑋𝑘

Source of Degrees of Mean


Sum of Squares F
Variation Freedom Squares (MS)
𝑘
Between 2 𝑆𝑆𝑏 𝑀𝑆𝑏
𝑆𝑆𝑏 = ෍ 𝑋ഥ𝑗 − 𝑋ത 𝑘−1 𝑀𝑆𝑏 = 𝐹=
Groups 𝑗=1
𝑘−1 𝑀𝑆𝑤
𝑘 𝑛
Within 2 𝑆𝑆𝑤
𝑆𝑆𝑤 = ෍ ෍ 𝑋𝑖𝑗 − 𝑋ഥ𝑗 𝑛−𝑘 𝑀𝑆𝑤 =
Groups 𝑗=1 𝑖=1
𝑛−𝑘
𝑘 𝑛
2
Total 𝑆𝑆𝑡 = ෍ ෍ 𝑋𝑖𝑗 − 𝑋ത 𝑛−1
𝑗=1 𝑖=1 138
One-Way ANOVA with Excel
“We want to evaluate the efficacy of different doses of a drug against high blood pressure,
comparing it with that of a salt-free diet. For this, 25 hypertensive patients are randomly
selected and distributed randomly in 5 groups. The first of them is not given any treatment,
the second a diet without salt, the third a drug dose 100 mg, the fourth the drug at 150 mg
dose and the fifth the same drug at 200 mg dose. The systolic blood pressures of the 25
subjects at the end of the treatments are captured.”
TREATMENTS
null salt free 100 mg 150 mg 200 mg
1 180 172 163 158 147
2 173 158 170 146 152
3 175 167 158 160 143
4 182 160 162 171 155
5 181 175 170 155 160

One-factor ANOVA

SUMMARY
Group n sum mean variance
null 5 891 178.2 15.7
salt free 5 832 166.4 54.3
100 mg 5 823 164.6 27.8
150 mg 5 790 158 81.5
200 mg 5 757 151.4 44.3

ANALYSIS OF THE VARIANCE


Varation source Sum square df Mean square F P-value F crit
Entre grupos 2010.64 4 502.66 11.240161 0.0000606 2.8660814
Within groups 894.4 20 44.72 139
Total 2905.04 24
One-Way ANOVA with Python
① ②
TREATMENTS
null salt free 100 mg 150 mg 200 mg Install Pingouin: pip install --upgrade pingou
1 180 172 163 158 147
2 173 158 170 146 152 Group Value
3 175 167 158 160 143 null 180
4 182 160 162 171 155
5 181 175 170 155 160 null 173

null 175
import numpy as np
One-factor ANOVA
import pandas as pd null 182
SUMMARY import scipy.stats as stats null 181
Group n sum mean variance
null 5 891 178.2 15.7 salt free 172
data = np.array([[180,172,163,158,147],
salt free 5 832 166.4 54.3 import numpy as np salt free 158
[173,158,170, 146,152],
100 mg 5 823 164.6 27.8 import pandas as pd
150 mg 5 [175,167,158,160,143],
790 158 81.5 … …
import pingouin as pg
200 mg 5 [182,160,162,171,155],
757 151.4 44.3
[181,175,170,155,160]])
ANALYSIS OFdf THE
data = np.array([[1,180],[1,173],[1,175],[1,182],[1,181],
= VARIANCE
pd.DataFrame(data=data, columns=["null", "salt free",
Varation source Sum square"150 df [2,172],[2,158],[2,167],[2,160],[2,175],
"100 mg", mg", Mean
"200 square
mg"]) F P-value F crit
Entre grupos 2010.64 4 502.66 11.240161 0.0000606 2.8660814 [3,163],[3,170],[3,158],[3,162],[3,170],
Within groups 894.4 20 44.72 [4,158],[4,146],[4,160],[4,171],[4,155],
Total fvalue, pvalue = stats.f_oneway(df['null'],
2905.04 24 df['salt free'],
[5,147],[5,152],[5,143],[5,155],[5,160]])
df['100 mg'], df['150 mg'], df['200 mg'])
df = pd.DataFrame(data=data, columns=["Group","Value"])
print(fvalue, pvalue)
aov = pg.anova(dv='Value', between='Group', data=df, detailed=True)
print(aov)
11.240161001788922 6.0616159772040674e-05
Source SS DF MS F p-unc np2
0 Group 2010.64 4 502.66 11.240161 0.000061 0.692121
1 Within 894.40 20 44.72 NaN NaN NaN

CONCLUSION: one or more treatments have a different mean efficacy 140


Two-way ANOVA without Replication
Without replication: The intersection between the two factors defines groups of only one value.
– E.g., compare the effect of tree different drugs in the systolic blood pressure of patients, depending on their gender, when only
one patient is involved per group in the study.
– Two null hypotheses: (1) the means of the genders are the same, and (2) the means of the treatments are the same

Group B1 Group B2 … Group Bc


Group A1 𝑋11 𝑋12 𝑋1𝑘 𝑋1.
𝑋𝑖𝑗 = ···

···

···

···

···
···

Group Ar 𝑋𝑟1 𝑋𝑟1 𝑋𝑟𝑘 𝑋𝑟.


𝑋.1 𝑋.2 ··· 𝑋.𝑐 𝑋ത
Source of Degrees of Mean Squares
Sum of Squares F
Variation Freedom (MS)
𝑟
Between 𝑆𝑆𝐴 𝑀𝑆𝐴
𝑆𝑆𝐴 = 𝑐 · ෍ 𝑋𝑖. − 𝑋ത 2 𝑟−1 𝑀𝑆𝐴 = 𝐹=
Group A 𝑖=1
𝑟−1 𝑀𝑆𝑤
𝑐
Between 2 𝑐−1 𝑆𝑆𝐵 𝑀𝑆𝐴
𝑆𝑆𝐵 = 𝑟 · ෍ 𝑋.𝑗 − 𝑋ത 𝑀𝑆𝐵 = 𝐹=
Group B 𝑗=1
𝑐−1 𝑀𝑆𝑤
𝑐 𝑟
Within 2 𝑆𝑆𝑤
𝑆𝑆𝑤 = ෍ ෍ 𝑋𝑖𝑗 − 𝑋𝑖. − 𝑋.𝑗 + 𝑋ത 𝑟 − 1 · (𝑐 − 1) 𝑀𝑆𝑤 =
Groups 𝑗=1 𝑖=1
(𝑟 − 1)(𝑐 − 1)

𝑘 𝑛
2
Total 𝑆𝑆𝑡 = ෍ ෍ 𝑋𝑖𝑗 − 𝑋ത 𝑟·𝑐−1
141
𝑗=1 𝑖=1
Two-way ANOVA (w/o replication) with Excel
“We want to evaluate the efficacy of different doses of a drug and patient’s gender against
high blood pressure. For this, we 15 male and 15 female and take at random one individual
of each group to provide each one of the 5 given treatments. These treatments are: (1) not
given any treatment (placebo), (2) restrict a diet without salt, (3) provide drug dose 100 mg,
(4) provide drug at 150 mg dose and (5) provide drug at 200 mg dose. The systolic blood
pressures of the 30 subjects is measured and tabulated at the end of the treatments.”

TREATMENTS
null salt free 100 mg 150 mg 200 mg
1 men 180 172 163 158 147
2 173 158 170 146 152
3 175 167 158 160 143
4 women 182 160 162 171 155
5 181 175 170 155 160
6 179 163 150 148 140

Two-Way ANOVA

Source of variation
Square sum df Square aver F P-value F crit value
Gender 28,0333333 1 28,0333333 0,4740699 0,499030358 4,3512435
Treatment 2813,53333 4 703,383333 11,8948703 4,14415E-05 2,8660814
Interaction 63,1333333 4 15,7833333 0,26691094 0,895750572 2,8660814
Within group 1182,66667 20 59,1333333

Total 4087,36667 29

142
Two-way ANOVA with Replication
With replication: The intersection between the two factors defines groups of more than 1 value.
– E.g., compare the effect of tree different drugs in the systolic blood pressure of patients, depending on their gender,
when there are several (the same) number of males/females receiving each treatment.
– Three null hypotheses: (1) the means of the genders are the same, and (2) the means of the treatments are the same,
(3) there is no interaction between treatment and gender.
Group B1 Group B2 … Group Bc
Group A1 𝑋111 , …, 𝑋11𝑚 𝑋121 , … , 𝑋12𝑚 𝑋1𝑐1 , … , 𝑋1𝑐𝑚 𝑋1..
𝑚
1 𝑋𝑖𝑗𝑘 = ···

···

···

···

···
···
𝑋𝑖𝑗. = ෍ 𝑋𝑖𝑗𝑘
𝑚
𝑘=1 Group Ar 𝑋𝑟11 , … , 𝑋𝑟1𝑚 𝑋𝑟11 , … , 𝑋𝑟1𝑚 𝑋𝑟𝑐1 , … , 𝑋𝑟𝑐𝑚 𝑋𝑟..
𝑋.1. 𝑋.2. ··· 𝑋.𝑐. 𝑋…

Source of Degrees of Mean Squares


Sum of Squares F
Variation Freedom (MS)
𝑟
Between 2 𝑆𝑆𝐴 𝑀𝑆𝐴
𝑆𝑆𝐴 = 𝑚 · 𝑐 · ෍ 𝑋𝑖.. − 𝑋… 𝑟−1 𝑀𝑆𝐴 = 𝐹=
Group A 𝑖=1
𝑟−1 𝑀𝑆𝑤
𝑐
Between 2 𝑐−1 𝑆𝑆𝐵 𝑀𝑆𝐴
𝑆𝑆𝐵 = 𝑚 · 𝑟 · ෍ 𝑋.𝑗. − 𝑋… 𝑀𝑆𝐵 = 𝐹=
Group B 𝑗=1
𝑐−1 𝑀𝑆𝑤
𝑐 𝑟
Interaction 𝑆𝑆𝐴𝐵 = 𝑚 · ෍ ෍ 𝑋𝑖𝑗. − 𝑋𝑖.. − 𝑋.𝑗. + 𝑋…
2 𝑆𝑆𝐴𝐵 𝑀𝑆𝐴𝐵
𝑟 − 1 · (𝑐 − 1) 𝑀𝑆𝐴𝐵 = 𝐹=
AB (𝑟 − 1)(𝑐 − 1) 𝑀𝑆𝑤
𝑗=1 𝑖=1
𝑚 𝑟 𝑐
Within 2 𝑆𝑆𝑤
𝑆𝑆𝑤 = ෍ ෍ ෍ 𝑋𝑖𝑗𝑘 − 𝑋𝑖𝑗. 𝑟 · 𝑐 · (𝑚 − 1) 𝑀𝑆𝑤 =
Groups 𝑘=1 𝑗=1 𝑖=1
𝑟𝑐 · (𝑚 − 1)

𝑚 𝑟 𝑐
2
Total 𝑆𝑆𝑡 = ෍ ෍ ෍ 𝑋𝑖𝑗𝑘 − 𝑋… 𝑟·𝑐·𝑚−1 143
𝑘=1 𝑗=1 𝑖=1
Two-way ANOVA with Python

import numpy as np Group A:


import pandas as pd
import pingouin as pg 1: women
2: men
data = Group B:
np.array([[1,1,180],[1,1,173],[1,1,175],[2,1,182],[2,1,181],[2,1,179],
[1,2,172],[1,2,158],[1,2,167],[2,2,160],[2,2,175],[2,2,163], 1: null
[1,3,163],[1,3,170],[1,3,158],[2,3,162],[2,3,170],[2,3,150], 2: salt free
[1,4,158],[1,4,146],[1,4,160],[2,4,171],[2,4,155],[2,4,148], 3: 100 mg
[1,5,147],[1,5,152],[1,5,143],[2,5,155],[2,5,160],[2,5,140]])
df = pd.DataFrame(data=data, columns=["A","B","Value"]) 4: 150 mg
5: 200 mg
aov = pg.anova(dv='Value', between=['A','B'], data=df, detailed=True)
print(aov)

Source SS DF MS F p-unc np2


0 A 28.033333 1 28.033333 0.474070 0.499030 0.023155
1 B 2813.533333 4 703.383333 11.894870 0.000041 0.704052
2 A * B 63.133333 4 15.783333 0.266911 0.895751 0.050677
3 Residual 1182.666667 20 59.133333 NaN NaN NaN

• There’s no evidence to conclude that men and women (factor A) have different SBP
• There’s an evidence that treatments (factor B) cause different SBP (P=0.000041)
• There’s no evidence that gender and treatment have some interaction (for example,
men don’t like some treatment and fail to follow it). 144
Post-Hoc Analysis: Tukey’s Test
• T-tests and ANOVA allow us to detect if there are differences between the group mean
values, but does not provide insight on the pairwise comparison of groups. The means can be
considered equal or different, but which ones are equal and which ones are different.
Post-hoc analysis: sort of study that allows pairwise comparison of a multi-group test.
Tukey’s test: is a post-hoc analysis used to find means that are significantly different from each
other.
Which treatments (column B) can be considered different to which other treatments?
A B mean(A) mean(B) diff se tail T \
import numpy as np 0 1 2 178.333333 165.833333 12.500000 4.121219 two-sided 3.033083
import pandas as pd 1 1 3 178.333333 162.166667 16.166667 4.121219 two-sided 3.922788
import pingouin as pg 2 1 4 178.333333 156.333333 22.000000 4.121219 two-sided 5.338227
3 1 5 178.333333 149.500000 28.833333 4.121219 two-sided 6.996312
4 2 3 165.833333 162.166667 3.666667 4.121219 two-sided 0.889704
data =
5 2 4 165.833333 156.333333 9.500000 4.121219 two-sided 2.305143
np.array([[1,1,180],[1,1,173],[1,1,175],[2,1,182],[2,1,181],[2,1,179], 6 2 5 165.833333 149.500000 16.333333 4.121219 two-sided 3.963229
[1,2,172],[1,2,158],[1,2,167],[2,2,160],[2,2,175],[2,2,163], 7 3 4 162.166667 156.333333 5.833333 4.121219 two-sided 1.415439
[1,3,163],[1,3,170],[1,3,158],[2,3,162],[2,3,170],[2,3,150], 8 3 5 162.166667 149.500000 12.666667 4.121219 two-sided 3.073524
[1,4,158],[1,4,146],[1,4,160],[2,4,171],[2,4,155],[2,4,148], 9 4 5 156.333333 149.500000 6.833333 4.121219 two-sided 1.658086
[1,5,147],[1,5,152],[1,5,143],[2,5,155],[2,5,160],[2,5,140]]) p-tukey hedges
df = pd.DataFrame(data=data, columns=["A","B","Value"]) 0 0.026411 1.616448
1 0.001503 2.090605  Treatment 1 (null) and 3 (100mg),
pt = pg.pairwise_tukey(data=df, dv='Value', between='B') 2 0.001000 2.844948  Treatment 1 (null) and 4 (150mg),
3 0.001000 3.728606  Treatment 1 (null) and 5 (200mg), and
print(pt)
4 0.891345 0.474158
5 0.154899 1.228500
6 0.001293 2.112158  Treatment 2 (salt free) and 5 (200mg) …
7 0.599289 0.754342
‘B’ is the treatment column 8 0.023565 1.638000
9 0.462465 0.883658
… all can be considered to have
John Wilder Tukey different means. 145
(1915-2000)
Normality test
• Values do not necessarily follow a normal distribution
• Testing normality:
– Rule of thumb: histogram of the data and observe if it follows a normal distr.
– Shapiro-Wilk method:
2
σ𝑛𝑖=1 𝑎𝑖 𝑥(𝑖)
𝑊= 𝑛
σ𝑖=1(𝑥𝑖 − 𝑥)ҧ 2
• Excel:
– Install “Real Statistics”
– Access real statistics with CTRL-M
– Select Shapiro-Wilk test
– Observe the P-values obtained
– P-value >  => normality cannot be rejected
• Example:
– Data: 11.25, 10.00, 9.68, 10.52, 8.77, 9.92, 8.62, 10.21, 9.09, 10.36.
– P-value (=SWTEST(data)): 0.8122 => The data follow a Normal distribution

– Kolmogorov-Smirnov method: compares the values with some distribution (in


this case, the normal distribution is taken).
• Example: given the same data above.

146
source: https://synapse.koreamed.org/Synapse/Data/PDFData/1010JRD/jrd-26-5.pdf
Shapiro-Wilk Test
Critical Values
Samuel Sanford Shapiro
(1930-)

A.
1. Calculate the SW statistic W for the sample:
2
σ𝑛𝑖=1 𝑎𝑖 𝑥(𝑖)
𝑊= 𝑛
σ𝑖=1(𝑥𝑖 − 𝑥)ҧ 2
2. Obtain the critical value CV from the table:
CV = table(n, )
3. Conclude about normality:
If W < CV, we reject normality
Otherwise we cannot reject normality

B.
1. Calculate the P-value of the data:
P-value = SWTEST(data)
2. Conclude about normality:
If P-value < , we reject normality
Otherwise we cannot reject normality

Martin B. Wilk
(1922–2013) 147
A-Coefficients for Shapiro-Wilks

148
Example Shapiro-Wilk Test in Excel

Data Ordered (x-mean) 2 ai (n=19) Ordered InvOrd - Ord Inv ai * (Ord-OrdInv)


2.4 -10 95.01 0.4808 9.9 -19.9 -9.5679 n= 19
4.5 -9 76.52 0.3232 7.8 -16.8 -5.4298 mean= -0.25
-5.5 -7.9 58.48 0.2561 7.1 -15 -3.8415
-0.8 -5.5 27.53 0.2059 5.8 -11.3 -2.3267
-7.9 -4.8 20.68 0.1641 4.6 -9.4 -1.5425
9.9 -3.8 12.58 0.1271 4.5 -8.3 -1.0549
-1 -3.3 9.29 0.0932 2.4 -5.7 -0.5312
7.8 -2.3 4.19 0.0612 1.2 -3.5 -0.2142
0.3 -1 0.56 0.0303 0.3 -1.3 -0.0394
-10 -0.8 0.30 0.0000 -0.8 0 0.0000
4.6 0.3 0.31 -1 TOTAL = -24.5482
5.8 1.2 2.11 -2.3 TOTAL2 = 602.6117
-3.3 2.4 7.04 -3.3
1.2 4.5 22.59 -3.8 W= 0.9730
-4.8 4.6 23.55 -4.8 P-value= 0.831 Table(p-v,19)=0.97
7.1 5.8 36.63 -5.5 Normality
-2.3 7.1 54.06 -7.9
-3.8 7.8 64.84 -9
-9 9.9 103.08 -10
TOTAL = 619.35

149
Kolmogorov-Smirnov Test

A.
1. Calculate the KS max value D for the
sample:
𝐷 = max 𝐹𝑛 𝑥 − 𝐹(𝑥)
𝑥
2. Obtain the critical value CV from the
table:
CV = table(n, )
3. Conclude about normality:
If D > CV, we reject normality
Otherwise we cannot reject normality

Andrey Nikolaevich Kolmogorov


(1903 – 1987)

Nikolai Vasilyevich Smirnov


(1900 –1966) 150
Example Kolmogorov-Smirnov Test
Data Ordered Position Position/n (Pos-1)/n NORM.S.INV(Pos/n) NORM.DIST(Ord) Difference
8.7 0.6 1 0.06 0.00 -1.5932 0.2160 0.2160 n= 18
51.5 1.2 2 0.11 0.06 -1.2206 0.2222 0.1667 mean= 23.02
24.6 1.2 3 0.17 0.11 -0.9674 0.2222 0.1111 std.dev= 28.53
35.7 1.4 4 0.22 0.17 -0.7647 0.2243 0.0577
31.9 1.5 5 0.28 0.22 -0.5895 0.2254 0.0032 alpha 0.05
67 2.3 6 0.33 0.28 -0.4307 0.2339 0.0439
79 4.6 7 0.39 0.33 -0.2822 0.2593 0.0740
82.7 5.1 8 0.44 0.39 -0.1397 0.2650 0.1239
6.4 6.4 9 0.50 0.44 0.0000 0.2801 0.1643
8.9 8.7 10 0.56 0.50 0.1397 0.3079 0.1921
1.5 8.9 11 0.61 0.56 0.2822 0.3104 0.2452
4.6 24.6 12 0.67 0.61 0.4307 0.5221 0.0890
2.3 31.9 13 0.72 0.67 0.5895 0.6222 0.0444
0.6 35.7 14 0.78 0.72 0.7647 0.6717 0.0505
1.4 51.5 15 0.83 0.78 0.9674 0.8409 0.0632
1.2 67 16 0.89 0.83 1.2206 0.9384 0.1051
5.1 79 17 0.94 0.89 1.5932 0.9751 0.0862
1.2 82.7 18 1.00 0.94 0.9818 0.0373
MAX=KS= 0.24518
T(n,lph)= 0.30936 Normal

Note: Shapiro-Wilk says NO

151
Normality test with Python
Shapiro-Wilks Kolmogorov-Smirnov
import numpy as np import numpy as np
import pandas as pd from scipy import stats
import pingouin as pg
data = np.array([2.4,4.5,-5.5,-0.8,-7.9,9.9,-1,7.8,0.3,
data = np.array([2.4,4.5,-5.5,-0.8,-7.9,9.9,-1,7.8,0.3, -10,4.6,5.8,-3.3,1.2,-4.8,7.1,-2.3,-3.8,-9])
-10,4.6,5.8,-3.3,1.2,-4.8,7.1,-2.3,-3.8,-9])
df = pd.DataFrame(data=data) ks, p_value = stats.kstest(data, 'norm')
print(ks, p_value)
sw = pg.normality(df)
print(sw)

W pval normal
0 0.972835 0.83139 True 0.4103285215572715 0.00208587800956761

Can’t discard data follows a normal distribution The data does not follow a normal distribution

CONTRADICTION (Shapiro-Wilk preferred)


152
Non Parametric Tests
• Students t-test, Welch’s test, and ANOVA assume data normality.
• If normality is rejected (with Shapiro-Wilks or Kolmogorov-Smirnov),
alternative (non parametric) tests must be used:
– For two samples (t-test): Wilcoxon’s rank sum test (or Mann-Whitney U test)
– For two or more samples (ANOVA): Kruskal-Wallis test
– For post-hoc analysis (Tukey): Nemenyi-Damico-Wolfe-Dunn test, Dwass-Steel-
Chritchlow-Fligner test, or Conover-Inman test.

Frank Wilcoxon Henry Berthold Mann D R Whitney William Henry "Bill" Kruskal Wilson Allen Wallis
(1892 –1965) (1905-2000) (? - ? ) (1919 –2005) (1912 – 1998) 153
Wilcoxon’s Rank Sum Test in Python
“Consider a Phase II clinical trial designed to investigate the effectiveness of a
new drug to reduce symptoms of asthma in children. A total of n=10
participants are randomized to receive either the new drug or a placebo.
Participants are asked to record the number of episodes of shortness of breath
over a 1 week period following receipt of the assigned treatment.”
Placebo 7 5 6 4 12 3 1 3 2 1
New Drug 3 6 4 2 1 8 9 12 1 10

import numpy as np
import pingouin as pg

placebo = np.array([7, 5, 6, 4, 12, 3, 1, 3, 2, 1])


new_drug = np.array([3, 6, 4, 2, 1, 8, 9, 12, 1, 10])

# normality test - Shapiro-Wilks # Wilcoxon's rank sum test = Mann-Whitney U test

all = np.concatenate((placebo, new_drug)) mwu = pg.mwu(placebo, new_drug)


sw = pg.normality(all) print(mwu)
print(sw)

W pval normal U-val tail p-val RBC CLES


0 0.899552 0.04045 False MWU 42.0 two-sided 0.56812 0.16 0.53

The sample is not normal We cannot discard placebo = new_drug results


154
Kruskal-Wallis Test in Python
“Consider a Phase II clinical trial designed to investigate the effectiveness of two
drugs to reduce symptoms of asthma in children. A total of n=30 participants
are randomized to receive either the previous drug (group=1), the new drug
(group=2), or a placebo (group=3). Participants are asked to record the number
of episodes of shortness of breath over a 1 week period following receipt of the
assigned treatment.”
Old Drug 1 2 5 3 2 1 1 3 2 1
New Drug 4 3 6 5 2 6 1 6 5 4
Placebo 9 6 7 7 5 1 8 9 6 5
import numpy as np
import pingouin as pg

data = np.array([1,2,5,3,2,1,1,3,2,1,4,3,6,5,2,6,1,6,5,4,9,6,7,7,5,1,8,9,6,5])
group = np.concatenate((np.full(10,1), np.full(10,2), np.full(10,3)))

# normality test # Kurskal-Wallis test

sw = pg.normality(data) df = pd.DataFrame(data={'Value':data, 'Group':group})


print(sw) kw = pg.kruskal(data=df, dv='Value', between='Group')
print(kw)

W pval normal Source ddof1 H p-unc


0 0.924408 0.034952 False Kruskal Group 2 13.67538 0.001073
155
The sample is not normal There are differences in the treatments
Testing Variances Between Groups

• F-test
• Levene’s test
• Bartlett’t test
• Brown-Forsythe test

156
Comparing Two-Variable Groups
• Concept: y = f(x), where x is the predictor variable (or group), and y the
prediction variable or event occurrence variable (or factor to study).
• Consider y a binary variable.
– Ex1. survival = f(taking_drug_d)
– Ex2. develop_comorbidity = f(has_index_disease_D)
– Ex3. goes_to_hospital = f(patient’s_old)
Death
– Ex4. receives_the_correct_treatment = f(has_disease_D) Develop disease

• Risk and Odds: Quantifying one binary factor for one group:
y: event occurrence
(factor to study)
Drug
Treatment Yes No Total

x: predictor variable
Yes a b a+b
(group)

Pr(factor | group) vs. Poss(factor | group)


• Risk (=probability) of y [given x]: Pr(y|x) = a/(a+b) Range = [0, 1]
• Odds (=possibility) of y [compared to not y][given x]: Pr(y|x)/Pr(no y|x) = a/b Range = [0, ]

Example: what’s the risk of a diabetic patient to develop a second disease? Pr(develop|DM)
Example: what’s the odds of a diabetic patient to develop a second disease? Pr(develop|DM)/Pr(no develop|DM)
157
Comparing two groups: Risk Ratio
• Risk ratio (RR) or relative risk: measures how much times is likely to observe one property p
(event or factor) in a group A than it is in a group B (groups are the predictor variable). For
example, how many times is more risky to develop breast cancer (property) if there have been
close family antecedents (group A) than if there have not (group B).
RR = Risk(p in A) / Risk(p in B) = Pr(p | A) / Pr(p | B)
– If RR = 1: the risk of A and the risk of B are the same.
– If RR < 1: the risk of A is lower than the risk of B. Comparison of one same
– If RR > 1: the risk of A is higher than the risk of B. risk between two groups

𝑎/(𝑎 + 𝑏)
𝑅𝑅 =
𝑐/(𝑐 + 𝑑)

• Relative Risk Reduction (RRR): measures how much risk is reduced between a
treatment group and a control group. For example, how much risk is reduced when
a chemotherapy drug is tested with respect to the usual treatment.
RRR = Risk(p in ctrl) – Risk(p in treat))/ Risk(p in ctrl) = (Pr(p | ctrl) – Pr(p | treat)) / Pr(p | ctrl)

𝑎 𝑐

𝑅𝑅𝑅 = 𝑎 + 𝑏 𝑎 𝑐 + 𝑑
𝑎+𝑏 158
Comparing two groups: Odds Ratio
• Odds ratio (OR): measures the association between a certain property p1 and a second
property p2 in a population A, telling how the presence or absence of p2 affects the presence
or absence of p1. For example, what is the proportion of patients having diabetes mellitus
(p1) or not (p2), in the development of hypertension (A)?
OR = Odds(p1 in A) / Odds(p2 in A)
Odds(p of A respect to B) = Pr(p|A)/Pr(not p|A)

– If OR = 1: the observation of p2 does not affect the odds of p1.


– If OR > 1: the observation of p2 increases the odds of p1. Comparison of one same
– If OR < 1: the observation of p2 decreases the odds of p1. odds between two groups

𝑎/𝑏
𝑂𝑅 =
𝑐/𝑑

159
RR and OR Examples
• What is the risk of developing secondary effects when a patient is
treated with drug D1 in comparison to the treatment with drug D2, Groups = {D1, D2}
if half of the D1 patients and one third of the D2 patients observed Factors = {2ary effect Y/N}
secondary effects after their treatments? Pr(2ary effect Y | D1) = 0.5
RR = 0.5 / 0.3333 = 1.5 Pr(2ary effect Y | D2) = 0.3333
(i.e., the risk of having secondary effects with D1 is 1.5 greater than with
D2)
• When we test a new drug on breast cancer patients we observe
that 20% of the control group (i.e., patients receiving the regular Groups = {ctrl, treat}
treatment) get worse, and 10% of the experimental group (i.e., Factors = {worse, better}
patients receiving the new drug) get worse. How much risk is Pr(worse | ctrl) = 20%
reduced with the use of this new drug? Pr(worse | treat) = 10%
RRR = (0.2-0.1)/0.2 = 0.5 = 50%
(i.e., the risk of getting worse with the new drug is reduced to one half)

• In a population of people which frequently go to the beach, 5% of


those who use a protection cream develop skin melanoma, and Groups = {protection Y/N}
15% of those not using protection cream develop melanoma. What Factors = {Melanoma Y/N}
is the odds ratio to develop skin melanoma not using a protection
cream respect to using protection? Group \ Factor Melanoma Y Melanoma N

OR = 0.1765 / 0.053 = 3.33 Protection Y 5% 95%

(i.e., among people going to the beach, one has 30% more options to Protection N 15% 85%

develop skin melanoma than not, if you don’t use a protection cream
than if you use some) Odds (Prot-Y in melanoma) = 5/95
Odds (Prot-N in melanoma) = 15/85

160
RR and OR in Python
Effect No Effect
2x2 Contingency Table: Placebo a b
Treatment c d

2ndary No 2ndary Develop No


Effect Effect Melanoma Melanoma
Drug D1 50 50 No Sun Protection 3 20
Drug D2 30 60 Use Sun Protection 15 300

import statsmodels.api as sm

RR = sm.stats.Table2x2([[50, 50], [30, 60]]).riskratio OR = sm.stats.Table2x2([[3,20],[15,300]]).oddsratio


print(RR) print(OR)

1.5 3.0

Risk Ratio = 1.5 Odds Ratio = 3


The risk of having secondary effects with D1 Not using sun protection (placebo) make you
(placebo) is 1.5 times of D2 (treatment). have triple options to develop skin melanoma
than if you use protection (treatment).
161
Survival Analysis: Kaplan-Meier
Study Period
• Understanding survival analysis:

starting event
Patient Type 1: dies before study ends
Patient Type 2: survives the study period
censored
Patient Type 3: withdraws from study X
Patient Type 4: enters in the middle of study (X & )
• Use Kaplan-Meier to:
– Measure the fraction of patients living for a certain amount of time after treatment.
– Calculate the probability of one patient to survive a certain amount of time after treatment
– Determine the probability of one patient to be discharged after an amount of time after diagnosis
– Etc.
• Elements of Survival Analysis:
1. Kaplan-Meier Survival table: What is the probability of surviving after a certain time?
2. Kaplan-Meier Survival curve: What’s the survival evolution of one group? And the mean survival time?
3. Log-Rank test: Can we compare the survival of two (or more) groups (treatments)? What’s the p-value?
4. Hazards Ratio: Can we quantify the survival difference between two (or more) groups? Can we provide
a confidence interval?
5. Proportional Hazards of Cox’s Hazards regression: To analyze the contribution of different variables in
the prediction of survival.

Edward Lynn Kaplan Paul Meier


(1920 – 2006) (1924162
– 2011)
Kaplan-Meier Survival Table and Curve
1. study description (patients)

Cumulative Pr
2. Kaplan-Meier Survival Table time
Remained Withdrawn Deaths at Risk Pr of death Pr surviving
surviving
i ti ni =ri-1 wi di ri =(ni -wi -di ) di/ri pi =1-di /ri S(ti )=pi *S(ti-1)
𝑛𝑖 − 𝑑𝑖 0 0 12 12 0.0000 1.0000 1.0000
𝑆(𝑡) = ෑ 1 4 12 1 11 0.0909 0.9091 0.9091
𝑛𝑖 2 5 11 1 10 0.1000 0.9000 0.8182
𝑡𝑖 ≤𝑡
3 6 10 1 9 0.0000 1.0000 0.8182
S(t) = Probability of a patient to survive time t 4 7 9 1 8 0.1250 0.8750 0.7159
S(t) = proportion of patients surviving after time t 5 8 8 2 6 0.3333 0.6667 0.4773
6 9 6 1 1 4 0.2500 0.7500 0.3580
7 11 4 1 3 0.3333 0.6667 0.2386
8 12 3 3 0.0000 1.0000 0.2386
(survival probability)
0.0 0.2 0.4 0.6 0.8 1.0

3. Kaplan-Meier Survival Curve


The mean survival time corresponds to the time whose
survival probability is 0.5. In this case 8 (months).

0 2 4 6 8 10 12 14 163
(time)
Survival Analysis with Python
• Not this year
• If interested: check https://medium.com/towards-
artificial-intelligence/survival-analysis-with-python-
tutorial-how-what-when-and-why-19a5cfb3c312

164
Case Study 2: Data Analysis with Python
1. Download the .csv file in a local folder
2. Take a look at the particularities of this data set
3. Make a table with all the attributes
4. In python, load the file in a pandas DataFrame structure indicating the names of the columns
5. Declare categorical attributes according to the previous table
6. Calculate the mean cholesterol level of all the patients, but also of male and female separately
7. Can be the mean of men considered equal to the global mean ?
8. Can we consider the chol mean value in women 22.15 mg/dl (=261.75-239.6) upper than for men ?
9. Oldpeak measures the ST depression induced by exercise relative to rest. Let us assume that the half-bigger
depressions correspond to patients before doing exercise, and the half-smaller values to the same patients after
doing exercise. Could we conclude that doing exercise increases their maximum heart rate (thatlach)?
a) We have to calculate the mean value of oldpeak
b) We use this mean oldpeak value to split thatlach in two groups of values: one before doing exercise and another one after doing
exercise
c) We adjust the two sets to have the same number of values
d) We apply paired Student’s t-test to reach a conclusion
10. We could be interested in analyzing whether the type of chest pain (cp) is indicative of the sort of heart disease
(num), or not.
11. Could we consider the maximum heart rate (thatlach) depends on patients age, when grouped in young (<45),
middle (45-65), old (>65) ?
12. Are these groups of age and patient's gender affecting the average serum cholesterol (chol) ?
13. In a previous analysis we got that maximum heart rate (thatlach) is different between patient’s age groups, but
is it different between young and middle aged? And between middle aged and the elders?
14. In our initial analyses we tested cholesterol mean differences, but never checked if chol values follow a normal
distribution
15. We should recalculate if men and women have different chol mean values (recall that women had +22.15 mg/dl
than men, under normality assumption)
16. Is cholesterol lack of normality affecting the conclusions that age and gender affect the chol mean value?

165
Conclusions
Normal Same Variances
Task Samples Quantitative Test
(parametric) (homocedasticity)

Y Y 1, 2 Y Student’s

Y N 2 Y Welch’s

- - 1, 2 N Chi-square
Compare means
between groups ANOVA
Y Y 2+ Y
Tukey’s (2 samples)
N - 2 Y Wilcoxon-Mann-Whitney
N - 2+ Y Kruskal-Wallis
Shapiro-Wilks
Check normality ? - - Y
Kolmogorov-Smirnov
Association of one
property between
- - - - Odds Ratio
complementary
groups
Relative relation of
one property - - - - Risk Ratio
between two groups
Survival analysis - - - - Kaplan-Meier
166
Artificial Intelligence Analysis of Health Care Data

• Sample Tool:
– Rapid Miner
• Data Preprocessing
• Case Study 3: Data Preparation with Rapid Miner
• AI Data Analysis: Unsupervised Machine Learning
• AI Data Analysis: Supervised Machine Learning
• Quality Assessment
• Case Study 4: Data Modelling with Rapid Miner

167
Rapid Miner

• Rationale:
– Accessibility: complete free version for students.
– Capability: RM can analyze large sets of data.
– Simplicity: all the AI-based Data Analysis involved in the
course is easily performed with RM.

168
Data Structure (Cross Sectional Study)
• Data Analysis deals with data organized as a matrix
(set) where
– Each column represents a (clinical) variable
– Each row describes an instance (e.g., patient) in terms of
the variables in the columns
column

Matrix Clinics Analysis


Row Patient Instance
Column Variable Feature
row

169
Example Data Set
• Heart Disease Data Set
• https://archive.ics.uci.edu/ml/datasets/Heart+Disease
• N = 303
• Features: 14
# Name Explanation Units Type Missing
1 Age Age of the patient Years Numeric
2 Sex Gender of the patient 1=male, 0=female Binary
3 CPT Chest Pain Type 1=typical angina, 2=atypical angina, 3=non-angina pain, Categorical
4=asymptomatic
4 BP Resting Blood Pressure on admission to hospital mmHg Numeric
5 Chol Serum Cholesteroral mg/dl Numeric
6 FBS Fasting Blood Sugar > 120 mg/dl 1=true, 0=false Binary
7 ECG Resting Electrocardiographic Results 0=normal, 1=having ST-T wave abnormality, 2=Showing Categorical
probable or definite LV hypertrophia
8 maxHR Maximum Heart Rate Achieved Numeric
9 ExAng Exercise Induced Angina 1=yes, 0=no Binary
10 OldPeak ST depression induced by exercise relative to rest Numeric
11 Slope slope of the peak exercise ST segment 1=upsloping, 2=flat, 3=downsloping Categorical
12 Ca Number of major vessels colored by flourosopy 0, 1, 2, 3 Numerical 4
13 Thal 3=normal, 6=fixed defect, 7=reversable defect Categorical 2

14 Diag diagnosis of heart disease 0=<50% diameter narrowing, 1=>50% diameter narrowing Categorical
170
Data (Descriptive) Statistics
• Binary data: proportion
• Quantitative data: average, min, max
• Qualitative data: histogram

171
Data Visualizations

values statistics

box-plots
plots

pie charts

… and many more !!!


172
Data Pre-Processing
(for Intelligent Data Analysis)

Data preprocessing is used to transform the raw data into a useful and efficient format
for data analysis.
• Data Cleaning: raw data can have irrelevant or missing parts, among other issues.
Data cleaning deals with these issues.
• Data Transformation: data can come in suboptimal or wrong forms. Data
transformation converts data into forms that are suitable for mining.
• Data Reduction: on the one hand data mining results improve as more and more
data is available, on the other hand the mining processes can be time-consuming.
Data reduction deals with the application of trade-off solutions to minimize data
size without compromising quality too much.
• Data Balancing: in the data one subset can dominate over another subsets, this
causing predictions to be more biased towards majority class. Under the name
data balancing there is a set of methods to increase or reduce the data set to more
balanced data.

173
Source: https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
Data Preprocessing Summary Chart
Remove instance
Remove feature
Encoding
Missing Data Imputation (replace)
Imputation (predict)
Data Cleaning NA, NaN
Binning
Noisy Data Clustering
Statistical Norm.
Normalization Specific Range Norm.
Proportional Norm.
Interquartile range N.
Remove useless attrs.
Attr. Selection Transform attrs.
Combine attrs.
By size
Data Transformation By binning
Discretization By frequency
By user specification
Data Pre-Processing By entropy
(Concept Hierarchy) Generalization
Aggregation
Attr. Subset Selection
Data Reduction
Cardinality Reduction
Dimensionality Reduction
Collect more data
Penalized models
Data Balancing New models & Algorithms
Resample 174
Change Perform. metric
Data Cleaning: Missing Data
Missing data: When some data is detected not available in the matrix.
• Six alternative solutions are proposed:
– Remove instance: get rid of the instance that contains the missing value.
– Remove feature: get rid of the column that contains many missing values.
– Encoding: indicate missing values with a special symbol (e.g., 99999).
– Imputation (Replace): use the mean/median/mode or most frequent
categorical value.
– Imputation (Predict): apply a regression function to calculate the value.
– NA: leave it as not available value and only use data analysis algorithms that
allow missing values.

175
Missing Data with Rapid Miner
Solution RM Module Explanation
Remove Instance Filter Examples: The Operator returns those examples that match the given condition.
The conditions are defined by the user. Several pre-defined conditions also exist as
advanced options. One of the available conditions is no_missing_attributes that match
only the examples that have no missing values.

Remove Feature Select Attributes: The Operator provides different filter types to make Attribute
selection easy. Possibilities are for example: Direct selection of Attributes. Selection by
a regular expression or selecting only Attributes without missing values.

Encode Replace Missing Values: Missing values can be replaced by the minimum, maximum or
average value of that Attribute. Zero can also be used to replace missing values. Any
replenishment value can also be specified as a replacement of missing values.

Imputation Replace Missing Values with default field set to average, or value with replenishment
value the pre-calculated mean, median, or mode.
(Replace)
Imputation Impute Missing Values: This is a nested operator i.e. it has a subprocess. This
subprocess should always accept an ExampleSet and return a model. The Impute
(Predict) Missing Values operator estimates values for missing values by learning models for
each attribute (except the label) and applying those models to the ExampleSet. The
learner for estimating missing values should be placed in the subprocess of this
operator.

NA Declare Missing Value: The Declare Missing Value operator replaces the specified
values of the selected attributes by Double.NaN, thus these values will become missing
values. These values will be treated as missing values by the subsequent operators.
The desired values (to be converted) can be selected through nominal, numeric or
regular expression mode.

176
Data Cleaning: Noisy Data
Noisy data: When some data is detected to be wrong (meaningless) in the
matrix. It can appear due to faulty data collection, data entry errors, etc.
Noise data can affect the class label (label noise) –this occurs when an
example is incorrectly labelled causing either contradictory examples or
instance misclassification; or regular attributes (attribute noise) –this occurs
when a regular attribute has an erroneous value. Three actions are possible:
do nothing and use robust ML methods (avoid overfitting), detect & remove
noise, and detect & correct noise.
• The following approaches to noise cleaning are proposed:
– Data Binning: The original data values are used to generate small intervals (bins), then all
the values in the bin are replaced by the mean value in the bin (detect & correct).
– Clustering: Groups of similar data are made. The outliers fall out of the classes, and they
can be removed (detect & remove).

177
Binning Methods for Data Smoothing
• Smoothing data by equal frequency bins
Example: 15, 21, 27, 8, 16, 9, 21, 24, 30, 26, 30, 34
1. Sort the data: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
2. Make N equal frequency bins:
Bin 1: [8, 9, 15, 16]; Bin 2: [21, 21, 24, 26]; Bin 3: [27, 30, 30, 34]
3. Calculate mean within each bin:
Bin 1: 12; Bin 2: 23; Bin 3: 30
4. Replace the values in the bins by the bin mean:
Sorted: 12, 12, 12, 12, 23, 23, 23, 23, 30, 30, 30, 30
Original sequence becomes: 12, 23, 30, 12, 12, 12, 23, 23, 30, 23, 30, 30

• Smoothing data by bin boundaries


Example: 15, 21, 27, 8, 16, 9, 21, 24, 30, 26, 30, 34
1. Sort the data: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
2. Make N equal frequency bins:
Bin 1: [8, 9, 15, 16]; Bin 2: [21, 21, 24, 26]; Bin 3: [27, 30, 30, 34]
3. Replace the bin internal values to the closest boundary value:
Bin 1: [8, 9→8, 15→16, 16] = [8, 8, 16, 16]; Bin 2: [21, 21→21, 24→26, 26] = [21, 21, 26, 26]; Bin 3: [27,
30→27, 30→27, 34] = [27, 27, 27, 34]
Original sequence becomes: 16, 21, 27, 8, 16, 8, 21, 26, 27, 26, 27 ,34

178
Clustering: Local Outlier Factor (LOF)

Example:
1. 13 points in a 2-dimensional space
2. Find out the 3-nearest neighbors of
A:
{n1, n2, n3}
3. Calculate 3-density of A
3
𝑑𝐴 = = 0.23
max 𝑑𝑖𝑠𝑡(𝐴,𝑛𝑖 )
𝑖=1,…,3
4. Calculate the 3-density of the
neighbors n1, n2, n3
dn1 = 2.1; dn2 = 2.7; dn3 = 2.4
5. Calculate the average of dn1, dn2,
and dn3:
Avg = (2.1+2.7+2.4)/3 = 2.4
6. Calculate LOF = Avg /dA:
LOF = 2.4 / 0.23 = 10.43
7. If LOF > 1, A is an outlier.

179
Noisy Data with Rapid Miner

Solution RM Module Explanation


Discretize by binning: This operator discretizes the selected numerical attributes to
Binning nominal attributes. The number of bins parameter is used to specify the required
number of bins. This discretization is performed by simple binning. The range of
numerical values is partitioned into segments of equal size. Each segment represents
a bin. Numerical values are assigned to the bin representing the segment covering the
numerical value.

Generate Aggregation: This operator generates a new attribute which consists of a


function of several other attributes. These 'other' attributes can be selected by the
attribute filter type parameter and other associated parameters. The aggregation
function is selected through the aggregation function parameter. Several aggregation
functions are available e.g. count, minimum, maximum, average, mode etc. The
attribute name parameter specifies the name of the new attribute.

Detect Outlier: Rapid Miner implements several algorithms to detect outliers (based
Clustering on distances, densities, LOF –local outlier factors, and COF –class outlier factors)

NA Generate Attribute: This operator constructs new attributes from the


attributes of the input data set and arbitrary constants using mathematical
expressions. It can be used to change the values of the attributes that satisfy a
certain condition.

180
Data Transformation: Normalization
Normalization: Normalization is used to scale values so they fit in a specific range
(e.g., [-1.0, +1.0] or [0.0, 1.0]). Adjusting the value range is very important when
dealing with attributes of different units and scales. For example, when using the
Euclidean distance all attributes should have the same scale for a fair comparison.
Normalization is useful to compare attributes that vary in size.

• Four common normalizations are:


– Statistical normalization: subtract the mean of the data from all values and then divide them by the
standard deviation. Afterwards, the distribution of the data has a mean of zero and a variance of
one. This is a common and very useful normalization technique. It preserves the original distribution
of the data and is less influenced by outliers.
– Specific Range normalization: it normalizes all attribute values to a specified value range [min, max].
The largest value is set to max, the smallest value to min, and the other values scaled. It preserves
the original distribution of the data but can be influenced by outliers.
– Proportional normalization: each attribute value is divided by the total sum of all the values of the
attribute.
– Interquartile range normalization: Normalization is performed using the interquartile range. The
interquartile range (IQR) is the difference between Q3 and Q1. The final formula for the interquartile
range normalization is then: (value - median) / IQR. It is less influenced by outliers.

181
Normalization Examples (1)

Statistical Normalization Specific Range Normalization


Data: Data:
66.88, 30.03, 79.58, 23.80, 45.43, 24.83, 79.03, 28.27, 97.79, 60.23, 63.14, 90.06, 86.94, 56.71,
51.17 96.95

1. Calculate mean 1. Propose a range (ex., percentage)


m = 50.09 [a, b] = [0, 100]
2. Calculate standard deviation 2. Find out the min value
st.dev = 23.13 min = 23.79
3. Apply normalization to all the values 3. Find out the max value
𝑥 −𝑚 max = 79.58
𝑛=
𝑠𝑡. 𝑑𝑒𝑣 4. Apply normalization to all the values
𝑥 − 𝑚𝑖𝑛
Normalized data: 𝑛= b−a +a
𝑚𝑎𝑥 − 𝑚𝑖𝑛
0.7258, -0.8676, 1.2751, -1.1370, -0.2015,
-1.0925, 1.2513, 0.0465 Normalized data:
77.23, 11.17, 100.00, 0.00, 38.7, 1.85, 99.01,
49.07

182
Normalization Examples (2)

Proportional Normalization Interquartile Range Normaliz.


Data: Data:
28.27, 97.79, 60.23, 63.14, 90.06, 86.94, 56.71,
66.88, 30.03, 79.58, 23.80, 45.43, 24.83, 79.03, 96.95
51.17
1. Calculate quartiles Q1 and Q3
1. Calculate total sum Q1= 28.73, Q3= 69.92
S = 400.75 2. Calculate the interquartile distance
d = Q3 – Q1 = 41.19
2. Apply normalization to all the values 3. Calculate median (or quartile Q2)
𝑥
𝑛= median = 48.30
𝑆
4. Apply normalization to all the values
𝑥 − 𝑚𝑒𝑑𝑖𝑎𝑛
𝑛=
Normalized data: 𝑑
0.1669, 0.0749, 0.1986, 0.0594, 0.1134, 0.0620,
0.1972, 0.1277 Normalized data:
0.4510, -0.4436, 0.7594, -0.5949, -0.0696,
-0.5699, 0.7461, 0.0696

183
Normalization with Rapid Miner

Solution RM Module Explanation


Statistical With method = Z-transformation
Normalization
Specific Range With method = range transformation
Normalization
Proportional With method = proportion transformation
Normalization
Interquartile range With method = interquartile range
Normalization
Denormalization In order to show the results in the normal range of
values of the attributes, Rapid Miner provides this De-
normalization operator.

184
Data Transformation: Attribute Selection
Attribute Selection: Attribute selection is about the process of choosing a
subset of representative attributes and getting rid of useless attributes.
Extending the idea, we could also consider transforming attributes, or
combining several attributes to form a new replacing one.
• Sorts of attribute selection:
– Remove useless attributes: Some attributes can become useless because of
the existence of other attributes and they can be removed. Ex., attributes with
a predominant value or with a low standard deviation (low variability), or
highly correlated attributes.
– Transform attributes: Some attributes may require a conversion of their
values. For example, change date of birth to years, change temperature in
Fahrenheit to Celsius, etc.
– Combine attributes: Some irrelevant attributes may be combined to form
other relevant attributes. Ex., patient’s height and weight can be combined to
describe the patient’s body mass index (BMI).

185
Attribute Selection with Rapid Miner
Solution RM Module Explanation
Remove Remove Useless Attributes: removes four kinds of useless
attributes: (1) Nominal attributes where the most frequent value
Useless is contained in more than the specified ratio of all examples. This
Attributes is used for removing nominal attributes in which one value
dominates all other values. (2) Nominal attributes where the
most frequent value is contained in less than the specified ratio
of all examples. This is used for removing nominal attributes with
too many possible values. (3) Numerical attributes where the
Standard Deviation is less than or equal to a given threshold. This
is used for removing attributes with low variability of their
values. (4) Nominal attributes where the value of all examples is
unique. This property is used to remove id-like attributes.
Remove Correlated Attributes: removes attributes which are
correlated above/below a given threshold.
Transform Map: maps specified values of selected attributes to new values
according to a conversion table.
Attributes

Combine Generate Attributes: constructs new attributes from the


attributes of the example set and arbitrary constants using
Attributes mathematical expressions.

186
Data Transformation: Discretization
Discretization: Discretization is the process by which a continuous
valued attribute is transformed into a discrete (categorical) attribute.
• Five alternative types of discretization are considered:
– By Size: converts the selected numerical attribute into a nominal attribute by discretizing
the numerical attribute into bins of user-specified size. All the values of the attribute
(including all the repetitions) are ordered and grouped into bins of size N (predefined
values). Each bin defines the new discretized value of the attribute. All bins contain the
“same” number of examples. Notice that equal values are kept in the same bin.
– By Binning: discretizes the selected numerical attributes into a user-specified number of
bins N representing intervals of equal size. The number of the values in bins may vary.
– By Frequency: discretizes the selected numerical attributes into a user-specified number
of bins N. Bins of equal number of the values are made.
– By User Specification: discretizes the selected numerical attributes to nominal attributes.
The numerical values are mapped to the classes according to the thresholds specified by
the user in the classes parameter. Classes are specified by their upper limit.
– By Entropy: converts the selected numerical attributes into nominal attributes. The
boundaries of the bins are chosen so that the entropy is minimized in the induced
partitions.

187
Discretization with Rapid Miner
Solution RM Module Explanation
By Size Discretize by Size: converts the selected numerical attributes
into nominal attributes by discretizing the numerical attribute
into bins of user-specified size. Thus each bin contains a user-
defined number of examples.
By Binning Discretize by Binning: discretizes the selected numerical
attributes into user-specified number of bins. Bins of equal range
are automatically generated, the number of the values in
different bins may vary.
By Frequency Discretize by Frequency: converts the selected numerical
attributes into nominal attributes by discretizing the numerical
attribute into a user-specified number of bins. Bins of equal
frequency are automatically generated, the range of different
bins may vary.
By User Discretize by User Specification: discretizes the selected
numerical attributes into user-specified classes. The selected
Specification numerical attributes will be changed to nominal attributes.
By Entropy Discretize by Entropy: converts the selected numerical attributes
into nominal attributes. The boundaries of the bins are chosen so
that the entropy is minimized in the induced partitions.

188
Data Transformation: Concept Hierarchy Generalization
Concept Hierarchy Generation: If we have a hierarchy for the
values an attribute can take, then this method allows replacing
the values in the attribute by other values which are in higher
positions of the hierarchy (generalization).
• For numerical values, these can be generalized to ranges. For
example, numerical age to infant (<4), child (4-12), young (13-
18), adult (18-60), and elder (>60).
• For categorical values, if there is a hierarchy of such concepts,
they can be generalized to upper levels in the hierarchy. For
example, Chlorhexidine (ATC code R02AA05) can be
generalized to its group Antiseptic (ATC code R02AA).

189
Concept Hierarchy Generalization with
Rapid Miner

Solution RM Module Explanation


Numerical Discretize by user specification: This operator discretizes the
selected numerical attributes to nominal attributes. The
Generalizati numerical values are mapped to the classes according to the
on thresholds specified by the user in the classes parameter. The
user can define the classes by specifying the upper limit of each
class. The lower limit of every class is automatically defined as
the upper limit of the previous class. The lower limit of the first
class is assumed to be negative infinity.
Categorical
Generalizati NA
on

190
Data Reduction
Data Reduction: Some data mining technologies are slow to handle a huge amount of
data. In order to cope with this, we can use a data reduction technique.
• Some common data reduction techniques are:
– Aggregation: process by which information is summarized. For example, if you have the
weights of all patients in a database, you could consider to reduce the size of the data by
grouping all the patients with the same primary disease and using the mean weight of
the patients with one same diagnosis to represent the group’s weight.
– Attribute Subset Selection: some attributes are more relevant than others with respect
to the classification label. There are multiple alternatives to calculate the relevance
(weight) of attributes: information gain, information gain ratio, deviation, correlation,
Chi-square, Gini index, Tree (by random forest), Relief, Support Vector Machine (SVM),
Principal Component Analysis (PCA), etc.
– Cardinality reduction: The number of examples (rows) in the database can be reduced if
they do not satisfy one or several conditions. For example, remove examples with
missing data, or outliers.
– Dimensionality reduction: No this course.

191
Data Reduction with Rapid Miner
Solution RM Module Explanation
Aggregation Aggregate: performs the aggregation functions known from SQL. This
operator provides a lot of functionalities in the same format as provided
by the SQL aggregation functions. SQL aggregation functions and
GROUP BY and HAVING clauses can be imitated using this operator.
Aggregation functions include SUM, COUNT, MIN, MAX, AVERAGE and
many other.

Attribute Weight by …: is a group of modules to calculate the weight of the


attributes by different approximations: information gain, info gain ratio,
Subset rule, average, deviation, correlation, chi-square, GINI, etc.
Selection Select By Weight: selects only those attributes of an input data set
whose weights satisfy the specified criterion with respect to the input
weights. Several criteria are possible: greater, less, top_k, bottom_k,
all_but_bottom_k, top_p%, etc.

Cardinality Filter Examples: returns those examples that match the given condition.
The conditions are defined by the user, and each condition consists of
Reduction an attribute, a comparison function and a value to match. Several pre-
defined conditions also exist as advanced options.

Sample: selects a subset of examples of a data set. The samples are


selected randomly. The number of examples in the sample can be
specified on absolute, relative or probability basis (sample parameter).
The class distribution of the sample can be controlled by the balance
data parameter.

Dimensionality
NA 192
Reduction
Data Balancing
Data Balancing: Data imbalance takes place when the majority classes dominate
(i.e., bigger number) over the minority classes in a supervised matrix. Most ML
Algorithms are usually designed to improve accuracy by reducing the error. So
they do not take into account the balance of classes. In such cases, the predictive
model developed using conventional machine learning algorithms could be biased
and inaccurate.
• Five alternative solutions are proposed:
– Collect more data: a larger data set could reduce might turn an unbalanced dataset balanced.
– Penalized models: use penalizing ML methods (e.g., cost matrix) to increase the cost of
classification mistakes on the minority class.
– New models and algorithms: Imbalanced data can be solved using an appropriate model. For
example, XGBoost model internally takes care that the bags it trains on are not imbalanced.
– Resample:
• Over-sampling: When the quantity of data is insufficient, it tries to balance by incrementing the size of rare samples.
• Under-sampling: Reduce the size of the class which is dominant among the other classes.
IMPORTANT!!!!: USE RESAMPLING WITH THE UNBALANCED CLASS AND NOT WITH THE WHOLE DATASET.
– Change the performance metric: It is important to choose the evaluation metric of the model
correctly or one would end up optimizing a useless parameter. One should try to change the
performance metric while solving the problem of imbalanced data.

193
Data Balancing with Rapid Miner
Solution RM Module Explanation
Collect More Append: This operator builds a merged example set from two or more
compatible example sets by adding all examples together. All the example sets
Data in the input must have the same attribute signature.

Penalization Metacost: The MetaCost operator makes its base classifier cost-sensitive by
using the cost matrix specified in the cost matrix parameter. The MetaCost
with Cost Matrix operator is a nested operator; i.e., it has a subprocess. The subprocess must
have a learner; i.e., an operator that expects a data set and generates a model.
This operator tries to build a better model using the learner provided in its
subprocess.

Over-sampling Sample (Bootstrapping): This operator uses sampling with replacement. At


every step all examples have equal probability of being selected. Once an
example has been selected for the sample, it remains candidate for selection
and it can be selected again in any other coming steps. Thus, the final sample
can have the same example multiple times. This operator can be used to
generate a sample that is greater in size than the original data set.

Under-sampling Sample: This operator creates a sample from an ExampleSet by selecting


examples randomly. The size of a sample can be specified on absolute, relative
and probability basis. The class distribution of the sample can be controlled by
the balance data parameter.

Change
performance NA
Metric

194
Case Study 3: Data Preparation with Rapid Miner
1. Download the .csv file in a local folder
2. Import the data from the csv file and save it as a RM file
3. Check that data is correctly imported
4. Spend some time with the statistical description of the
different features
5. Visual analysis of data
6. Perform some data preprocessing actions
• Missing Values
• Noisy Data
• Normalization
• Attribute Selection
• Discretization
• Data Reduction
• Data Balance
7. The data set is ready for data analysis

195
AI Data Analysis: Machine Learning
• Machine Learning: subarea of artificial intelligence that
uses computer algorithms to build mathematical models
based on sample data, known as "training data", in order
to make predictions or decisions without being explicitly
programmed to do that.
• Types of ML:
– Unsupervised ML: the training data is not labelled, in this sense
nobody supervised the data to say which data correspond to
which decision group. For example, data on breast cancer
patients is captured but no information is given on survival.
– Supervised ML: the training data is composed of examples of
decision groups, annotated each example with the sort or
decision taken. Annotating data requires somebody to supervise
the data providing them with a meaning. For example, data on
breast cancer patient providing a column on whether the
patient survived or not.

196
AI Data Analysis: Unsupervised ML
• Some most famous unsupervised ML algorithms:
– k-means clustering
– Hierarchical Clustering
• Agglomerative

DataSet

197
k-means

K-means looks for a predefined


number (k) of clusters in a
dataset. Data in the same cluster
are similar.

Concept:
1. Select k data at random
2. Make one cluster with each data
as centroid
3. Assign the rest of the data to the
cluster with a closer centroid.
4. Recalculate centroids within each
cluster
5. Repeat steps 3-5 N times or until
the centroids stabilize

Attributes: qualitative and quantitative.


Missing values: accepted 198
Hierarchical Clustering
Hierarchical clustering (or Connectivity based
clustering): algorithm that builds a hierarchy of
clusters based on the idea that related objects are
nearby while unrelated objects are farther away.
- Agglomerative: strategy by which nearby objects are
combined in superclusters.
- Divisive: strategy by which distant objects are split
into sub-clusters.

Concept (Agglomerative):
1. Calculate pairwise distance between objects
-cut 2. Make singletons (clusters of one object)
3. Combine closer clusters into superclusters applying a
linkage operation
4. Repeat step 3 until a single supercluster
Dendrogram
Linkage operations: the distance between 2 clusters is …
– Single linkage: … the distance between the closest objects
– Complete-linkage: …the distance btwn the farthest objects.
– Average-linkage: … the average distance between all pairs.
Attributes: qualitative and quantitative. – Centroid-linkage: … the distance between centroids.
199
Missing values: accepted
AI Data Analysis: Supervised ML
• Some most famous supervised ML algorithms:
– k-Nearest Neighbor
– Logistic Regression
– Decision trees
– Naïve Bayes Classifier
– Classification Rules DataSet
– Neural Networks
– Support Vector Machines
– Discriminant Analysis
– Etc.

• Ensembles

200
K-Nearest Neighbor (k-NN)
K-NN: algorithm that stores all
available cases and classifies new
cases based on a similarity measure
(e.g., distance function) by a majority
vote of its k closest neighbors, with
the case assigned to the most
common class.

Concept:
1. Let S be the training set
2. Let O be a new object
3-NN 3. Get N the set with the k closest
objects to O in S.
5-NN 4. Let C be the most frequent class in N
5. Return “O classifies in C”

Attributes: qualitative and quantitative.


Class: qualitative. 201
Missing values: accepted
Logistic Regression
Logistic regression: algorithm to describe
data and to explain the relationship
between one dependent binary variable
and one or more qualitative or quantitative
independent variables.

Concept: since the dependent variable is


binary only two contrary values (1 and 0)
are possible, with p the probability of 1. A
linear regression is calculated for the log-
odds
𝑝 𝑛
log = 𝛽0 + ෍ 𝛽𝑖 · 𝑥𝑖
1−𝑝 𝑖=1
(n the number of independent variables, b
the base of the logarithm). Betas are
calculated with the training set data, and p
for a new instance (x1, …, xn) with
1
𝑝= 𝑛
1 + 𝑏 −(𝛽0 +σ𝑖=1 𝛽𝑖 ·𝑥𝑖 )
For p>0.5, the probability of 1 is higher than
the probability of 0.
Attributes: qualitative and quantitative.
Class: qualitative binary. 202
Missing values: accepted
Decision Trees (DT)
Decision Tree: predictive model expressed as
tree structure that, for each intermediate
node, a feature is used to partition the local
space into subspaces. Terminal nodes or
leaves of the tree are labelled with one class
or its probability.

Concept: Several algorithms exist to construct


decision trees from training sets, for example,
choosing recursively the feature (and
partition) that minimizes the entropy of the
domain space.

1. Let S be the training set


2. If “all” objects in S are of class C return leave of C
class.
3. Otherwise, choose feature F to partition S into
Leaves contain: prediction of survival and
percentage of elements in the leave {S1, …, Sk} by respective values V1, …, Vk.
4. Make an intermediate node with feature F and k
branches with the i-th branch labelled with value
Attributes: qualitative and quantitative. Vi and connected to the tree returned by this
Class: qualitative (and quantitative). same algorithm when applied to Si (i=1, …, k).
203
Missing values: accepted
Naïve Bayes Classifier
Naïve Bayes Classifier: algorithm that uses the
Bayes theorem
𝑝 𝑥 ·𝑝(𝑦|𝑥)
𝑝 𝑥𝑦 =
𝑝(𝑦)
combined with the “naïve” conditional
independence assumption between all variables
𝑝(𝑥𝑖 |𝑥𝑗 ) = 𝑝(𝑥𝑖 )
Under such circumstances, the probability of an
object (x1, …, xn) to be in class Cj can be
expressed as
𝑝 𝐶𝑗 ·ς𝑛
𝑖=1 𝑝 𝑥𝑖 𝐶𝑗
𝑝 𝐶𝑗 𝑥1 , … , 𝑥𝑛 = 𝑝(𝑥1 ,…,𝑥𝑛 )

where all the probabilities can be calculated


from the data in the training set.

Variable-Value pairs
x1 x2 … xi … xn
C1 p(C1)
C2 p(C2)
classes

… …
Cj p(xi | Cj) p(Cj)
Attributes: qualitative and quantitative. … …
Cm p(Cn)
Class: qualitative. 204
Missing values: accepted
Classification Rules
Classification Rule: if-then expression in
which the premise is a condition on a
subset of features and the conclusion is
one of the classes in the training set.

Concept: Several algorithms exist to


construct sets of classification rules from
training sets.
1. Let S be the training set
2. For each class Ci in S,
2.1. Make the empty rule  → Ci
2.2. Copy S in Si
2.3. If there are objects in Si not in class Ci,
2.3.1. Choose the “best” feature restriction R that
rejects Si objects not in Ci and accepts Si objects
in Ci.
2.3.2. Add R to the premise of the rule.
2.3.3. Remove all objects in Si not in Ci that R rejects
2.3.4. Repeat 2.3.* until “all” objects in Si are of class Ci
2.4. Add the rule to the rule set
2.5. Remove Si objects from S
2.6. Repeat 2.* until S does not contain objects of class Ci.
2.7. Restore S to the original set

Attributes: qualitative and quantitative.


Class: qualitative. 205
Missing values: accepted
Artificial Neural Networks (ANN)
Artificial Neural Networks: artificial system
simulating neural connections. Neurons
receive signals from their input connections
that can activate them. Activated neurons
propagate signals in their output connections.
The weights in the connections are adjusted
with a backward propagation method which is
based on the data in the training set. If an
object in the input layer generates the correct
class of the object in the output layer this is a
positive reinforcement, otherwise a negative
reinforcement.

Artificial Neuron

activation
function

Attributes: qualitative and quantitative.


Class: qualitative and quantitative. 206
Missing values: accepted
Support Vector Machine (SVM)
Support Vector Machine: uses a
representation of the examples as points in
space, and map them to a new space (by the
use of kernels) so that the examples of the
separate categories are divided by a clear gap
that is as wide as possible. New examples are
then mapped into that same new space and
predicted to belong to a category based on
the side of the gap on which they fall.

Attributes: quantitative.
Class: qualitative.
Missing values: accepted 207
Quality of Data Analysis
• Recommender systems can make two types of error:
– Type I Error: The system recommends something that should not be recommended. For
example, a dangerous drug is recommended.
– Type II Error: The system fails to recommend something that should have been
recommended. For example, a needed drug is not recommended.

• Confusion matrix

True Positives: objects of class P that the predictor classifies in class P.


True Negatives: objects in class N that the predictor classifies in class N.

False Positives: objects in class N that are classified in class P. Equivalent to Type I error. 208
False Negatives: objects in class P that are classified in class N. Equivalent to Type II error.
Quality Ratios
• Accuracy: proportion of cases correctly classified. A system has a good accuracy if it detects
positive cases and rejects negative cases. For example, if required drugs are recommended
and unnecessary drugs are not recommended.
• Sensitivity (recall): proportion of positive cases correctly classified. A system has good
sensitivity if it detects many positive cases. For example, if required drugs are recommended.
• Specificity: proportion of negative cases correctly classified. A system has good specificity if it
rejects negative cases. For example, if unnecessary drugs are not recommended.
• Positive Predictive Value (precision): proportion of correct positive classified cases. A system
has good precision if it does not recommend unnecessary things. For example, if among the
recommended drugs, all are necessary.
• Negative Predictive Value: proportion of correct negative classified cases. A system has a
good negative predictive value if it does not fail to recommend necessary things. For
example, if unnecessary drugs are not recommended.
• Measures use to be given in pairs:
– Sensitivity + Specificity
– Precision + Recall
• In order to have one single quality measure:
– F-score: calculated as two times the quotient between precision and recall multiplied and added.
– AUC (area under the ROC curve): The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR)
as the threshold for a binary classification varies. AUC value computes the area under the ROC curve.

209
Testing: n-fold cross-validation
1 2 3 n
train
data …
test

1 2 3 n quality ratio

… q1

… q2

average
… q3

… qn 210
Quality Ratios and Test with Rapid Miner
• Performance (Binomial Classification): accuracy,
sensitivity, specificity, positive predictive value,
negative predictive value, AUC, etc.
• Performance (Classification): accuracy, normalized
absolute error, root mean square error, etc.

• Cross Validation:

211
A Data Science Project (recall)
(Previously introduced)
A Data Science project is defined as a cycling process
which combines the following steps:
1. Goals and Objectives Setting
2. Data Extraction
3. Data Cleaning
4. Feature Engineering
5. Model Creation and Assessment
6. Impact Analysis

212
Case Study 4: Data Modelling with Rapid Miner
1. Download the .csv file in a local folder
2. Insert the attribute names
3. Import the hepatitis.txt with rapid miner and save it as a RM
file.
4. Get rid of missing values
5. Obtain clusters of similar cases
6. Modeling for prediction 1: k-NN
7. Modeling for prediction 2: Logistic Regression
8. Modeling for prediction 3: Decision Trees
9. Modeling for prediction 4: Naïve Bayes Classifier
10. Modeling for prediction 5: Neural Network
11. Modeling for prediction 6: Support Vector Machine
12. Observe that Support Vector Machine is the best approach
13. Calculate sensitivity and specificity for all the approaches

213
4. KR and KM in Medicine
• What is knowledge?
• Types of knowledge
• Representation of knowledge with logic
• Representation of knowledge with production rules
• Representation of knowledge with objects
• Representation of knowledge with ontologies
• Knowledge management: Knowledge Lifecycle

214
Knowledge in Knowledge Representation
• Concepts
– Data: values without any meaning. They can be operated.
– Information: values with a meaning. They can be interpreted and combined.
– Knowledge: actionable generalization.

• Types of Knowledge
– Declarative (know what) ex. blood contains red cells
– Procedural (know how) ex. hand surgery consists on (1) anesthesia, (2) the
incision, (3) closing the incisions, and (4) see the results.
(Source: https://www.plasticsurgery.org/reconstructive-procedures/hand-surgery/procedure)

“Coronavirus disease 2019 (COVID-19) is an infectious disease(1) caused by severe acute respiratory syndrome
coronavirus 2 (SARS-CoV-2)(2). […] Common symptoms include fever, cough, fatigue, shortness of breath, and loss
of smell and taste(3). While most people have mild symptoms(4), some people develop acute respiratory distress
syndrome (ARDS) possibly precipitated by cytokine storm, multi-organ failure, septic shock, and blood clots(5).
[…] As is common with infections, there is a delay between the moment a person first becomes infected and the
appearance of the first symptoms(1’). This delay is called the incubation period(6). The median incubation period for
COVID-19 is four to five days. Most symptomatic people experience symptoms within two to seven days after
exposure, and almost all symptomatic people will experience one or more symptoms before day twelve.”
(Source: Wikipedia)
215
Knowledge Representation: Logic
• First Order Logic (FOL)

Declarative Knowledge:
• Jane is pregnant: Pregnant(Jane)
• Male patients cannot be pregnant: x: Male(x)  Patient(x) →  Pregnant(x)
• Pregnancy complications can only occur to pregnant patients: x, y: ICD9_Chapter11(x)  has(y, x) →
Pregnant(y)
• Severe headaches, nose bleed, and fatigue are signs of hypertension: x: diagnosed(x, hypertension) →
has(x, severe_headache)  has(x, nosebleed)  has(x, fatigue)

Procedural knowledge:
• After anesthesia surgeon takes over the operation: x t t’ y : anesthetized(x, t)  surgeon(y) → after(t’,
t)  takesOver(y, x, t’) 216
Knowledge Representation: Production Rules
Working Memory

Knowledge Tuple

Assert Knowledge Tuple Retract

Knowledge Tuple

IF knowledge_tuple* THEN (ASSERT/RETRACT knowledge_tuple)*

Declarative Knowledge:
• Positive cases of COVID19 must follow quarantine: IF (COVID19_positive :x) THEN (ASSERT (quarantine :x))
• After overcoming COVID19, the patient has antibodies: IF (has :x COVID19) (overcomes :x COVID) THEN
(RETRACT 1) (ASSERT (has_antibodies :x))

Procedural knowledge:
• ER admission must follow triage: IF (admitted :x ER) THEN (ASSERT (on_triage :x))

217
Knowledge Representation: Objects
Object

IF-ADDED SLOT 1 IF-NEEDED


SLOT 2

SLOT n

Script COVID-Treat
Slot Subject: Patient
Slot Starts: Triage
Declarative Knowledge: Object Sign Slot Steps:
• The current age of a patient is the difference
between the current date and the patient’s birth ObjectSlot Name: String (Triage … (Homecare)…)
Fever (Homecare …)
date: IF-NEEDED Slot Has_COVID:
Slot Subclass: Sign Boolean
(General_hosp …)
• If a patient is declared to have COVID19, he/she has :IF-ADDED () (ICU_hosp …)
fever: IF-ADDED Slot Has_COVID: Boolean
Object Patient
Slot …
Signs: list()of SIGN
• In general, patients arriving to a hospital do not :IF-ADDED
have COVID19: DEFAULT
Slot Name: String
Slot Signs: list of SIGN
Slot Birth: Date
Procedural knowledge: Slot Age: Integer
:IF-NEEDED to_integer(− Current_Date Self:Birth)
• The COVID-protocol starts with a triage of the
patient to decide if he/she needs homecare, general Slot Has_COVID: Boolean
hospitalization or ICU hospitalization: Script :DEFAULT False
:IF-ADDED insert( (create Fever) :Self:Signs) &
Insert ( (create COVID-Treat :Self) :Self:Treatements)
Slot Signs: list of Sign
Slot Treatments: list of Treatment 218
Knowledge Representation: Ontologies
Class

IS-A relationship

Subcl
Instance
ass

Domain Range
PROPERTY 1

PROPERTY 2
… property

PROPERTY n

Declarative Knowledge:
• COVID19 is an infectious disease: class (Infectious-Disease), class (COVID19), subclass (COVID19, Infectious-Disease)
• Infectious diseases are caused by disease causing agents, like viruses: class (Disease-Causing-Agent), class (Virus), subclass (Virus,
Disease-Causing-Agent), property (caused-By Domain(Infectious-Disease) Range(Disease-Causing-Agent))
• COVID19 is caused by SARS COVID2 virus: class (SARS-COVID-2), subclass (SARS-COVID-2, Virus), property (caused-By
Domain(COVID19) Range(SARS-COVID-2))
Procedural knowledge:
• Ad hoc solutions implementing chains as next properties in action objects: property (Next Domain(Action) Range(Action)),
minCardinality (Next, 0)
219
Knowledge Life Cycle
Def (Knowledge life cycle): is a cyclic process of transforming information
into knowledge within an organization.

• Create
Knowledge is identified and represented in a
formal way
• Share
Created knowledge and its possible uses are
introduced to the final users
• Store
Created knowledge is stored in (formal)
knowledge bases for a later use
• Use
Stored knowledge is exploited in the daily
practice of the knowledge-based organization
• Update
From daily practice needs of new, more specific
or corrected knowledge can be identified that
needs a new loop of the whole process.

220
5. Decision Support Systems (CDSS) and Artificial
Intelligence in Medicine (AIM)
• Def. (Biomedicine): the branch of medicine concerned with the application of the principles
of biology and biochemistry to medical research or practice.
• Def. (Medical practice): is the practice of medicine. A complex problem involving multiple
cognitive tasks such as diagnosis, treatment, and prognosis.
– Diagnosis: Differential knowledge that is acquired of the physical and mental state of the patient by
observing the signs and symptoms of the disease that they present.
– Treatment: The management and care of a patient for the purpose of combating disease, injury, or
disorder.
– Prognosis: The likely outcome or course of a disease; the chance of recovery or recurrence.

• These are decisional issues that can be supported by Clinical Decision Support Systems and
Artificial Intelligence tools.
– Clinical Decision Support System (CDSS): health information technology system that is designed to
provide health-care professionals with clinical decision support; i.e., assistance with clinical decision-
making tasks.
– Artificial Intelligence in Medicine (AIM): the use of artificial intelligence technology and automated
processes in the diagnosis, treatment, and prognosis of patients who require care.
• Some outstanding CDSS
– Electronic Differential Diagnosis (DDx) Generators
– Drug Interaction Checkers (DIG)
– Alert and surveillance Systems
221
Further Reading: https://www.ncbi.nlm.nih.gov/books/NBK543516/pdf/Bookshelf_NBK543516.pdf
Electronic Differential Diagnosis (DDx) Generators
Def. (Differential Diagnosis (DDX) Generators): Electronic tools that may
facilitate the diagnostic process by introducing the observed signs and
symptoms.
• Examples:

https://symptomchecker.isabelhealthcare.com/
https://symptoms.webmd.com/ ② Obtain ranking of possible causes

① Insert demographic data and symptoms


③ Select disease(s) and
obtain detailed
information on them

④ Select disease(s) and


obtain information on
where to get care.

222
Drug Interaction Checkers (DIG)
Def. (Drug Interaction Checker (DIG)): Electronic tools that help detecting
(and solving) the interactions between groups of drugs and another
substances that prevent the drugs from performing as expected.
• Examples:
https://reference.medscape.com/drug-interactionchecker
https://www.webmd.com/interaction-checker/default.htm

① Insert drugs for interaction detection


② Obtain list of interactions

223
Alert and surveillance Systems
Def. (Clinical Surveillance and Alert System): Clinical surveillance systems
utilize multivariate, continuous, real-time data from multiple monitoring
devices; applies advanced analytics to provide a quantitative and qualitative
estimate of a patient's condition over time; and communicates clinically
relevant alerts to the appropriate clinician.
• Examples:

• Alarm fatigue and false positives.


• General conceptual architecture:

Intelligent Health-care
Patient Device
Alarm System Professionals
224
6. Legality, Security and Ethics
• Legal Aspects of Health Care Administration
– USA
– Spain
• Code of Ethics of Medical Informatics
– International Medical Informatics Association
• Other proposals
– General Data Protection Regulation (GDPR)
– Spanish Ley Orgánica de Protección de Datos (LOPDGDD)
– Artificial Intelligence and Law Enforcement - Impact on
Fundamental Rights

225
Legal Aspects of Health Care Administration

• Introduction of the legal system


• Tort law on negligence
• Criminal aspects of health care
• Contracts and antitrust
• Civil Procedure and trial practice
• Corporate structure and liabilities
• Information management
• Patient consent
• Legal reporting requirements
• Patient rights and responsibilities
• Health-care ethics
• Procreation and ethical dilemmas
• End-of-life issues
• HCP as an employee

226
Legal Aspects of Health Care Administration

• Parte de Lesiones
– Código Penal (art 147)
– Ley de Enjuiciamiento judicial (art
262, 355)
• Historia Clínica (HC)
– Ley General de Sanidad (art 10, 61)
– Código de Ética y Deontología
Médica (art 13)
– Ley 41/2002 básica reguladora de
la autonomía del paciente y de
derechos y obligaciones en materia
de información y documentación
clínica
• Uso de la HC
• Conservación de la HC
• Derecho de acceso a la HC
• Propiedad de la HC

Source: https://www.elsevier.es/es-revista-medicina-familia-semergen-40-pdf-13072713
227
Legislación sobre Historia Clínica en España (I)
Ley General de Sanidad Código de Ética y Deontología Médica
• Art 10: “Todos tienen los siguientes derechos con respecto a • Art 13:
las distintas administraciones públicas sanitarias: […] 1. Los actos médicos quedarán registrados en la correspondiente
11. A que quede constancia por escrito de todo su proceso. historia clínica. El médico tiene el deber y el derecho de
redactarla.
Al finalizar la estancia del usuario en una institución
hospitalaria, el paciente, familiar o persona a él allegada 2. El médico y, en su caso, la institución para la que trabaja, están
obligados a conservar las historias clínicas y los elementos
recibirá su informe de alta”.
materiales de diagnóstico. En caso de no continuar con su
conservación por el transcurso del tiempo, podrá destruir el
• Art 61: “En cada Área de Salud debe procurarse la máxima material citado que no se considere relevante, sin perjuicio de lo
integración de la información relativa a cada paciente, por que disponga la legislación especial. En caso de duda, deberá
lo que el principio de historia clínico sanitaria única por consultar a la Comisión de Deontología del Colegio.
cada uno deberá mantenerse, al menos, dentro de los 3. Cuando un médico cesa en su trabajo privado, su archivo podrá
límites de cada institución asistencial. Estará a disposición ser transferido al colega que le suceda, salvo que los pacientes
de los enfermos y de los facultativos que directamente manifiesten su voluntad en contra. Cuando no tenga lugar dicha
sucesión, el archivo podrá ser destruido, de acuerdo con lo
estén implicados en el diagnóstico y el tratamiento del
dispuesto en el apartado anterior.
enfermo, así como a efectos de inspección médica o para
fines científicos, debiendo quedar plenamente garantizados 4. Las historias clínicas se redactan y conservan para la asistencia
del paciente u otra finalidad que cumpla las reglas del secreto
el derecho del enfermo a su intimidad personal y familiar y
médico y cuente con la autorización del médico y del paciente.
el deber de guardar el secreto por quien, en virtud de sus
5. El análisis científico y estadístico de los datos contenidos en las
competencias, tenga acceso a la historia clínica. Los
historias y la presentación con fines docentes de algunos casos
poderes públicos adoptarán las medidas precisas para concretos pueden proporcionar informaciones muy valiosas, por
garantizar dichos derechos y deberes”. lo que su publicación y uso son conformes a la deontología,
siempre que se respete rigurosamente la confidencialidad y el
derecho a la intimidad de los pacientes.
6. El médico está obligado a la solicitud y, en beneficio del
paciente, a proporcionar a otro colega los datos necesarios para
completar el diagnóstico, así como a facilitarle el examen de las
pruebas realizadas.
228
Legislación sobre Historia Clínica en España (II)
Ley 41/2002 básica reguladora de la autonomía del paciente y
de derechos y obligaciones en materia de información y
documentación clínica
• Artículo 14. Definición y archivo de la historia clínica
• Artículo 15. Contenido de la historia clínica de cada paciente
• Artículo 16. Usos de la Historia Clínica
• Artículo 17. La conservación de la documentación clínica
• Artículo 18. Derechos de acceso a la historia clínica
• Artículo 19. Derechos relacionados con la custodia de la
historia clínica

229
Resumen Legislación Española sobre HC (I)
A. Uso de la historia clínica B. Conservación de la historia clínica
1. Por profesionales asistenciales del centro médico en el –El deber recae en el profesional sanitario en su ejercicio
que se realiza el diagnóstico o el tratamiento del paciente. profesional de forma individual y en el centro sanitario
2. Por personal de la administración y gestión del centro cuando se ejerce en este ámbito la asistencia médica.
sanitario (sólo datos relacionados con sus funciones) –Cuando el deber de custodia recae sobre el facultativo
3. Por el personal sanitario con funciones de inspección, es un deber patrimonial, por tanto, transmisible por
evaluación, acreditación y planificación, para comprobar la muerte a sus herederos. La ley 41/20027 prevé un tiempo
calidad asistencial o el respeto a los derechos del paciente. mínimo de conservación de 5 años desde la fecha del alta
4. Con fines judiciales, epidemiológicos, de salud pública, asistencial. El máximo variará en función del tipo de
de investigación o de docencia. Tener en cuenta: asistencia que se haya prestado, previsión de
complicaciones, persistencia de la necesidad de ciertos
– Que se necesitará una habilitación legal específica.
tratamientos, etc.
– La habilitación será siempre restrictiva, separando los datos
de identificación del paciente de los de carácter clínico –La conservación de la HC es fundamental en el caso de
asistencial. denuncias por responsabilidad. La no aportación de la
– Para fines judiciales, se reclama la historia clínica completa misma puede tener efectos adversos sobre el profesional
(hay diversas sentencias condenatorias al médico o a la
institución por no aportar la historia clínica alegando que
ésta se había extraviado, se había dado al paciente o no
se había hecho).

230
Resumen Legislación Española sobre HC (II)

C. Derecho de acceso a la historia clínica D. Propiedad de la historia clínica


• El principal titular del derecho al acceso es el paciente, • En consultorio privado: la HC pertenece al médico que la
que puede ejercerlo personalmente o por realiza.
representación debidamente acreditada. • En una institución privada (médico contratado): las HC
• En caso de fallecimiento, el derecho de acceso pasa a son propiedad de la institución o de la entidad para la
las personas vinculadas al fallecido por razones que el médico trabaja.
familiares o de hecho, salvo que éste lo hubiera • En una institución privada (médico en régimen de
prohibido expresamente. arrendamiento de consultas): los pacientes lo son del
• La Ley 41/20027 también prevé el acceso de un tercero facultativo, y no de la entidad. Por este motivo, es el
a la historia clínica motivado por un riesgo para su médico el propietario de la HC.
salud, limitándose en este caso a los datos • Médico en régimen funcionarial o estatutario (sanidad
estrictamente relacionados con su salud. pública): la titularidad de la HC corresponde a la
institución.

Conclusiones prácticas, que pueden ayudar a evitar problemas legales.


• Ser cuidadoso en la elaboración de la historia clínica: El desorden o la letra ilegible dan una mala
imagen cuando la historia clínica, por motivos de demandas por responsabilidad, llega al juzgado.
• El facultativo no debe olvidar nunca que aunque es un documento asistencial tiene una importancia
médico-legal fundamental.
• Se debe recoger en la historia cínica todo lo que se le pregunta, se le realiza, o se prescribe al
paciente. En principio, y como norma general, todo aquello que no está reflejado en la historia clínica
no está hecho.

231
Code of Ethics
• International Medical Informatics Association (IMIA)
• 2016 IMIA Code of Ethics for Health Information Professionals

• Introduction: general principles


A. Set of fundamental ethical principles (6)
B. List of general principles of informatics ethics (7)
• Rules of ethical conduct: particular set of ethical rules that should
guide the behavior of a Health Informatics Professional (HIP)
A. Subject-centered duties (12)
B. Duties towards Health Care Professionals (HCPs) (6)
C. Duties towards institutions, employers and agencies (10)
D. Duties towards society (5)
E. Self-regarding duties (7)
F. Duties towards the profession (5)

Source: https://imia-medinfo.org/wp/imia-code-of-ethics/
232
Fundamental Ethical Principles
1. Principle of Autonomy: All persons have a fundamental right to self-determination.
2. Principle of Equality and Justice: All persons are equal as persons and have a right
to be treated accordingly.
3. Principle of Beneficence: All persons have a duty to advance the good of others
where the nature of this good is in keeping with the fundamental and ethically
defensible values of the affected party.
4. Principle of Non-Malfeasance: All persons have a duty to prevent harm to other
persons insofar as it lies within their power to do so without undue harm to
themselves.
5. Principle of Impossibility: All rights and duties hold subject to the condition that it
is possible to meet them under the circumstances that obtain.
6. Principle of Integrity: Whoever has an obligation has a duty to fulfil that obligation
to the best of their ability.

233
General Principles of Informatics Ethics
1. Principle of Information-Privacy and Disposition: All persons and group of persons have a fundamental
right to privacy, and hence to control over the collection, storage, access, use, communication,
manipulation, linkage and disposition of data about themselves.
2. Principle of Openness: The collection, storage, access, use, communication, manipulation, linkage and
disposition of personal data must be disclosed in an appropriate and timely fashion to the subject or
subjects of those data.
3. Principle of Security: Data that have been legitimately collected about persons or groups of persons should
be protected by all reasonable and appropriate measures against loss degradation, unauthorized
destruction, access, use, manipulation, linkage, modification or communication.
4. Principle of Access: The subjects of electronic health records have the right of access to those records and
the right to correct them with respect to its accurateness, completeness and relevance.
5. Principle of Legitimate Infringement: The fundamental right of privacy and of control over the collection,
storage, access, use, manipulation, linkage, communication and disposition of personal data is conditioned
only by the legitimate, appropriate and relevant data-needs of a free, responsible and democratic society,
and by the equal and competing rights of others.
6. Principle of the Least Intrusive Alternative: Any infringement of the privacy rights of a person or group of
persons, and of their right of control over data about them, may only occur in the least intrusive fashion
and with a minimum of interference with the rights of the affected parties.
7. Principle of Accountability: Any infringement of the privacy rights of a person or group of persons, and of
the right to control over data about them, must be justified to the latter in good time and in an appropriate
fashion.

234
Rules of Ethical Conduct
A. Subject-centered duties (12): duties derived from the relationship
between EHRs, the data contained in them and the subjects of those
records
B. Duties towards HCPs (6): HIPs obligations to assist HCPs they are
associated to, compatible with the duties towards the subjects of the
EHRs
C. Duties towards institutions, employers and agencies (10):
D. Duties towards society (5): HIPs social obligations
E. Self-regarding duties (7): HIPs own obligations
F. Duties towards the profession (5): HIPs as a group obligations

Please read at least one time the original document


235
General Data Protection Regulation (GDPR)

• Regulation on personal data protection and privacy in the European Union


(EU) and the European Economic Area (EEA)
– Only applicable to personal data.
– Personal data is information that relates to an identified or identifiable individual
(e.g., name, ID number, credit card number, IP address, etc.).
– If it is possible to identify an individual directly from the information you are
processing, then that information may be personal data.
– If you cannot directly identify an individual from that information, then you need
to consider whether the individual is still identifiable if your information were
processed together with some other means reasonably likely to be used by either
you or any other person to identify that individual.

• In Spain, GDPR was implemented in the Organic Law 3/2018 of December


5, on the Protection of Personal Data and Guarantee of Digital Rights
(LOPDGDD)

236
Artificial Intelligence and Law Enforcement
• Not this course

237
Source: https://www.europarl.europa.eu/thinktank/en/document.html?reference=IPOL_STU(2020)656295
Conclusions
In this course we learned !!!
• About clinical data…
… what are the sources and the sinks of data; what types of clinical data we can find and how to store it in variables; how to convert
variables between different types and why; some undesired issues with variables (wrong, missing, noise, etc.) and how to deal with
them; basic concepts on big data; what are the most frequently used standards of clinical data codification; what is an EHR, its parts,
uses, their past present and future, standards, and some of the current EHR software systems; what is interoperability and their relation
to clinical data sharing.
• About clinical data description and analysis…
… what are the parts of a data science project; how to use descriptive statistics to describe clinical data and how to do that in Python;
practical example on clinical data description (Case Study 1); analyze data with inferential statistics; Student’s and Welch’s t-tests;
Pearson's Chi-square test; ANOVA (one-way, two-way with/out replication); post-hoc analysis with Tukey’s test; how to check data
normality (Shapiro-Wilks and Kolmogorov-Smirnov tests); how to perform non-parametric tests (Wilcoxon’s Rank Sum, Kruskal-Wallis
tests); how to calculate and interpret risk ratio and odds ratio; how to do all this with Python; practical example on clinical data inference
(Case Study 2); perform survival analysis with Kaplan-Meier table, curve, and log-rank test.
• About artificial intelligence for clinical data analysis…
… practical use of Rapid Miner; basics on data statistics and visualization; identify data preprocessing issues and alternative solutions for
data cleaning, transformation, reduction, and balancing; practical application of these technologies with Rapid Miner (Case Study 3); use
of ML algorithms for data analysis and modeling; introduction to unsupervised ML (k-Means and Agglomerative Clustering); introduction
to supervised ML (k-NN, Logistic Regression, DT, NB classifier, Rules, ANN, SVM, discriminant analysis); know the main quality measures
to test clinical predictive models; develop a data science project for the analysis of hepatitis (Case Study 4).
• About knowledge representation in medicine…
… what is the difference between clinical data, information, and knowledge; practical distinction between descriptive and procedural
knowledge; 4 alternative ways to represent clinical knowledge (logic, production rules, objects, and ontologies); the steps of the
knowledge life cycle.
• About clinical decision support systems…
… what is a CDSS; practical introduction to 3 CDSS: differential diagnosis generators (diagnosis), drug interaction checkers (treatment),
and alert & surveillance systems (follow-up).
• About clinical legal, security, and ethics…
… legal aspects of HC administration in the USA and Spain; code of ethics of medical informatics; the European General Data Protection
Regulation (GDPR) and the Spanish LOPDGDD organic law implementation.
238
References
• Lloyd B. (2017) Stanford Medicine 2017 Health Trends Report Harnessing the Power of Data in
Health.
• Smith A., Nelson M. (1999) Data Warehouses and Clinical Data Repositories. In: Ball M.J., Douglas
J.V., Garets D.E. (eds) Strategies and Technologies for Healthcare Information. Health Informatics.
Springer, New York, NY
• Coding Systems for Categorical Variables in Regression Analysis. UCLA: Statistical Consulting Group,
Institute for Digital Research & Education, https://stats.idre.ucla.edu/spss/faq/coding-systems-for-
categorical-variables-in-regression-analysis/ (accessed Jan 2020).
• Hammond W.E., Cimino J.J. (2006) Standards in Biomedical Informatics. In: Shortliffe E.H., Cimino
J.J. (eds) Biomedical Informatics. Health Informatics. Springer, New York, NY.
• Hammond W.E., Jaffe C., Cimino J.J., Huff S.M. (2014) Standards in Biomedical Informatics. In:
Shortliffe E., Cimino J. (eds) Biomedical Informatics. Springer, London.
• Kelley T. Electronic Health Records for Quality Nursing & Health Care. DEStech Publications, Inc.
2016.
https://books.google.es/books?hl=es&lr=&id=BhqUCwAAQBAJ&oi=fnd&pg=PR11&dq=Electronic+H
ealth+Records+for+Quality+Nursing+and+Health+Care+pdf&ots=riUu7qPBdD&sig=5t1NrCDZIRrRiw
PWyS_OSw00Suw#v=onepage&q&f=false
• Machin D., Campbell M.J., Walters S.J. (2007) Medical Statistics. John Wiley & Sons Ltd 4th Edition.
• Material docente de la Unidad de Bioestadística Clinica. Hospital Universitario Ramón y Cajal.
http://www.hrc.es/bioest/M_docente.html

239

You might also like