You are on page 1of 28

The role

of Statistics in
Discovery
R A M R A M A S WA M Y
P R E S I D E N T, I N D I A N A C A D E M Y O F S C I E N C E S , B A N G A L O R E
P R O F E S S O R , J AWA H A R L A L N E H R U U N I V E R S I T Y , N E W D E L H I
Knowledge Discovery
is the non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data
Outline
Gary Taylor and the poem
Kosambi and the discovery of numismatics
Statistics in the service of public health: Breast
cancer and the discovery of its subtypes. Lasso
DESI-MS
One of the leading statisticians at
the present times is
Professor C Radhakrishna Rao
Shall I die? Shall I fly
Lover's baits and deceits
sorrow breeding?
Shall I tend? Shall I send?
Shall I sue, and not rue
my proceeding?
In all duty her beauty
Binds me her servant for ever.
If she scorn, I mourn,
I retire to despair, joining never.
14 November, 1985, Gary Taylor found a
poem. 9 stanzas, 429 words
• The poem was signed “W.S.”.

• Was it by Winston Smith, Wesley Snipes, or William


Shakespeare?
• Or Roger Bacon or Christopher Marlowe?
Shakespeare used a total of 884,697
words
As discussed by C R
Rao, based on the
frequencies of word
usages in the complete
works of Shakespeare,
the canon, one can
infer that the newly
discovered poem is
probably by
Shakespeare…
Zipf’s Law: N(k) ~ 1/k
 In any natural language the frequency of
any word is inversely proportional to its rank.
Thus the most frequent word will occur
approximately twice as often as the second
most frequent word, three times as often as
the third most frequent word, etc
In English, the word "the" is the most
frequently occurring word (69,971). The
second-place word "of" has 36,411
occurrences, followed by "and" (28,852).
Zipf’s Law (1935) is statistical and
empirical
 An early critic was D D Kosambi, who applied it to ancient
Kannada texts and wrote a critical paper, in 1942, “On valid tests
of linguistic hypotheses” in the New Indian Antiquary, 5, 21–24
Zipfian analysis seems to apply quite
generally
To natural (as well as computer) languages
To analysis of city sizes
Looking at patterns in DNA – looking for “words” in DNA
regions that code for genes, for instance

But why does one have these “power-laws”?


Sample space reducing stochastic processes

Survival probability:

P(k) ~ 1/k
Kosambi: Mathematics+Statistics and History

 D D Kosambi is well-known for introducing statistical


methodology into the study of Indian history.
In explaining the beginnings of his experimental work with
early Indian coins, Kosambi mentions that he took up two
initial problems “to teach myself statistics”.
“…every hoard of coins bears the
signature of its society”
 Kosambi's research in numismatics beginning in the 1940s
marked a radical departure in the field from the practices and
interests in the previous 100 years.
 There can be questions about how historians, including
Kosambi, may have used coins as markers of socio-economic
change.
But Kosambi's use of numismatics was such that historians
can no longer ignore numismatic evidence for societal history.

B D Chattopadhyaya, EPW
Scientific Numismatics

Ancient coins have provided much information about the sites


in which they were found and about the societies that produced
them.
They can be made to yield even more information by modern
statistical methods
Circulation of coins
induces variation in
their weights
The longer they have
been used, there is more
variance, and the mean
decreases.
Kosambi was able to
show this by weighing the
coins of the “Taxila Hoard”
BDC, EPW
Statistics in Public Health
 Integration of large numbers and varieties of information and
data to provide better analysis.
 Leads to better (and more cost-effective) treatment.
 Spectacular advances in the treatment of breast cancer.
Breast cancer incidence
 Most commonly diagnosed
cancer after nonmelanoma skin
cancer
 Second leading cause of
cancer deaths after lung cancer.
1/8 chance to develop BCA
1/33 chance of death
5-10% hereditary
BCA has been one of the diseases
subjected to early molecular profiling
Measurement of global expression patterns towards identification of
individual genes that mediate particular aspects of cellular physiology.
 DNA microarrays are a systematic method to study the mRNA variation
between cancer/healthy cells, make it possible to identify of clinically
relevant tumor entities and subclasses and therefore give potential
therapeutic targets.
Initial studies by Botstein and colleagues gave “molecular” portraits of
human breast tumours and identified multiple tumor classes which differ in
ER expression: Luminal A, Luminal B, ERBB2+, Basal
http://genome-www.stanford.edu/breast_cancer
Patterns, Models, Classifiers

Positive Patterns Negative Patterns Model


Biomedical data
 Breast cancer data (Stanford &
Norway) cDNA gene expression
data, 122 breast cancer samples
(112+10). Sorlie et al., PNAS 2003

 552 “intrinsic genes” of high


variability; Hierarchical clustering
shows 5 major subgroups of
samples / genes
Real-time diagnosis for oncosurgery
In the past two years, Richard
Zare, Livia Eberlin, George
Poulsides and Robert Tibshirani
of Stanford University have
collaborated to bring
Desorption electrospray
ionization + mass spectrometry
to distinguish cancer cells from
normal.
This technique, DESI-MS, can
give “real time” information to
assist in surgery.
A Statistics approach is vital
 They use a technique to measure the combination of molecules,
based on which they need to assess as quickly as possible whether
a cell/group of cells is cancerous or not with as high a probability
as possible.
LASSO method: Least Absolute Shrinkage and Selection
Operator or LASSO is a regression method that performs both
variable selection and regularization in order to enhance the
prediction accuracy and interpretability of the statistical
model it produces.
Simple setting: Regression + constraint
Knowledge Discovery
is the non-trivial process of identifying
valid, novel, potentially useful, and
ultimately understandable patterns in
data
Particularly in this age of
“Big Data”, statistics can
provide an increasingly
useful means of discovering
knowledge.

You might also like