You are on page 1of 6

PHARMACEUTICAL STATISTICS

Pharmaceut. Statist. 2007; 6: 155–160


Published online in Wiley InterScience
(www.interscience.wiley.com) DOI: 10.1002/pst.303

The difficult and ubiquitous problems of


multiplicities
Donald A. Berry*,y
Department of Biostatistics, M.D. Anderson Cancer Center, Houston, TX, USA

Multiplicities are ubiquitous. They threaten every inference in every aspect of life. Despite the focus in
statistics on multiplicities, statisticians underestimate their importance. One reason is that the focus is
on methodology for known multiplicities. Silent multiplicities are much more important and they are
insidious. Both frequentists and Bayesians have important contributions to make regarding problems
of multiplicities. But neither group has an inside track. Frequentists and Bayesians working together
is a promising way of making inroads into this knotty set of problems. Two experiments with
identical results may well lead to very different statistical conclusions. So we will never be able to
use a software package with default settings to resolve all problems of multiplicities. Every
problem has unique aspects. And all problems require understanding the substantive area of
application. Copyright # 2007 John Wiley & Sons, Ltd.

Keywords: multiplicities; Bayesian statistics; multiple comparisons; subset analyses

Most scientists are oblivious to the problems of Others recognize the problem and overreact. I will
multiplicities. Yet they are everywhere. In one or suggest my approach below, but I do not claim
more of its forms, multiplicities are present in that it is an ideal compromise. And indeed, it may
every statistical application. They may be out in not be a compromise at all.
the open or hidden. And even if they are out in the Multiplicities humble us all. The way we deal
open, recognizing them is but the first step in a with them separates good statisticians from the
difficult process of inference. Problems of multi- less good. I have no qualms about my own abilities
plicities are the most difficult that we statisticians in this regard and unhesitatingly place myself in
face. They threaten the validity of every statistical the less-good category. Too frequently I carry out
conclusion. statistical tests or make posterior probability
Some statisticians do not understand that the calculations based on the ‘data’ sitting in front of
possibility of multiplicities may be a problem. me. But the fact that these particular data are
sitting in front of me at all is part of the data! I
*Correspondence to: Donald A. Berry, Department of may choose to ignore this critical information
Biostatistics, M.D. Anderson Cancer Center, Houston, TX,
USA. because I do not know what probability model to
y
E-mail: dberry@mdanderson.org use for the process that brought the data to me.

Copyright # 2007 John Wiley & Sons, Ltd.


156 D. A. Berry

But it nags at me. I sometimes try to bring this observations at once gives the same answer as
information into the process at the very end, when three sequential applications of Bayes’ theorem,
I report my conclusions. Multiplicities caveat all of one after each observation. However, in practice
my analyses. The only favorable light I can shine people will end of with different conclusions.
on my ability to handle them is that I recognize There are countless examples of this phenom-
their importance and their ubiquity. enon in drug development. And it has the same
My first exposure to multiplicities was in my two traps: discounting valid observations, and
first-grade class, which was my first time away subsetting. So important are these phenomena
from home. There were two redheads in the class. that they may account for as much as 30%
They were from different families and they were wastage of research expenditures in the pharma-
both very bright. I came to associate red hair with ceutical industry. (This subjective assessment is
intelligence. It took several years of meeting subject to the vagaries of observational inter-
redheads of normal intelligence and smart brun- pretation, and you may reasonably wonder
ettes and blondes for me to come to understand whether my inferential abilities have improved
that I had read too much into the early data. since the first grade!) Drug developers are en-
The two brightest kids in any class are trapped by early results and can usually find
necessarily similar in some other way – perhaps characteristics of more recent results that ‘explain’
in several other ways. Perhaps they are both girls, why they are different.
both boys, both tall, both short, extreme in height Sometimes there are no known covariates to
(one may be very tall and the other very short), explain the differences. In such a case one would
both of the same nationality or religion or ethnic think that there is no recourse for the entrappee.
group, both overweight, both underweight, have However, here is a real counterexample, with
similar hair color, have buck teeth, have freckles, names withheld to protect me from libel suits. A
speak with a lisp, can run fast, cannot run fast, are company runs a randomized phase II study. At the
handsome, are not handsome, etc. So I was halfway point they observe a statistically signifi-
doomed. I was bound to learn something that cant advantage for their drug over placebo. The
was wrong! trial continues. In the second half the data
Part of the reason I drew the wrong conclusions completely flip-flop, with a statistically significant
and was slow to unlearn was that I did not know advantage for placebo. Overall, no difference. The
Bayes’ theorem. However, knowing Bayes theorem patient characteristics in the two halves were the
is not enough: I might have applied it badly. Like same. The company was sure that the first half was
everyone else, Bayesians have great difficulty the truth!
properly interpreting observational data – whether A related phenomenon is a subset analysis,
they know it or not! I dare say that most which is a difficult but reasonably well-understood
applications of Bayes theorem are ‘bad’! For multiplicity [1,2]. There is always a subset of
example, the third redhead I met might have been patients that does better on the experimental drug
of normal or subnormal intelligence. But by this than does the complementary subset. A Bayesian
time I had a pretty clear notion of the implications might observe a subset difference, assume a
of redheadedness and I was entrapped by my early noninformative prior distribution within the
observations. So it would be easy to explain away subset, and find the corresponding posterior
my latest observation. This apparent dolt was distribution. This calculation is flawed. It
probably bright after all, but just shy. Or maybe fails to model what is sometimes the most
his hair was not quite as red as the others. Or important data of all: the process of determining
maybe it is only redheaded girls who are smart. the subset. Of course, Bayesians are not unique in
Or maybe the third redhead was just weird and this regard. Unfortunately, many researchers
not representative of his brethren. A mathematical and many drug developers fall prey to exactly
application of Bayes’ theorem for all three the same problem.

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 155–160
DOI: 10.1002/pst
The difficult and ubiquitous problems of multiplicities 157

DOOMED BY SILENT carried out before I get to see the data. I have
MULTIPLICITIES learned to handle silent multiplicities by asking
lots of questions. I enquire about aspects of the
All data analysis is subject to the curse of multi- experiment all the way back to the time it was
plicity. Some multiplicities are reasonably well conceived. Especially important are methods of
understood. Examples include subgroup analyses data collection and data processing. The answers
and multiple comparisons of k>2 treatment groups. are usually revealing. In the worst case, the
[3,4] I do not mean to suggest that these ‘visible answers make clear that the data as presented
multiplicities’ are easy to handle. However, they pale are useless for inference.
in difficulty compared with ‘silent multiplicities.’ [5] Some types of silent multiplicities are avoidable,
The latter are aspects of the ‘data’ that are hidden depending on the circumstances. An extreme
from the analyst. And there are always some that are approach is for the statistician to run the experi-
hidden! You may think this statement is hyperbole. ment and coordinate all data collection and
After all, the analyst can design and run every aspect handling. However, this does not make good use
of the experiment. But even in this case the analyst of statisticians’ abilities. (Moreover, as I have
must rely on his or her objectivity and memory to already indicated, it does not completely resolve
ensure that nothing is hidden or discounted. None of the problem.) We would get little else done.
us can do this. And did the analyst consider when he However, it is possible and wise for us to be
took a lunch break, why he took a lunch break then involved in the data collection process and we can
and what he had for lunch?! supervise some of the critical aspects of this process.
There is a fundamental distinction between More difficult to handle – and perhaps even
missing data or incomplete information and silent impossible to handle! – are aspects that have nothing
multiplicities. The latter are not known to be to do with the experiment per se. An investigator
missing. They can reflect the data structure, or conducts an experiment and for some reason does
aspects that are not part of the experiment under not like the results. You never get to see the data.
consideration and therefore seem to be ancillary. Perhaps the investigator throws them out and so no
I distinguish two categories of silent multi- one else gets to see them. The fact that you did not
plicities, depending on whether or not it is possible see the results of a particular experiment is informa-
to bring them into the open. tive and should be included in future inferences that
I collaborate with physicians and scientists in you might make. How on Earth to do that? Good
the substantive fields. They frequently ask for an luck in building a model based on things you have
analysis of a data set that they provide in a not seen, and that might not even exist!
spreadsheet. The data have usually been ‘cleaned.’ Statisticians react to this dilemma by becoming
Perhaps the investigators have removed duplicates, pessimists. It is an occupational hazard! Just as
which seems innocuous enough. Or they have with the publication bias, study results we are
averaged some observations, which starts to get given to analyze tend to be positive. Because there
dicey. First, this process inflates the precision of are likely to be negative studies addressing the
the measurements. More importantly, it may bias same question but that never saw the light of day,
the results in a number of ways because some extra the next study is likely to be negative – or at least
measurements may have been made ‘because the less positive. (This effect is in addition to but
original measurements were unusual.’ Or they may confounded with regression to the mean.) Our
have removed outliers, which is an easier problem pessimism leads us to give greater credibility to the
with which to deal because it is so obvious what null hypothesis than would usually be appropriate.
they have done and why. Or they may have However, some statisticians overreact.
restricted to experimental units of most interest – As regards silent multiplicities, pharmaceutical
perhaps based on a perusal of the data! Sometimes statisticians are better off than academic statisti-
the most important statistical analyses have been cians. Firstly, they are more likely to be involved

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 155–160
DOI: 10.1002/pst
158 D. A. Berry

from the start. Secondly, the pharmaceutical to give more credence – sometimes too much – to
industry is much more closely regulated. So studies null hypotheses than do Bayesians. For example, a
have protocols. However, pharmaceutical statisti- standard frequentist approach is to adjust infer-
cians are not immune to the problem. For ences (such as Type I error rate) assuming that the
example, in the preclinical setting an investigator null hypothesis is true. In the extreme when there
might restart an experiment because the first try is a large number of multiplicities, adjustments to
‘was not going very well.’ Type I error rates in this ‘familywise’ approach
Even in a heavily regulated clinical setting there make it nigh impossible to reject any of the various
are insidious silent multiplicities to plague phar- null hypotheses.
maceutical statisticians. Suppose a clinical trial On the other hand, Bayesians tend to be too
shows a positive result. There are likely to be trials casual in handling multiplicities. In effect, their
sponsored by other companies in similar diseases approach gives too much prior credibility to
and with similar drugs. They matter! But informa- alternative hypotheses. They calculate posterior
tion about them, and even about their existence, is probabilities based on the superficial data. They
lacking. It is hard to tell the size of the wave from assume some probability model for calculating the
the perspective of the wave’s peak. The regression likelihood function. However, this model is usually
effect (dampening the estimated positivity of the inadequate in that it ignores multiplicities –
result) is stronger if this were one of many studies. sometimes because it is difficult or impossible to
Not all information about ‘other studies’ is do otherwise, as I mentioned above in the context
silent, of course. For example, the disease may of silent multiplicities. Too frequently Bayesians
be one with the history of failed therapies (the misunderstand the ‘data.’ In addition to the
common cold, stroke, sepsis, and pancreatic superficial data they should consider what they
cancer come to mind). In such circumstances, have not observed. (Former U.S. Secretary of
investigators should be more cautious. Unfortu- Defense Donald Rumsfeld was ridiculed for saying
nately, there are no frequentist statistical methods something that is actually quite insightful:
for dealing with ‘other studies that may be related.’ ‘Reports that say that something hasn’t happened
The Bayesian approach can address this problem, are always interesting to me, because as we
but the art of doing so is not well understood by know, there are known knowns; there are things
anyone, including by Bayesians. We have to rely we know we know. We also know there are
on the subjectivity of regulators in such matters. known unknowns; that is to say we know there are
As an example, despite many attempts, as of this some things we do not know. But there are also
writing there is no approved vaccine for the unknown unknowns – the ones we don’t know we
treatment of cancer. If there were to be a positive don’t know.’)
vaccine trial, it should have a higher hurdle for This is more than a theoretical problem.
approval than if the results were identical but the Bayesians make major errors in analyzing existing
experimental drug is one in a known class of data, paying no heed to how they came to analyze
cytotoxic agents. it in the first place. For example, Bayesian analyses
of ‘classical data sets’ are problematic. The fact
that a classical data set became classical is related
to the various conclusions of the data. Perhaps
BAYESIAN VS FREQUENTIST Bayesians can be forgiven because frequentists are
APPROACHES similarly duped. However, Bayesian posterior
probabilities refer to the truth of the original
Based on my observations of statisticians in scientific question while frequentist measures
action, frequentists are more likely to overreact are much more limited in inferential scope,
in adjusting what they see based on other things referring only to the degree of concordance of
that seem ancillary to less sensitive eyes. They tend the superficial observations with hypothetical

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 155–160
DOI: 10.1002/pst
The difficult and ubiquitous problems of multiplicities 159

distributions. So they are immune from the most protein/drug interaction experiments. (Exploiting
serious Bayesian mistakes. the elegance of hierarchical modeling is also
Bayesians are wont to claim that they address important in designing such experiments.) Borrow-
questions of direct scientific and clinical impor- ing via modeling is critical. Understanding the
tance, and they do. However, this ability has a biology or the chemistry or the other relevant
huge price tag. Contrary to conventional wisdom science is essential for good modeling, of both the
it is not because they must assume a prior likelihood and the prior distribution.
probability. It is that they have to work very hard Although the hierarchical Bayesian approach
to assess their prior probabilities and they have to is an effective method of handling visible multi-
work even harder to ensure that the likelihood plicities, there is no panacea for silent multi-
function they are using reflects all aspects of the plicities. Understanding and effectively addressing
‘data’ at hand – including why they have it and these require both experience and knowledge.
why they are analyzing it. Experience is necessary almost by definition
If I had to choose between the conservatism and for it teaches the range of possibilities that
pessimism of the frequentist and the naiveté of the may be unseen in any particular experiment.
Bayesian, it could be a difficult choice. I would The issue is what might have happened that is
favor one in some instances and the other in not stated and that is not obvious from the
others. On balance I would choose the frequentist. numerical results. Knowledge of the subject matter
The naiveté that characterizes many Bayesians and is necessary to place the experimental results in
their approaches to inference too often leads to context.
serious errors. In addition, knowledge of the subject matter is
necessary for questioning the investigator in the
process of assessing whether there are unseen
HIERARCHICAL BAYESIAN multiplicities. For example, I know the culture in
APPROACH some laboratory sciences is to redo parts of
experiments if the results do not seem to fit. The
There are compromises that dominate both the investigators tend to drop the anomalous observa-
naı̈ve Bayesian approach and the conservative tions and present me with those that ‘seem right.’
frequentist approach. None of them is a panacea, We statisticians understand that this can bias the
but such compromises elucidate the issues and may results, and it always leads to underestimates of
enable us to begin understanding how to appro- variability. However, few scientists have our kind
priately deal with multiplicities. One such com- of training. Since I know that this kind of thing
promise is hierarchical Bayesian modeling [6–11]. can happen, in any particular experiment I know
The idea is to borrow across groups within each to ask if it did happen – and to worry even if the
level of the hierarchy based on prior information answer is ‘no.’
and on the extent to which the data in the groups
are concordant. A hierarchical Bayesian approach
is superior to both the frequentist and naı̈ve THE BOTTOM LINE
Bayesian approaches. But it is not perfect.
Nothing is perfect. Both frequentists and Bayesians have important
The hierarchical Bayesian approach is especially contributions to make regarding problems of
helpful for handling inferences when the dimen- multiplicities. Neither group has an inside track
sion of the visible multiplicities is huge. For to the answer. Frequentists and Bayesians working
example, genes on a cDNA microarray may together is a promising way to make inroads into
number in the tens of thousands. And there are this knotty set of problems. Two experiments with
an estimated 1018 drug-like molecules and so the identical results may well lead to very different
number of possibilities is effectively limitless in statistical conclusions. So we will never be able to

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 155–160
DOI: 10.1002/pst
160 D. A. Berry

use a software package with default settings to of the American Statistical Association 1998; 93:
resolve all problems of multiplicities. Every 1130–1139.
problem has unique aspects. And all problems 5. Berry DA. Statistics: a Bayesian perspective. Dux-
bury: Belmont, CA, 1996.
require understanding the substantive area of 6. Berry DA. Bayesian clinical trials. Nature Reviews
application. Drug Discovery 2006; 5:27–36.
7. Berry DA, Stangl DK. Bayesian biostatistics.
Marcel-Dekker: New York, 1996.
REFERENCES 8. Berry SM, Berry DA. Accounting for multi-
plicities in assessing drug safety: a three-level
1. Berry DA. Multiple comparisons, multiple tests, hierarchical mixed model. Biometrics 2004; 60:
and data dredging: a Bayesian perspective (with 418–426.
discussion). In Bayesian statistics, Bernardo JM 9. Carlin BP, Louis TA. Bayes and empirical Bayes
et al. (eds.), vol. 3. Oxford University: Oxford, 1988. methods for data analysis. Chapman & Hall: New
2. Berry DA. Subgroup analyses. Biometrics 1990; 47: York, 1996.
1227–1230. 10. Gelman A, Carlin JB, Stern HS, Rubin DB.
3. Berry DA, Hochberg Y. Bayesian perspectives on Bayesian data analysis. Chapman & Hall: New
multiple comparisons. Journal of Statistical Plan- York, 1995.
ning and Inference 1999; 82:215–227. 11. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian
4. Gopalan R, Berry DA. Bayesian multiple approaches to clinical trials and health-care evalua-
comparisons using Dirichlet process priors. Journal tion. Wiley: Chichester; 2004.

Copyright # 2007 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2007; 6: 155–160
DOI: 10.1002/pst

You might also like