You are on page 1of 11

(Mis)Using Evidence1

So far, our attention in CAT has been largely confined to deductive arguments. That is,
you made some premises or assumptions, and came to a conclusion by connecting them
logically. Sometimes the premises concerned the objectives of one party or another;
other premises were statements of fact (and, in some cases, statements of uncertain fact,
as in “Perhaps the Chinese government, in the future will do thus and such.”)

The note on Arguments described another sort of argument, an inductive argument. In


an inductive argument, the conclusion is supported by evidence that is brought forward
by the individual constructing the argument, evidence that (it is hoped) convinces the
listener or reader that the conclusion is probably true. In most cases producing the
evidence that would be needed to “prove” the conclusion is not possible. 2 But a mass of
evidence that suggests or is consistent with the conclusion can go a long way to
convincing others (or yourself) that the conclusion is true.

And what is true for the grand conclusion of an argument is true, pari passu, for the
premises of arguments. If a premise is of the form, “Google should try never to do evil,”
it is hard to imagine bringing evidence that convinces the reader3 that the premise is
correct. But if the premise is of the form, “The Chinese government is likely to do thus
and such,” one can justify this premise by showing how, in similar circumstances in the
past, the Chinese government did the equivalent of thus and such, adapted to the
circumstances that then prevailed.4

It is not always easy to judge when evidence supports (or refutes) a statement, be it a final
conclusion, a premise, or a statement intermediate to reaching a final conclusion.
Evidence may seem supportive, but as a critical analytical thinker, you must ask yourself
whether the evidence is, in a word, “cooked.”

Let’s be clear: “Cooked” is a pejorative term, indicating that the person making the
argument is deliberating trying to mislead the reader. Sometimes that is true; the intent is
to mislead intentionally. But perhaps as often, there is no intent to mislead; the person
constructing the argument is citing inappropriate data in completely good faith, perhaps
out of ignorance, and perhaps out of enthusiasm for the conclusion being supported.
We’ll use the term “cooked” to cover all cases.

1
Copyright DMK 2008. For use in CAT, 2009. Do not use, reproduce, or disseminate without the written
permission of David M. Kreps.
2
It is possible to gather all the evidence needed to prove the claim that “all Stanford MBAs in the class of
2011 have undergraduate degrees,” but it’s impossible to gather all the evidence needed to prove the claim
that “students with an undergraduate degree will always perform better in an MBA program than those
without such a degree.”
3
Arguments can be read and they can be heard. It is tiresome to use listener/reader repeatedly, so we’ll use
“reader” to cover both cases throughout.
4
Note that if you justify one of your (empirical) premises by presenting evidence that supports it, in effect
you’ve transformed it into an intermediate conclusion. As the note on “Building Arguments” indicates, a
true premise is a statement that isn’t justified in the argument at hand.
Also, there are (broadly speaking) two types of data that can be used to support the types
of arguments we’ll encounter in CAT as well as many business situations. One is so-
called field data, evidence that arises by examining the real world. The other is
experimental data, gathered in relatively controlled experiments.5 In most cases
experiments are carefully constructed and, especially when they are reported in scholarly
journals, the vetting or refereeing process to which such papers are subjected eliminates
some of the problems discussed below. (The refereeing process also applies to studies
using field data, but because those data are relatively harder to collect and control,
corners are sometimes cut.) But some is not all, as we will see.

A complete catalog of the ways data can be cooked is far beyond the scope of this short
note. In the winter quarter, most of you will take a Foundations course in Data Analysis
that will go into these matters in much greater detail. But, to start you on your way to
becoming a sophisticated consumer of evidence, here are four common ways evidence is
cooked.

1. Irrelevant data: What is the “conclusion,” and does the


evidence really speak to it?

The first category of cooking is to cite evidence that doesn’t really address the conclusion
that (it is alleged) is being supported. It is rare that data can be found that precisely
addresses the issue at hand. Often one resorts to “analogous” situations and what
transpired there. If the situations are analogous in the right ways, this is fine. But as the
connections between the evidence and the desired conclusion become more and more
tenuous, you should attach less and less credence to the evidence, at least as evidence in
favor of the conclusion.

This is particularly a problem in cases where the conclusion is broad and somewhat
ambiguous. It is easy to speak in broad generalities and to bring to the evidentiary table
a host of somewhat loosely connected situations that fit, perhaps somewhat awkwardly or
partially, inside the broad tent of the conclusion. It can be easy to elide from data in one
sort of situation to a conclusion in a somewhat related second sort of situation, where
what you see in the data depends on characteristics in the first situation that differ
markedly from characteristics in the second. As a skeptical consumer of evidence, you
should always be asking yourself: What is the precise conclusion being asserted, and
does this evidence really speak to it? (And when you are making inductive arguments,
unless your intent is to mislead, be as precise as possible in stating your conclusion.)

A variation on this form of “cooking” concerns the interpretation you yourself give to
data. Suppose you were handed some customer survey data, where customers were
asked “Which three features of Product X are most satisfactory? Which three features are
least satisfactory?” You find that a very large fraction of your customers find two

5
The word “controlled” refers to protecting against inferential errors: controlled experiments are designed
to reduce the frequency of making mistakes in inductive arguments.

(Mis)using Evidence – CAT 2008 page 2


features of the product to be among the least satisfactory. You interpret this as saying
that you should devote time and effort to improve those two features. Right?

Maybe! You don’t know how important those features are to your customer’s choice of
product. Indeed, these are your customers; maybe you should figure out what turned off
people who were potentially customers but chose not to be. Perhaps the question would
be better phrased: “What (deficient) features of the product imperil the possibility that
you will buy from us again?” And knowing that some feature is least satisfactory doesn’t
mean that you can, in a cost effective manner, make it more satisfactory. Especially with
survey data, where the way the questions are phrased can dramatically impact how survey
respondents interpret or frame your question and, therefore, how they respond, you have
to be careful what conclusions you jump to, from the answers you receive.

2. Non-representative data: anecdote, sample selection bias,


cherry-picking, data-dredging
Getting to more specific ways in which evidence can fail to support an alleged
conclusion, we have a variety of ways in which data can be non-representative of the
population meant to be covered by the conclusion. To explain: Evidence is typically
brought to bear on random (stochastic) phenomena. The “fact” in question doesn’t hold
in all instances; there are examples where it holds and where it doesn’t, and the
conclusion is one about the “usual” case, as in: Entrepreneurs are more likely to be
successful if they refuse to give in to adversity. This isn’t saying that every entrepreneur
will be more successful, if she is stubborn. But it is asserting that this is true more often
than not. What sort of evidence might we look at, to support that conclusion? Are the
data presented representative of the full population out there?

Anecdotal evidence

Perhaps the most common method of using non-representative evidence is to rely on the
“illustrative anecdote.” William Hewlett, David Packard, Phil Knight, and Sam Walton
all faced adversity in their early years, all persevered, and look where they (and the
companies they built) wound up! One recounts in vivid detail the story of each of these
cases – what adversity the various individuals faced, how they refused to give in, how
large are their organizations and their personal fortunes – and leaves the reader to draw
the desired conclusion. This isn’t very good evidence, because the examples were
chosen on two grounds: (a) They have famous names, which will impress the reader. (b)
They were chosen because they confirm the hypothesis: they overcame early adversity
and were tremendously successful. No one can claim that they are representative of the
population of entrepreneurs; they are instead exemplars of that population, and successful
exemplars. Their stories may inspire us, and they may illustrate the point being made.
But, if the evidentiary value of these sorts of anecdotes isn’t nil when it comes to
establishing the proposition, it is close to nil.

(Mis)using Evidence – CAT 2008 page 3


Having said that, it has to be observed that a lot of instruction in the GSB is via the case
method, and cases are nothing more than illustrative anecdotes. The explanation for this
is: When your instructor uses a case to illustrate or illuminate a particular general point,
it is a pedagogical exercise. A vivid case study brings often brings home the point more
readily than a careful empirical analysis of many data points. But one hopes and expects
that the generalizations offered in courses taught here are based on more careful analysis
and justification, whether inductive or deductive. (And, if you find yourself doubting a
generalization offered up in a class, you should feel completely comfortable in asking
your instructor for the “scientific” justification for what is being taught. Just don’t
always expect a simple answer on the spot.

Sample-selection bias

Rather than relying on anecdotes, suppose we took the more careful approach of looking
at a sample of start-ups, choosing for our sample all those high-tech firms in Silicon
Valley that have undertaken an IPO (initial public offering) in the past five years. We
rely on this sample, because while it is hard if not impossible to find all start-ups, to get a
list of (registered) IPOs is relatively easier. We go to this full sample of firms and ask,
Did the founder(s) meet with early adversity? (If the founders did not meet with early
adversity, the firm isn’t relevant to the question being investigated.) We partition this set
(firms that have recently had an IPO, with a founder who faced early adversity) into two
subsets:

• those firms where the founder persevered; and


• those where the founder gave up and let others take over.

And then we carefully measure the return (for the founder) to his or her investment of
personal funds and sweat equity, finding that the founder’s average personal return on
investment is higher in the first set than in the second. Thus, when faced with adversity,
founders do better (personally) to persevere.

There are (at least) two problems with this evidence, but for now focus on the fact that
this is a sample of firms that, eventually, succeeded. By looking (only) at firms that
reached the stage of IPO, you are looking at a sample of start-ups that overcame whatever
initial adversities they faced. When the founder persevered, he or she probably benefited
personally from the success. It is the firms that fail for which “perseverance” translates
(in the end) into “stubbornly throwing good money after bad”; by taking as our sample of
firms those that get to IPO, we have biased our sample to exclude those cases.

So, to address this problem, suppose we looked at a random sample of high-tech start-ups
from Silicon Valley, including firms that have died. (Getting your hands on such a
sample and getting responses to survey questions, especially for firms that died, might be
a problem, but suppose that, somehow, we could do this.) Suppose that these data
support the hypothesis. We’d still need to worry: All our data are from high-tech start-

(Mis)using Evidence – CAT 2008 page 4


ups located in Silicon Valley. Perhaps this sample is not representative of all
entrepreneurial ventures, everywhere. Perhaps low-tech start-ups are different. Perhaps
start-ups in Europe or Asia or North Carolina are different. This isn’t to say that we have
a good reason to think there is something special about high-tech firms from Silicon
Valley in terms of this hypothesis. Our prior, in this case, is that there is little or no
connection – we’d be relatively convinced by such evidence that the general proposition
is true. But one has to worry, at some level, whether the effect we see in the data is
somehow related to the specific geography or firm type that characterizes this sample.
This is often called a problem in non-representative samples, although it isn’t that
different in character from a sample-selection bias problem.

A special case of sample-selection bias is self-selection bias, where the sample being
studied self-selects. Imagine, for instance, that the GSB’s Center for Entrepreneurial
Studies surveys all GSB alumni from the Class of 1997, asking them three questions:

1. Did you every start your own business? (Yes/No)

2. If your answer to question 1 is yes, how successful do you consider your efforts to
have been? (Unsuccessful/Moderately successful/Substantially Successful/Hugely
Successful)

3. In the course of running this business, do you feel that you persevered in the face
of adversity? (Yes/No/Never faced adversity)

(If you started more than one business, answer questions 2 and 3 separately for each
of those businesses.)

The question to ask yourself is, What percentage of members of the Class of 1997 who
consider themselves to have persevered in the face of adversity and who were
unsuccessful do you think will fill out the survey? What percentage of the Class who
consider themselves to have persevered and been substantially or hugely successful do
you think will fill out the survey? Our guess is that the response rate will be higher
among the more successful entrepreneurs, giving the same effect, through self-selection,
that one gets by looking at firms that have made it to IPO.

Given the recent presidential election, it may be worthwhile to mention the sample
selection bias problems faced by pollsters, who call people on the phone and ask for
whom they are planning to vote. The list of possible biases in the sample of voters so
polled is long: Is a random sample of people with telephones representative of the
electorate? (Nowadays, probably yes. But in the past, this was not true and substantially
affected the pre-election polling results.) Is a random sample of people who will take the
time to answer the pollster’s questions representative? If the pollster calls on a Sunday
morning, is the sample of people at home at that time representative? And is the sample
of people who are at home, have a phone, answer the phone, and are willing to take the
time to answer the questions representative of those citizens who will take the time and
trouble to go to the polls (or file an absentee ballot)? Pollsters know about all these

(Mis)using Evidence – CAT 2008 page 5


problems; doing something about them is a matter of art and is what distinguishes a good
pollster from one who, election after election, doesn’t do so well in his or her predictions.

Cherry picking and data dredging

Suppose we find 70 members of the MBA Class of 1997 who have started companies,
and we ask them each 50 yes-or-no questions about what they see as important to success
as an entrepreneur. Examples might be

• Is perseverance in the face of adversity important to success as an entrepreneur?

• Is a broad array of managerial skills more important to success as an entrepreneur


than deep skills in a particular area of management?

• To have success as an entrepreneur, is it more important to have a good product or


service to sell, or to run an organization that commands the loyalty of your
employees?

And so forth.

Assume, for the sake of argument, that in the vast population of entrepreneurs, half would
answer yes to each of these questions and half would answer no. Assume also that the
answers to these questions are independent: If you know that a randomly selected
entrepreneur answered Yes to each of the first three questions, this wouldn’t affect the
odds of a Yes answer to any of the others.

Under these circumstances, the Truth (with a capital T) is that none of these statements,
and none of their negations, is true for “a substantial majority” of all entrepreneurs.
Under the assumption that our sample of 70 is a random sample, what do you think are
the odds that on some question, either more than 45 out of 70 (65%) or less than 26 out of
70 (35%) would answer Yes? The probability of this is around 0.43. That means that if
you run this survey on a random sample of 70 people out of a large population where
exactly half say Yes to any question, and there is a 43% chance that you’ll generate a
headline such as:

• Nearly (or Over) two-thirds of entrepreneurs say that a good product or service is
more important than loyal employees for success!; or
• Nearly (or Over) two-thirds of entrepreneurs say that employee loyalty is more
important for success than a good product or service!
There are, of course, 100 such headlines that could be written – the point is that just from
random selection, the probability that one (or more) out of the 100 is “verified by the
survey” is nearly 0.43. But you won’t read the headlines that weren’t verified by the
survey – only the one (or more) that the data “confirmed.”

(Mis)using Evidence – CAT 2008 page 6


This general phenomenon has many names, including cherry-picking (picking the one
result of the survey that seems to have a significant result) and data-dredging or data-
mining (combing through tons of data, looking to find something significant). You
might think that it is the product of deliberate misrepresentation, but more often than not,
no malice is intended:

• Writers and researchers like to report “significant results.” If they run lots of
experiments, the odds are that something will look significant and that’s what
they write about.
• Magazine and journal editors don’t see much point in publishing articles that
show that “it doesn’t matter.” Even if authors are submitting all the results they
get, it is the rare and unlikely result that will appear in print.

So how does a skeptical consumer of data deal with this problem? One thing is to ask,
How hard did the author have to look, to find that significant result? Or you can and
should ask, After the author saw that significant relationship, did he test it on a fresh set
of data?6

One form of this problem can sometimes be the result of over-enthusiasm for one’s
hypothesis. A very general theory is often supported by citing multiple pieces of
evidence: field data from source A, experimental evidence from lab B, different
experimental evidence from C, different field data from D, and so forth. In fact,
scholarly authors who do this sort of thing are said to be engaged in meta-analysis; their
theory knits together the results published in a variety of situations by others, and their
evidence is all the earlier results that are being knit together.

The problem is, Did the author really look hard for field data E and experiment F that
might contradict his theory? Or did he cherry-pick the specific studies that confirm what
he believes to be true? More often than not, and especially in publications that are meant
to excite and entertain a lay readership, the author chooses the studies and evidence that
are most confirming of his theory. You have to watch out for this.

3. Attributing causation in view of correlation:


Omitted variables and otherwise
The conclusion at issue is that: Entrepreneurs are more likely to be successful if they
refuse to give in to adversity. This can be read at least two different ways.

6
This is how academic authors usually try to deal with this problem. After they arrive at their hypothesis,
perhaps by sifting through mounds of data, they test their hypothesis on a “hold-out” sample. This is good
technique, but it isn’t absolutely adequate to the problem of journal editors who only accept “significant
results,” because if it is the 1 in 100 survey questions that generates what looks like a significant result
based on initial data, then 1 in 10,000 or so survey questions will first generate a significant result and then
look statistically significant when tested on a hold-out sample. That’s one reason academic editors often
ask for multiple sets of data supporting a paper’s hypothesis, and it is one reason important results in the
scholarly literature are often retested, to see if other experimenters can replicate the original findings.

(Mis)using Evidence – CAT 2008 page 7


a. One interpretation is that there is a statistical correlation between being an
entrepreneur who refused to give in to adversity and being successful. In a
large population of entrepreneurs who faced adversity, if we partition the
population into those who gave in to adversity and those who did not, the
second set will, on average, have met with more personal financial success.
b. Or you can read this prescriptively: If you are an entrepreneur facing
adversity, you will improve your chances of personal financial success by
persevering rather than giving in.

The first is a statement of correlation. The second is a statement of causation. They


aren’t the same and, in particular, evidence which normally takes the first form can be
misinterpreted as implying the second.

What’s the difference? Imagine that when an entrepreneur faces adversity, she may or
may not see a way through the adversity. If the entrepreneur sees a way through the
adversity, she perseveres. If she does not, she gives up. The act of persevering indicates
that the entrepreneur sees a way out of her problems, which means success. If all we
observe is whether or not she perseveres (and not if she sees a way through her troubles)
and whether she succeeds, we will see a correlation, because perseverance indicates a
solution to the problems, which means success.

That’s correlation. But if that is what is going on, and if the entrepreneur, facing
adversity, does NOT see a way through her difficulties, the correlation seen in the sample
is not a good reason for her to beat her head against the wall, trying to succeed when she
sees no path to success.

Put in the form of a caricature, great and successful chefs at 3-star restaurants often write
cookbooks. This does not imply that a chef opening a brand new restaurant should
invest a lot of time in writing his cookbook, in an attempt to increase the chances of all
those stars.

A related effect has the name spurious correlation. For instance, significant correlation
can be observed between marital status and annual consumption of candy. Specifically,
people who are single are more likely to eat a lot of candy. Does this mean that being
married dulls one’s appetite for sweet stuff? No, it doesn’t. Consumption of candy is
negatively correlated with age, and people who are married tend, on average, to be older.
Control for age, and the correlation between marital status and candy consumption
disappears.

4. Survey data: response biases and honesty

(Mis)using Evidence – CAT 2008 page 8


We’ve given a number of examples with survey data, for the simple reason that a lot of
data that you will deal with in a business context comes from surveys. We’ve more or
less assumed that the answers given to survey questions are “honest,” where we are using
scare quotes here for the same reason that we used them around the term “cooked”:
Respondents may not be given misleading or incorrect answers intentionally, but still the
answers given may not reflect the facts. This can happen for a variety of reasons,
including the following:

• In many if not most surveys, respondents have little incentive to think the matter
through. Especially when the issues are complex, the answers given have little
correlation with what might have been the response, had the response really mattered
to the respondent. In some cases, there may be bias in the answers (that is, they tend
toward a particular sort of “incorrect” response); in other cases, the answers given
may simply be somewhat noisy versions of the “truth.”

• Answers given to survey questions about the respondent him or herself (or friends,
relatives, or associates of the respondent) may be biased towards how the respondents
would like to think of themselves. Ask “If you found a $100 bill lying on the ground
in Arbuckle, would you post a note on GSB Unofficial asking the person who lost it
to claim it?,” and we venture to say that more people would answer “Yes” than would
in fact take that action. Please note that we are not saying that no one would post
such a note. But we suspect that fewer would post the note than would say that they
would. And we are not (necessarily) saying that those who say they would post such
an email are actively trying to deceive with their answer. You might well believe
that, having found the hypothetical $100 bill, you would “do the right thing.” But
when it changes from a hypothetical to a real $100 bill, other things might change as
well.

• In some cases, people answer survey questions in a way intended to improve their
image in the eyes of others, including the survey taker. A classic example often
mentioned in months leading up to the last presidential race is the so-called Bradley
Effect in political polling. This effect is named for Tom Bradley, an African-
American candidate for the governorship of California. (Bradley was mayor of Los
Angeles.) Pre-election polls predicted a Bradley victory, but in the election itself, he
lost. The Bradley Effect offers the hypothesis that survey respondents who fully
intended to vote against Bradley misrepresented their intentions to pollsters, to avoid
appearing racist to the pollster.
Note that [at least] two explanations can be given for the fact that Bradley did
worse in the election than in the polls: The Bradley Effect as described above
concerns a voter who plans to vote against Bradley but responds otherwise. An
alternative, which is in the category of the bullet point just before this one, concerns
someone who, at the time of the survey, really intends to vote for Bradley but, once in
the voting booth, finds himself or herself making a different choice, perhaps for
racially-based reasons. The first is an example of conscious misrepresentation, and
is fully described by dishonesty, without the scare quotes. The second, while

(Mis)using Evidence – CAT 2008 page 9


certainly not laudable, is not “dishonest” in the full sense of that word, at least at the
moment the survey is being taken.

5. Internal inconsistencies in data


Suppose we asked MBAs two questions in a survey:

• If you found a $100 bill lying on the ground in Arbuckle, would you post a note
on GSB Unofficial asking for the owner to step forward, or take some similar
action to try to find who lost the money?

• If one of your classmates found a $100 bill lying on the ground in Arbuckle, what
do you think are the chances (stated as a probability, please) that he or she would
post a note on GSB Unofficial asking for the owner to step forward, or take some
similar action to try to find who lost the money?

Suppose in a (suitably large) sample of respondents, you found 80% of the respondents
answering the first question with Yes, while the average answer given to the second
question was only 50%. How do we explain the discrepancy?

• It could be a matter of (bad) luck: We just happened, in the sample of students, to


find a disproportionate number of “saints” who would try to restore the money to its
original owner, even though in the larger population only half the students would do
so.

• It could be a matter of bad survey technique: Something in how the random sample
was drawn could be biased in favor of “saints.” For instance, we could have sampled
students in Birds Courtyard at a time when an elective in philanthropy had just let out.

• It could be the sort of self-deception mentioned in the previous section: 80% of


MBAs would like to believe that they would try to restore the money to its original
owner, but in fact, only half would

• It could be that MBAs know themselves well – the 80% figure is accurate – but they
are unduly cynical about the behavior of their peers.

Any of these is possible, and several of them may simultaneously be at work. But the
point is, The data reported are internally inconsistent. Often, when you look at data
(especially survey data), you can see inconsistencies. And, when you do, the first task in
front of you, before you try to draw conclusions from the data, is to try to understand why
they are inconsistent, so that you don’t draw wrong conclusions.

(Mis)using Evidence – CAT 2008 page 10


The bottom line
You will spend a lot of time this winter studying these issues in depth. For now, we
reiterate what was said in the note on Arguments: When it comes to evidence in support
of a conclusion, caveat emptor. Be a skeptical (and informed) consumer of what you
read and what you hear.

(Mis)using Evidence – CAT 2008 page 11

You might also like