You are on page 1of 256
The Design of Experiments By Sir Ronald A. Fisher, Sc.D., F.R.S. Honorary Research Fellow, Division of Mathematical Statistics, C.S1LR.O., University of Adelaide; Foreign Associate, United States National Academy of Sciences; and Foreign Honorary Member, American Academy of Arts and Sciences; Foreign Member of the Swedish Royal Academy of Sciences, and the Royal Danish Academy of Sciences and Letters; Member of the Ponti- fical Academy; Member of the German Academy of Sciences (Leopoldina); formerly Galton Professor, University of London, and Arthur Balfour Pro- fessor of Genetics, University of Cambridge. HAFNER PRESS A DIVISION OF MACMILLAN PUBLISHING Co., INC, New York COLLIER MACMILLAN PUBLISHERS London * Copyright © 1971, The University of Adelaide Reprinted by Arrangement First Published 1935 Second Edition 1937 Third Edition 1942 Fourth Edition 1947 Fifth Edition 1949 Sixth Edition 1951 Reprinted 1953 Seventh Edition 1960 Eighth Edition 1966 ‘Ninth Edition 1971 Reprinted 1974 All rights reserved. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy- ing, recording, or by any information storage and retrieval system, without permission in writing fgom the publisher. HAFNER PRESS A Division of Macmillan Publishing Co., Inc. 866 Third Avenue, New York, N.Y, 10022 Collier Macmillan Canada,Ltd. Printed in the United States of America printing number M12 13 14 15 16 17 18 19 20 PREFACE TO FIRST EDITION In 1925 the author wrote a book (Statistical Methods for Research Workers) with the object of supplying practical experimenters and, incidentally, teachers of mathematical statistics, with a connected account of the applications in laboratory work of some of the more recent advances in statistical theory. Some of the new methods, such as the analysis of variance, were found to be so intimately related with problems of experimental design that a considerable part of the eighth chapter was devoted to the technique of agricultural experimentation, and these sections have been progressively enlarged with subsequent editions, in response to frequent requests for a fuller treatment of the subject. The design of experiments is, however, too large a subject, and of too great importance to the general body of scientific workers, for any incidental treatment to be adequate. A clear grasp of simple and standardised statistical procedures will, as the reader may satisfy himself, go far to elucidate the principles of experimentation; but these procedures are themselves only the means to a more important end. Their part is to satisfy the requirements of sound and intelligible experimental design, and to supply the machinery for unambiguous interpretation. To attain a clear grasp of these requirements we need to study designs which have been widely successful in many fields, and to examine their structure in relation to the requirements of valid inference. The examples chosen in this book are aimed at a viii PREFACE illustrating the principles of successful experimentation ; first, in their simplest possible applications, and later, in regard to the more elaborate structures by which the different advantages sought may be combined. Statis- tical discussion has been reduced to a minimum, and all the processes required will be found more fully exemplified in the previous work. The reader is, how- ever, advised that the detailed working of numerical examples is essential to a thorough grasp, not only of the technique, but of the principles by which an experi- mental procedure may be judged to be satisfactory and effective. GALTON LABORATORY July 1935 PREFACE TO SEVENTH EDITION Tue second edition differed little from the first, published a year earlier. Apart from numerical corrections the principal changes were the fuller treatment of completely orthogonal squares in Section 35, and the addition of examples in Section 47.1, representing some of the newly developed combinatorial arrangements, which had attracted considerable interest. In the third edition Sections 45.1 and 45.2 were added, giving a more comprehensive view of the possibilities of confounding with many factors, and introducing the method of double confounding. In the fourth edition, Section 62.1 has been added on the fiducial limits of a ratio. In the fifth edition, Section 35.01 on configurations in three or more dimensions has been added. In the sixth edition attention may be called to the addition which has been made to Section 65, “Com- parisons with Interactions,” with a view to clarifying the differences in logical status between different sorts of cate- gories which may appear in a factorial analysis. The numbers of sections have not been changed. In the seventh edition, 1960, Sections 12.1 and 21.1 are new, while smaller additions and clarifications are scattered throughout. Departuenr op Staristics, CSIRO, Apetatpe, AUSTRALIA Oct. 1959 NOTE TO THIS EDITION The Ninth Edition, 1971, is the same as the eighth, except for additions and clarifications (mostly in Chapter X), introduced from notes written for this purpose by Sir Ronald Fisher some time before his death, pepe ayes CONTENTS I. INTRODUCTION ‘The Grounds on which Evidence is Disputed . . The Mathematical Attitude towards Induction The Rejection of Inverse Probability The Logic of the Laboratory Il, THE PRINCIPLES OF EXPERIMENTATION, ILLUSTRATED BY A PSYCHO-PHYSICAL EXPERIMENT . Statement of Experiment Interpretation and ite Ressoued Basis . The Test of Significance . The Null Hypothesis. 9, Randomisation ; the Physical Basis ofthe Validity of the Test . 10. The Effectiveness of Randomisation Mm The Sensitiveness of an Eeperinent Effects of Bnlargement and Repetition soe . Qualitative Methods of increasing Sensitiveness 121. Scientific Inference and Acceptance Procedures . 13. 15. 14, Darwin’s Discussion of the Data Il A HISTORICAL EXPERIMENT ON GROWTH RATE Galton’s Method of Interpretation . 16, Pairing and Grouping . 1 18. 5 0 6 50 6 Oo 19. Manipulation of the Data. 0. / ee ee “Student's” sTest . Fallacious Use of Statistics . 20. Validity and Randomisstion. =... se Py coe 1 ry 3 15 7 Cy ar 22 a5 27 27 39 32 38 ar xii CONTENTS a1, Test of a Wider Hypothesis... se ee art. “Nonparametric” tests. se ee 8 IV. AN AGRICULTURAL EXPERIMEN’ IN RANDOMISED BLOCKS 22, Description of the Experiment =. 9... ws 50 23, Statistical Analysis of the Observations. =. =... 52 a4. Precision of the Comparisons =. 0. sw ws 8B 25, The Purposes of Replication. 9... wwe 26 Validity of the Estimation of Error =.) sss G8 27. Bias of Systematic Arrangements... ews 6G 28. Partial Elimination of Error. =. 0. swe 85 29, Shape of Blocks and Plots. 5 se ew go. Practical Example. 6. eee ee V. THE LATIN SQUARE 31. Randomisation subject to Double Restriction... 70 32. The Estimation of Error. Bee ob oo 33. Faulty Treatment of Square Dsigos 5.7 34. Systematic Squares. 6 ew we 35 Graco-Latin and Higher Squares. 5. 8 3§-0r. Configurations in Three or more Dimensions... 85 Sei An Tcepliceall Design eee +. ae 36. Practical Exercises, =... ew wks 80 VI. THE FACTORIAL DESIGN IN EXPERIMENTATION 37 TheSingle Factor =. ee ee 8 38. A Simple Factorial Scheme... wwe. 8 39. The Basis of Inductive Inference. =. 0... 10 40. Inclusion of Subsidiary Factors... 108 41. Experiments without Replication... ss 106 VIL. CONFOUNDING 42. The Problem of Controlling Heterogencity . =... 109 43. Example with 8 Treatments, Notation . . oem 44. Design suited to Confounding the Triple Interaction | | 113 CONTENTS 45. Effect on Analysis of Variance : 45*1. General Systems of Confounding in Powers of 2» 45°2. Double Confounding 46. Example with 27 Treatments 47. Partial Confounding 471. Practical Exercises. VIIL SPECIAL CASES OF PARTIAL CONFOUNDING 8. 5 49. Dummy Compatisons . 50. Interaction of Quantity and Quality 51, Resolution of Three Comparisons among Four Materials - 52, An Early Example 53. Interpretation of Results 54 An Experiment with 81 Plots ror 114, M6 iat 122 129 135 137 137 139 141 142 4152 154 IX, THE INCREASE OF PRECISION BY CONCOMITANT MEASUREMENTS. STATISTICAL CONTROL 55» Occasions suitable for Concomitant Measurements . 36. Arbitrary Corrections. . 57. Calculation of the Adjustment 58 The Test of Significance 581. Missing Values. ww we 59 Practical Examples =. 0. wee X, THE GENERALISATION OF NULL HYPOTHESES. FIDUCIAL PROBABILITY 60, Precision regarded as Amount of Information 61. Multiplicity of Tests of the same Hypothesis . 62, Extension of the # Test 5 Gat, Fiducial Limits of a Ratio 63. The xTest . 64. Wider Tests based on the Analysis of Variance 65. Comparisons with Interactions. XI. THE MEASUREMENT OF AMOUNT OF INFORMATION IN GENERAL 66. EstimationinGeneral. =. ssw 67, Frequencies of Two Alternatives 163 168 17t "5 7 180 184 187 19t 194 195 198 207 aig 216 68, |. The Frequency Ratio in Biological Asay =. 6 wy 70. m 2. 73 4 CONTENTS Functional Relationships among Parameters... Linkage Values inferred from Frequency Ratios Linkage Values inferred from the Progeny of Sel fertilised of Intercrossed Hetcrozygotes 66 Information as to Linkage derived from Human Faralies. The Information elicited by Different Methods of Estimation The Information lost in the Estimation of Error Inpex mace 218 234 226 231 236 239 242 147 I aM very sorry, /yrophilus, that to the many (elsewhere enumerated) difficulties which you may mect with, and must therefore surmount, in the serious and effectual prosecution of experimental philos« phy I must add one discouragement more, which will perhaps 1s much surprise as dishearten you; and it is, that besides that you will find (as we elsewhere mention) many of the experiments published by authors, or related to you by the persons you converse with, false and unsuccessful (besides this, I say), you will meet with several observations and experiments which, though communicated for true by candid authors or undistrusted eye-witnesses, or perhaps recommended by your own experience, may, upon further trial, disappoint your expectation, either not at all succeeding constantly, or at least varying much from what you expected. RoBERT Boyer, 1673, Concerning the Unsuceessfulness of Experiments, LE seul moyen de prévenir ces écarts, consiste & supprimer, ou au moins a simplifier, autant qu'il est possible, le raisonnement qui est de nous, & qui peut seul nous égarer, & la mettre continuellement @ I’épreuve de l'expérience; & ne conserver que les faits qui sont des vérites données par la nature, & qui ne peuvent nous tromper; & ne chercher la verité que dans Venchatnement des expériences & des observations, sur-tout dans l’ordre dans lequel elles sont présentées, de la méme maniére que les mathématiciens parviennent a la solution d’un probléme par le simple arrangement des données, & en réduisant le raisonnement & des opérations si simples, & des jugemens si courts, quills ne perdent jamais de vue l’évidence qui leur sert de guide, Methode de Nomenclature chimique, A. L, LavoIstER, 1787. THE DESIGN OF EXPERIMENTS INTRODUCTION 4. The Grounds on which Evidence is Disputed Wuen any scientific conclusion is supposed to be proved on experimental evidence, critics who still refuse to accept the conclusion are accustomed to take one of two lines of attack, ‘They may claim that the interpre- tation of the experiment is faulty, that the results reported are not in fact those which should have been expected had the conclusion drawn been justified, or that they might equally well have arisen had the con- clusion drawn been false. Such criticisms of interpreta- tion are usually treated as falling within the domain of statistics, They are often made by professed statisticians against the work of others whom they regard as ignorant of or incompetent in statistical technique; and, since the interpretation of any considerable body of data is likely to involve computations, it is natural enough that questions involving the logical implications of the results of the arithmetical processes employed, should be relegated to the statistician, At least I make no complaint of this convention. The statistician cannot evade the responsibility for understanding the processes he applies or recommends. My immediate point is that the questions involved can be dissociated from all that is strictly technical in the statistician’s craft, and, when so detached, are questions only of the right use of 2 INTRODUCTION human reasoning powers, with which all intelligent people, who hope to be intelligible, are equally con- cerned, and on which the statistician, as such, speaks with no special authority. The statistician cannot excuse himself from the duty of getting his head clear on the principles of scientific inference, but equally no other thinking man can avoid a like obligation. The other type of criticism to which experimenta) results are exposed is that the experiment itself was ill designed, or, of course, badly executed. If we suppose that the experimenter did what he intended to do, both of these points come down to the question of the design, or the Jogical structure of the experiment. This type of criticism is usually made by what I might call a heavyweight authority. Prolonged experience, or at least the long possession of a scientific reputation, is almost a pre-requisite for developing successfully this line of attack. Technical details are seldom in evidence. The authoritative assertion “‘ His controls are fotally inadequate” must have temporarily discredited many a promising line of work; and such an authoritarian method of judgment must surely continue, human nature being what it is, so long as theoretical notions of the principles of experimental design are lacking— notions just as clear and explicit as we are accustomed to apply to technical details. Now the essential point is that the two sorts of criticism I have mentioned are aimed only at different aspects of the same whole, although they are usually delivered by different sorts of people and in very different language. If the design of an experiment is faulty, any method of interpretation which makes it out to be decisive must be faulty too. It is true that there are a great many experimental procedures which are well designed in that they may lead to decisive conclusions, INDUCTION 3 but on other occasions may fail to do so; in such cases, if decisive conclusions are in fact drawn when they are unjustified, we may say that the fault is wholly in the interpretation, not in the design. But the fault of interpretation, even in these cases, lies in overlooking the characteristic features of the design which lead to the result being sometimes inconclusive, or conclusive on some questions but not on all. To understand correctly the one aspect of the problem is to understand the other. Statistical procedure and experimental design are only two different aspects of the same whole, and that whole comprises all the logical requirements of the complete process of adding to natural knowledge by experimentation. 2. The Mathematical Attitude towards Induction In the foregoing paragraphs the subject-matter of this book has been regarded from the point of view of an experimenter, who wishes to carry out his work competently, and having done so wishes to safeguard his results, so far as they are validly established, from ignorant criticism by different sorts of superior persons. I have assumed, as the experimenter always does assume, that it zs possible to draw valid inferences from the results of experimentation; that it is possible to argue from consequences to causes, from observations to hypotheses; as a statistician would say, from a sample to the population from which the sample was drawn, or, as a logician might put it, from the particular to the general. It is, however, certain that many mathematicians, if pressed on the point, would say that it is not possible rigorously to argue from the particular to the general; that all such arguments must involve some sort of guesswork, which they might admit to be plausible guesswork, but the rationale of which, they 4 INTRODUCTION would be unwilling, as mathematicians, to discuss. We may at once admit that any inference from the particular to the general must be attended with some degree of uncertainty, but this is not the same as to admit that such inference cannot be absolutely rigorous, for the nature and degree of the uncertainty may itself be capable of rigorous expression. In the theory of probability, as developed in its application to games of chance, we have the classic example proving this possi- bility. If the gamblers’ apparatus are really ¢rwe or unbiased, the probabilities of the different possible events, or combinations of events, can be inferred by a rigorous deductive argument, although the outcome of any particular game is recognised to be uncertain, The mere fact that inductive inferences are uncertain cannot, therefore, be accepted as precluding perfectly rigorous and unequivocal inference, Naturally, writers on probability have made deter- mined efforts to include the problem of inductive inference within the ambit of the theory of mathematical probability, developed in discussing deductive problems arising in games of chance. To illustrate how much was at one time thought to have been achieved in this way, I may quote a very lucid statement by Augustus de Morgan, published in 1838, in the preface to his essay on probabilities in The Cabinet Cyclopedia. At this period confidence in the theory of inverse proba- bility, as it was called, had reached, under the influence of Laplace, its highest point. Boole’s criticisms had not yet been made, nor the more decided rejection of the theory by Venn, Chrystal, and later writers. De Morgan is speaking of the advances in the theory which were leading to its wider application to practical problems. “There was also another circumstance which stood in the way of the first investigators, namely, the not INDUCTION 5 having considered, or, at least, not having discovered the method of reasoning from the happening of an event to the probability of one or another cause. The questions treated in the third chapter of this work could not therefore be attempted by them. Given an hypothesis presenting the necessity of one or another out of a certain, and not very large, number of con- sequences, they could determine the chance that any given one or other of those consequences should arrive ; but given an event as having happened, and which might have been the consequence of either of several different causes, or explicable by either of several different hypotheses, they could not infer the probability with which the happening of the event should cause the different hypotheses to be viewed. But, just as in natural philosophy the selection of an hypothesis by means of observed facts is always preliminary to any attempt at deductive discovery ; so in the application of the notion of probability to the actual affairs of life, the process of reasoning from observed events to their most probable antecedents must go before the direct use of any such antecedent, cause, hypothesis, or what- ever it may be correctly termed. These two obstacles, therefore, the mathematical difficulty, and the want of an inverse method, prevented the science from extending its views beyond problems of that simple nature which games of chance present.” Referring to the inverse method, he later adds: “This was first used by the Rev. T. Bayes, and the author, though now almost forgotten, deserves the most honourable remembrance from all who treat the history of this science.” 6 INTRODUCTION 3, The Rejection of Inverse Probability Whatever may have been true in 1838, it is certainly not true to-day that Thomas Bayes is almost forgotten. That he seems to have been the first man in Europe to have seen the importance of developing an exact and quantitative theory of inductive reasoning, of arguing from observational facts to the theories which might explain them, is surely a sufficient claim to a place in the history of science. But he deserves honourable remembrance for one fact, also, in addition to those mentioned by de Morgan. Having perceived the problem and devised an axiom which, if its truth were granted, would bring inverse inferences within the scope of the theory of mathematical probability, he was sufficiently critical of its validity to try to avoid the axiomatic approach, and, perhaps for the same reason, to withhold his entire treatise from publication until his doubts should have been satisfied. In the event, the work was published after his death by his friend, Price, and we cannot say what views he ultimately held on the subject. The discrepancy of opinion among historical writers on probability is so great that to mention the subject is unavoidable. It would, however, be out of place here to argue the point in detail. I will only state three considerations which will explain why, in the practical applications of the subject, I shall not assume the truth of Bayes’ axiom. Two of these reasons would, I think, be generally admitted, but the first, I can well imagine, might be indignantly repudiated in some quarters. The first is this: The axiom leads to apparent mathe- matical contradictions. In explaining these contra- dictions away, advocates of inverse probability seem forced to regard mathematical probability, not as an objective quantity measured by observable frequencies, but LOGIC OF THE LABORATORY 7 as measuring merely psychological tendencies, theorems respecting which are useless for scientific purposes. My second reason is that it is the nature of an axiom that its truth should be apparent to any rational mind which fully apprehends its meaning. The axiom of Bayes has certainly been fully apprehended by a good many rational minds, including that of its author, without carrying this conviction of necessary truth. This, alone, shows that it cannot be accepted as the axiomatic basis of a rigorous argument. My third reason is that inverse probability has been only very rarely used in the justification of conclusions from experimental facts, although the theory has been widely taught, and is widespread in the literature of probability. Whatever the reasons are which give experimenters confidence that they can draw valid con- clusions from their results, they seem to act just as powerfully whether the experimenter has heard of the theory of inverse probability or not. 4. The Logic of the Laboratory In fact, in the course of this book, I propose to consider a number of different types of experimentation, with especial reference to their logical structure, and to show that when the appropriate precautions are taken to make this structure complete, entirely valid inferences may be drawn from them, without using the disputed axiom. Jf this can be done, we shall, in the course of studies having directly practical aims, have overcome the theoretical difficulty of inductive inferences. Inductive inference is the only process known to us by which essentially new knowledge comes into the world. To make clear the authentic conditions of its validity is the kind of contribution to the intellectual development of mankind which we should expect 8 INTRODUCTION experimental science would ultimately supply. Men have always been capable of some mental processes of the kind we call “ learning by experience.” Doubtless this experience was often a very imperfect basis, and the reasoning processes used in interpreting it were very insecure ; but there must have been in these processes a sort of embryology of knowledge, by which new knowledge was gradually produced. Experimental observations are only experience carefully planned in advance, and designed to form a secure basis of new knowledge ; that is, they are systematically related to the body of knowledge already acquired, and the results are deliberately observed, and put on record accurately. As the art of experimentation advances the principles should become clear by virtue of which this planning and designing achieve their purpose. It is as well to remember in this connection that the principles and method of even deductive reasoning were probably unknown for several thousand years after the establishment of prosperous and cultured civilisations, We take a knowledge of these principles for granted, only because geometry is universally taught in schools. The method and material taught is essentially that of Euclid’s text-book of the third century a.c., and no one can make any progress in that subject without thoroughly familiarising his mind with the requirements of a precise deductive argument. Assuming the axioms, the body of their logical consequences is built up systematically and without ambiguity. Yet it is certainly something of an accident historically that this particular discipline should have become fashionable in the Greek Universities, and later embodied in the curricula of secondary education. It would be difficult to overstate how much the liberty of human thought has owed to this fortunate circumstance. Since Euclid’s time there LOGIC OF THE LABORATORY 9 have been very long periods during which the right of unfettered individual judgment has been successfully denied in legal, moral, and historical questions, but in which it has, none the less, survived, so far as purely deductive reasoning is concerned, within the shelter of apparently harmless mathematical studies. The liberation of the human intellect must, however, remain incomplete so long as it is free only to work out the consequences of a prescribed body of dogmatic data, and is denied the access to unsuspected truths, which only direct observation can give. The develop- ment of experimental science has therefore done much more than to multiply the technical competence of mankind; and if, in these introductory lines, I have seemed to wander far from the immediate purpose of this book, it is only because the two topics with which we shall be concerned, the arts of experimental design and of the valid interpretation of experimental results, in so far as they can be technically perfected, must constitute the core of this claim to the exercise of full intellectual liberty. The chapters which follow are designed to illustrate the principles which are common to all experimentation, by means of examples chosen for the simplicity with which these principles are brought out. Next, to exhibit the principal designs which have been found successful in that field of experimentation, namely agriculture, in which questions of design have been most thoroughly studied, and to illustrate their applicability to other fields of work. Many of the most useful designs are extremely simple, and these deserve the greatest atten- tion, as showing in what ways, and on what occasions, greater elaboration may be advantageous. The careful reader should be able to satisfy himself not only, in detail, wky some experiments have a complex structure, 10 INTRODUCTION but also Aow a complex observational record may be handled with intelligibility and precision. The subject is a new one, and in many ways the most that the author can hope is to suggest possible lines of attack on the problems with which others are confronted. Progress in recent years has been rapid, and the few sections devoted to the subject in the author’s Statistical Methods for Research Workers, first published in 1925, have, with each succeeding edition, come to appear more and more inadequate. On purely statistical questions the reader must be referred to that book; on logic, and the analysis of meaning, to Statistical Methods and Scientific Inference. The present volume is an attempt to do more thorough justice to the problems of planning and foresight with which the experimenter is confronted. REFERENCES AND OTHER READING T. Bayes (1763). An essay towards solving a problem in the doctrine of chances, Philosophical ‘Transactions of the Royal Society, Iii. 370. A. DE Morcan (1838). An essay on probabilities and on their application to life contingencies and insurance offices. Preface, vi. Longman & Co. R. A, Fis (1930). Inverse probability. Proceedings of the Cambridge Philosophical Society, xvi, 528-535. R. A. Fiswer (1932). Inverse probability and the use of likelihood. Proceedings of the Cambridge Philosophical Society, xxvii. 257-262. R. A. Fisuer (1935). The logic of inductive inference. Journal Royal Statistical Society, xeviil. 39-54. R. A. Fiser (1936). Uncertain inference. Proceedings of the American Academy of Arts and Sciences, 71. 245-258. R. A. Fister (1925-1963). Statistical methods for research workers. Oliver and Boyd Ltd., Edinburgh. R.A, FisuEn (1956, 1959) Statistical methods and scientific inference. Oliver and Boyd Ltd., Edinburgh. ret THE PRINCIPLES OF EXPERIMENTATION, ILLUSTRATED BY A PSYCHO-PHYSICAL EXPERIMENT 5. Statement of Experiment A vapy declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup. We will consider the problem of designing an experiment by means of which this assertion can be tested. For this purpose let us first lay down a simple form of experiment with a view to studying its limitations and its characteristics, both those which appear to be essential to the experi- mental method, when well developed, and those which are not essential but auxiliary. Our experiment consists in mixing eight cups of tea, four in one way and four in the other, and presenting them to the subject for judgment in a random order. The subject has been told in advance of what the test will consist, namely that she will be asked to taste eight cups, that these shall be four of each kind, and that they shall be presented to her in a random order, that is in an order not determined arbitrarily by human choice, but by the actual manipulation of the physical apparatus used in games of chance, cards, dice, roulettes, etc., or, more expeditiously, from a published collection of random sampling numbers purporting to give the actual results of such manipulation. Her task is to divide the 8 cups into two sets of 4, agreeing, if possible, with the treatments received, 12° THE PRINCIPLES OF EXPERIMENTATION 6, Interpretation and its Reasoned Basis In considering the appropriatencss of any proposed experimental design, it is always ncedful to forecast all possible results of the experiment, and to have decided without ambiguity what interpretation shall be placed upon each one of them. Further, we must know by what argument this interpretation is to be sustained, In the present instance we may argue as follows. There are 70 ways of choosing a group of 4 objects out of 8. This may be demonstrated by an argument familiar to students of “ permutations and combinations,” namely, that if we were to choose the 4 objects in succession we should have successively 8, 7, 6, 5 objects to choose from, and could make our succession of choices in 8x7xX6x5, or 1680 ways. But in doing this we have not only chosen every possible set of 4, but every possible set in every possible order; and since 4 objects can be arranged in order in 4x3X2XT1, or 24 ways, we may find the number of possible choices by dividing 1680 by 24. The result, 70, is essential to our interpretation of the experiment. At best the subject can judge rightly with every cup and, knowing that 4 are of each kind, this amounts to choosing, out of the 7o sets of 4 which might be chosen, that particular one which is correct. A subject without any faculty of discrimination would in fact divide the 8 cups correctly into two sets of 4 in one trial out of 70, or, more properly, with a frequency which would approach 1 in 70 more and more nearly the more often the test were repeated. Evidently this frequency, with which unfailing success would be achieved by a person lacking altogether the faculty under test, is calculable from the number of cups used. The odds could be made much higher by enlarging the experiment, while if the experiment were much smaller SIGNIFICANCE 13 even the greatest possible success would give odds so low that the result might, with considerable probability, be ascribed to chance. 7. The Test of Significance It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him. Thus, if he wishes to ignore results having probabilities as high as 1 in 20—the probabilities being of course reckoned from the hypothesis that the phenomenon to be demonstrated is in fact absent—then it would be useless for him to experiment with only 3 cups of tea of each kind. For 3 objects can be chosen out of 6 in only 20 ways, and therefore complete success in the test would be achieved without sensory discrimination, te. by “ pure chance,” in an average of § trials out of too. It is usual and convenient for experimenters to take 5 per cent. as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. No such selection can eliminate the whole of the possible effects of chance coincidence, and if we accept this convenient convention, and agree that an event which would occur by chance only once in 70 trials is decidedly “ significant,” in the statistical sense, we thereby admit that no isolated experiment, however significant in itself, can suffice for the experimental demonstration of any natural pheno- menon; for the “one chance in a million” will 14 THE PRINCIPLES OF EXPERIMENTATION undoubtedly occur, with no less and no more than its appropriate frequency, however surprised we may be that it should occur to ws. In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experi- ment which will rarely fail to give us a statistically significant result. Returning to the possible results of the psycho- physical experiment, having decided that if every cup were rightly classified a significant positive result would be recorded, or, in other words, that we should admit that the lady had made good her claim, what should be our conclusion if, for cach kind of cup, her judgments are 3 right and 1 wrong? We may take it, in the present discussion, that any error in one set of judgments will be compensated by an error in the other, since it is known to the subject that there are 4 cups of each kind. In enumerating the number of ways of choosing 4 things out of 8, such that 3 are right and 1 wrong, we may note that the 3 right may be chosen, out of the 4 available, in 4 ways and, independently of this choice, that the 1 wrong may be chosen, out of the 4 available, also in 4 ways. So that in all we could make a selection of the kind supposed in 16 different ways. A similar argument shows that, in each kind of judgment, 2 may be right and 2 wrong in 36 ways, 1 right and 3 wrong in 16 ways and none right and 4 wrong in 1 way only. It should be noted that the frequencies of these five possible results of the experiment make up together, as it is obvious they should, the 70 cases out of 70. It is obvious, too, that 3 successes to 1 failure, although showing a bias, or deviation, in the right NULL HYPOTHESIS 15 direction, could not be judged as statistically significant evidence of a real sensory discrimination. For its frequency of chance occurrence is 16 in 70, or more than 20 per cent. Moreover, it is not the best possible result, and in judging of its significance we must take account not only of its own frequency, but also of the frequency of any better result, In the present instance “ 3 right and 1 wrong ” occurs 16 times, and “ 4 right” occurs once in 70 trials, making 17 cases out of 70 as good as or better than that observed. The reason for including cases better than that observed becomes obvious on considering what our conclusions would have been had the case of 3 right and 1 wrong only 1 chance, and the case of 4 right 16 chances of occurrence out of 70, The rare case of 3 right and 1 wrong could not be judged significant merely because it was rare, seeing that a higher degree of success would frequently have been scored by mere chance. 8. The Null Hypothesis Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes ‘with opposed interpretations. Tests of significance are of many different kinds, which need not be considered here. Here we are only concerned with the fact that the easy calculation in permutations which we encountered, and which gave us our test of significance, stands for something present in every possible experimental arrangement; or, at least, for something required in its interpretation. The two classes of results which are distinguished by our test of significance are, on the one hand, those which show a significant discrepancy from a certain hypothesis ; namely, in this case, the hypothesis that the judgments 16 THE PRINCIPLES OF EXPERIMENTATION given are in no way influenced by the order in which the ingredients have been added; and on the other hand, results which show no significant discrepancy from this hypothesis. This hypothesis, which may or may not be impugned by the result of an experiment, is again characteristic of all experimentation. Much confusion would often be avoided if it were explicitly fermulated when the experiment is designed. In relation to any experiment we may speak of this hypothesis as the “null hypothesis,” and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypo- thesis, It might be argued that if an experiment can dis- prove the hypothesis that the subject possesses no sensory discrimination between two different sorts of object, it must therefore be able to prove the opposite hypothesis, that she can make some such discrimination. But this last hypothesis, however reasonable or true it may be, is ineligible as a null hypothesis to be tested by experi- ment, because it is inexact. If it were asserted that the subject would never be wrong in her judgments we should again have an exact hypothesis, and it is easy to see that this hypothesis could be disproved by a single failure, but could never be proved by any finite amount of experimentation. It is evident that the null hypothesis must be exact, that is free from vagueness and ambiguity, because it must supply the basis of the “problem of distribution,” of which the test of signifi- cance is the solution. A null hypothesis may, indeed, contain arbitrary elements, and in more complicated cases often does so: as, for example, if it should assert that the death-rates of two groups of animals are equal, RANDOMISATION 17 without specifying what these death-rates actually are. In such cases it is evidently the equality rather than any particular values of the death-rates that the experi- ment is designed to test, and possibly to disprove. In cases involving statistical “ estimation ” these ideas may be extended to the simultaneous consideration of a series of hypothetical possibilities. The notion of an error of the so-called “‘ second kind,” due to accepting the null hypothesis “ when it is false” may then be given a meaning in reference to the quantity to be estimated. It has no meaning with respect to simple tests of significance, in which the only available expecta- tions are those which flow from the null hypothesis being true. Problems of the more elaborate type involving estimation are discussed in Chapter IX. 9. Randomisation; the Physical Basis of the Validity of the Test We have spoken of the experiment as testing a certain null hypothesis, namely, in this case, that the subject possesses no sensory discrimination whatever of the kind claimed ; we have, too, assigned as appropriate to this hypothesis a certain frequency distribution of occurrences, based on the equal frequency of the 70 possible ways of assigning 8 objects to two classes of 4 each; in other words, the frequency distribution appropriate to a classification by pure chance. We have now to examine the physical conditions of the experimental technique needed to justify the assumption that, if discrimination of the kind under test is absent, the result of the experiment will be wholly governed by the laws of chance, It is easy to see that it might well be otherwise. If all those cups made with the milk first had sugar added, while those made with the tea first had none, a very obvious difference in flavour 18 THE PRINCIPLES OF EXPERIMENTATION would have been introduced which might well ensure that all those made with sugar should be classed alike. These groups might either be classified all right or all wrong, but in such a case the frequency of the critical event in which all cups are classified correctly would not be x in 70, but 35 in 70 trials, and the test of signifi- cance would be wholly vitiated. Errors equivalent in principle to this are very frequently incorporated in otherwise well-designed experiments. It is no sufficient remedy to insist that “ all the cups must be exactly alike” in every respect except that to be tested. For this is a totally impossible requirement in our example, and equally in all other forms of experi- mentation. In practice it is probable that the cups will differ perceptibly in the thickness or smoothness of their material, that the quantities of milk added to the different cups will not be exactly equal, that the strength of the infusion of tea may change between pouring the first and the last cup, and that the tempera- ture also at which the tea is tasted will change during the course of the experiment. These are only examples of the differences probably present ; it would be impossible to present an exhaustive list of such possible differences appropriate to any one kind of experiment, because the uncontrolled causes which may influence the result are always strictly innumerable. When any such cause is named, it is usually perceived that, by increased labour and expense, it could be largely eliminated. Too frequently it is assumed that such refinements constitute improvements to the experiment. Our view, which will be much more fully exemplified in later sections, is that it is an essential characteristic of experimentation that it is carried out with limited resources, and an essential part of the subject of experimental design to ascertain how these should be best applied; or, in RANDOMISATION 19 particular, to which causes of disturbance care should be given, and which ought to be deliberately ignored. To ascertain, too, for those which are not to be ignored, to what exéent it is worth while to take the trouble to diminish their magnitude. For our present purpose, however, it is only necessary to recognise that, whatever degree of care and experimental skill is expended in equalising the conditions, other than the one under test, which are liable to affect the result, this equalisation must always be to a greater or less extent incomplete, and in many important practical cases will certainly be grossly defective. We are concerned, therefore, that this inequality, whether it be great or small, shall not impugn the exactitude of the frequency distribution, on the basis of which the result of the experiment is to be appraised. 10. The Effectiveness of Randomisation The element in the experimental procedure which contains the essential safeguard is that the two modifi- cations of the test beverage are to be prepared “in random order.” ‘This, in fact, is the only point in the experimental procedure in which the laws of chance, which are to be in exclusive control of our frequency distribution, have been explicitly introduced. The phrase “random order” itself, however, must be regarded as an incomplete instruction, standing as a kind of shorthand symbol for the full procedure of randomisation, by which the validity of the test of significance may be guaranteed against corruption by the causes of disturbance which have not been eliminated. To demonstrate that, with satisfactory randomisation, its validity is, indeed, wholly unimpaired, let us imagine all causes of disturbance—the strength of the infusion, the quantity of milk, the temperature at which it is 20 THE PRINCIPLES OF EXPERIMENTATION tasted, etc.—to be predetermined for each cup; then since these, on the null hypothesis, are the only causes influencing classification, we may say that the probabili- ties of each of the 70 possible choices or classifications which the subject can make are also predetermined. If, now, after the disturbing causes are fixed, we assign, strictly at random, 4 out of the 8 cups to each of our experimental treatments, then every set of 4, whatever its probability of being so classified, will certainly have a probability of exactly 1 in 70 of Jeing the 4, for example, to which the milk is added first. However important the causes of disturbance may be, even if they were to make it certain that one particular set of 4 should receive this classification, the probability that the 4 so classified and the 4 which ought to have been so classified should be the same, must be rigorously in accordance with our test of significance. It is apparent, therefore, that the random choice of the objects to be treated in different ways would be a complete guarantee of the validity of the test of signifi- cance, if these treatments were the last in time of the stages in the physical history of the objects which might affect their experimental reaction. The circumstance that the experimental treatments cannot always be applied last, and may come relatively early in their history, causes no practical inconvenience; for sub- sequent causes of differentiation, if under the experi- menter’s control, as, for example, the choice of different pipettes to be used with different flasks, can either be predetermined before the treatments have been random- ised, or, if this has not been done, can be randomised on their own account ; and other causes of differentiation will be either (¢) consequences of differences already randomised, or (4) natural consequences of the difference in treatment to be tested, of which on the null hypothesis PRECISION 21 there will be none, by definition, or (¢) effects supervening by chance independently from the treatments applied. Apart, therefore, from the avoidable error of the experi- menter himself introducing with his test treatments, or subsequently, other differences in treatment, the effects of which the experiment is not intended to study, it may be said that the simple precaution of randomisation will suffice to guarantee the validity of the test of significance, by which the result of the experiment is to be judged. 41. The Sensitiveness of an Experiment. Effects of Enlargement and Repetition A probable objection, which the subject might well make to the experiment so far described, is that only if every cup is classified correctly will she be judged successful. A single mistake will reduce her performance below the level of significance. Her claim, however, might be, not that she could draw the distinction with invariable certainty, but that, though sometimes mis- taken, she would be right more often than not; and that the experiment should be enlarged sufficiently, or repeated sufficiently often, for her to be able to demon- strate the predominance of correct classifications in spite of occasional errors. An extension of the calculation upon which the test of significance was based shows that an experiment with 12 cups, six of each kind, gives, on the null hypo- thesis, 1 chance in 924 for complete success, and 36 chances for 5 of each kind classified right and 1 wrong. As 37 is less than a twentieth of 924, such a test could be counted as significant, although a pair of cups have been wrongly classified ; and it is easy to verify that, using larger numbers still, a significant result could be obtained with a still higher proportion of errors. By 22 THE PRINCIPLES OF EXPERIMENTATION increasing the size of the experiment, we can render it more sensitive, meaning by this that it will allow of the detection of a lower degree of sensory discrimination, or, in other words, of a quantitatively smaller departure from the null hypothesis. Since in every case the experiment is capable of disproving, but never of proving this hypothesis, we may say that the value of the experiment is increased whenever it permits the null hypothesis to be more readily disproved. The same result could be achieved by repeating the experiment, as originally designed, upon a number of different occasions, counting as a success all those occasions on which 8 cups are correctly classified. The chance of success on each occasion being : in 70, a simple application of the theory of probability shows that 2 or more successes in 10 trials would occur, by chance, with a frequency below the standard chosen for testing significance ; so that the sensory discrimina- tion would be demonstrated, although, in 8 attempts out of 10, the subject made one or more mistakes. This procedure may be regarded as merely a sccond way of enlarging the experiment and, thereby, increasing its sensitiveness, since in our final calculation we take account of the aggregate of the entire series of results, whether successful or unsuccessful. It would clearly be illegitimate, and would rob our calculation of its basis, if the unsuccessful results were not all brought into the account, 12. Qualitative Methods of increasing Sensitiveness Instead of enlarging the experiment we may attempt to increase its sensitiveness by qualitative improve- ments ; and these are, generally speaking, of two kinds : (a) the reorganisation of its structure, and (4) refinements of technique. To illustrate a change of structure we PRECISION 23 might consider that, instead of fixing in advance that 4.cups should be of each kind, determining by a random process how the subdivision should be effected, we might have allowed the treatment of each cup to be determined independently by chance, as by the toss of a coin, so that each treatment has an equal chance of being chosen. The chance of classifying correctly 8 cups randomised in this way, without the aid of sensory discrimination, is 1 in 2°, or 1 in 256 chances, and there are only 8 chances of classifying 7 right and 1 wrong ; consequently the sensitiveness of the experiment has been increased, while still using only 8 cups, and it is possible to score a significant success, even if one is classified wrongly. In many types of experiment, therefore, the suggested change in structure would be evidently advantageous. For the special requirements of a psycho-physical experiment, however, we should probably prefer to forego this advantage, since it would occasionally occur that all the cups would be treated alike, and this, besides bewildering the subject by an unexpected occurrence, would deny her the real advan- tage of judging by comparison. Another possible alteration to the structure of the experiment, which would, however, decrease its sensi- tiveness, would be to present determined, but unequal, numbers of the two treatments. Thus we might arrange that 5 cups should be of the one kind and 3 of the other, choosing them properly by chance, and informing the subject how many of each to expect. But since the number of ways of choosing 3 things out of 8 is only 56, there is now, on the null hypothesis, a probability of a completely correct classification of 1 in 56. It appears in fact that we cannot by these means do better than by presenting the two treatments in equal numbers, and the choice of this equality is now seen to be 24 THE PRINCIPLES OF EXPERIMENTATION justified by its giving to the experiment its maximal sensitiveness. With respect to the refinements of technique, we have seen above that these contribute nothing to the validity of the experiment, and of the test of significance by which we determine its result. They may, however, be important, and even essential, in permitting the phenomenon under test to manifest itself. Though the test of significance remains valid, it may be that without special precautions even a definite sensory discrimina- tion would have little chance of scoring a significant success. If some cups were made with India and some with China tea, even though the treatments were properly randomised, the subject might not be able to discriminate the relatively small difference in flavour under investigation, when it was confused with the greater differences between leaves of different origin. Obviously, a similar difficulty could be introduced by using in some cups raw milk and in others boiled, or even condensed milk, or by adding sugar in unequal quantities. The subject has a right to claim, and it is in the interests of the sensitiveness of the experiment, that gross differences of these kinds should be excluded, and that the cups should, not as far as possid/e, but as far as is practically convenient, be made alike in all respects except that under test, How far such experimental refinements should be carried is entirely a matter of judgment, based on experience. The validity of the experiment is not affected by them. Their sole purpose is to increase its sensitiveness, and this object can usually be achieved in many other ways, and particularly by increasing the size of the experiment. If, therefore, it is decided that the sensitiveness of the experiment should be increased, the experimenter has the choice between different PRECISION 25 methods of obtaining equivalent results; and will be wise to choose whichever method is easiest to him, irrespective of the fact that previous experimenters may have tried, and recommended as very important, or even essential, various ingenious and troublesome precautions. 412-4. Scientific Inference and Acceptance Procedures In ‘The Improvement of Natural Knowledge”, that is, in learning by experience, or by planned chains of experimentation, conclusions are always provisional and in the nature of progress reports, interpreting and embodying the evidence so far accrued. Convenient as it is to note that a hypothesis is contradicted at some familiar level of significance such as 5% or 2% or 1% we do not, in Inductive Inference, ever need to lose sight of the exact strength which the evidence has in fact reached, or to ignore the fact that with further trial it might come to be stronger, or weaker. The situation is entirely different in the field of Acceptance Procedures, in which irreversible action may have to be taken, and in which, whichever decision is arrived at, it is quite immaterial whether it is arrived at on strong evidence or on weak. All that is needed is a Rule of Action which is to be taken automatically, and without thought devoted to the individual decision. The pro- cedure as a whole is arrived at by minimising the losses due to wrong decisions, or to unnecessary testing, and to frame such a procedure successfully the cost of such faulty decisions must be assessed in advance; equally, also, prior knowledge is required of the expected distri- bution of the material in supply. In the field of pure research no assessment of the cost of wrong conclusions, or of delay in arriving at more correct conclusions can conceivably be more than a pretence, and in any case 26 THE PRINCIPLES OF EXPERIMENTATION such an assessment would be inadmissible and irrelevant in judging the state of the scientific evidence ; more- over, accurately assessable prior information is ordin- arily known to be lacking. Such differences between the logical situations should be borne in mind whenever we see tests of significance spoken of as “ Rules of Action”. A good deal of confusion has certainly been caused by the attempt to formalise the exposition of tests of significance in a logical framework different from that for which they were-in fact first developed. REFERENCES AND OTHER READING R. A, Fisuer (1925-1963). Statistical methods for research workers, ip. IIT., §§ 15-19 R.A, Fisuer (1926). The arrangement of field experiments, Journal of Ministry of Agriculture, xxuxili. 503-513. II A HISTORICAL EXPERIMENT ON GROWTH RATE 43. We have illustrated a psycho-physical experiment, the result of which depends upon judgments, scored “right” or “wrong,” and may be appropriately interpreted by the method of the classical theory of probability. This method rests on the enumeration of the frequencies with which different combinations of right or wrong judgments will occur, on the hypo- thesis to be tested. We may now illustrate an experiment in which the results are expressed in quantitative measures, and which is appropriately interpreted by means of the theory of errors. In the introductory remarks to his book on “‘ The effects of cross and self-fertilisation in the vegetable kingdom,” Charles Darwin gives an account of the considerations which guided him in the design of his experiments and in the presentation of his data, which will serve well to illustrate the principles on which biological experiments may be made conclusive. The passage is of especial interest in illustrating the extremely crude and unsatisfactory statistical methods available at the time, and the manner in which careful attention to commonsense considerations led to the adoption of an experimental design, in itself greatly superior to these methods of interpretation. 14. Darwin’s Discussion of the Data “T long doubted whether it was worth while to give the measurements of each separate plant, but have ” 28 EXPERIMENT ON GROWTH RATE decided to do so, in order that it may be seen that the superiority of the crossed plants over the self-fertilised does not commonly depend on the presence of two or three extra fine plants on the one side, or of a few very poor plants on the other side. Although several observers have insisted in general terms on the offspring from intercrossed varieties being superior to either parent-form, no precise measurements have been given ; and I have met with no observations on the effects of crossing and self-fertilising the individuals of the same variety. Moreover, experiments of this kind require so much time—mine having been continued during eleven years—that they are not likely soon to be repeated, “ As only a moderate number of crossed and self- fertilised plants were measured, it was of great importance to me to leam how far the averages were trustworthy, I therefore asked Mr Galton, who has had much experi- ence in statistical researches, to examine some of my tables of measurements, seven in number, namely those of /pomaa, Digitalis, Reseda lutea, Viola, Limnanthes, Petunia, and Zea. 1 may premise that if we took by chance a dozen or score of men belonging to two nations and measured them, it would I presume be very rash to form any judgment from such small numbers on their average heights. But the case is somewhat different with my crossed and self-fertilised plants, as they were of exactly the same age, were subjected from first to last to the same conditions, and were descended from the same parents. When only from two to six pairs of plants were measured, the results are manifestly of little or no value, except in so far as they confirm and are confirmed by experiments made on a larger scale with other species. I will now give the report on the seven tables of measurements, GALTON’S METHOD OF INTERPRETATION 29 which Mr Galton has had the great kindness to draw up for me.” 15. Galton’s Method of Interpretation “Y have examined the measurements of the plants with care, and by many statistical methods, to find out how far the means of the several sets represent constant realities, such as would come out the same so long as the general conditions of growth remained unaltered. The principal methods that were adopted are easily explained by selecting one of the shorter series of plants, say of Zea mays, for an example. “The observations as I received them are shown in columns Il. and III., where they certainly have no primd facie appearance of regularity. But as soon as we arrange them in the order of their magnitudes, as in columns IV. and V., the case is materially altered. We now see, with few exceptions, that the largest plant on the crossed side in each pot exceeds the largest plant on the self-fertilised side, that the second exceeds the second, the third the third, and so on. Out of the fifteen cases in the table, there are only two exceptions to this rule." We may therefore confidently affirm that a crossed series will always be found to exceed a self-fertilised series, within the range of the conditions under which the present experiment has been made. “Next as regards the numerical estimate of this excess. The mean values of the several groups are so discordant, as is shown in the table just given, that a fairly precise numerical estimate seems impossible. But the consideration arises, whether the difference between pot and pot may not be of much the same order of importance as that of the other conditions upon which the growth of the plants has been modified. If so, and only on that condition, it would follow that when all the measurements, either of the crossed or the self-fertilised planta, were combined into a single series, that series would be statistically regular, The experiment is tried in columns VII. and VIII., where the regularity is abundantly clear, and justifies us in considering its mean as perfectly reliable * Galton evidently did not notice that this is true also before rearrange: ment. EXPERIMENT ON GROWTH RATE 3 = St ie 2 = a ya! i ¥ a at fe “cand | ie gt rf Sgr i te— for | dx | doe 4 ie— for rn | he i ie— 1 art fez 1 e— wz | ge | He + MD 20a fe ft ' i ic Hz | fat. 61 c— eof fet) Re; ! 3e— fee} or tee - may e- fee 3 : : e— fo} fe log | 3 ce} oe Ge I i— ee fgets Tea | “spuy “smpuy | ‘srpay | spuy j - sour | serps | soa | aepps | pag | i , , i “mA { WA | uo foa« | car i swung ous * Uy | soa areas wy “oprayudeyy Jo 29p10 uy poBuaxry i j (squmpg Sunoh) shows vez 1 STEVE GALTON’S METHOD OF INTERPRETATION 31 I have protracted these measurements, and revised them in the usual way, by drawing a curve through them with a free hand, but the revision barcly modifies the means derived from the original observations. In the present, and in nearly all the other cases, the difference between the original and revised means is under 2 per cent. of their value. It isa very remarkable coincidence that in the seven kinds of plants, whose measure- ments I have examined, the ratio between the heights of the crossed and of the self-fertilised ranges in five cases within very narrow limits. In Zea mays it is as 100 to 84, and in the others it ranges between 100 to 76 and 100 to 86. TABLE 2 Self-fert, | Difference. “The determination of the variability (measured by what is technically called the ‘ probable error’) is a problem of more delicacy than that of determining the means, and I doubt, after making many trials, whether it is possible to derive useful conclusions from these few observations, We ought to have measurements of at least fifty plants in each case, in order to be in a position to deduce fair results. . . .” “Mr Galton sent me at the same time graphical representations which he had made of the measurements, and they evidently form fairly regular curves. He appends the words ‘very good’ to those of Zea and Limnanthes. He also calculated the average height of the crossed and self-fertilised plants in the seven tables by a more correct method than that followed by me, namely by including the heights, as estimated in accordance with statistical rules, of a few plants which 32 EXPERIMENT ON GROWTH RATE died before they were measured; whereas I merely added up the heights of the survivors, and divided the sum by their number. The difference in our results is in one way highly satisfactory, for the average heights of the self-fertilised plants, as deduced by Mr Galton, is less than mine in all the cases excepting one, in which our averages are the same; and this shows that I have by no means exaggerated the superiority of the crossed over the self-fertilised plants.” 16, Pairing and Grouping It is seen that the method of comparison adopted by Darwin is that of pitting each self-fertilised plant against a cross-fertilised one, in conditions made as equal as possible. The pairs so chosen for comparison had germinated at the same time, and the soil conditions in which they grew were largely equalised by planting in the same pot. Necessarily they were not of the same parentage, as it would be difficult in maize to self- fertilise two plants at the same time as raising a cross- fertilised progeny from the pair. However, the parents were presumably grown from the same batch of seed, The evident object of these precautions is to increase the sensitiveness of the experiment, by making such differences in growth rate as were to be observed as little as possible dependent from environmental circumstances, and as much as possible, therefore, from intrinsic differences due to their mode of origin, The method of pairing, which is much used in modern biological work, illustrates well the way in which an appropriate experimental design is able to reconcile two desiderata, which sometimes appear to be in conflict. On the one hand we require the utmost uniformity in the biological material, which is the subject of experiment, in order to increase the sensitiveness PAIRING AND GROUPING 33 of each individual observation; and, on the other, we require to multiply the observations so as to demon- strate so far as possible the reliability and consistency of the results. ‘Thus an experimenter with field crops may desire to replicate his experiments upon a large number of plots, but be deterred by the consideration that his facilities allow him to sow only a limited area on the same day. An experimenter with small mammals may have only a limited supply of an inbred and highly uniform stock, which he believes to be particularly desirable for experimental purposes. Or, he may desire to carry out his experiments on members of the same litter, and feel that his experiment is limited by the size of the largest litter he can obtain. It has indeed frequently been argued that, beyond a certain moderate degree, further replication can give no further increase in precision, owing to the increasing heterogeneity with which, it is thought, it must be accompanied. In all these cases, however, and in the many analogous cases which constantly arise, there is no real dilemma. Uniformity is only requisite between the objects whose response is to be contrasted (that is, objects treated differently). It is not requisite that all the parallel plots under the same treatment shall be sown on the same day, but only that each such plot shall be sown so far as possible simultaneously with the differently treated plot or plots with which it is to be compared. If, there- fore, only two kinds of treatments are under examina- tion, pairs of plots may be chosen, one plot for each treatment; and the precision of the experiment will be given its highest value if the members of each pair are treated closely alike, but will gain nothing from similarity of treatment applied to different pairs, nor lose anything if the conditions in these are somewhat varied, In the same way, if the numbers of animals 4 EXPERIMENT ON GROWTH RATE available from any inbred line are too few for adequate replication, the experimental contrasts in treatments may be applied to pairs of animals from different inbred lines, so long as each pair belongs to the same line. In these two cases it is evident that the principle of combining similarity between controls to be compared, with diversity between parallels, may be extended to cases where three or more treatments are under investi- gation. The requirement that animals to be contrasted must come from the same litter limits, not the amount of replication, but the number of different treatments that can be so tested. Thus we might test three, but not so easily four or five treatments, if it were necessary that each set of animals must be of the same sex and litter. Paucity of homogeneous material limits the number of different treatments in an experiment, not the number of replications. It may cramp the scope and comprchensiveness of an experimental enquiry, but sets no limit to its possible precision. 47. Student’s” ¢ Test * Owing to the historical accident that the theory of errors, by which quantitative data are to be interpreted, was developed without reference to experimental methods, the vital principle has often been overlooked that the actual and physical conduct of an experiment must govern the statistical procedure of its interpreta- tion. In using the theory of errors we rely for our con- clusion upon one or more estimates of error, derived from the data, and appropriate to the one or more sets * A full account of this test in more varied applications, and the tables for its use, will be found in Statistical Methods for Research Workers. Its originator, who published anonymously under the pseudonym “ Student,” possesses the remarkable distinction that, without being @ professed mathematician, buta research chemist, he made early in life this revolutionary refinement of the classical theory of errors. “STUDENT’S TEST” 35 of comparisons which we wish to make. Whether these estimates are valid, for the purpose for which we intend them, depends on what has been actually done. It is possible, and indced it is all too frequent, for an experiment to be so conducted that no valid estimate of error is available. In such a case the experiment cannot be said, strictly, to be capable of proving any- thing. Perhaps it should not, in this case, be called an experiment at all, but be added merely to the body of experience on which, for lack of anything better, we may have to base our opinions. All that we need to emphasise immediately is that, if an experiment does allow us to calculate a valid estimate of error, its struc- ture must completely determine the statistical procedure by which this estimate is to be calculated. If this were not so, no interpretation of the data could ever be unambiguous; for we could never be sure that some other equally valid method of interpretation would not lead to a different result. The object of the experiment is to determine whether the difference in origin between inbred and cross-bred plants influences their growth rate, as measured by height at a given date; in other words, if the numbers of the two sorts of plants were to be increased indefinitely, our object is to determine whether the average heights, to which these two aggregates of plants will tend, are equal or unequal. The most general statement of our null hypothesis is, therefore, that the limits to which these two averages tend are equal. The theory of errors enables us to test a somewhat more limited hypothesis, which, by wide experience, has been found to be appropriate to the metrical characters of experi- mental material in biology. The disturbing causes which introduce discrepancies in the means of measure- ments of similar material are found to produce quanti- 36 EXPERIMENT ON GROWTH RATE tative effects which conform satisfactorily to a theoretical distribution known as the normal law of frequency of error. It is this circumstance that makes it appropriate to choose, as the null hypothesis to be tested, one for which an exact statistical criterion is available, namely that the two groups of measurements are samples drawn from the same normal population. On the basis of this hypothesis we may proceed to compare the average difference in height, between the cross-fertilised and the self-fertilised plants, with such differences as might be expected between these averages, in view of the observed discrepancies between the heights of plants of like origin. We must now see how the adoption of the method of pairing determines the details of the arithmetical procedure, so as to lead to an unequivocal interpreta- tion. The pairing procedure, as indeed was its purpose, has equalised any differences in soil conditions, illumina- tion, air-currents, ctc., in which the several pairs of individuals may differ. Such differences having been eliminated from the experimental comparisons, and contributing nothing to the real errors of our experiment, must, for this reason, be eliminated likewise from our estimate of error, upon which we are to judge what differences between the means are compatible with the null hypothesis, and what differences are so great as to be incompatible with it. We are therefore not con- cerned with the differences in height among plants of like origin, but only with differences in height between members of the same pair, and with the discrepancies among these differences observed in different pairs. Our first step, therefore, will be to subtract from the height of each cross-fertilised plant the height of the self-fertilised plant belonging to the same pair. The differences are shown below in eighths of an inch. “STUDENT'S” TEST 7 With respect to these differences our null hypothesis asserts that they are normally distributed about a mean value at zero, and we have to test whether our 15 observed differences are compatible with the supposition that they are a sample from such a population. TABLE 3 Differences in eighths of an inch between cross- and self-fertilised plants of the same pair 49 23 56 67 28 24 8 4 B 6 14 60 6 29 48 The calculations needed to make a rigorous test of the null hypothesis stated above involve no more than the sum, and the sum of the squares, of these numbers. The sum is 314, and, since there are 15 plants, the mean difference is soit in favour of the cross-fertilised plants. The sum of the squares is 26,518, and from this is deducted the product of the total and the mean, or 6573, leaving 19,945 for the sum of squares of devia- tions from the mean, representing discrepancies among the differences observed in the 15 pairs, The algebraic fact here used is that S(e—A)" = S(x%)—25(x) where S stands for summation over the sample, and # for the mean value of the observed differences, x. We may make from this measure of the discrepancies an estimate of a quantity known as the variance of an individual difference, by dividing by 14, one less than the number of pairs observed. Equally, and what is more immediately required, we may make an estimate of the variance of the mean of 15 such pairs, by dividing again by 15, a process which yields 94-976 as the estimate. 38 EXPERIMENT ON GROWTH RATE The square root of the variance is known as the standard error, and it is by the ratio which our observed mean difference bears to its standard error that we shall judge of its significance. Dividing our difference, 20-933, by fs standard error 9-746, we find this ratio (which is usually denoted by #) to be 2-148. The object of these calculations has been to obtain from the data a quantity measuring the average differ- ence in height between the cross-fertilised and the self- fertilised plants, in terms of the observed discrepancies among these differences; and which, moreover, shall be distributed in a known manner when the null hypo- thesis is true. The mathematical distribution for our present problem was discovered by “Student” in 1908, and depends only upon the number of independent comparisons (or the number of degrees of freedom) available for calculating the estimate of error. With 1g observed differences we have among them 14 inde- pendent discrepancies, and our degrees of freedom are 14. The available tables of the distribution of ¢ show that for 14 degrees of freedom the value 2:145 is exceeded by chance, either in the positive or negative direction, in exactly 5 per cent. of random trials. The observed value of #, 2°148, thus just exceeds the 5 per cent. point, and the experimental result may be judged significant, though barely so, 48. Fallacious Use of Statistica We may now see that Darwin’s judgment was perfectly sound, in judging that it was of importance to learn how far the averages were trustworthy, and that this could be done by a statistical examination of the tables of measurements of individual plants, though not of their averages. The example chosen, in fact, falls just on the border-line between those results which FALLACIOUS METHODS 9 can suffice by themselves to establish the point at issue, and those which are of little value except in so far as they confirm or are confirmed by other experiments of a like nature. In particular, it is to be noted that Darwin recognised that the reliability of the result must be judged by the consistency of the superiority of the crossed plants over the self-fertilised, and not only on the difference of the averages, which might depend, as he says, on the presence of two or three extra-fine plants on the one side, or of a few very poor plants on the other side; and that therefore the pre- sentation of the experimental evidence depended essen- tially on giving the measurements of each independent plant, and could not be assessed from the mere averages. It may be noted also that Galton’s scepticism of the value of the probable error, deduced from only 15 pairs of observations, though, as it turned out, somewhat excessive, was undoubtedly right in principle. The standard error (of which the probable error is only a conventional fraction) can only be estimated with con- siderable uncertainty from so small a sample, and, prior to “ Student’s” solution of the problem, it was by no means clear to what extent this uncertainty would invalidate the test of significance. From “ Student’s”” work it is now known that the cause for anxiety was not so great as it might have seemed. Had the standard error been known with certainty, or derived from an effectively infinite number of observations, the 5 per cent. value of # would have been 1-960. When our estimate is based upon only 15 differences, the 5 per cent. value, as we have seen, is 2-145, or less than io per cent. greater. Even using the inexact theory available at the time, a calculation of the probable error would have provided a valuable guide to the interpretation of the results. 40 EXPERIMENT ON GROWTH RATE 49. Manipulation of the Data A much more serious fallacy appears to be involved in Galton’s assumption that the value of the data, for the purpose for which they were intended, could be increased by rearranging the comparisons. Modern statisticians are familiar with the notions that any finite body of data contains only a limited amount of informa- tion, on any point under examination ; that this limit is set by the nature of the data themselves, and cannot be increased by any amount of ingenuity expended in their statistical examination: that the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. If the results of an experiment, as obtained, are in fact irregular, this evidently detracts from their value; and the statistician is not elucidating but falsifying the facts, who rearranges them so as to give an artificial appear- ance of regularity. In rearranging the results of Darwin's experiment it appears that Galton thought that Darwin’s experi- ment would be equivalent to one in which the heights of pairs of contrasted plants had been those given in his columns headed VI. and VII., and that the reliability of Darwin’s average difference of about 2% inches could be fairly judged from the constancy of the 15 differences shown in column VIII. How great an effect this procedure, if legitimate, would have had on the significance of the result, may be seen by treating these artificial differences as we have treated the actual differences given by Darwin. Apply- ing the same arithmetical procedure as before, we now find ¢ equals 5:171, a value which would be exceeded by chance only about once or twice in 10,000 trials, and is far beyond the level of significance ordinarily VALIDITY 4r required. The falsification, inherent in this mode of procedure, will be appreciated if we consider that the tallest plant, of either the crossed or the self-fertilised series, will have become the tallest by reason of a number of favourable circumstances, including among them those which produce the discrepancies between those pairs of plants, which were actually grown together. By taking the difference between these two favoured plants we have largely eliminated real causes of error which have affected the value of our observed mean. We have, in doing this, grossly violated the principle that the estimate of error must be based on the effects of the very same causes of variation as have produced the real errors in our experiment. Through this fallacy Galton is led to speak of the mean as perfectly reliable, when, from its standard error, it appears that a repetition of the experiment would often give a mean quite 50 per cent. greater or less than that observed in this case. 20. Validity and Randomisation Having decided that, when the structure of the experiment consists in a number of independent com- parisons between pairs, our estimate of the error of the average difference must be based upon the discrepancies between the differences actually observed, we must next enquire what precautions are needed in the practical conduct of the experiment to guarantee that such an estimate shall be a valid one; that is to say that the very same causes that produce our real error shall also contribute the materials for computing an estimate of it. The logical necessity of this requirement is readily apparent, for, if causes of variation which do not. i flu-, ence our real error are allowed to affect, EIS of it, or equally, if causes of variati error in such a way as to make no 42 EXPERIMENT ON GROWTH RATE estimate, this estimate will be vitiated, and will be incapable of providing a correct statement as to the frequency with which our real error will exceed any assigned quantity ; and such a statement of frequency is the sole purpose for which the estimate is of any use. Nevertheless, though its logical necessity is easily apprehended, the question of the validity of the estimates of error used in tests of significance was for long ignored, and is still often overlooked in practice. One reason for this is that standardised methods of statistical analysis have been taken over ready-made from a mathematical theory, into which questions of experimental detail do not explicitly enter. In consequence the assumptions which enter implicitly into the bases of the theory have not been brought prominently under the notice of practical experimenters. A second reason is that it has not until recently been recognised that any simple precaution would supply an absolute guarantee of the validity of the calculations. In the experiment under consideration, apart from chance differences in the selection of seeds, the sole source of the experimental error in the average of our fifteen differences lies in the differences in soil fertility, illumination, evaporation, etc., which make the site of each crossed plant more or less favourable to growth than the site assigned to the corresponding self-fertilised plant. It is for this reason that every precaution, such as mixing the soil, equalising the watering and orienting the pot so as to give equal illumination, may be expected to increase the precision of the experiment. If, now, when the fifteen pairs of sites have been chosen, and in so doing all the differences in environmental circum- stances, to which the members of the different pairs will be exposed during the course of the experiment, have been predetermined, we then assign at random, VALIDITY “a as by tossing a coin, which site shall be occupied by the crossed and which by the self-fertilised plant, we shall be assigning by the same act whether this particular ingredient of error shall appear in our average with a positive or a negative sign. Since each particular error has thus an equal and independent chance of being positive or negative, the error of our average will necessarily be distributed in a sampling distribution, centred at zero, which will be symmetrical in the sense that to each possible positive error there corresponds an equal negative error, which, as our procedure guaran- tees, will in fact occur with equal probability. Our estimate of error is easily seen to depend only on the same fifteen ingredients, and the arithmetical processes of summation, subtraction and division may be designed, and have in fact been designed, so as to provide the estimate appropriate to the system of chances which our method of choosing sites had imposed on the data. This is to say much more than merely that the experiment is unbiased, for we might still call the experiment unbiased if the whole of the cross- fertilised plants had been assigned to the west side of the pots, and the self-fertilised plants to the east side, by a single toss of the coin. That this would be in- sufficient to ensure the validity of our estimate may be easily seen; for it might well be that some unknown circumstance, such as the incidence of different illumina- tion at different times of the day, or the desiccating action of the air-currents prevalent in the greenhouse, might systematically favour all the plants on one side over those on the other. The effect of any such pre- vailing cause would then be confounded with the advantage, real or apparent, of cross-breeding over inbreeding, and would be eliminated from our estimate of error, which is based solely on the discrepancies 44 EXPERIMENT ON GROWTH RATE between the differences shown by different pairs of plants. Randomisation properly carried out, in which each pair of plants are assigned their positions independently at random, ensures that the estimates of error will take proper care of all such causes of different growth rates, and relieves the experimenter from the anxiety of considering and estimating the magnitude of the in- numerable causes by which his data may be disturbed. The one flaw in Darwin’s procedure was the absence of randomisation. Had the same measurements been obtained from pairs of plants properly randomised the experiment would, as we have shown, have fallen on the verge of significance. Galton was led greatly to overestimate its conclusivencss through the major error of attempting to estimate the reliability of the comparisons by re- arranging the two series in order of magnitude. His discussion shows, in other respects, an over-confidence in the power of statistical methods to remedy the irregularities of the actual data. In particular, the attempt mentioned by Darwin to improve on the simple averages of the two series “ by a more correct method . . » by including the heights, as estimated in accord- ance with statistical rules, of a few plants which died before they were measured,” seems to go far beyond the limits of justifiable inference, and is one of many indications that the logic of statistical induction was in its infancy, even at a time when the technique of accurate experimentation had already been notably advanced. 24, Test of a Wider Hypothesis It has been mentioned that “ Student’s” / test, in conformity with the classical theory of errors, is appro- priate to the null hypothesis that the two groups of measurements are samples drawn from the same normally GENERAL TEST 45 distributed population. This is the type of null hypo- thesis which experimenters, rightly in the author’s opinion, usually consider it appropriate to test, for reasons not only of practical convenience, but because the unique properties of the normal distribution make it alone suitable for general application. There has, however, in recent years, been a tendency for theoretical statisticians, not closely in touch with the requirements of experimental data, to stress the element of normality, in the hypothesis tested, as though it were a serious limitation to the test applied. It is, indeed, demonstrable that, as a test of this hypothesis, the exactitude of “ Student’s” # test is absolute. It may, nevertheless, be legitimately asked whether we should obtain a materially different result were it possible to test the wider hypothesis which merely asserts that the two series are drawn from the same population, without specifying that this is normally distributed. In these discussions it seems to have escaped recogni- tion that the physical act of randomisation, which, as has been shown, is necessary for the validity of any test of significance, affords the means, in respect of any particular body of data, of examining the wider hypo- thesis in which no normality of distribution is implied. The arithmetical procedure of such an examination is tedious, and we shall only give the results of its appli- cation in order to show the possibility of an independent check on the more expeditious methods in common use. On the hypothesis that the two series of seeds are random samples from identical populations, and that their sites have been assigned to members of each pair independently at random, the 15 differences of Table 3 would each have occurred with equal frequency with a positive or with a negative sign. Their sum, taking account of the two negative signs which have actually 46 EXPERIMENT ON GROWTH RATE occurred, is 314, and we may ask how many of the 2!5 numbers, which may be formed by giving cach com- ponent alternatively a positive and a negative sign, exceed this value, Since ex Aypothest each of these 26 combinations will occur by chance with equal frequency, a knowledge of how many of them are equal to or greater than the value actually observed affords a direct arithmetical test of the significance of this value. It is easy to see that if there were no negative signs, or only one, every possible combination would exceed 314, while if the negative signs are 7 or more, every possible combination will fall short of this value. The distribution of the cases, when there are from 2 to 6 negative values, is shown in the following table + TABLE 4 Number of combinations of differences, positive or negative, which exceed or fall shart of the total observed Number of negative values, o . ro. 15 21° 98 i i 105 30 263 3 189 455 400 302 1 1,052 1,365 5 5 tt] 8 1 2,853, 3,003 ieee : 22 1 4,982 5,005 7 or more | fs 223819 23,819 31905 32,768 In just 863 cases out of 32,768 the total deviation will have a positive value as great as or greater than that observed. In an equal number of cases it will have as great a negative value. The two groups together constitute 5-267 per cent, of the possibilities available, GENERAL TEST a a result very nearly equivalent to that obtained using the # test with the hypothesis of a normally distributed population. Slight as it is, indeed, the difference between the tests of these two hypotheses is partly due to the continuity of the ¢ distribution, which effectively counts only half of the 28 cases which give a total of exactly 314, as being as great as or greater than the observed value. Both tests prove that, in about 5 per cent. of trials, samples from the same batch of seed would show differ- ences just as great, and as regular, as those observed ; so that the experimental evidence is scarcely sufficient to stand alone. In conjunction with other experiments, however, showing a consistent advantage of cross- fertilised seed, the experiment has considerable weight ; since only once in 40 trials would a chance deviation have been observed both so large, and in the right direction. How entirely appropriate to the present problem is the use of the distribution of 4, based on the theory of errors, when accurately carried out, may be seen by inserting an adjustment, which effectively allows for the discontinuity of the measurements. This adjustment is not usually of practical importance, with the ¢ test, and is only given here to show the close similarity of the results of testing the two hypotheses, in one of which the errors are distributed according to the normal law, whereas in the other they may be distributed in any conceivable manner. The adjustment * consists in calculating the value of ¢ as though the total difference between the two sets of measurements were less than that actually observed by half a unit of grouping; * This adjustment is an extension to the distribution of # of Yates’ adjustment for continuity, which is of greater importance in the distribution of x4, for which it was developed. 48 EXPERIMENT ON GROWTH RATE ge. as if it were 313 instead of 314, since the possible values advance by steps of 2. The value of ¢ is then found to be 2:139 instead of 2-148. The following table shows the effect of the adjustment on the test of significance, and its relation to the test of the more general hypothesis. TABLE 5 Probability of a Positive Dilference exceeding that he observed. se{unadjusted 2148 2485 per cent. Normal hypothesis sangiogee 2139 2529, General hypothesis, se 26340 The difference between the two hypotheses is thus equivalent to little more than a probability of one in a thousand, 24-1, “Non-parametric” Tests In recent years tests using the physical act of randomisation to supply (on the Null Hypothesis) a frequency distribution, have been largely advocated under the name of “ Non-parametric” tests. Some- what extravagant claims have often been made on their behalf. The example of this Chapter, published in 1935, was by many years the first of its class. The reader will realise that it was in no sense put forward to super- sede the common and expeditious tests based on the Gaussian theory of errors. The utility of such non- parametric tests consists in their being able to supply confirmation whenever, rightly or, more often, wrongly, it is suspected that the simpler tests have been appre- ciably injured by departures from normality. They assume less knowledge, or more ignorance, of the experimental material than do the standard tests, and this has been an attraction to some mathematicians who often discuss experimentation without personal GENERAL TEST 49 knowledge of the material. In inductive logic, however, an erroneous assumption of ignorance is not innocuous ; it often leads to manifest absurdities. Experimenters should remember that they and their colleagues usually know more about the kind of material they are dealing with than do the authors of text-books written without such personal experience, and that a more complex, or less intelligible, test is not likely to serve their purpose better, in any sense, than those of proved value in their own subject. REFERENCES AND OTHER READING C. Darwin (1876). The effects of cross- and self-fertilisation in the vegetable kingdom, John Murray, London, R. A, FisHer (1925). Applications of Student's” distribution. Metron, v. 90-104. R, A, Fisuur (1925-1963). Statistical methods for research workers. Chap. V., §§ 23-24. R. A. Fiswer (1956, 1950). Statistical methods and scientific inference. Oliver and Boyd Ltd., Edinburgh, “Srupent ” (1908). ‘The probuble error of a mean, Biometrika, vi, 1-25. Iv AN AGRICULTURAL EXPERIMENT IN RANDOMISED BLOCKS 22. Description of the Experiment In pursuance of the principles indicated by the discussions in the previous chapters we may now take an example from agricultural experimentation, the branch of the subject in which these principles have so far been most explicitly developed, anid in which the advantages and disadvantages of the different methods open to the experimenter may be most clearly discussed. We will suppose that our experiment is designed to test the relative productivity, or yield, of five different varieties of a farm crop ; and that a decision has already been arrived at as to what produce shall be regarded as yield. In the case of cereal crops, for example, we may decide to measure the yield as total grain, or as grain sufficiently large not to pass a specified sieve, or as grain and straw valued together at predetermined prices, or in whatever method may be deemed appro- priate for the purposes of the experiment, Our object is to determine whether, on the soil or in the climatic conditions experienced by the test, any of the varieties tested yield more than others, and, if so, to evaluate the differences with a determinate degree of precision. We shall suppose that the experimental area is divided into eight compact, or approximately square, blocks, and that each of these is divided into five plots running from end to end of the block, and lying side * DESCRIPTION OF THE EXPERIMENT st by side, making forty plots in all. Apart from the differences in variety to be used, the whole area is to have uniform agricultural treatment. At harvest, narrow edges about a foot in width for cereal crops, or the width of a single row for larger plants, such as roots and potatoes, are to be discarded from experi- mental yields ; the central portions, cut to be of equal area, are to be harvested, and the produce weighed, or, if preferred, measured in some other manner. In each block the five plots are assigned one to each of the five varieties under test, and this assignment is made at random, This does not mean that the experi- menter writes down the names of the varieties, or letters standing for them, in any order that may occur to him, but that he carries out a physical experimental process of randomisation, using means which shall ensure that each variety has an equal chance of being tested on any particular plot of ground. A satisfactory method is to use a pack of cards numbered from 1 to 100, and to arrange them in random order by repeated shuffling. , The varieties are then numbered from 1 to 5, and any | card such as number 33, for example, is deemed to correspond to variety number 3, because on dividing by 5 this number is found as the remainder. Numbers divisible by 5 will correspond to variety number 5. The order of varieties in each block may then be quickly determined from the order of the cards in the pack, after thoroughly shuffling. The remainder correspond- ing to any variety is disregarded after its first occurrence in the block. Since 5 is a divisor of a hundred, each variety will be represented by 20 cards, and the probabilities of each appearing in any particular place will be equal. If we had been randomising six varieties we should have used a number of cards divisible by 6, for example

You might also like