P. 1
Logical Models in Social Science

# Logical Models in Social Science

|Views: 35|Likes:

See more
See less

05/25/2014

pdf

text

original

# BeginLM10.

10

Version of April 2012

Logical Models in Social Sciences: How to Begin
In 2008: Beginners’ Logical Models in Social Sciences

The version in http://www.psych.ut.ee/stk/Beginners_Logical_Models.pdf may still be the one from October 2009. The differences are fairly minor, except in B4. Compare to October 2010, changes are minimal.

Contents and Expanded contents Preface: Science Walks on Two Legs – Observation and Thinking A. Getting a Feel for Logical Models: Examples and Tools
A1. A Trivial Game with Serious Consequences A2. What Are “Logical Models”? Gorilla and Gravitation A3. Relative Error: Percent and “By a Factor of” A4. The Largest Component: Between Mean and Total A5. Forbidden and Allowed Regions: Logarithmic Scales A6. Volatility: Allowed Regions, Anchor Points, and Simplification A7. Duration of Cabinets: The Number of Communication Channels A8. How to Draw Graphs on Regular Scales A9. How to Draw Graphs on Logarithmic Scales A10. Fermi’s Piano Tuners: “Exact” Science and Approximations A11. Approximations and Cumulating Error A12. How to Test Logical Models: Logical testing and Testing with Data. A13. Examples of Models across Social Sciences B1. The Basics of Linear Regression and Correlation Coefficient R2 B2. Symmetric Regression and its Relationship to R2 B3. When is Linear Fit Justified? B4. Graph More than just the Data! B5. Getting a Feel for Exponentials and Logarithms B6. When to Fit with What B7. Medians and Geometric Means B8. What to Look for and Report in Multivariable Linear Regression B9. Substantive vs. Statistical Significance

Page 1 5
9 14 18 21 24 27 34 39 44 50 55 60 67 75 83 87 92 99 104 110 117 127 128 130 138 146 155 161 171 182 187 188

B. Linear Regression and Logical Models

C. Examples of Gradual Model Building
C1. Choice of Examples C2. Ingredients of Federalism: Allowed Areas and Regression Lines C3.Capitalism and Democracy: Applying Y=Xk C4. The Importance of Slopes in Model Building C5. Support for Democrats in US States: Applying Y=Xk C6.How the Number of Parties Connects: Exponential Fit C7. Moderate Districts, Extreme Representatives: Complex fits C8. Centrist Voters, Leftist Elites: A Cautionary Note C9. Toward Social SCIENCE

References

1

Expanded contents
A. Getting a Feel for Logical Models: Examples and Tools

A1. A Trivial Game with Serious Consequences A guessing game Skytte prize 2008 Means and median Does this answer make sense? A2. What Are “Logical Models”? Gorilla and Gravitation Quantitatively predictive logical models The invisible gorilla Gravitation undetected The gorilla moment for the number of seat-winning parties A3. Relative Error: Percent and “By a Factor of” Absolute and relative differences and errors Large relative differences and errors E: Addition-subtraction versus multiplication-division – Percent differences as a mixed bag A4. The Largest Component: Between Mean and Total Between mean and total Directional and quantitative models A cautioning example: Center of the room E: The smallest share A5. Forbidden and Allowed Regions: Logarithmic Scales Regular scale and its snags Logarithmic scale Logarithms of numbers A6. Volatility: Allowed Regions, Anchor Points, and Simplification A coarse model for volatility Testing the model with data Testing the model for logical consistency How to measure the number of parties A7. Duration of Cabinets: The Number of Communication Channels The number of communication channels among n actors Average duration of governmental cabinets Dimensional consistency Laws and models A8. How to Draw Graphs on Regular Scales Constructing the framework on regular scale Placing data and theoretical curves on regular-scale graph Making sense of the graph Converting to logarithms and graphing on regular scales Using the graph of logarithms to find constant k How I found the inverse square law of cabinet duration A9. How to Draw Graphs on Logarithmic Scales Finding logarithms of numbers Fully logarithmic or log-log graphs Semilog graphs Regular, semilog and log-log graphs – when to use which? A10. Fermi’s Piano Tuners: “Exact” Science and Approximations As exact as possible – and as needed How many piano tuners? The range of possible error E: Dimensional consistency A11. Approximations and Cumulating Error Addition: Absolute errors add Subtraction: Absolute errors still add Multiplication: Small percent errors add

2

Multiplication with large relative errors: Errors “be a factor of” multiply Division: Errors “by a factor” multiply, small percent errors add Do standard formulae overdo cumulating error? E1: Combined error on additions E2: Combined error on multiplications E3: Untested corrections for standard formulae on cumulating error A12. How to Test Logical Models: Logical testing and Testing with Data Logical testing Testing with data The tennis match between data and models Why would the simplest forms prevail? What can we see in this graph? A13. Examples of Models across Social Sciences Sociology: The number of speakers of a language and the number of journals Political history: Growth of empires Demography: The critical date around 2025 Economics: Trade/GDP ratio B. Linear Regression and Logical Models B1. The Basics of Linear Regression and Correlation Coefficient R2 Regression of y on x Reverse regression of x on y Directionality of the two OLS lines Non-transitivity of OLS regression Correlation coefficient R2 The R2 for the best fit: To which line does it correspond? B2. Symmetric Regression and its Relationship to R2 From minimizing the sum of squares to minimizing the sum of rectangles The relationships between R2 and the slopes of regression lines B3. When is Linear Fit Justified? Many data clouds do not resemble ellipses Grossly different patterns can lead to the same regression lines and R2 Sensitivity to outliers Empirical configuration and logical constraints B4. Graph More than just the Data! Graphing just the data and regressing linearly Graph the equality line, if possible Graph the conceptually allowed area Find the equation of the curve joining the anchor points Why linear fit may lack comparison power B5. Getting a Feel for Exponentials and Logarithms Exponents of 10 Fractional exponents of 10 Decimal logarithms What are logarithms good for? Calculation of constants in y=axk Exponents of numbers other than 10 Logarithms on other bases than 10 B6. When to Fit with What Unbounded field – try linear fit Calculating the parameters of a line in unbounded field One quadrant allowed – try fixed exponent fit Two quadrants allowed – try exponential fit How to turn curves into straight lines, before regression Calculating the parameters of fixed exponent equation in a single quadrant Calculating the parameters of exponential equation in two quadrants B7. Medians and Geometric Means

3

Arithmetic mean and normal distribution Geometric mean and lognormal distribution Means and the median The sticky case of almost lognormal distributions Conclusion: The conceptual range matters B8. What to Look for and Report in Multivariable Linear Regression Making use of published multi-variable regression tables: A simple example Guarding against colinearity Running a multi-variable linear regression Processing data prior to exploratory regression Running exploratory regression Lumping less significant variables: The need to report all medians and means Re-running exploratory regression with fewer variables Report the domains of all variables! Model-testing regression B9. Substantive vs. Statistical Significance C. Examples of Gradual Model Building C1. Choice of Examples
Graph more than the data!

C2. Ingredients of Federalism: Allowed Areas and Regression Lines Constitutional rigidity and judicial review Degree of federalism and central bank independence Bicameralism and degree of federalism Conversion to scale 0 to 1 C3. Capitalism and Democracy: Applying Y=Xk Support for democracy and capitalism Fit with fixed exponent function Is this model building? C4. The Importance of Slopes in Model Building Many variables are interdependent, not “dependent” or “dependent” Notation for slopes Assembly size which minimizes communication load The model “slope proportional to size” – exponential growth The slope of fixed exponent function How slopes combine when functions are combined C5. Support for Democrats in US States: Applying Y=Xk Solution C6. How the Number of Parties Connects: Exponential Fit Interest group pluralism Fitting an exponential curve to interest group pluralism data Frequency of minimal winning and/or one-party cabinets Electoral disproportionality The slope of exponential function (Appendix, temporary) C7. Moderate Districts, Extreme Representatives: Complex fits The simplest fit of data to conceptual constraints Why do the representatives’ positions vary as they do? The envelopes of the data cloud Conclusions C8. Centrist Voters, Leftist Elites: A Cautionary Note C9. Toward Social SCIENCE
How to determine the value of k in Xk

4

Preface: Science Walks on Two Legs – Observation and Thinking

5

6

While expanding on the published graphs, I gradually reinforce the ability to use formats such as exponential and fixed exponent – and the ability to recognize when to use them !) This makes use of tools developed in Part B, such as ability to handle logarithms and exponents. The more I teach, the more I become aware of small but essential tricks of the trade that I have taken for granted but are not self-evident to students. Hence, further revisions of this volume are likely to insert even more methodological notes. It might be said that what has been presented here does not amount at all to a logical model construction methodology – it is just a mix of basic algebra, logarithms and exponentials, statistics, differential calculus, and so on – often even on a primitive rather than a “basic” level. So it is indeed. There is no separate logical model construction methodology in the sense of distinct mathematics. It is a matter of making use of usual mathematical tools in a possibly less usual ways than is the case in automatic application of formulas or in push-button computation. This is why I stress those “primitive” stages which involve thinking, and which students often overlook or forget in the rush towards ever more complex equations. Just ask a student proficient in running OLS to explain it to an elementary school kid at the level of Figure B1.2 – and you’ll find out if he really understands. Or try slopes, at the level of Figure C4.1. But this is the level most needed for the type of thinking involved in choosing the suitable logical model. When the nature of the problem needs more complex mathematics, the student should be able to recognize it and know where to go for help. If one should feel that all this is not so different from the contents of usual methods courses in social sciences, one would be right. It’s just a matter of emphasis on basic understanding. Exercises are spiked throughout the text. When this book is used as textbook in courses, the exercises are meant to be partly solved by students in class and partly assigned as homework to be turned in and corrected. I do have a set of brief answers for part of the exercises, but I am reluctant to distribute them electronically. In most cases fairly unique answers exist and should be clear to an instructor who has studied this book and Making… Indeed, I would not want this book “taught” by someone who needs a set of answers. If an issue looks ambiguous, one might discuss it with students. Ability to think is served better by an inconclusive discussion than by a machine-readable set of answers.
Many students at the University of Tartu and at the University of California, Irvine have contributed to the present draft by their questions and also by their mistakes when solving the problems presented in exercises. These showed me where my wordings were faulty or more intermediary steps were needed. Many colleagues also have helped, wittingly and unwittingly. Mirjam Allik, who co-taught with me a course based on a previous draft first comes to mind, but the long list of people acknowledged in the Preface of Making…could well be repeated. Special thanks go to colleagues who graciously agreed to have their graphs subjected to further analysis in Part C: Arend Lijphart, Russell J. Dalton along with Doh Chull Shin, and Richard Johnston along with Michael G. Hagen, and Kathleen H. Jamieson. I still have not had time to ask Pippa Norris and Ronald Inglehart for permission to include one of their graphs. As this draft gets further testing in classroom, the list will expand.

Rein Taagepera

7

Science walks on two legs: Observation and Thinking SCIENCE
How things
Statistical testing of quantitatively predictive logical models Empirical relationships ↑ Data analysis -statistical etc. ↑  Data Measurement ↑ Quantitatively predictive logical models ↑ Thinking ↑ Directional prediction ↑

How things

ARE

SHOULD BE on
logical grounds

Observation

Thinking

Today’s social science hops on one leg: Observation
Today’s

SOCIAL
How things
SCIENCE Statistical testing of directional prediction Empirical relationships ↑ Data analysis -statistical ↑  Data Measurement ↑
How things SHOULD BE on logical grounds

ARE

Observation

Directional prediction ↑ Thinking

8

A. Getting a Feel for Logical Models: Examples and Tools
A1. A Trivial Game with Serious Consequences

9

So you did have some information, after all – you knew the lower and upper limits, beyond which the answer cannot be on logical grounds. At this point make a guess in the range 1 to 100 and write it down. Call this number p, the number of parties. Now, let us proceed systematically. When such conceptual limits are given, our best guess would be half way between the limits. In the absence of any further information, nothing else but the mean of the limits could be justified. However, there are many kinds of means. The good old arithmetic mean of 1 and 100 would be around 50: (1+100)/2=50.5. But having 50 parties getting seats would mean that on the average they would win only two seats each. Only two seats per party? This might look rather low. If so, then let’s ask which number would not look too low. How many seats would you expect those parties to win on the average? How would you proceed now? Yes, you should again think about conceptual limits. The average number of seats per party must be at least 1 (when every party wins only one seat) and at most 100 (when one party wins all the seats). At this point, make a guess in the range 1 to 100 and write it down. Call this number s, the number of seats per party. Now, if you really think that p parties win an average of s seats each, then the total number of seats must be the product p times s. Calculate this product for the two guesses you have written down. If this product isn’t 100 seats, your two guesses do not fit together. In particular, guessing at the two arithmetic means runs into gross inconsistency. Indeed, about 50 parties winning about 50 seats each – this would require about 2500 seats – way above the 100 we started with! This approach clearly does not work. If the product of your two guesses came out as 100 seats, congratulations – your guesses are mutually consistent. But these guesses might still be less than optimal. Suppose someone guessed at 5 parties winning an average of 20 seats each, while someone else guessed at 20 parties winning an average of 5 seats each. What would be the justification for assuming that there are more parties than seats per party – or vice versa? In the absence of any further information on which way the tilt goes, the neutral assumption is that the two are equal, meaning 10 parties winning an average of 10 seats each. This is what we call the geometric mean. The geometric mean of two numbers is the square root of their product. This means that the square of the geometric mean equals the product of these two numbers. In the present case, 10 times 10 is the same as 1 times 100. It will be explained later on why we should use the geometric mean (rather than the arithmetic) whenever we deal with quantities that logically cannot go negative. This is certainly the case for numbers of parties or seats.

10

Do we have data to test the guess that 10 parties might win seats? Yes, from 1918 to 1952 The Netherlands did have a first chamber of 100 seats, allocated on the basis of nationwide vote shares. Over these 9 elections the number of seat-winning parties ranged widely, from 8 up to as many as 17. But the geometric mean was 10.3 parties, with an average of 9.7 seats per party. This is pretty close to 10 parties with an average of 10 seats. As you see, we could make a prediction with much less information than you may have thought necessary. And this ignorance-based approach actually worked! This is what I call an ignorance-based logical model. It is based on nearly complete ignorance. All we knew were the conceptual limits 1 and 100. More broadly, if something can conceptually range only from 1 to M, then our best guess is the geometric mean of 1 and M, which is the square root of M. It is written as M1/2 or M0.5. Even more broadly, when the conceptual limits are A and B, both positive, then our best guess is the geometric mean of A and B: Best guess between limits A and B = (AB)1/2. This is so “in the absence of any further information.” This is a term we will encounter often in this book. It’s the key for building parsimonious logical models. Skytte Prize 2008 Why have I dwelled so long on this simple guessing game? In 2008, I received the Skytte Prize, one of the highest in political science, worth 500,000 Swedish kroner (roughly 50,000 euros or 75,000 US dollars). And I basically received it for this guessing game! Of course, lots of further work and useful results followed, but the breakthrough moment came around 1990, when I was puzzled about the number of seat-winning parties and suddenly told myself: Consider the mean of the extremes. Using this approach twice enabled me to calculate the number of parties in the entire representative assembly when a country allocates assembly seats in many districts. All I needed was assembly size and the number of seats allocated in the average electoral district. In turn, the number of parties could be used to determine the average duration of governmental cabinets. Here the logical model is quite different from the previous – we’ll come to that. The overall effect is that we can design for a desired cabinet duration by manipulating the assembly size and the number of seats allocated in the average district. This is of practical use, even while the range of error is as yet quite large. Means and median The median is the point where half the values are smaller and half are larger. E.g., for the 5 numbers 2, 3, 5, 7 and 10, it is 5. The arithmetic mean of n values adds the values, then divides by n. For the 5 numbers above, it is (2+3+5+7+10)/5=27/5=5.40. In other words, 5.4+5.4+5.4+5.4+5.4=2+3+5+7+10.

11

12

13

A2. What Are “Logical Models”? Gorilla and Gravitation

constructions. So, when we think we have a model, simple or complex, we should verify whether it really holds. This is what testing a model means, in science. Science walks on two legs Science walks on two legs – see scheme in p. 8. One leg refers to the question: How things are? It leads to careful observation, description, measurement, and statistical analysis. The other leg refers to the question: How things should be, on logical grounds? That question guides the first one. The question “How things are?” assumes that we know which aspects of things are worth paying attention to. But we largely see only what we look for. And it’s the question “How things should be?” that tells us what to look for. This is the question we asked about the number of seat-winning parties, even while it may not look so. That science walks on two legs is a notion as old as social sciences. Auguste Comte, one of the initiators of social studies, put it as follows, two centuries ago, in his Plan of Scientific Studies Necessary for Reorganization of Society: If it is true that every theory must be based upon observed facts, it is equally true that facts cannot be observed without the guidance of some theory. Without such guidance, our facts would be desultory and fruitless; we could not retain them: for the most part we could not even perceive them. We largely see only what we look for. It almost seems in this quote as if the logical model ("theory") would come first, but actually a continuous interaction is meant: "some theory" as guidance, some observation, some further model refinement… The chicken and the egg evolve conjointly. The invisible gorilla We largely see only what we look for. A gorilla taught me so. At a science education conference in York we were shown a film clip. People were playing basketball, and we were instructed to count the number of passes. The action was too fast for me. I soon gave up counting and just watched idly, waiting for the end. Thereafter we were asked if we had noticed anything special. A few laughed knowingly and shouted: “The gorilla!” Oh my! I vaguely recalled reading something about a gorilla experiment. Was I among the ones taken in? I surely was. The clip was run again. While the game went on in the background, a person in gorilla suit slowly walked across in the foreground. Center stage it stopped, turned and looked at us, then continued at a slow pace and exited. It was as plain as anything could be, once we were given a hint – but without such a hint most of us had not seen him! In science, the word “should” – as in “How things should be?” –is often the gorilla word. We see only what we think we should look for. If science were reduced to supposedly hard boiled observation and analysis of facts but nothing else, we might improve our ability to count the passes while still missing the gorilla. The following test illustrates it.

15

Gravitation undetected I sent three dozen social scientists data where the output was calculated exactly from the formula for the universal law of gravitation – but I didn’t tell them that. The law is F=GMm/r2 – force of attraction F between two bodies is proportional to their masses (M and m) and inversely proportional to the square of their distance (r). G is a universal constant. I simply sent my colleagues a table of values of y, x1, x2, and x3, and told them that y might depend on the other three variables. Where did I get the x-values? I picked essentially random values, using a telephone book. Then I used y=980x1x3/x22 to calculate y. What was the purpose of this experiment? If data analysis sufficed to detect how things are connected, some of my colleagues should have found the actual form of the relationship. All of them were highly competent in data analysis. Social data usually comes with large random variation, which makes detection of regularities so much more difficult. My pseudo-data had no such scatter. Yet no one found the form of the relationship. They tried out standard formulas used by statisticians, usually of the type y=a+bx1-cx2+dx3. This linear expression uses addition or subtraction, while the actual equation involves multiplication and division. If only they would have found no connection! But it was worse than that. All those who responded found quite satisfactory results by the usual criteria of data analysis. Why is it worse than finding nothing? If we do not get a satisfactory result, we may keep on working. But if we get a result that looks satisfactory, yet is off the mark, then we stop struggling to find anything better. I cannot blame my colleagues for not detecting the law of gravity – I only gave them plenty of “what is” but no clue about “what should be”. Small wonder they missed it. The gorilla moment for the number of seat-winning parties Of course, the distinction between “is” and “should” isn’t always so clean cut. Most often they enter intermixed. Consider how the number of seat-winning parties eluded me (and everyone else) for decades. The obvious part was that this number depends on how people vote – but pinning down typical votes patterns was even harder than counting the number of passes in a basketball game. In hindsight, this difficulty applied only to districts with many seats. In one-seat districts obviously one and only one party would win the single seat available, regardless of how many parties competed. This observation looked trivial and not even worth stating – and this was the problem. It was “obviously” irrelevant to the puzzle in multi-seat districts. But suppose someone had spelled out the following: “In a one-seat district, the number of seat-winning parties is one, regardless of how voters vote.” It would have been like someone shouting: “The gorilla!” Indeed, if the way voters vote is completely

16

overridden by something else in one-seat districts, could it be the same in multi-seat districts, at least partially? We are directed to see that the total number of seats available should matter, at least on the average. The rest was easy. When did the shift take place, from fact-oriented “is” toward “should”? It’s hard to say. The observation that a district of 100 seats offers room for more parties than a district of 10 seats or a district of 1 seat is sort of factual. Yet it brings in a fact that previously was thought irrelevant. It supplied the jumping board for the first sentence above where the word “should” explicitly enters: “The total number of seats available should matter.” Moreover, don’t overlook the expression “on the average”. Dwarfs and giants may occur, but first try to pin down the usual. ______________________________________________________________________ Exercise A2.1 As I was completing secondary school in Marrakech, in 1953, the first Moroccan uprisings against the French “protectorate” took place in Casablanca. The Frenchcontrolled newspapers wrote that 40 people were killed. Our Moroccan servant, however, reported rumors that several thousand were. Take this to mean around 4,000. My friend Jacques, with family ties in high military circles, asked me to guess how many people actually were killed, according to classified army reports. a) Which estimate did I offer, in the absence of any further information? Write down your best reasoned guess. b) What do you think is the broad purpose of this exercise? ______________________________________________________________________ Exercise A2.2 There is an animal they call lmysh in Marrakech. What is your best guess at roughly how much a lmysh weighs? The only information is that lmysh is a mammal. The smallest mammal is a shrew (about 3 grams), and the largest is a blue whale (30 metric tons). a) Convert grams and tons into kilograms. 1 ton=1000 kg. 1 gram= 1/1000 kg. b) Estimate the weight of lmysh. c) What do you think is the broad purpose of this exercise? ______________________________________________________________________ Exercise A2.3 It is easy to give lip service to this notion of model-observation interweaving, yet quickly deviate from it. This is illustrated in a remarkable book on How Math Explains the World (Stein 2008: 30-31). It approvingly offers the quote of Comte above, yet continues in the very next page: Most of the great theories of physics represent the scientific method in full flower. Experiments are conducted, data is gathered, and a mathematical framework explaining the data is constructed. Predictions are made… a) Specify what comes first and what follows, in this description. b) Why could it be argued that "mathematical framework explaining the data" is ambiguous? HINT: Compare "explaining" and "accounting for", as well as "data" and "phenomenon".

_________________________________________________________

17

A3. Relative Differences: Percent and “By a Factor of”
___________________________________________________________________  When relative differences are small, it is convenient to express them in percent.  Percent differences lack symmetry: They cannot be lower than -100%, but they can be very much higher than +100%. Hence a percent error of ±5% is fairly clear, but an error of ±50% becomes ambiguous. Use percent difference or error only when it is much less than 50%.  When relative differences are large, it is more meaningful to express error ranges by multiplicative factors, such as ×÷2 (“multiply or divide by 2”). ____________________________________________________________________ Exercises in previous chapter involve estimates. Whenever we have estimates or imperfect measurements, the question arises of how close we are to the actual values. So we have to express the extent of the difference between the two. There are several ways to do so. Indicating the range of error is an inescapable part of any scientific reporting (and thinking), so take it seriously. It will crop up throughout this book. Absolute and relative differences and errors Suppose we estimate the size of a crowd at 50 people, while a detailed count yields 75. The absolute difference in our rough measurement is 75-50=25 people. Now suppose we estimate the size of another crowd at 500 people, while a detailed count yields 525. The absolute difference is again of 25 people. But the implication is quite different. Being off by 25 out of 75 is pretty coarse, while being off by 25 out of 525 is pretty good: 0 50 75 500 525 |---------|----|----|-------------------|-------------------|-------------------|-------------------|----| To express this distinction, we use relative difference. It is often expressed in percentages. We take the absolute difference, divide it by the actual figure, and then multiply by 100%. Thus [(75-50)/75]100%=33% while [(525-500)/525]100%=5%, which reflects a much smaller undercount. (Your calculator shows 33.333… and 4.7619…, respectively, but this is pointless precision. Even 30% and 5% would tell all we need to know about the extent of the difference.) The difference between something estimated or measured and the actual value is called measurement error. If we repeatedly measure the length of a building using a yardstick, our results vary. If they vary from 53 to 59 meters, we may express the result as L=56 meters ± 3 meters. The range of absolute error is plus-or-minus 3 meters. If we measure the length of a table with the same yardstick, the absolute error is likely to be smaller. Maybe our measurements range from 1.85 to 1.93 meters. So the length is L’=1.89 meters ± 0.04 meters. Relative error in terms of percent divides the absolute error by the mean value. For the building, it is 100%(3/56)=5.4%≈5%. (The wavy symbol ≈ indicates approximate equality.) For the table it is 100%(0.04/1.89)=2.1%≈2%. So the measurement is more precise for the table than for the building. We write L=56 meters ±5% and L’= 1.89 meters ±2%.

18

19

Absolute differences use philosophy of addition-subtraction: expected value minus actual (or vice versa). Relative differences “by a factor of” use a philosophy of multiplication-division: expected value divided by the actual (or vice versa). Percentage differences mix the two: they subtract and then divide: (expected-actual)/actual. This is where they lose symmetry. (The subsequent multiplication by 100% matters little.) They also are more confusing: Should I divide by the “actual” or by the “expected”? The choice is not always clear cut. We have now encountered several pairs of categories: different means, differences, errors. Can you tell which members of these pairs go with addition and which go with multiplication?

20

A4. The Largest Component: Between Mean and Total
____________________________________________________________________  The share of the largest component is often close to the total size divided by the square root of the number of components. This is a quantitative model.  In contrast, “The largest share tends to go down when the number of components increases” is a directional model.  We should try to go beyond directional models, because quantitative models have vastly more predictive power. ___________________________________________________________________ The United States has about 300 million people (294 million in 2004) divided among 50 states. (Make it simple and ignore DC and Puerto Rico.) What could the population of the largest state be? Call it P1. The US also has an area of 10 million square kilometers (9.88, if one wishes to be overly precise). What could the area of the largest state be? Call it A1. Between mean and total Once again, the first reaction might be that one cannot know, short of running for an almanac. But think about the conceptual limits. Obviously, the largest state cannot exceed the total. But what is the least size it could have? Stop and think. Here the number of states enters. If all states had the same population and area, each would have 300/50=6 million people and 10/50=0.2 million square kilometers. If they are unequal, the largest state must have more than that. So we know that 6 million < P1 < 300 million, and 0.2 millions km2 < A1 < 10 million km2. (Note that the largest by area need not be the largest by population!) In the absence of any other information, our best estimate is the mean of the extremes. Since population and area cannot go negative, the geometric mean applies. The result is P1=42 million and A1=1.4 million km2. Actually, the most populous state was California, 36 million in 2004, and the largest by area is Alaska, 1.7 million km2. Our estimates are within a factor of 1.25 of the actual figures – or within 25%. ______________________________________________________________________ Exercise A4.1 Are you still uneasy about the need for geometric mean? After all, we used the arithmetic means 300/50=6 million and 10/50=0.2 million to establish the lower limits. OK, take the arithmetic means of 6 and 300 and see what population it would predict for California. Do the same for area. Roughly how large would the largest component be, compared to the total? Would it agree with your gut feeling? _____________________________________________________________________ We can repeat the estimates for Canada or Australia, and the same error limits hold. This approach fails, however, when the federal units are purposefully made fairly equal (such as in Austria) or a large hegemonic state has added small satellites (Prussia in

21

Imperial Germany). On the other hand, the approach applies much wider than to sizes of federal units. In particular, it applies to parties. If 100 seats are distributed among 10 parties, their mean share is 10 seats. The largest party can be expected to have (100x10)1/ 2=32 seats. In the aforementioned case of The Netherlands 1918-1952, the actual largest share over 9 elections ranged from 28 to 32, with an average of 30.6. The fit isn’t always that good, but the estimate works on the average of many countries. Indeed, this relationship was a major link in my series of models leading to prediction of cabinet duration. It may be useful for some other sociopolitical problems. Up to now, we have worked out individual cases. It’s time to establish the general formula. When a total size C is divided among N components, then the largest component (C1) can range from C/N to C. The best estimate for C1 is the geometric mean of these extremes: C1 = [(C/N)C]1/2 = [C2/N]1/2 = C/N1/2. In short: C1 = C/N1/2. The largest component is often close to the total size divided by the square root of the number of components. ____________________________________________________________________ Exercise A4.2 As of 2002, Australia had a population of 20 million. How large would you expect the population of its largest federal component to be? Your guess would depend on the number of components, but there is a snag. Australia has 6 states plus a large and empty Northern Territory with limited self-rule and also a separate Capital Territory (Canberra). Should we base our estimate on 6 or 8 components? a) Estimate the largest component both for N=6 and N=8. b) The actual figure for the most populous component (New South Wales) was 6.3 million. By how much are your two estimates off? c) Draw some conclusions. ______________________________________________________________________ Directional and quantitative models A logical model can be merely directional, such as “If the number of components goes up, the share of the largest component goes down.” But how fast does it go down? We get much more out of a quantitative model such as C1 = C/N1/2. It tells us all the directional model does, but also much more. For instance, if N increases 4-fold, then C1 is reduced by one-half. Not just a directional “If N increases 4-fold, then C1 is reduced”. We should always aim at going beyond a directional model, because a quantitative model has vastly more predictive power. A cautioning example: Center of the room Now that I may have convinced you of the usefulness of geometric mean let me introduce a counter-example, before you swallow the geometric mean too deeply. Suppose I hide a

22

needle somewhere in the room where you cannot see it. You get a bonus or a penalty, depending on how close your guess is. A measuring tape which extends into the hallway reads 1 meter at one wall of the room and 9 meters at the other wall. At what location along this tape would it be advisable to place your bet? Write down your best guess. This is another example of a model of (almost complete) ignorance. But if you have gotten brainwashed into automatically applying the geometric mean, think again – and draw a picture: |________________________ | 0 1 2 3 4 5 6 7 8 9 Wall C Wall The geometric mean of 1 and 9 is 3. This would be merely 2 meters from the closest wall and hence a full 6 meters from the other. But would you not be safer at guessing at the center of the room (C), 4 meters from each wall? Then your maximum error could be 4 meters, rather than 6. (This is called minimax strategy: minimizing the maximal possible error.) Here we are advised to use the arithmetic mean: (1+9)/2=5 meters along the tape. So what is different in this case, compared to the previous? Here the numbers refer to arbitrary measuring units with no fixed zero. One could shift the tape so that its 0 point corresponds to the wall – or even reverse it to start from the other wall. The zero is arbitrary. In contrast, 40 people and 4,000 people refer to non-arbitrary counting units (1 person) where the zero point is fixed – "not a single person". These numbers cannot be shifted or flipped. If you don’t quite follow this abstract reasoning, there is no problem. At this point, just realize that we have to think about which mean makes more sense – geometric or arithmetic. (Sometimes neither does!) EXTRA: The smallest share
One might think that it would be as easy to establish an estimate for the smallest among N shares (CN), but it’s tricky. What is the largest value CN could have? It’s obviously C/N. But what is the smallest value CN could have? Here we run into trouble. I finally got a handle on it after writing the first draft of this book, in fall 2008, pointing out the problem. Such public statement of a problem can be a significant step toward solving it, because one thus challenges oneself. I was motivated to think on this problem, off and on, always running into roadblocks but, in retrospect, still making tiny advances in locating new ways to look at the problem. The breakthrough came in June 2009, but I still struggled for two months to understand what it was that I had found. Go and try to explain something to others when it isn’t clear to oneself! In August 2009 a clean draft was completed: “The largest, smallest and median components: Their expected sizes”. So how large is the smallest among N components? The answer is much more complex and fuzzy than the simple C1 = C/N1/2. The smallest component can be expected to be not smaller than about CN =C/[e0.14(N-1)N1.89].

23

A5. Forbidden and Allowed Regions: Logarithmic Scales ____________________________________________________________
Some values for a quantity may be forbidden on conceptual grounds, others on grounds of information supplied. What remains is the allowed region.  The center of the allowed region is often easier to visualize when showing the regions on a logarithmic scale.  The logarithmic scale shows 0.1, 1, 10 and 100 at equal intervals. ________________________________________________________________ In all the previous examples, we could represent our options on an axis ranging from minus to plus infinity (-∞ to +∞). How did we narrow down the options, say, for the weight of lmysh? Let us proceed by systematic elimination, shown in Figure A5.1. Regular scale and its snags Negative weights are inconceivable, even in fairy tales. Therefore, we should first mark off the negative part of the scale as a conceptually forbidden region for any weights whatsoever. Thereafter, we could mark the weights of shrews and blue whales and exclude what’s outside this range as “informationally” forbidden regions for mammals. This would leave the allowed region, between Shrew and Whale. In Figure A5.1, the forbidden regions are dashed off, so that the allowed region stands out. The dashing is heavier for conceptually forbidden regions, to visualize a stronger degree of exclusion. Figure A5.1. Weights of mammals on a regular scale: Allowed region CONCEPTUALLY | INFORM| ALLOWED REGION | INFORMATIONALLY FORBIDDEN | FORBID.| | FORBIDDEN surprise / / / / / / / / / / / / / /////////////////////////////// / / / / / / surprise central ---------------------------|--------|-----------------------------------|-------------------------- -∞ 0 Shrew Whale +∞  All that’s left is to make the best estimate within this region. This is still quite a job, but without pinning down the allowed region, we could not even start. Within the allowed region we could still distinguish a central region, where we are most likely to expect to find the lmysh, and marginal regions of surprise, where we would be surprised to find it, if we know that few mammals are as small as shrews or as large as whales. Here the “central region” would not be in the visual center between Shrew and Whale, because the visual center (the arithmetic mean of shrew and the largest whale) would correspond to a small whale. In this picture, the region around the geometric mean would be rather close to Shrew, itself almost indistinguishable from the zero point. (I had 

24

to expand this part of the graph to make it visible.) We would get a more accurate picture if we shifted to a logarithmic scale. Logarithmic scale Logarithmic scale is a scale where you add equal distances as you add zeroes to the number. On this scale, the distances from 1 to 10, from 10 to 100, and from 100 to 1000 are equal, because each time you add one 0 to the number. In the reverse direction, the same distance takes you from 1 to 0.1 and from 0.1 to 0.01. Logarithmic scale goes by the “order of magnitude” of numbers, which roughly means the number of zeros they have. Let us show the weights of mammals on such a logarithmic scale (Figure A5.2). When using kilograms, the shrew is at 0.003, between 0.001 and 0.01, and the blue whale at 30,000, between 10,000 and 100,000. Now the visual center of the allowed region is around the geometric mean of shrew and blue whale, 10 kg. But what happens to the conceptually forbidden region? It vanishes, because the zero point shifts toward to the left, infinitely far. Indeed, you may add as many zeros as you wish behind the decimal point, and you are still higher than zero. Just like the notion of weight disallows negative values, so does the logarithmic scale. The two are well suited for each other. Figure A5.2. Weights of mammals on a logarithmic scale: Allowed region FORBID. | ALLOWED REGION | FORBIDDEN / / / / / / /| surprise central surprise | / / / / / / / / / / -----|--------|--------|---------|---------|-------- |---------|--------|--------|--------|------ 0 .001 | .01 0.1 1 10 100 1,000 10,000 | 100,000 +∞  Geom. mean Whale Shrew _______________________________________________________________________ Exercise A5.1 a) Make a list of coins and banknotes used in your country. Mark their locations on a logarithmic scale. b) Comment on the distances between these locations. c) Now suppose I asked you to mark their locations on a regular scale. In your response, please do not use too many impolite words. “Go fly a kite” is acceptable. d) Comment on the statement "We do not encounter the logarithmic scale in everyday life". ______________________________________________________________________ Exercise A5.2 a) Show the forbidden and allowed areas for the number of seat-winning parties in a 100-seat district, both on regular and on logarithmic scales. Do not worry

25

b) Do the same for Casablanca deaths, on logarithmic scale only. c) Do the same for the population and area of the largest US state, on logarithmic scale only. d) Try to do the same on regular scale, for parts b and c, if you feel it’s simpler. e) Draw some conclusions. _____________________________________________________________________ Logarithms of numbers Why do we call the scale in Figure A5.2 “logarithmic”? Figure A5.3 has the same scale as the previous figure. Underneath, it has the number of zeros following “1”. The number of zeros preceding “1” carries a minus sign. This is a shorter way to label this scale. And this number of zeros is called the (decimal) logarithm of the number. Thus log100=2 while log0.01=-2.

Figure A5.3. Numbers and their logarithms +∞   0 0.001 0.01 0.1 1 10 100 1,000 10,000 100,000 -----|--------|--------|--------|---------|---------|---------|-------- |---------|--------|-------3 -2 -1 0 1 2 3 4 5 6 What if a number begins with something else than “1”? Consider 30. What can you say about log30? Well, try to place it on the scale shown. Since 10<30<100, we must have log10<log30<log100. Hence log30 must be somewhere between 1 and 2. If you understand that, you’ll understand everything about logarithms when we’ll need them. And do we need them! _______________________________________________________________________ Exercise A5.3 What are the lower and upper limits for log537.5? For log0.74? Can you devise a general rule for approximate placement of logarithms of numbers? ______________________________________________________________________

26

A6. Volatility: Allowed Regions, Anchor Points, and Simplification

27

Next, note that there is a conceptual extreme case. Suppose only one party runs at the first election, and only the same party runs at the next one. This means that N=1, and switching to another party is impossible. Hence volatility must be zero. This point (N=1, V=0) is marked in Figure A6.1 with a triangular symbol. It is a conceptual anchor point. At N=1, even a slight deviation of V away from zero would violate logic. Of course, democratic countries practically always have more than one party running. Logical models, however, must not predict absurdities even under extreme conditions. Figure A6.1. Individual-level volatility of votes vs. effective number of electoral parties – conceptually forbidden regions (areas), anchor point, and expected zone. (Note: The legend “Anchor point” at top left is confusing, the actual anchor point is the triangle at lower left.)
CONCEPTUALLY FORBIDDEN AREA 100 Anchor point
V= 20 (N

80 Individual level volatility (%)
CONCEPTUALLY FORBIDDEN

-1 )

Surprise zone

(N 10 V=

-1)

60

Expected zone

40

20

0 0 1 2 3 4 5 6 7 8 9 10 CONCEPTUALLY FORBIDDEN AREA

Effective number of electoral parties

At this point, we make two simplifying assumptions. First, we assume that at least one party obtains some votes in both elections. This restriction excludes the unlikely situation where a single party has all the votes in one election but loses them all to a brand new party in the next election. Second, we assume that the same voters vote at both elections. This simplification is serious, because real elections always have some voters drop out and some others entering, from one election to the next. We should not forget about it, but model building best proceeds by stages. “As a first approximation”, let us assume that negligibly few voters drop out or join. Let us work out this simple situation first. If successful, we can go to a second approximation, where we take into account more factors – not only the shift

28

in voters but also some other factors that might influence volatility, besides the number of parties. This “number of parties” also needs clarification. Suppose the percent vote shares of parties are 45, 35, 10, 9 and 1. How many meaningful parties are there – 5 or 4 or 2? As explained at the end of this chapter, the “effective number” is 2.9. This means that for some purposes the party constellation 45-35-10-9-1 behaves like a constellation of 3 equal-sized parties. From now on, N will stand for this effective number of parties. If V increases with N, our simplest tentative assumption could be linear increase: V=a+bN. But the anchor point adds a constraint. All acceptable lines must pass through the anchor point. How do we build this constraint into V=a+bN? For N=1, we must have V=0. Plug these values into V=a+bN, and you get 0=a+b. This means that a=-b, so that V = -b+bN = b(N-1). Among the infinite number of upward sloping straight lines, only those will do where the initial constant equals the negative of slope. Without any input of data, the conceptual anchor point approach has already narrowed down the range of possibilities. Instead of having to look for two unknowns (a and b), we have only one. This is tremendous simplification. Now we move to shakier grounds. The effective number of parties rarely reaches N=6. You may share a gut feeling that even with so many parties to choose from, not all voters will switch. If so, then V=100 percent at N=6 would be a highly surprising outcome, although it is not conceptually impossible. The line V=b(N-1) that passes through this point is shown in Figure A6.1. How do we find the equation of this particular line? Plug the values (6,100) into V=b(N-1). The result is 100=b(6-1); hence b=100/(6-1)=20. Thus, the equation of this line is V=20(N-1). Any data point located above this line would be highly surprising, although we cannot completely exclude the possibility, in contrast to the conceptually forbidden areas. Hence this zone is marked as a surprise zone in Figure A6.1. So V=20(N-1) is roughly the highest value of V that would not utterly surprise us. Do we also have a lowest value? No – even with a very high number of parties, it is still conceivable that party loyalty of voters could be complete. Thus no limit higher than V=0 can be proposed, meaning a horizontal line in Figure A6.1. Without any real data input, we have now narrowed down the reasonably expected zone where data points could occur. It’s the cone between the lines V=20(N-1) and V=0. In the absence of any other knowledge, we have no reason to expect the actual line to be closer to either of these two extremes. Therefore, our best “minimax bet” would be the average of the likely extremes. We really should write it with a wavy equality sign “”, because it is quite approximate:

29

V10(N-1). Still, without resorting to any data, we have gone beyond a directional model to a quantitative one. It is based on near-complete ignorance and is shown in Figure A6.1. This model makes two distinct predictions, one of them very firm, the other quite hesitant. 1) If any straight line fits at all, it absolutely must have the form Vb(N-1), so as to respect the anchor point. 2) The slope b would be around 10, very approximately. It means: “If you force me to guess at a specific number, I would say 10.” Testing the model with data Once constructed, such a model needs testing in two different ways: Logical testing to guard against any absurd consequences; and testing with actual data. Let us start with the last one. Data are not that easy to come by, but a uniform data set is available from Oliver Heath (2005), for state-level elections in India, 1998-1999. Many parties competed in some of these states, while few did in some others. The mean values were N=3.65 and mean V=31.6. They lead to b=31.6/(3.65-1)=11.9. Thus our very coarse expectation of 10 was off by only 20 percent – this is pretty good for a prediction not based on data. So, at least for India, V=11.9(N-1)=-11.9+11.9N. Heath (2005) reports the best statistical fit as V=-9.07+11.14N [R2=0.50]. 2 Here the “correlation coefficient” R indicates the goodness of fit to the best possible straight line. Perfect fit of data points to the line would lead to R2=1.00, while utter scatter would lead to R2=0.00. So R2=0.50 reflects appreciable scatter but still a clear trend, as can be seen in Figure A6.2. This Figure also shows the lines corresponding to the two equations. Neglect for the moment the curved pattern. Which equation should we prefer? Note first that they are very close to each other, compared to the wide scatter of data points. They fit practically equally well. Actually, both represent statistical best fit lines, but to different ways to fit data. Equation V=-9.07+11.14N results from the assumption that any straight line is acceptable, meaning that any values of a and b in V=a+bN are just fine. For N=1, it yields V=2.07 rather than the conceptually required V=0. On a scale 0 to 100, the difference is small – but it is absurd nonetheless. In contrast, equation V=-11.9+11.9N results from the assumption that only the lines passing through the conceptual anchor point are acceptable. This line is the best statistical fit subject to this additional condition. If we want to predict the results of future elections in Indian states, both are as good (or as bad, given the scatter in previous data), but conceptually we are better off with the line that respects the conceptual anchor point. This is even more so, if we want to guess at volatility in elections elsewhere, because we are certain that the anchor point

30

holds universally. However, we should be prepared to find that the slope might differ appreciably from 11.9 when it comes to countries with political cultures different from India’s. So we might be cautious and offer a universal quantitative prediction with a wide range of error, such as V=(123)(N-1). Figure A6.2. Individual-level volatility of votes vs. effective number of electoral parties: data and best linear fit from Heath (2005), plus coarse and refined predictive models
CONCEPTUALLY FORBIDDEN AREA 100 Anchor point
1) N.9( 11 V= N .14 11 + .07 = -9 V )) 3(N-1 -0.14 0( 1- e V= 10

80 Individual level volatility (%)
CONCEPTUALLY FORBIDDEN

60

40 v 20

0 0 1 2 3 n4 5 6 7 8 9 10 CONCEPTUALLY FORBIDDEN AREA Effective number of electoral parties

Testing the model for logical consistency So this takes care of testing the model with actual data. But we still need logical testing, so as to guard against any absurd consequences. Look again at Figure A6.2. What volatility does our model V=-11.9+11.9N predict for N=10? It predicts more than 100 per cent! This is absurd. We cannot plead that such large numbers of parties practically never materialize. A logical model must not predict absurdities even under extreme conditions. If it does, it must be modified. Indeed, in addition to the anchor point (1,0), we must satisfy another extreme condition. There is a conceptual ceiling: When N becomes very large, V may approach 100 but not surpass it. Mathematically: When N∞ then V100 per cent. The curve that bends off below the ceiling in Figure A6.2 corresponds to the “exponential” equation V=100[1-e-0.145(N-1)]. This equation may look pretty complex. Yet, it represents the simplest curve that satisfies both extreme conditions – anchor point and ceiling – and best fits the data. What does this equation stand for, and how was it 31

obtained? This will be discussed later. We can say that V=-11.9+11.9N is a coarser model, a first approximation, and that V=100[1-e-0.143(N-1)] is a more refined model, a second approximation. Even when we have a conceptually more refined model, we might prefer to use the simpler one because it’s easier to work with. In the usual range of N it works as well as the more refined model – this is visible in Figure A6.2. There is nothing wrong with such simplification, as long as we do not forget its limitations. If ever we should get a value of N larger than 6 or so, we should have enough good sense to switch to the refined model. ____________________________________________________________________ Exercise A6.1 In Figure A6.2, the single data point above N=6 actually agrees with the coarse linear model better than with the exponential one. Then how can I say that for N>6 we should predict on the basis of the exponential curve, when the straight line fits better? ____________________________________________________________________ Exercise A6.2 In Figure A6.1, we took the arithmetic mean of slopes 20 and 0, to get V10(N-1), despite my pushing the geometric mean in previous chapters. What’s the justification? ____________________________________________________________________ How to measure the number of parties We have been talking of parties as if they were all of equal size. Actually, we face a mix of large parties and others with few votes. How do we measure the number of parties when some are large and some are small? Most often the “effective number of components” is used: N = 1/Σ(pi2), where pi is the fractional share of the i-th component and the symbol Σ (sigma) stands for SUM. Suppose the vote shares of 4 parties are 40, 30, 20 and 10, for a total of 100. Then N=1/(0.402+0.302+0.202+0.102)=1/0.30=3.3. The effective number is never larger than the “physical” number (which here is 4). The values of N in Table A7.1 are such effective numbers of electoral parties, i.e., numbers based on votes. For other purposes, one might use seat shares. For volatility, we compare two elections, where the effective number of parties usually differs somewhat. Which one do we pick? Take the mean of the two. Arithmetic or geometric mean? It hardly matters when the two values of N are quite similar, as they usually are. _____________________________________________________________________ Exercise A6.3 a) Calculate the effective number when the seat shares of parties are 45, 35, 10, 9 and 1. b) Calculate the effective number when the number of seats for three parties are 100, 80 and 20, respectively. (CAUTION: To get the fractional shares, one first has to divide by the total number of seats.

32

______________________________________________________________________ Exercise A6.4 Second and further approximations can take many directions. Here we have specified the relationship between V and N. We could also consider other factors that might affect volatility. We could also consider possibly better ways to measure both volatility and the number of parties. What about people who vote only in the first or only in the second of the two elections on which volatility figures are based? How might these occasional voters affect our model? I have no answers to offer. If you find any, it would be impressive – but don’t spend too much time on this exercise. ______________________________________________________________________ Exercise A6.5 Our graphs (Figures A6.1 and A6.2) show no upper limit for N – as if N could reach infinity. Student Valmar Valdna pointed out to me (2.9.09) that the adult population (P) of the country would impose an upper limit. The maximum N would correspond to each person forming a separate party and voting for this party. Such people most likely would stick to their very own parties the next time around, and so volatility should drop to zero! This would be another constraint. Can you work it into the model? Hint: Find a simple expression in N and P that would equal 1 when N=0 but would drop to 0 when N=P; then multiply the existing V by this expression. [In more mathematical language: find a function f(N) such that f(0)=1 and f(P)=0. This is how I often translate constraints into models.] ___________________________________________________________________

33

A7. Duration of Cabinets: The Number of Communication Channels ________________________________________________________
 The number of communication channels increases roughly as the square of the actors.  The inverse square law of cabinet duration is one of the consequences.  A law, in the scientific sense, combines empirical regularity and explanation through a logical model. ________________________________________________________________ There is more to logical model building than the ignorance-based approach. Now we take on a problem where a very different approach is needed. Start with an issue close to home. One of the tasks of parents is to adjudicate squabbles among their children. The more children, the more squabbles. This would be a directional model: when x up, then y up. But how fast does the frequency of squabbles increase as the number of children increases? Let us establish a quantitative model. The number of communication channels among n actors With no children or with just one child (A), no conflicts among children can arise. With two children (A and B), they can. When a third child (C) is added, conflict frequency triples, because in addition to conflict channel AB there is also AC and BC. You better believe me: We felt this tripling the moment our third child began to walk. What would have happened if we had a fourth child? You figure it out. It’s the same with parties in a political system. The number of potential conflict channels increases faster than the number of parties. It should affect the duration of governmental cabinets: The more parties, the shorter the cabinet durations. But the broad issue is much wider. It involves not only conflict channels but any sort of communication channels among social actors. And interaction among individuals is the very definition of what society is. The number of communication channels increases enormously as more individuals are added, and much of social organization addresses the issue of how to cut down on this number. The shift from direct democracy to representative democracy is one such attempt. How many communication channels (c) are there among n actors? Draw pictures and count the channels, for a small number of actors. You’ll find the following: Number of actors 0 Number of conflict channels 0 1 0 2 1 3 3 4 6 5 10 6 …

What is the general model for n actors? Each actor extends a channel toward each of the (n-1) remaining ones. So the total for n actors is n(n-1). But we have double counted, approaching each channel from both ends. So we have to divide by 2. The result is c = n(n-1)/2.

34

35

d) Would some number of departments reduce this load even further? If so, then to what level? e) Would reduction in communication load seem to justify reorganization, assuming that our way to measure it is valid? _______________________________________________________________________ Average duration of governmental cabinets How can c≈n2/2 be used to determine the average duration of governmental cabinets? Cabinet breakdowns are caused by conflicts. When the number of potential conflict channels (c) among parties is reduced by one-half, the breakdown frequency of cabinets should also be reduced by one-half. Hence cabinet duration (capital C) should double. But the number of conflict channels itself grows as the square of the number of parties (N). Thus, if the number of parties is reduced by one-half, cabinet duration should become 4 times longer. Such reasoning leads to C = k/N2, where k is a “constant”, the value of which we do not know. _____________________________________________________________________ Exercise A7.3 “If the number of parties is reduced by one-half, cabinet duration should become 4 times longer.” a) Verify that equation C=k/N2 does lead to this outcome. You do so by plugging in N/2 instead of N, and 4C instead of C. Are the two sides of the equation still equal? b) To carry out such verification, do we have to know the value of constant k? _____________________________________________________________________ Does this equation enable us to predict duration of cabinets, for a given number of parties? No, because the value of constant k is not known. How do we find it? We can turn C=k/N2 into k=N2C. (If you don’t know how, speak up – otherwise you’ll be lost from now on.) Pick a country which feels typical, and plug into this equation its number of parties and cabinet duration. Take Finland, where the average values over many decades are N=5.0 and C=1.5 years. If Finland is typical, then k=5.02×1.5=37.5≈38 years. But other countries may lead to different values. Consider the five countries in Table A7.1. I have arranged the countries by increasing N. Note first that the coarse directional model “N up, C down” is satisfied. But how about the more demanding quantitative model C=k/N2? If we complete the table above for values of k=N2C, we find that each country yields a different value of k. What should we do next? Do Exercises A7.4 and 5. My doing them in your place will not teach you what to do. Make sure to include proper units, because 1.5 years is not the same as 1.5 months!

36

Table A7.1. Number of parties and cabinet duration for representative countries. _________________________________ Country N C (years) k=N2C _________________________________ Botswana 1.35 40 …….. sbatlet Bahamas 1.7 14.9 ……… arbhar Greece 2.2 4.9 ……… zuzneg hroah Colombia 3.3 4.7 ……… Finland 5.0 1.5 38 years _________________________________ ________________________________________________________________________ Exercise A7.4 Use Table A7.1. a) Calculate the values of k=N2C for all the countries and enter them in the table. b) Graph k vs. N. Does k seem to be steady, with random ups and downs, or does it seem to change as N increases? (If k steadily decreased or decreased with increasing N, this would mean that our model does not express the trend well.) c) Find the median and the arithmetic mean of the 5 values of k, with no decimals. arbhar d) Express the mean and the range of k for this set of countries, adding to the mean such error limits that they include all 5 cases: k = …±… years.
arbsta, zuzsba

________________________________________________________________________ Exercise A7.5 a) Also calculate the geometric mean of the 5 values of k. How do you do it? Multiply the 5 values together and make a note of this product. Now figure out which number multiplied 5 times by itself reproduces this product. You might well take the arithmetic mean as the starting point for trial-and-error. (Yes, there are quicker ways to calculate the geometric mean – see Chapter A1 EXTRA. But the Little Bear approach gives you a firmer sense of what the geometric mean is all about. Because I can do it in this coarse and direct way, I can work on models even in the absence of a pocket calculator.) b) Express k for these countries, indicating by which factor the geometric mean must be multiplied or divided so as to include all 5 cases: k = … years ×÷… arbtlet, oah.tme c) Compare this geometric mean to the arithmetic mean and to the median. This means asking: Do they differ appreciably, compared to the extreme values of k in table above? _______________________________________________________________________ The actual study included 35 countries. The 5 countries in Table A7.1 were chosen so as to represent fairly well this broader set. For the 35 countries, the best fit is with k=42 years. So, within a large margin of error of ×÷2, the model becomes C = 42 years/N2.

37

Once constructed, this model too needs logical testing and testing with data. Here the latter part involves two stages. First, we have to make sure that the pattern C versus N actually does follow the pattern C=k/N2 rather than something else – like C=k/N or C=k/N3 or still something quite different. Second, we have to determine the best fitting value of constant k. Testing a model can be quite tricky. We’ll come to that later. It will be seen that the cabinet duration model passes the test and thus becomes the inverse square law of cabinet duration. Cabinet duration depends on the number of parties. The number of parties, in turn, can be deduced from the assembly size and the number of seats allocated in the average district. (Calculating the number of seat-winning parties in Chapter A1 and the share of the largest party in Chapter A4 were important steps for this broader model.). Hence we can deduce cabinet duration from the number of seats in the assembly and in the district. This means that we can design for a desired cabinet duration by manipulating the assembly size and the number of seats allocated in the average district. Thus, this model building has reached a stage which could be of practical use, even while the range of error is as yet quite large. Dimensional consistency Kilometers cannot equal hours, because length and time are different dimensions. Our equations must be consistent in dimensions. In k=N2C, N is a pure number (without units) and C is in years. Hence the product k must also be in years – k is NOT a pure number! We’ll return to this issue. Laws and models What is a law, in the scientific sense? It’s a regularity that exists, and we know why it holds. A law can come about in two ways. Some regularity may be first observed empirically and later receives an explanation through a logical model. Conversely, a model may be deduced logically, and later is confirmed by testing with data. The latter path seems to be followed here, but it is more complex than that. I first had some limited data, and graphing them (that’s the next chapter!) showed something close to an inverse square relationship. This is what urged me to look for a model. Once the model came about, more data were needed for testing the model. For the purpose of learning model building, the main point here is that this logical model is quite different from the ones previously discussed. This model is based on the idea of communication channels. It has nothing to do with the ignorance-based approach used previously for the number of seat-winning parties and for the largest share. We will soon return to the ignorance-based approach, and you may start thinking that the ignorance-based approach is all there is to model construction. It isn’t, even while being extremely useful. The example of cabinet duration should serve as a warning that logical models come in an infinite variety of ways, depending on the issue on hand.

38

A8. How to Draw Graphs on Regular Scales
______________________________________________________________  Most of what follows in this book involves some graphing. The deeper your understanding of graphing is, the more you will be able to understand model construction and testing.  This chapter has no highlights that can be packaged into a few sentences – everything matters. _______________________________________________________________ We already felt the need to place numbers on a scale in Chapter A5. When we want to visualize relations between two variables, we need to draw these scales at right angle to each other and place data points on this field. This is so for the largest share and the number of shares. Do the actual cases fall on the curve C1=C/N1/2? This is also so for cabinet duration and number of parties: Do the actual cases fall on any curve of form C=k/N2, and if so, do they fit C=42 years/N2? We also found in Chapter A5 that a logarithmic scale is preferable to the regular, when one needs to place tiny and huge quantities on the same scale. The same applies to two-dimensional graphs. In the next chapter, we’ll see that logarithmic scales offer sometimes advantages even when the quantities do not vary from tiny to huge. But the present chapter deals with graphing on regular scales. The data in Table A7.1 are used. The task is: Graph C vs. N so as to test the general model C=k/N2 and the more specific model C=42 years/N2. We’ll go slowly. Indeed, we’ll go so slow that you may feel like bypassing the first half of this chapter, thinking that you already know all that. Don’t bypass it. While reading, try to tell apart things you know, tidbits that are new and may come handy, and – things you have been doing while not quite knowing why. (I myself found some of those while writing this chapter!) Construct the graphs by hand – NOT by computer. This way you learn more about the problem on hand, and you do not have to fight the peculiarities of computer programs. Computer-drawn graphs have their proper place, but you have to do enough graphing by hand before you can become the master of computer programs rather than slave to their quirks and rigidities. Constructing the framework on regular scales Try to be precise – otherwise important conclusions may be missed. This includes the following.  Get some square paper (or special graph paper with small squares within larger ones). Make the graphs at least the size of a regular half-page. Tiny freehand sketches on blank paper are NOT precise enough. (The graphs you see in books are mostly reduced from larger originals.)  By convention, “C vs. N” means C on the vertical “y-axis” and N on the horizontal “x-axis”. Do not reverse them.

39

 

 

Inspect the data table to see how much space you need. In Table A7.1, the y scale must accommodate values from 0 to close to 50 years, to fit Botswana and Finland. The x scale must accommodate values from 0 to 5, but better include the range 0 to 6, to be on the safe side. At equal distances, mark up to locations on both scales. For y, one could indicate 0, 10, … 40, 50. For x, 0, 1, … 5, 6 impose themselves. DO NOT laboriously write in 0, 1, 2, … on the y scale – such crowding blurs the overview. Make sure that intervals between these main divisions include 2, 5 or 10 squares of the squared paper – then you can easily place the data points precisely. DON’T use intervals of 3 or 7 squares – if you do, just try later to place 1.6 on a scale with 7 squares between “1” and “2”! DO include the point 0;0 in the graph, if at all possible, and DO label it on the axes with “0” just as you do for “1” or “2”. Zeros are numbers too, and you hurt their feelings when you omit them. (Indeed, arithmetic really got going only when the Hindus invented a symbol for “nothing”, something that eluded the “practical” Romans.) NEVER use unequal interval lengths for equal distances, just because some intervals have data points and others don’t. (One encounters published graphs where, say, population is plotted against time, with 1950, 1980, 1990, 2000, 2005, 2006 and 2007 at equal intervals. Population growth seems to slow down, even when it actually does not. This is lying with graphs.) Label the axes. In the present case show “Number of parties (N)” along the x axis, and “Cabinet Duration (C)” along the y axis. When drawing straight lines (such as y and x axes), use a ruler – do not do it freehand. If you don’t have a ruler on hand, the back of a book will do.

Placing data and theoretical curves on regular-scale graph   Indicate data points by small dots at the precise location, complemented by a larger mark around it – an “O” or an “X”, etc. Do NOT use just a small dot that looks accidental or a blurb the center of which is fuzzy. For theoretical curve such as C=42 years/N2, calculate C for simple values of N such as 1, 2, 4, 6, mark them on the graph with tiny symbols, and draw a smooth curve through them. DON’T make life hard for yourself by calculating C at N=1.35 or 1.7 just because Botswana and Bahamas happen to have these values. But you might need to calculate C at N=1.5 because the curve drops so steeply between N=1 and N=2. Before drawing the curve, turn your graph on the side so that the y scale is on top. Your natural wrist movement readily allows you to joint the theoretical points smoothly. Special for these data: The number of parties cannot be less than 1. So draw a vertical line at N=1 and mark its left side by shading or slanted lines – this is a “conceptually forbidden area”.

 

40

Making sense of the graph We have constructed the graph. Now comes the main part: Making sense of it. How do the actual data points compare with the theoretical curve (the model)? If all the points are above the curve, the model might be in trouble – but maybe we just need to replace the constant k=42 years by a higher value. If the data points at low N are below the curve, while the points at high N are above the curve, the model would be in really deep trouble, because now the very exponent 2 in C=k/N2 would be disconfirmed. In the actual case, some points are above and some below the theoretical curve – so the model holds, with some random scatter. Mere 5 data points do not suffice to test a model, but recall that these points are representative of the trend and degree of scatter for a wider set of 35 countries. _______________________________________________________________________ Exercise A8.1 a) Graph data in Table A7.1 (also reproduced in Table A8.1 below) and also the model C=42 years/N2 on a regular scale, C vs. N. Compare data and model. NOTE: For model, find C at convenient values of N, such as 1, 2, 3, 4, 5 – not at N values for data points. b) Also graph the curves C=24 years/N2 and C=73 years/N2. Compare to the data, with special attention to Greece and Botswana. Compare to values of k calculated in Exercise A7.3. c) Calculate the arithmetic mean for the N values of the 5 countries, and similarly for their C values. Enter this mean point on the graph. Where is it located relative to the curve C=42 years/N2? How come all 5 countries fit the model better than their mean? (If you can’t answer this question, don’t worry. But don’t stop being puzzled.) zu.sbaoah, ohatlet.zuz d) Now calculate the geometric means for N and C of these countries and enter this mean point on the graph. Where is it located relative to the curve C=42/N2? How come? (If you can’t answer this question, don’t worry. But zu.arbazuz, sba.zutsõud don’t stop being puzzled.) e) Try to draw some conclusions of these disparate questions: What is the purpose of this exercise, besides practicing graphing? ______________________________________________________________________ Table A8.1. Number of parties and cabinet duration, and their logarithms. ____________________________________ Country N C (yrs.) x=logN y=logC ___________________________________ Botswana 1.35 40 0.130 1.602 Bahamas 1.7 14.9 0.230 1.173 Greece 2.2 4.9 0.342 0.690 Colombia 3.3 4.7 0.519 0. 672 Finland 5.0 1.3 0.699 0.114 ___________________________________

41

Converting to logarithms and graphing on regular scales In the graph you constructed for Exercise A8.1, the point for Botswana is twice as high as the next one, forcing us to extend the C scale. Also the curves for C=42 years/N2 etc. are bent, which makes it hard to compare data with respect to these curves. However, taking logarithms on both sides of C=42 years/N2 turns it into a straight line logC vs. logN: logC=log42-2logN, when C is in years. (Why? We’ll come to that.) We’ll also see that log42=1.62. With y=logC and x=logN, we then have y=1.62-2x. We can graph this line (on regular scale), and we can add the logarithms of data points in Table A7.1. Table A8.1 shows them. _______________________________________________________________________ Exercise A8.2 a) On a regular scale, graph y=logC vs. x=logN for countries in Table A8.1. Use the same scale on both axes. Make sure your axes cross at (0,0). b) Use a ruler, preferably a transparent one, so you can see all the data points. By eye, draw the best fit line through these points. This means that there should be about the same number of points above and below the line. Moreover, such balance should hold both at low and at high ends of the data cloud. So balance the 3 lowest point and the 3 highest points (thus using Greece twice). c) Determine the constants in y=a+bx. How is it done? See Figure A8.1. d) How close is your b to -2? How close is your a to log42=1.62. _________________________________________________________________ Figure A8.1. For the line y=a+bx, intercept a is the left side of the triangle, and slope b is the ratio of the left side (a) and the bottom (a/b) – with negative sign. a

y
y=a+bx

0 0 -a/b

x

Using the graph of logarithms to find constant k In Exercise A7.3, we calculated 5 values for constant k, each which would perfectly fit the data for one country. We then took their mean, which was around 42 years. This is a 42

tedious process. Moreover, we cannot see which countries are typical and which may be unusual. To see, we have to graph. Your graphs in Exercises A8.1 and A8.2 easily show that Greece looks low and Colombia looks high, compared to the general trend. Assume that we have no logical model – just data. It is hard to describe the general trend in any detail when looking at the graph C vs. N, apart from the coarse directional description “N up, C down”. On the graph logC vs. logN, however, something new jumps to the eye: the pattern visibly is very close to linear. This means that logC=a+b(logN) and C=k/Nb. Thus b is the same in both equations while logk=a. We can calculate the constants a and b in logC=a+b(logN), as you did in Exercise A8.2. Hence we get the b in C=k/Nb directly and k indirectly. (How do you find k when you know logk? We’ll get to that.) _______________________________________________________________________ Exercise A8.3 a) On top of your data graph in Exercise A8.2 also graph the line y=1.62-2x. b) Also graph the lines y=1.38-2x and y=1.86-2x, which correspond to C=24 years/N2 and C=73 years/N2, respectively. Compare the three lines and the data points. c) Compare this graph and the one in Exercise A8.1. What remains the same and what is different? d) EXTRA: Try to place on the graph the arithmetic and geometric means calculated in Ex.A8.1. (At this stage, you are not expected to know how to calculate the logarithms for their coordinates, but locations of the means on graph drawn in Ex.A8.1 should give some hints.) _________________________________________________________________ How I found the inverse square law of cabinet duration
Now you are in a position to follow the history of a research issue. This may be useful in perceiving how quantitative research proceeds more generally. I graphed data on cabinet duration and number of parties, as calculated and tabulated by someone else, who just observed that duration decreased when the number of parties increased (Lijphart 1984: 83, 122, 124-126). I did so, because I wanted to have a more detailed picture of their relationship. It was clearly curved (like your graph in Exercise A8.1). So I graphed their logarithms (like your graph in Exercise A8.2). Moreover, the slope in this graph was suspiciously close to -2.0. Such simple integer values do not happen just like that – maybe they want to tell us something, I said to myself. For a long time, I could not figure out what the data graph was trying to tell me. – it isn’t such a straightforward process. And then the model with communication channels occurred to me. But it started with graphing the data, and this is why I have explained the process here in some detail, so that you can not only admire it but also use it. How did I locate the constant k? I assumed that the slope on the “log-log” graph is exactly 2 and kept moving a transparent ruler, keeping it at slope 2. This leads to parallel lines (like in your graph in Exercise A8.3, part b). I picked the one that best fitted the data points.

43

A9. How to Draw Graphs on Logarithmic Scales
______________________________________________________________  Most graphs in this book involve logarithmic scales. They have to be mastered. But understanding gradually increases as one uses them over and over.  This chapter, too, has few highlights that can be packaged into a few sentences – everything matters.  The (decimal) logarithm of a number such as 10,000 – “1” followed by zeros – is simply the number of zeros.  The (decimal) logarithm of any positive integer is the number of zeros that would follow the first cipher, if all the other ciphers were replaced by zeros, plus a fraction. _______________________________________________________________ When we want to graph logarithms, it’s a nuisance to have to calculate the logarithms for each data point, the way we did in Table A8.1. There is a shorter way. But now we really need a handle on those logarithms briefly introduced in Chapter A5 and toward the end of A8. Have no illusion: This will not be the last time. You better brush up on them. Not just the way you may have memorized formulas for school tests. Make an effort to understand what they stand for. This chapter helps you to do so, starting with the following simple notions (cf. earlier Exercise A5.3): The (decimal) logarithm of a number such as 10,000 – “1” followed by zeros – is simply the number of zeros. The logarithms of intermediary numbers logically must be intermediary. Hence the (decimal) logarithm of a positive integer is the number of zeros that would follow the first cipher, if all the other ciphers were replaced by zeros, plus a fraction. Thus, 728 is more than 100 but less than 1,000. Hence its logarithm must be more than 2 and less than 3. Using the method above for log728, replace 2 and 8 by zeros. This leads to 700. Count the zeros. Hence log728.6=2 plus a fraction smaller than 1. But how large is this fraction? This comes next. Finding logarithms of a number Let us use a scale where 1, 10, 100, etc. are at equal distances. For such numbers, their logarithms are simply the number of zeros that follows “1”. Figure A9.1 shows how it looks. This is a repeat of Figure A5.3. Figure A9.1. Powers of 10 and their logarithms. 0.001 0.01 0.1 1 10 100 1,000 10,000 100,000 1,000,000 -----|--------|--------|---------|---------|-------- |-------- |-------- |---------|--------|------log x -3 -2 -1 0 1 2 3 4 5 6 x

44

How did we get the minus values for logarithms of 0.001 etc.? Each time you move one notch down from 1 million, you divide by 10 and subtract 1 from the logarithm. Right? Well, then just continue doing it, past 10 and 1 – this is the only consistent way to proceed. We already used such a scale in Figure A5.2, for the weights of mammals. If 10, 102, 103 and 104 are at equal distances, and we want to be consistent, then 2, 2 , 2 and 24 should also be at equal distances, and the same for 3, 32, 33 and 34 etc. This is so indeed on logarithmic scale, provided we assign equal distances of 0.301 for 2 and equal distances of 0.477 for 3. This means that log2=0.301 and log 3=0.477.
2 3

_____________________________________________________________ Exercise A9.1 Calculate log5. Hint: note that 2×5=10. ______________________________________________________________ For powers of 2, the magnified picture is shown in Figure A9.2. Note that 210 is awfully close to 103. This means that 10 times log2 equals 3 times log 10: 10log2=3. This is why log2=0.3 – more precisely log2=0.301 because 1024>1000. It follows that log4=0.602 and log8=0.903. We can verify that this agrees wit their locations in Figure A9.2 Figure A9.2. Powers of 10 and 2, and their logarithms.
0.125 0.25 0.5 1 1 2 4 8 10 16 32 64 128 100 256 512 1024 1,000

x

0.1

----|-------------------|-------------------|-------------------|-------------------|---log x -1 0 1 2 3 The happy outcome is that, instead of calculating logarithms for each number we want to graph on regular scale, we just graph the numbers themselves on this logarithmic scale. How do we do it, say, for the data in Table A7.1? This comes next. Fully logarithmic or log-log graphs First look at how many periods from 1 to 10 you need. In Table A7.1, the number of parties ranges from 1 to 5, so one period will suffice. Duration ranges from 1 to 40 years, so we need two periods – 1 to 10 and 10 to 100. Mark a big distance 0 to 1 on the x axis (for N) and two big distances 0 to 1 to 2 on the y axis (for C). Make sure to use an easy number of squares – 5, 10 or 20 – for duration. Mark slightly the positions 0.2, 0.4 etc. Then mark the locations of log 1.5, 2, etc. from TableA9.1. For 15, 20, etc., simply add 1 to the logarithm of 1.5, 2, etc. – as also shown in TableA9.1.

45

TableA9.1. Logarithms of integers from 1 to 10. ____________________________________________________________________ x 1 1.5 2 3 4 5 6 7 8 9 10 logx 0 .176 .301 .477 .602 .699 .778 .845 .903 .954 1 log(10x) 1 1.176 1.301 1.477 1.602 1.699 1.778 1.845 1.903 1.954 1 _____________________________________________________________________ Instead of labeling these positions as log1.5, 2, etc., simply label them 1.5, 2, etc. Now you have those numbers on logarithmic scales. Draw horizontal and vertical lines at these locations. The result should look roughly like Figure A9.3 below – but yours should be much larger, more detailed and more precise. This is called a doubly logarithmic or “log-log” graph, because both axes are on logarithmic scales. Now the data points in Table A7.1 can be placed on this grid. Figure A9.3. Grid for doubly logarithmic or log-log graph y vs. x.
100 50 30 20 10

y
5 3 2 1 1 2 3 5 10 x

___________________________________________________________________ Exercise A9.2 Use data in Table A8.1. a) Construct the log-log grid for 1 period on x axis and 2 periods on y axis. b) If you can, make copies for further use – it’s a universal log-log grid. If you can find printed log-log paper, so much the better. c) Place the data points (N,C) in Table A8.1 on this grid. CAUTION: Do NOT graph logN and log C – the grid itself is taking care of logging. d) Compare to your graph in Exercise A8.2. Are they exactly the same, apart from different magnification? _____________________________________________________________________

46

Commercial log-log papers used to be available, with different numbers of periods: 1 by 1, 2 by 3, 3 by 5, etc. Computerization has driven them out, but they still are useful – for educating students in handling logarithmic scales, and for rough model building schemes. I keep some and make copies of them as the need arises. Note that all period squares are exactly the same. Therefore, you can paste together copies of your grid in Exercise A9.1 and get as many periods as you need for fitting your data. ___________________________________________________________________ Exercise A9.3 To graph population vs. area of countries in the world, one must reach from 21 square kilometers (Nauru) to 17 million (Russia), and from 13,000 for Nauru to 1.3 billion for China on the population scale. sba a) How many 1-to-10 periods do we need on the area axis? b) How many 1-to-10 periods do we need on the population axis? hat NOTE: We do not have to start with 1 square kilometer or 1 person. __________________________________________________________________ Figure A9.4. Grid for log-log graph y vs. x, with x pulled out to twice the length.
100 50 30 20 10

y
5 3 2 1 1 2 3 5 10

x

The periods on the two scales do not have to be equal. In Figure A9.4 the x scale has been pulled out to twice the length, compared to Figure A9.3. Our C vs. N data showed a steep decrease in the previous log-log plot (slope -2). In the new grid, the resulting line would decrease at 45 degrees (slope -1). Sometimes such a moderate slope is preferable for inspecting and fitting data. Recall CAUTION in Exercise A9.2b: Graph logy on regular scale OR y itself on logarithmic scale – but not logy on logarithmic scale!

47

Semilog graphs On log-log graphs both axes are on logarithmic scales. But one can also use log scale only on one axis, the other being on a regular scale. This is often called a “semilog” graph. Its grid is shown in Figure A9.5. Figure A9.5. Grid for semilog graph y vs. x.
100 50 30 20 10

y
5 3 2 1 0 2 4 6 8 10 12

x

________________________________________________________________ Exercise A9.4 Suppose the national currency of a country uses 11 different coins and bank notes, and we rank them by increasing value: Rank 1 Value 0.10 a) b) c) d) 2 0.20 3 0.50 4 1 5 2 6 5 7 10 8 25 9 50 10 100 11 500

Graph value on log scale and rank on a regular scale. Draw in a line or curve, if that seems to make sense. Draw conclusions regarding regularities and irregularities in this pattern. Make a list of coins and bank notes used in your country and repeat the exercise. Comment. e) Does your country have some coins or bank notes that exist but you rarely see them? Would the pattern be more clear-cut without them? __________________________________________________________________

48

Regular, semilog and log-log graphs – when to use which? When data include both very small and very large values, a logarithmic scale is the only way to tell apart the median countries from the tiny ones. But the number of parties in Table A7.1 did not vary that much. We still graphed it on log scale. We did so because the logical model suggested that then the expected curve would turn into a straight line. How do we know then which way to graph, in general? Some guidelines will be given in Part B. But often I have no idea. Then I graph the data in several ways, and sometimes a linear pattern appears for some way of graphing. Then it’s time to ask: What is the reason behind this regularity.

49

A10. Fermi’s Piano Tuners: “Exact” Science and Approximations
______________________________________________________________  Exact sciences mean sciences that strive to present their results as exactly as possible at the given stage of research. In later applications they use no more precision than is needed.  While still building a logical model, one might initially apply “no more exact than needed”.  Superficially impossible estimates can be decomposed into a sequence of simpler estimates that can be answered approximately.  One should develop a sense for typical sizes of things, which could then be fed into such approximations. _______________________________________________________________ We have repeatedly used approximations, and you may feel uneasy about it. Aren’t natural sciences (and physical sciences in particular) “exact sciences” and shouldn’t social sciences try to become more of an exact science? Exact science, however, does not mean that every result is given with three decimals. It rather means striving to be as exact as possible, under the given conditions--and specifying the likely range of error. In the beginning, this range of possible error may be huge. It’s acceptable, if there is some basis for gradually improving our measurements and conceptual models. As exactly as possible – and as needed Nothing would stifle such advance more than advice to give up on quantitative approaches just because our first measurements involve a wide range of fluctuation or our conceptual model does not agree with the measurements. A three-decimal precision will never be reached, if one refuses to work out problems approximately, at first. Exact science means being as exact as possible at the given stage of research and thus making it possible to be more exact in the future. But even “as exact as possible” must be qualified. In later applications there is little point in using more precision than is needed for the given purpose. And while building a logical model, one might initially apply “no more exact than needed”. What does it mean? Take volatility. We restricted ourselves to a single input variable that could affect it – the number of parties. Even here we at first built a linear model that ignored a glaring constraint: volatility cannot surpass 100 per cent. We simplified and aimed at an approximation that might work at the usual volatilities of much less than 100 per cent. Then we approximated the limit of a “surprise zone”. We got a fair fit to this coarse model (Figure A6.2). The refined exponential model would become more efficient (conform to reality) only if we located party systems with a huge effective number of parties. If we had tackled the refined model first, I might have lost you in needless complexities. And when I first developed the model, I most likely would have confused myself.

50

51

c) A student of mine around 1975 looked up the US census. Believe it or not – it did have the number of piano tuners in New York: about 400. By what percentage or by what factor was your estimate off? _______________________________________________________________________ The range of possible error How good is our guess for the number piano tuners? Now this is a question we cannot avoid. How much off are we likely to be? Recall relative error discussion in Chapter A3. Let us look realistically at the possible errors at each step.  How many people in New York? You might be off by a factor of 2: relative error ×÷2.  How many households in New York? Once we ask this, we discover that we first have to ask: How many people per average household? It could again be ×÷2.  What share of households might have a piano? This is where students in the 1960 and 1970s were liable to be far off, depending on whether their own parents did or didn’t have a piano – and you have to think back into no-computers surroundings. And compared to the number of household pianos, how many pianos elsewhere (concert halls etc.)? This adds to the uncertainty. You might be off by ×÷4.  How often does a piano have to be tuned? You may have no idea, and phoning a piano tuner would give you only a lower limit, because: how often is the average piano tuned in reality? I know a piano that hasn’t been tuned for 20 years. You might be off by ×÷3.  How long does it take to tune a piano? You might be off by ×÷3.  So what is the total work load for piano tuners, in hours, during one year? You have a long string of multiplication but no new error, unless you make a computation error.  Assuming a 40 hour work week, how many piano tuners would it keep busy? No new error here either, unless you make a computation error. By what factor are you likely to be off on the number of piano tuners? Suppose you overestimate the every factor in the multiplication by its maximal error and also underestimate every factor you divide by, by its maximal error. By the error estimates above, we than have a total overestimate by a factor of 2×2×4×3×3=144≈150. Instead of 400 tuners, we could have found 150×400= 60,000 tuners. Or if we underestimated to the same degree, we could propose 400/150=2.7≈3 tuners. An estimate ranging from 3 to 60,000 – this was not the point Fermi was trying to make. When you did the exercise above, without any cheating (so as to get it closer to 400), you probably got much closer, most likely to within ×÷10 or ×÷15. Why? You were most likely to err in random directions. At some steps your error boosted the number of tuners, and at some other step it reduced it. If you were very lucky, you could have horrendous errors at each step, yet end up with the actual value, if your errors perfectly cancelled out.

52

What is the most likely error range on multiplicative sequences of estimates? My educated guess is: Take the square root of the maximal combined error. For the piano tuners, 1441/2=12. Indeed, being off by ×÷12 fits with my average experience with untutored students. However, the combined error cannot be smaller than that for any component. If you combine ×÷2 and ×÷4 to ×÷8, then the actual likely error is at least ×÷4, even though 81/2=2.8. EXTRA: Dimensional consistency We encountered the dimensional consistency requirement in Chapter A7. It looked like another nuisance of a formality. In the present problem, it actually becomes extremely useful. But if it mystifies you, leave it for later – it’s not essential, just helpful. We started with the city population. The larger it is, the more piano tuners there must be. But so as to find the number of households, should we multiply or divide by the number of people in a single household? We may use our good sense. If we multiply, we get more households than people, so I guess we must divide. And so on. But introduce units, or quasi-units. City population is not just so many million – it’s so many million people. What is the unit for the number of people in the household? The unit is people/household. When multiplying, the units also multiply, resulting in [x people][y people/household]= xy [people]2/household. People squared? No, this is not what we want. So let us try dividing. By the usual rules of arithmetic, people/people cancel out, and 1/[1/household]=household. So [x people]/([y people/household]= (x/y) households. Yes, this is the unit we want to have. Overall, we want a sequence where, after the rest cancels out, only the unit “tuners” remains. In the following, I’ll drop the numbers (such as x and y above) and show only units. I’ll go in three steps. The first is for the total number of pianos:
[people] [pianos/household] = pianos. [people/household] [household pianos/all pianos]

Check that, indeed, everything else cancel out, leaving pianos. The second step is for the total work hours/year needed: [tunings/(year×piano)] [work hours/tuning] = work hours/year. The third step is for the number of piano tuners:
[work hours/year] = tuners. [work hours/(tuner×week)][weeks/year]

Check that everything else cancel out, leaving tuners. We could do the whole operation in one mammoth step:
[people][pianos/household][tunings/(year×piano)][hours/tuning][work hours/year] = tuners. [people/household][household pianos/all pianos][work hours/(tuner×week)][weeks/year]

53

Here we just have to insert the numbers in their proper places (multiplying or dividing). But maybe such a long sequence is too much of a mouthful.

54

A11. Approximations and Cumulating Error

55

Subtraction: Absolute errors still add For subtraction, it’s trickier. Suppose you subtract 50±2 from 60±3. The outcome can vary as widely as 63-48=15 to 57-52=5, i.e., 10±5. More generally, (a±Δa)-(b±Δb) = a-b±(Δa+Δb). When you subtract two quantities, their error ranges do NOT subtract – they add. But this is nothing compared to what can happen to relative error. In our example it is boosted to (3+2)/(60-50)=5/10=50% – hugely more than the 4 and 5% for the components. In subtraction, relative error can explode. This is something to really watch out for when we measure differences between almost equal entities. Differences between vote shares and seat shares of parties offer one example. Multiplication: Small percent errors add Calculate the area of the table discussed above. It is (60±3 cm)(20±2 cm). The average estimate is 60×30=1200 cm2. The maximum is 63×22=1386, above the average by 186. The minimum is 57×18=1026, below the average by 174. We cannot express it as 1200 plus or minus the same amount. Now consider the relative error. It is 3/60=5%, 2/20=10%, and for the product +15.5% and -14.5%, both fairly close to 5+10=15%. It can be shown that this is always the case, as long as the relative errors on components are small. More generally, relative error on a quantity a can be designated as Δa/a. If you want it in percent, multiply by 100%. If such errors are small, then term-by-term multiplication leads to (a±Δa)(b±Δb) = ab±(bΔa+aΔb+ΔaΔb) ≈ ab±(bΔa+aΔb), because ΔaΔb is tiny, compared to bΔa or aΔb. Dividing by ab results in (1±Δa/a)(1±Δb/b) ≈ 1±(Δa/a+Δb/b). Thus (a±Δa)(b±Δb) leads to a relative error of ±(Δa/a+Δb/b). Here relative errors add, as the example above illustrates it. If you multiply by a third quantity (c±Δc), add its relative error too: 1±(Δa/a+Δb/b+Δc/c). ____________________________________________________________________ Exercise A11.1 Verify that ΔaΔb is tiny, compared to bΔa or aΔb. Draw a rectangle with sides a±Δa and b±Δb. Separate it into 4 regions corresponding to ab, bΔa,aΔ and ΔaΔb, respectively. When Δa<<a and Δb<<b, then what does it to ΔaΔb? ______________________________________________________________________ Multiplication with large relative errors: Errors “by a factor of” multiply But what happens when the relative errors are large? First of all, the count by percentages becomes awkward anyway (cf. Chapter A3), as +90% and -90% take on quite different meanings. If we add 90% to our wealth, we merely increase it by a factor close to 2. If we

56

lose 90% to our wealth, we are down by a drastic factor of 10. We have to switch terminology to “by a factor”. Suppose we estimate the value of a herd of cows we see on a pasture, but we have only a hazy idea both of how many there are and what an average cow costs. Our best guess is that there are 200 cows, but it could easily be one-half or double. Our estimate is good within a factor of 2: 200×÷2. We are even more uncertain about the average price for a cow, if we not in the cow business and some cows are in much better shape than some others. Our best guess for this price is 1000 money units (M), but it could easily be one-quarter or quadruple of that: p=1000M×÷4. Our estimate of the value of the herd is np=200×÷2×1000M×÷4=200,000M×÷8. We could be off by a factor of 2×4=8. The value of the herd could be as low as 25,000M or as high as 1,600,000M. More generally, when the error ranges are indicated as “multiply or divide by a certain factor” and quantities are multiplied, then these factors are also multiplied. When designating this factor on quantity a as A, [a×÷A][b×÷B] = ab×÷[AB]. What happens when you apply this rule to small relative errors? It still works and, in fact, is more accurate than adding percent errors. It’s just that you also can use percent error addition as long as the relative errors are small. Division: Errors “by a factor” multiply, small percent errors add We observed that errors on subtractions cumulated like those on additions. Similarly, errors on divisions cumulate like those on multiplications, with one happy difference. When a≈b, the relative error on a-b becomes huge, but the relative errors on a/b remain manageable. It is easy to see, by rearranging the terms, that “by a factor” relative errors multiply: [a×÷A]/ [b×÷B] = a/b×÷AB. As long as the relative error is small, we can express a division in the format (a±Δa)/(b±Δb) = (a/b)(1±Δa/a)/(1±Δb/b). The relative error term can be approximated as (1±Δa/a)/(1±Δb/b) ≈ 1±(Δa/a+Δb/b). Thus small percent errors are seen to add. All this also applies when more than two quantities enter multiplications and divisions. Do standard formulas overdo cumulating error? When we combine many more than 2 quantities, the standard equations above may overestimate the likely error. For the value of the herd of cows in the example above, we could be off by a factor of 2×4=8. However, underestimates and overestimates at the two steps may counterbalance each other. So we are likely to be off by a factor between 4 and 8. The more steps are involved, the more they are likely to cancel each other out. In Exercise A10.1 you most likely were off by much less than a factor of 150.

57

What is the most likely error range on multiplicative sequences of estimates? My gut feeling is to take the larger of the following two: the largest single error, or the square root of the maximal combined error. EXTRA 1: Combined error on additions Reconsider the example with length and width of the table. When we write 20±2 cm we do not really mean that all values from 18 to 22 cm may occur with equal likelihood. Rather, 19, 20 and 21 are likely to be more frequent than 18 or 22. Moreover, 17 and 23 may also slip in with very low likelihood. But let us assume 20±2 to mean that all values from 18 to 22 can occur with equal likelihood. See what happens when we add 20±2 and 60±3. The following table shows all 35 combinations that could result. 57 58 59 60 61 62 63 _________________________________________ | 75 76 77 78 79 80 81 | 76 77 78 79 80 81 82 | 77 78 79 80 81 82 83 | 78 79 80 81 82 83 84 | 79 80 81 82 83 84 85

18 19 20 21 22

The frequency of the sums is: 75 76 77 78 79 80 81 82 83 84 85 Total ______________________________________________________________________ 1 2 3 4 5 5 5 4 3 2 1 35 The extreme combinations (shown in bold) occur quite rarely. This is a very coarse procedure, but it gives an idea of how application of the rule (60±3)+(20±2)=80±5 overdoes the actual error range. We might feel like excluding the extremes that occur rarely and say (60±3)+(20±2)=80±3. This range excludes only 75, 76, 84 and 85, Thus it still includes 29/35=83% of the outcomes. Note that 31/2+21/2=3.1 would yield such a range. EXTRA 2: Combined error on multiplications Similar observations can be made for multiplication. Addition of small percent errors may overdo the actual error to be expected, and so may multiplication of terms “by a factor of” in the general case. The price of the cow herd in the previous example leads to the following multiplication table, where the products are in thousands. p: 250 500 1000 2000 4000 n: ______________________________ 100 | 25 50 100 200 400 200 | 50 100 200 400 800 400 | 100 200 400 800 1600

58

The frequency of the products is: 25 50 100 200 400 800 1600 Total cases __________________________________________________ 1 2 3 3 3 2 1 15 Yes, we may be off by a factor of 8, but only in 2 cases out of 15. The range ×÷4 would include 87% of all the combinations. Compared to the error ×÷4 on the price of a cow, the error ×÷2 on the number of cows becomes almost negligible. EXTRA 3: Untested corrections for standard formulae on cumulating error
Such concerns lead to the question: If (a±Δa)+(b±Δb) = a+b±(Δa+Δb) overestimates the likely absolute error on the sum, which formula would be more realistic? This section is really far out. I have not searched the literature, so I may reinvent the wheel – or be mistaken. The corrected formula should satisfy the following conditions.  It should yield an error less than ±(Δa+Δb) but more than either ±Δa or ±Δb.  For Δb =0 the error on the sum must be just Δa.  For Δb=Δa, the error on the sum must be between Δa and 2Δ, and 20.5Δa=1.41Δa is in the right ballpark in view of the previous table for (60±3)+(20±2). The following satisfies most of the conditions: (a±Δa)+(b±Δb) ≈ a+b ± Δa(1+x)1.5/(1+x0.5), where x=Δb/Δa, with Δa being the largest single error. The fly in the ointment is that when 0<Δb<0.375Δa, then the error on the sum comes out as slightly less than Δa! The maximum drop is to 0.875Δa when Δb=0.079Δa. One can keep the error on the sum above Δa, without changing the outcome at Δb =0 and at Δb=Δa, by multiplying Δa(1+x)1.5/(1+x0.5) by 0.572[2-(x0.273-0.5)2], but this is a rather ad hoc way out. It might be preferable to calculate the error on the sum by the formula above and just raise it to the largest single error if it falls below it. Even simpler, one could always use 20.5Δa=1.41Δa without underestimating the error range. The formula can readily be extended to addition of more than two items. For n items added, ±n0.5Δa would apply when all items have comparable absolute errors. If some among the errors are smaller, ±n0.5Δa would be a safe upper estimate. Similar adjustments could be made for error ranges on multiplication. For small relative errors, the error on (a±Δa)(b±Δb) might be only ±20.5Δa/a rather than ±(Δa/a+Δb/b) when the two components have comparable relative errors. When relative errors are large, the error on [a×÷A] [b×÷B] might be ×÷[20.5A] rather than ×÷[AB], when the two components have similar relative errors. For the example of cow herd, this coarse approximation leads to ×÷5.7 rather than ×÷8. Making use of (1+x)1.5/(1+x0.5) cuts it down to ×÷4.3, which underestimates the error range unless the further correction 0.572[2-(x0.273-0.5)2] is also introduced, with x=2/4=0.5. This correction is 1.08, so the outcome is ×÷4.7. Once more, the results in this Extra need to be checked further. What is certain is that the standard equations for cumulating error overdo it.

59

A12. How to Test Models: Logical Testing and Testing with Data ________________________________________________________
Logical models must not predict absurdities even under extreme circumstances. Logical testing checks the model for extreme or special situations.  Most models apply only under special conditions, which should be stated.  In testing models with data, eyeballing must precede statistical tests; otherwise, the wrong type of test might be used.  Data often must be transformed prior to statistical testing of data, guided by the logical model. Logarithms often enter.  Study the graphs of data and model-predicted curves very carefully. They may supply much more information than meets the eye at the first glance. ________________________________________________________________ We already touched on model testing in Chapters A6 and A7. Testing a logical model involves two aspects:  Logical testing, to guard against any absurd consequences.  Testing with actual data. Testing with data is often thought to be the only test a model needs, but it alone does not suffice. The coarse linear model of volatility (Chapter A6) ran into trouble at a large number of parties, predicting volatilities of more than 100%. It had to be modified into an exponential model that avoids this absurdity. Let us review all the models previously offered. Logical testing In Chapter A1, the model for the number of seat-winning parties (p) in a district with M seats is p=M1/2. Look for extreme cases. The lowest possible M is 1. Does the model yield a reasonable result? Yes, it predicts p=1, which is indeed the only logically acceptable possibility. There is no conceptual upper limit on M, and no contradictions can be seen even for very high values. Of course, both M and p come in integer numbers. Yet for most integer values of M, the formula yields a fractional p. This is no problem, because we deal with average expectations. For M=14, we calculate p=3.7. It means that 4 parties winning seats is somewhat more likely than 3. Note that it is quite possible that only 2 or as many as 5 or more parties win seats – we just consider it much less likely. Does the model for the largest component (Chapter A4) fit even under extreme conditions? The model is S1=S/N1/2. Suppose there is only one component: N=1. The formula correctly yields S1=S. Now suppose we have a federal assembly of 100 seats and the number of federal units is also 100. If the term “federal unit” has any meaning, each would have one and only one seat, including the largest. Yet the formula yields S1=100/1001/2=10, leaving only 90 seats for the other 99 federal units! How should we refine the model so as to avoid such an outcome? 60 

When establishing the model, we started with S/N<S1<S. Actually, we should specify that S/N<S1<S-(N-1)m, where m is the minimal amount that must be left to the other N-1 components. If this minimum is 1 seat, then the geometric mean of the extremes is S1=(S-N+1)1/2(S/N)1/2. This is a pretty messy expression, but yes, plug in S=100 and N=100, and we do get S1=(1)1/2(1/1)1/2=1, as we should. Can we still use S1=S/N1/2? Yes, as long as the number of items allocated (S) is much larger than the number of components (N) multiplied by the minimal share (m): S>>Nm. Thus we should specify the model as follows: S1=S/N1/2 [S>>Nm]. If this condition is not fulfilled, switch to more refined model. Most models apply only under special conditions, which should be stated. The relationship is analogous for the exponential and linear models for volatility. One is more refined, yet the other is so much easier to use and most often works with sufficient precision. One just has to know when we can approximate to what degree. Note that Exercise A6.5 introduces a further limit. Consider next the model for cabinet duration (Chapter A7). We already noted that the basic model for the number of communication channels, c=n(n-1)/2, works also for the extreme cases of n=0 and n=1. But what about the model for cabinet duration, C=k/N2? The lowest limit on the number of parties is N=1. Then C=k. The best fitting value of k has been found to be 42 years. If a one-party democracy came to exist, the model predicts a cabinet duration of about 42 years. For a pure two-party system (N=2), the model predicts 10 years. Does it feel right? Here it is not a matter of clear conceptual limits but having surprise zones (as in Figure A6.1 for volatility). By the loosest definition, a cabinet is considered to continue as long as it consists of the same party or parties. The ministers or even the prime minister could change. Even so, we might be surprised if the same cabinet continued beyond human life spans, say 80 years. At the low side, consider a two-party system. One party is bound to have a slight majority and form the cabinet, which is likely to last until the next elections, short of a rather unusual rift within the ruling party. Elections typically take place every 4 years. Hence average durations of less than 4 years for two-party constellations would surprise us. Plug C=4 years and N=2 into C=k/N2, and out pops k= CN2=16 years. In sum, values of k outside the range 16 years<k<80 years might surprise us. The geometric mean of the non-surprise zone is 36 years. The actual 42 years is remarkably close to 36 years, but this is sheer luck. After all, the limits 16 and 80 years are not conceptual anchor points but quite hazy limits of vague zones of surprise. Does C=k/N2 make conceptual sense at the other extreme, a very large number of parties. As N∞ (N tends to infinity), C=k/N20. Yes, we can imagine cabinet

61

durations becoming ever shorter as the number of parties increases. Of course, just as one-party democracies are hard to imagine (although Botswana comes close, with effective number of parties N=1.35), very short durations are hard to imagine in practice. But models must not predict absurdities even under unreal circumstances. ______________________________________________________________________ Exercise A12.1 a) Suppose we draw a line at a situation where each day sees a new cabinet formed. How many parties would this correspond to? b) Suppose we actually feel that no democratic regime could withstand more cabinet changes than once a month. How many parties would this correspond to? c) Suppose we draw a line at a situation where the effective number of parties is 30. To what cabinet duration might it lead? ____________________________________________________________________ Testing with data Models may avoid logical inconsistencies, yet fail to agree with reality. We may deduce logically that the pattern C=k/N2 should prevail, but we may discover that the actual pattern is closer to C=k/N3 or to C=k/N1.5 – or to no equation of the form C=k/Nn. How do we discover what the actual pattern is? One way is to graph all the raw data we have and compare to what the model predicts. This is what we did in Exercise A8.1, for a few highly selected countries. We may graph the curves C=k/N2 for selected values of k such as k=30, 40 and 50 years and see if the clouds of data points fit along them or in between them. If they do, the model fits. By trial and error, we might find the best fitting value of k, but this is slow process. If, on the contrary, the data cloud crosses the curves C=k/N2, then the model does not fit, and we have to find a family of curves to which the cloud does fit. Curves are messy. It is so much easier to work with straight lines. With the latter, the eyeball test often works. At a glance, we can see whether the data points follow a straight line and what its slope is. Fortunately, whenever a model involves only multiplication, division and fixed exponents, certain routine transformations change it into a linear model. They use logarithms. We had fair exposure to logarithms in Chapter A9. For the moment, all we need to add is the following rule: When y=aAbB/cC, then logy=Aloga+Blogb-Clogc. ______________________________________________________________________ Exercise A12.2 a) Calculate y=2551/34 directly and also by using logy=Aloga+Blogb-Clogc, to convince yourself you get the same result. (Use the log values derived in Chapter A.9: log 2=0.301, log3=0.477, log5=0.699.) b) In y=aAbB/cC and logy=Aloga+Blogb-Clogc make c=1 and A=B=1. Which simple correspondence do you get?

62

c) In y=aAbB/cC and logy=Aloga+Blogb-Clogc make b=1 and A=C=1. Which simple correspondence do you get? ____________________________________________________________________ Most of the models discussed up to now have this form (Table A.12.1). Why? We live very much in a multiplicative world (rather than additive). For all these models, conversion to logarithms turns the curved pattern into linear. Thus it is most useful to graph them on log-log paper. The coarse model for volatility is different; testing it does not need logarithms. However, logarithms enter the refined model in a different way, to which we’ll come later. Table A.12.1. Many models have the form y=aAbB/cC. _________________________________________________________________  logy=Aloga+Blogb-Clogc Generic form: y=aAbB/cC __________________________________________________________________  logp=0.5logM The number of seat-winning parties: p=M1/2 1/2 The largest share: S1=S/N  logS1=logs-0.5logN Cabinet duration: C=k/N2  logC=logk-2logN __________________________________________________________________ Figure A12.1 shows the result for cabinet duration in 35 democracies (data from Lijphart 1999) graphed against the effective number of parties, both on logarithmic scales. (Compare it to the curved pattern you drew in Exercise A8.1 and the straightened patterns in Exercises A8.2 and A9.1.) Figure A12.1. Mean cabinet duration vs. effective number of legislative parties – Predictive model and regression line. Source: Taagepera and Sikk (2007).

Notes: Thin solid line: best fit between logarithms. Bold solid line: theoretically based prediction [C=42 years/N2]. Dashed lines: one-half and double the expected value.

63

______________________________________________________________________ Exercise A12.3 What can we see in graph A12.1? Stop reading, focus on the graph, and make short notes on what you can see directly or indirectly, and what you can conclude on the basis of this graph. Students trained to limit themselves to directional models often see only that “when N increases, C decreases.” There is much more. This is one of the most important exercises in the book. This is why, as an exception to the usual practice, the suggested answer is given at the end of the chapter. Do not peek, before making your own list! Only this way will you learn to see more than up to now. ________________________________________________________________________ The tennis match between data and models What has been done here may look like the time-honored simple recipe: “hypothesis (model)  data collection  testing  acceptance/rejection”. However, this would oversimplify the process. The actual interaction between model and data looks like a tennis match. Superficial data inspire the first coarse logical model. The model may suggest looking for different data that better correspond to what the model is about. But the same discrepancies between model and data may also motivate search for a more refined model. Some hidden assumptions may have entered the first round of model building and must be explicitly stipulated. For instance, the coarse model for volatility implicitly assumed that any positive values of volatility are acceptable, even those surpassing 100 percent. In sum, the process of scientific research most often looks like an ascending spiral. The simple recipe above represents a single cycle within this spiral: Initial hunch (qualitative hypothesis)  limited data collection   quick testing  quantitatively predictive model (quantitative hypothesis)   further data collection  testing  refined model  testing   further refining of model or data  testing ... The essential part of a predictive model is the predicted functional form of relationship among the variables. The model may include a constant or parameter (like k for cabinet duration) which must be determined empirically. Due to conceptual constraints, predictive models rarely are linear. Linear approximations are useful in preliminary work, along with graphical representations, to get a feel for the empirical pattern. They are also useful at the very end, as practical simplifications. In order to know when a simplification can be used, one must be aware of the refined model.

64

Why would the simplest forms prevail? This book has highlighted the models based on ignorance – or rather near-ignorance, teasing the most out of what we know about constraints. Conceptually forbidden zones, anchor points, and continuity in-between those points are important parts of our knowledge. Asking what would happen under extreme conditions can lead to insights, even when we agree that such extremes will never materialize. When the impossible is eliminated, the possible emerges with more clarity. But why should we expect the simplest mathematical formats to apply, among the many formats that also satisfy some obvious constraints? Addressing a similar issue, physicist Eugen Wigner (1960) observed that the physicist is a somewhat irresponsible character. If the relationship between two variables is close to some well-known mathematical function, the physicist jumps to the conclusion that this is it – simply because he does not know any better options. Yet, it is eerie how often this irresponsible approach works out. It is as if mathematics were indeed the language in which nature speaks to us. What can we see in this graph? At least the following can be seen in graph A12.1.  The data cloud visibly follows a roughly linear pattern.  The exponent 2 in the denominator of C=k/N2 corresponds to slope -2 in logC=logk-2logN. Regardless of the value of k, all lines that fit the model have this slope. The value of logk just raises or lowers the line.  The graph shows in bold the line that corresponds to k=42 years and in dotted lines the lines that corresponds to k=84 and 21 years, respectively – i.e., the double and one-half of 42. It can be seen that all data points but one fall into the zone between the dotted lines, and most points crowd along the central line. Thus,  The data cloud does follow a straight line (on log-log graph)  This line does have a slope close to -2  This line corresponds to a value of k around 42 years.  Nearly all data points are located in a zone along the line that corresponds to an error of ÷2 for C. The “parameter” k is not part of logical prediction – it is determined precisely from this graph. When you put N=1 into C=k/N2, it becomes C=k. This means that you can read off the values of k for the various lines on the vertical scale at N=1, at the left side of the graph. Please note that, in this graph, the first unmarked notch above 1 stands for 2, the next one is 3, etc. For easier reading of the graph, one might also label the locations for either 3 or for 2 and 5. Recall our estimate that values of k outside the range 16 years<k<80 years might surprise us. Well, this is pretty much the same zone as the zone between the dotted lines in the graph (21 years<k<84 years). Only Mauritius (MRT) surprises us. It seems to have 65

much too short-lived cabinets, given its rather moderate number of parties. Was it unstable? To the contrary, it had an extremely long-lasting prime minister. He cleverly played the parties against each other, including and dropping some all the while, which technically made it a different cabinet. Whenever we have deviating cases, we must look for reasons. Sometimes we find them, sometimes not. Still more is to be seen in the graph.  The line labeled C=31.3/N1.757 is the best fit line determined by statistical means. Never mind for the moment how it’s determined. Note the following: It is visually very close to the line with slope -2; its own slope is -1.757. It would reach the left scale (N=1) at the height 31.3 years, appreciably below 42 years. We can see that just a small drop in slope (from 2 to about 1.8) can change the intercept (the C value at N=1, where logN=0) quite a lot – but this is an area with no data points. Slope is more informative than the intercept.  Finally, consider the values of R-square. As stated previously, R2=1 expresses a perfect fit and R2=0 perfect scatter. Never mind for the moment how it’s measured. Just observe that the fit for the best fitting line is quite good (R2=0.79) but the one for the best fitting line with the predicted slope -2 is almost as good (R2=0.77). In view of existence of a logical model and its empirical confirmation, we have here a law, in the scientific sense of the term -- the inverse square law of cabinet duration, relative to the number of parties. Did you see most of this in the graph? Did you draw most of these conclusions? If you did not, don’t blame yourself. It’s a learning process. The main message is: Study the graphs of data and model-predicted curves very carefully. They may supply much more information than meets the eye at the first glance. Testing for the number of seat-winning parties and the largest share follows very much the same pattern as the duration of cabinets, with each presenting different snags (like the issue of 100 seats for 100 subunits). The example cabinet duration suffices for the moment. The main message is: In testing models with data, eyeballing must precede statistical tests; otherwise the wrong type of statistical approach might be adopted.

66

A13. Examples of Models across Social Sciences
__________________________________________________________  Sociology offers examples where the number of communication channels is useful.  Political history offers examples where reasoning by extremes and differential equations, especially exponential, are useful.  Demography offers examples where differential equations, exponential and beyond, are useful.  Economics offers examples where allowed areas and anchor points, plus differential equations, exponential and beyond, are useful. __________________________________________________________ Do the models presented apply to social sciences beyond political science? A student asked me so. Indeed, the examples presented up to now belong largely in a subfield of political science – electoral and party studies. This is so only because I found the simplest examples in that subfield. Most models in social sciences require further ideas and tools, some of which will be introduced in Parts B and C. But it may good to sketch some of these models right away, so as to show their scope beyond electoral studies. Here are some examples from my own work. Figure A13.1. Dependence of weighted per capita circulation of journals on the number of speakers of a language. Data from Taagepera (1999: 401-402).

2 Weighted per Capita Circulation 0.2

0.02 10,000

Population

100,000

1,000,000

67

Sociology: The number of speakers of a language and the number of journals How would the circulation of journals (J) published in a language increase as the number of speakers of the language increases? Language means communication. As the population (P) increases, the number of communication channels (C) increases proportional to P squared. Journals might expand at a similar rate, because with more speakers more specialized journal can be introduced. Thus we might tentatively test the model J=kP2. The circulation per capita would be j= J/P=kP – it would increase proportionately to the population. Graphing on log-log scale, the slope must be 1, if the model fits. Such a test needs a group of languages under rather similar conditions, so as to eliminate a multitude of other factors. I found such a group in the form of languages belonging to the Finno-Ugric language family and spoken in the Soviet Union, where the authoritarian regime made the condition highly uniform. Circulation was weighted, giving as much weight to a thick monthly magazine as to a daily newspaper. Graphing on log-log scale does indeed lead to slope 1 for nations with more than 10,000 speakers (Figure A13.1). For smaller populations figures become erratic. The best fit is around j=2×10-6P (Taagepera 1999: 401-402). This is an example how the basic notion of the number of communication channels can apply in sociology. Political history: Growth of empires As history unfolds, more advanced technology enables larger “polities” (political entities) to form, so that the total number of polities decreases. What is the pattern of this decrease? First, we have to define the number of polities, given that some are large while others are tiny. The aforementioned effective number of components again comes handy. Should we consider size by area or by population? Let us do both. But the two are interrelated, because large polities do not form in empty space – they tend to form where population is the densest. Indeed, it can be shown that the effective number of polities by area can be expected to be the square of their number by population: NA =NP2. This follows from consideration of geometric means of extremes. As we graph logN against time, over 5,000 years (Figure A13.2), we find that the broad pattern is linear, which means that N decreases exponentially with time. (We’ll deal with exponential change in Parts B and C.) This is not surprising, because exponential change is the simplest pattern around. It simply says that the rate of change (a small change dN during a small time interval dt) is proportional to the existing amount: dN/dt=kN. (We’ll deal with such “differential equations” in Part C.) The slope for areas (NA ) is close to double the slope for population (NP). Thus the model NA =NP2 is confirmed. But why is the decrease rate constant k (the slope in Figure A13.2) the way it is? Why isn’t the slope shallower or steeper? We don’t know as yet.

68

Figure A13.2. Decrease of effective number of polities over time, based on area (NA) and population (NP) in Taagepera (1997).

Logically, the two lines in Figure A13.2 must cross at N=1, which corresponds to a single polity encompassing the entire world. The statistical best-fit lines cross around N=2. It would take a minor shift to make them cross at N=1, at around year 4,000. Yes, don’t expect (or be afraid of) of a single world empire any time soon. This is an example how the exponential model and the use of geometric means of extremes can apply in political history. Demography: World population growth over one million years Differential equations (like dN/dt=kN above) are widely used in demography. Growth of populations often follows the exponential pattern. However, the growth of the world population over thousands of years followed an even steeper pattern: P=A/(D-t)m, where A and D are constants. The nasty thing about such a “quasi-hyperbolic” growth is that when time reaches t=D, population would tend toward infinity. When time is counted backwards from D, the equation would correspond to a straight line on a log-log graph. How do we determine D? By trial-and error, until the data curve straightens out. The best fit is found with D roughly around 2025 (Figure A13.3). 69

However, world population veered off this model and clearly slowed down around 1970. In 2011, I found a way to modify the quasi-hyperbolic growth to include this slowdown, and it projects toward a ceiling at 10 billion. This work has not yet been published. This is an example how the exponential and more complex models can apply in demography.

Figure A13.3. World population increase over time prior to 1970, both on logarithmic scales (Taagepera 1979).

70

Economics: Trade/GDP ratio Some countries export most of what they produce, so they have a high Exports/GDP ratio and a correspondingly high Imports/GDP ratio. Some other countries have little foreign trade, compared to their Gross Domestic Product. Could it depend on country size? Let us carry out a thought experiment. If a country came to include the entire inhabited world, what would be its Trade/GDP ratio? It must be 0, since this country has no one to trade with. So we have an anchor point: P=Pworld  Exports/GDP= Imports/GDP=0. At the opposite extreme, if a country consisted of a single person, what would its Trade/GDP ratio be? It would be 1, since all this person’s monetary transactions would be with people outside her own country. So we have another anchor point: P=1  Exports/GDP= Imports/GDP=1. It is now time to draw a graph of forbidden areas and anchor points (Figure A13.4). Visibly, P can range only from 1 to Pworld, and Exports/GDP can range from 0 to 1 (or 100%). Given the huge difference between 1 and Pworld, we better graph P on logarithmic scale. How can we move from the top-left anchor point to the one at lower right? The simplest way would be a straight line, and indeed, a tentative entropy based model (Taagepera 1976) predicts just that: A fraction logP/logPworld of the GDP is not exported, and hence Exports/GDP=1- logP/logPworld. Figure A13.4. Exports/GDP graphed against population. 1 Exp/ GDP 0.5 Actual, roughly 0 1 Population (log scale) 105 Pworld

Exports/GDP= 1- logP/logPworld

Trouble is, actual data do not fit. Countries with populations below one million tend to have much trade ratios than predicted, while countries with populations above 10

71

million tend to have lower trade ratios than predicted. The curve shown in the graph very roughly indicates the actual trend. From this point on, skills are used which are introduced later, in Parts B or C, or are beyond the reach of this book. If you do not follow, don’t worry. The main thing is to note that it all starts with Figure A13.4 (allowed area and anchor points), and to note why and how the skills introduced in Parts B or C are needed. Going beyond the approach based on extremes, Taagepera (1976) started out from the classic differential equation for absorption in physics, dI/dr=-kI. Here I is flow intensity, r is distance from source (of neutrons or of goods), and constant k reflects how rapidly the stuff produced is absorbed. This means simple exponential decrease (see Part B): I=I0 e-kr, where I0 is the intensity at the source, such as a factory. Whatever flow reaches beyond the country border is counted as export. There are many sources (factories) spread across the country, so that the equations become more complex, but in one dimension they can be solved. Unfortunately, countries are two-dimensional, making it even more complex. An approximate solution can be worked out, and it fits the data cloud. This model explains how trade depends on population, on the average. For practical purposes, however, it was found simpler to replace the model by the following empirical approximations (Taagepera and Hayes 1977): Imports/GDP = 40/P1/3. Exports/GDP = 30/P1/3. How could imports exceed exports? Countries also obtain revenue from shipping, tourism, exporting labor, etc. As seen in Figure A13.5, the Imports equation fits within a factor of 2, while scatter is much wider for Exports, due to the variety of the other aforementioned revenue items. Note that previous Figure A13.4 has the trade ratio on regular scale and in fractions of 1, while Figure A13.5 has it on logarithmic scale and in percent. How do those approximations fit with the logical anchor points? At populations equal to world population (around 4 billion in 1970), the equations predict Imports/GDP =0.0006=0.06% and Exports/GDP=0.04%, rather than the conceptually required 0. For a logical model, any non-zero value would be unacceptable, but for an approximation, this is quite close – as long as we keep in mind that this is an approximation. At the other extreme, Imports/GDP would reach 1 when P=6,400 rather than 1, and Exports/GDP would do so at 2,700 persons. This is far from the anchor point of 1 person. The detailed model and it approximations in Figure A13.5 could be part of an Sshaped curve (see Chapters C7 and C8) joining the anchor points and going steeply down around P=1 million (106). Indeed, we could get a fair fit of data to logical anchor points by using the equation used in Figure C7.7, which corresponds to adding a third central anchor point: Y/(1-Y)=[X/(1-X)]k.

72

Figure A13.5. Empirical approximations for dependence of Imports/GDP and Exports/GDP on population (Taagepera and Hayes 1977).

Figure A13.6 is the same as Figure A13.4, except that the transformation X=logP/logPworld has been carried out on the x-scale, so that P=1 becomes X=0, and P=Pworld becomes X=1. We could get a fair fit of data to logical anchor points by using an equation similar to the one used in Figure C7.7. It adds a third central anchor point. Here (1-Y)/Y= [X/(1-X)]k represents a symmetric “drawn-out S” curve joining the anchor points and passing through the central bending point 0.5,0.5. This curve drops too soon, compared the data cloud. A better fit is obtained by adding a “bias exponent” b, as also 73

introduced in Chapter C7: (1-Y)/Y = [Xb/(1-Xb)]k. It shifts the bending point away from 0.5,0.5, to get a better fit. Figure A13.6. Exports/GDP graphed against population. 1 Y= Exp/ GDP 0.5 (1-Y)/Y = [Xb/(1-Xb)]k, roughly Exports/GDP= 1- logP/logPworld =1- X Actual, roughly 0 0 0.25 0.5 0.75 X = logP/logPworld 1

Why would the Export/GDP ratio take such a path between the two anchor points, rather than a simpler path such as Y=Xk ? This remains to be explained. Imagine a country with only 300 inhabitants, so that X=0.25 in Figure A13.6. We can well imagine that it would still export almost all of what it produces and import almost all of what it needs, in line with the dashed part of the curve. On the other hand, consider a country with 30 million people, so that X=0.75 in Figure A13.6. We can visualize that it could easily produce much more than 75% of what it needs, in line with the curve shown. Thus this curve makes sense. But it’s a long way from what makes sense to a quantitatively predictive logical model. This is an example how allowed areas, anchor points, and equations more complex than the exponential can apply in economics.

74

B. Linear Regression and Logical Models
B1. The Basics of Linear Regression and Correlation Coefficient R2 ________________________________________________________
The Ordinary Least Squares (OLS) procedure minimizes the sum of the squares of deviations between data points and the line.  Regressing y on x minimizes the squares of vertical distances. Regressing x on y minimizes the squares of horizontal distances. Two different OLS line result.  If we first estimate y from x and then x from y, we don’t get back the original x – the two OLS lines are directional.  If we first estimate y from x and then z from y, we don’t get the same value as when estimating z directly from x – OLS regression is not transitive.  Correlation coefficient R2 expresses the degree of lack of scatter of data points. Utter scatter means R2=0. Points perfectly on a line mean R2=1.  In contrast to OLS slopes, R2 is symmetric in x and y. ________________________________________________________________ By now we have seen some examples of logical models, and some inevitable quantitative tools have been introduced. It is time to present some basic mathematical formats for models in a more systematic way: why they arise so frequently, and how to deal with them. In this connection, logarithms and exponential functions have to be grappled with – there is no escape, much as I would like to present an introductory book without them. Testing models with data involves use of statistics, and at least the basics of linear regression have to be presented here, even while developing statistical skills is outside the scope of this book. You should acquire those skills, but somewhere else. What is very much within our present scope is when and how use these skills – and when NOT to use them blindly. Statistics books tend to go light on the latter aspect, and even when they point out inappropriate uses, students tend to overlook these sections. The damage can be serious. Regression of y on x So let us proceed to linear regression. Suppose that we have 5 moderately scattered data points in a field with no constraints such as forbidden areas or anchor points, as shown in Figure B1.1: x -3.0 0.5 2.5 3.5 4.5 y -0.5 2.0 0.5 3.0 5.0 These might be, respectively, the growth rates of Gross Domestic Product (GDP) and of population of 5 countries. They usually are positive but can turn negative. We want to pass the “best fitting” line through these points. 

75

What do I mean by “best fitting”? That’s the catch. There are many ways to define it. The Ordinary Least Squares (OLS) procedure does the following: It minimizes the sum of the squares of deviations between data points and the line. What does this mean? Draw a somewhat haphazard line through the data cloud. Draw vertical lines joining the data points to the line (thick lines in Figure B1.1).This shows how far the data points are from the line. Then draw in squares having these lines as one side. Some of the squares are large while some others are so tiny I cannot even show them. Measure the areas of squares and add them. Now try to shift and tilt the line so that the sum of the squares is reduced. Figure B1.1. Vertical distances of 5 points to a somewhat random line. y

x

The middle point in Figure B.1.1 has a large square – more than the other squares combined – so it may pay to lower the line so as to reduce this square. It also pays to tilt the line more, so that the square on the far left is reduced. But if we overdo it, then the squares of the two intermediary data points start to grow. Maybe the line shown in Figure B1.2 is close to optimal. It passes near the point on the far left, and it balances off the two large squares on the right. Any shifting or tilting of this line might increase the sum of the squares. So this line is close to the best fit by the OLS procedure. We don’t have to do it by such graphical trial and error. There are regression equations which do it exactly. Just feed in the coordinates (x, y) of all data points, and out pops the best fit OLS line y=a+bx. The calculations are rather simple, but they become quite tedious when there are many data points. So these calculations are best left to computer programs.

76

The danger is that by leaving everything to the computer one does not develop a “finger tip feel” for what OLS does – and what it cannot do. If you apply it to improper data (what is improper will be explained later), the computer program for OLS does not protest – it still calculates a line. If you draw mistaken conclusions from this line, don’t blame the method – blame the one who misapplied a perfectly good method. Figure B1.2. Vertical distances of 5 points to the line which roughly minimize the sum of squares – best fit y on x. y

x

Reverse regression of x on y Note that we drew vertical distances from points to the line – and this direction matters very much. We would get a different picture if we considered the horizontal distances. Figure B1.3 shows the same line as in Figure B1.2, plus horizontal distances to the data points and their squares. Visibly, this line no longer is the best fit line – some squares are so large they hardly can be shown. Tilting the line could strongly reduce the sum of these horizontally based squares. This is done in Figure B1.4. It shows both the previous best-fit line (dashed), labeled “OLS y-on-x”, and a new one, along with its squares, labeled “OLS x-on-y”. This line passes close to the point on the upper right, and it balances off the large squares for the lowest two points. The sum of squares is visibly smaller than in previous figure.

77

Figure B1.3. Previous best fit line y on x and horizontal distances to the 5 points: The sum of squares clearly is not minimized. y

x

Figure B1.4. Horizontal distances of 5 points to the line that roughly minimizes the sum of squares based on them – the best fit x on y. The best fit y on x is the dashed line. y

OLS y-on-x

x

OLS x-on-y

78

Directionality of the two OLS lines The important conclusion is that the OLS lines depend on the direction in which we proceed: Regressing y on x (minimizing the squares of vertical distances) leads to one OLS line; regressing x on y (minimizing the squares of horizontal distances) leads to a different OLS line. Directionality of OLS lines is sometimes considered good. Suppose we are told that another country has a given growth rate x. What is our best estimate for its other growth rate (y)? Mark this value of x on the graph and see what the corresponding value of y is on the OLS line y on x. If for another country we are given the value of y and asked to estimate x, we’ll use the other regression line, OLS x on y. Now suppose we first estimate y from x and then use this value of y to estimate x. We do not get back the original value of x! This is what directionality means. ________________________________________________________________ Exercise B1.1 The best fit line y on x in Figure B1.4 is close to y=1.1+0.6x. This is what we would use to estimate y from x. The best fit line x on y is close to y=0.50+1.00x. But this line is used to estimate x from y, so we have to transpose it to x=-0.50+1.00y. To keep track of which equation goes in which direction, we better write them as y1.1+0.6x and x-0.50+1.00y. a) Graph the presumed data (from p. 75) and the two lines. They cross at G, where x=1.5, y=2.0. Show the points A, B, and C on the graph, as well as arrows A to B and B to C. b) If another country has x=4.0, what would be the corresponding value of y? Call this point A. tle.ot c) Use this value of y to calculate the corresponding value of x. Call this point B. har.n d) Use this value of x and calculate the corresponding y. Call this point C. zu.tso e) What would happen, if we continue this game? Where would we eventually end up? f) If we write and apply a computer program for the process above, how much time might the process take, with today’s computer capabilities, to get there? g) What would happen, if we start with a very low value of x? Where would we now end up? ________________________________________________________________________ But what if we are asked what would be general relationship between x and y, without random fluctuation? Suppose we could reduce random error to zero, and suppose that all data points then fall neatly on a straight line. Which line would it be – OLS y on x, OLS y on x, or something else? Growth rates of population and GDP mutually affect each other, so neither OLS direction is to be preferred. Some compromise between them should be looked for. Moreover, most logical models are not directional. Hence the same line should work in both directions.

79

Indeed, a regression line intermediary between the two standard OLS lines can be defined, such that it works in both directions. This symmetric linear regression will be presented in the next chapter. For the moment, just keep in mind that it matters in which direction one carries out standard OLS. Non-transitivity of OLS regression Suppose logical considerations suggest that x has and impact on y, which in turn has an impact on z. Symbolically: xyz. Average cabinet duration (C) is one such example. For a given assembly size, the number of seats in the district (district magnitude M) largely determines the effective number of parties (N), which in turn largely determines cabinet duration: MNC. Sometimes we may already have calculated N from M, and we want to use it to calculate C. This means MN, followed by NC. Some other times we may wish to estimate C directly from M: MC. We’d rather expect that it should not matter, which way we go. The outcome should be the same, regardless of whether it is MNC or M C. This is what transitivity means, and it applies to the “algebraic” equations, meaning pretty much all the equations one encounters in secondary school mathematics. Symbolically: algebraic equations [xyz] = [xz] The trouble with OLS regression is that, in contrast to algebraic equations, it is not transitive. When we regress the number of parties on district magnitude and then regress cabinet duration on the number of parties we get one relationship between district magnitude and cabinet duration. We get a different one when regressing directly cabinet duration on district magnitude: MNC is not the same as M C. At low scatter, the difference is negligible. At high scatter, it can become enormous. Symbolically: [xyz] ≠ [xz] OLS regression, high scatter Why is this so? This follows from directionality. If we cannot go back to the same value of x, after passing through y (Exercise B1.1), then we cannot reach the same value of z when going there directly or through y. Most logical models are transitive, like MNC. Thus OLS regression works in model testing only when scatter is low – which often works out in physics but rarely is the case in social sciences. Correlation coefficient R2 Along with the coefficients a and b in linear regression equation y=a+bx, one usually reports R2. It expresses the degree of lack of scatter of data points. If the points line up perfectly, R2 is 1. If the points form a blob without any direction, then R2 is 0. We have seen an example of half-way scatter in-between these extremes in Figure A6.2, where R2=0.51. It is scarcely higher for the data in the present chapter. Figure A12.1 shows a much less scattered data cloud, R2=0.79. Both notation and names in the literature are confusing. Kvålseth (1985) presents no less than 8 different expressions for R2 that appear throughout the literature. They most often yield approximately the same result, but for some odd data constellations they

80

can differ. Some are called coefficients of dispersion or of determination. Some sources distinguish between R2 and r2. At the risk of omitting significant differences, this book uses just “correlation coefficient R2”. It is always positive, when applied to the scatter of the data cloud as such. One can use just R. In addition to scatter, it also indicates whether the slope is up or down. In Figure A12.1, R=-0.7871/2=-0.89, while in Figure A6.2, it is R=+0.511/2=+0.71. Why is R2 used more frequently than R, when the latter gives more information? It can be said that R2 expresses the share of variation in y that is accounted for by the variation in x. Figure B1.5. Linear regression presumes a roughly elliptic data cloud, without curvature. y

C x

R≈+0.7 R2≈0.5

R≈-0.4 R2≈0.15

R≈-0.95 R2≈0.9

Figure B1.5 offers some schematic examples. Imagine that the ellipses shown are rather uniformly filled with data points, and there are none outside the ellipse. The flatter the ellipse, the higher R2 is. The first one is akin to Figure A6.2, tilting up (R positive) and moderately flat. The last one is akin to Figure A12.1, tilting down (R negative) and quite flat. We can attach an R2 value to any line, with respect to the data points. Thus, Figure A12.1 shows two values of R2: 0.787 for best fit, and 0.770 for the logical model. Here the model fits almost as well as the best possible fit. However, R2 actually can go negative when comparing the data cloud to some poorly fitting line, such as it might result from a totally inadequate logical model.

81

The R2 for the best fit: To which line does it correspond? The main axis of the ellipse (dashed line) is visibly the best fit line for data clouds in Figure B1.5. Is this line the OLS regression line? But if so, which one would it be – y-onx or x-on-y? It is actually an intermediary line, the symmetric regression line. All three pass through the center of gravity C of the data cloud. (The coordinates of C are mean x and mean y of all the data points.) For the flat data cloud on the left in Figure B1.5, the three regression lines are practically the same. But for the almost round data cloud in the center they diverge, as is shown in Figure B1.6. Here the y-on-x line has a much shallower slope than the central line, while the line x-on-y has a much steeper slope. Figure B1.6. When scatter is extensive, the two OLS lines diverge from the main axis of an elliptic data cloud. y Symmetric line – main axis of data cloud R2 – measure of flatness of data cloud C x OLS y on x – scatter-reduced slope OLS x on y – Scatter-enhanced slope

This is a general feature of OLS regression. As one randomly increases scatter, the slope of y on x is reduced, while the slope of x on y is enhanced.

82

B2. Symmetric Regression and its Relationship to R2 ________________________________________________________
 Symmetric regression line minimizes the sum of rectangles (or triangles) formed by vertical and horizontal distances from data points to line. Its slope B is a pure measure of slope, independent of scatter.  Similarly, R2 is a pure measure of lack of scatter, independent of slope. Together, B and R2 tell us everything about how tilted and how flat the data cloud is – B expresses the slope of the main axis, and R2 the relative width of the ellipse.  The slopes of the two OLS lines result from a combination of these pure measures. They are mixtures of slope B and scatter R2.  Symmetric regression lines are multi-directional and transitive. In this sense, they are algebraic equations. ________________________________________________________________ We have seen that there are two OLS lines, and they are directional and non-transitive. They are so because they treat x and y asymmetrically, minimizing squares of deviations either in the vertical or in the horizontal direction. Testing of logical models might be on safer grounds with a regression method that treat x and y in a symmetric way. Figure B2.1. Vertical and horizontal and distances of 5 points to the line of best fit y on x. y

x

From minimizing the sum of squares to minimizing the sum of rectangles How could we regress so that x and y enter in a symmetric way? Ask first the reverse question: What caused the asymmetry in the OLS procedure? It came about because we measured either the vertical distances between data points and the line, or the horizontal.

83

Well, take them both into account. Start with the OLS line y on x in previous Figure B1.2, but show both the vertical and horizontal distances of points to line (Figure B2.1). These lines form the two sides of rectangles. The two remaining sides are shown as dashed lines. Now look at the areas of these rectangles. Could we reduce their sum? Visibly, a more tilted line could reduce the areas of the two largest rectangles. Compared to the two OLS lines in Figure B1.4, an intermediary line minimizes the sum of rectangles, as seen in Figure B2.2. To minimize clutter, only two sides of the rectangles are shown. In fact, the entire argument could be made on the basis of the areas triangles delineated by the vertical and horizontal distances and the line. Symmetric regression line minimizes the sum of rectangles (or triangles) formed by vertical and horizontal distances from data points to line. Symmetric regression lines are multi-directional and transitive. In this sense, they are algebraic equations. In terms of the example in the preceding chapter, symmetric regression MN, followed by regression NC yields the same result as direct regression MC. Figure B2.2. Vertical and horizontal distances of 5 points to the line that roughly minimizes the sum of rectangles (or triangles) based on them – the best fit symmetric in x and y. y

OLS y-on-x

x Symmetric regression line

OLS x-on-y

The relationships between R2 and the slopes of regression lines This section is more mathematical than the rest. You can get the formulas for the OLS lines in any basic statistics book, but few of the statistics texts addressed to social

84

scientists even mention symmetric regression. The present section enables you to make the connection. All three regression lines pass through the center of gravity C of the data cloud. The coordinates of C are mean x (X=xi/n) and mean y (Y=yi/n) of all the data points. Indices i refer to the coordinates of individual data points. The slope B of the symmetric regression line can be calculated directly from data (Taagepera 2008: 173-174): B =±[(yi -Y)2/(xi -X)2]1/2. With indices dropped and the axes shifted so that X=Y=0, B = ±[y2/x2]1/2. The sign of B (+ or -) is the same as for correlation coefficient R. This formula may not look symmetric in x and y, given that y is on top and x at the bottom. But keep in mind that any slope stands for dy/dx. When we introduce B=dy/dx, the result is symmetric in x and y: (dy)2/Σy2=(dx)2/Σx2. This slope B is a pure measure of slope in the sense that, if we increase random scatter, this slope does not systematically shift up or down. The formula for R2 is also symmetric in x and y: R2=[(yi-Y)(xi-X)]2/[(xi-X)2(yi -Y)2]. With indices dropped and X=Y=0, it becomes R2=[yx]2/[x2y2]. It is a pure measure of lack of scatter in the sense that it does not depend on slope. In previous pictures of roughly elliptic data clouds (Figures B1.5 and B1.6), B and R tell us everything about how tilted and how flat or roundish the data cloud is – B expresses the slope of the main axis, and R2 the relative width of the ellipse. (For fuller description, we also need the coordinates of the center and the range of values of x or y.) This means that the slope of the symmetric regression line is thescatter-independent complement to R2.
2

The slopes of the two OLS lines result from a combination of these pure measures. They are mixtures of slope B and scatter R2. We usually measure the slope of OLS y on x relative to the x axis (b=dy/dx). But the slope of OLS x on y might be measured either relative to y axis (b’=dx/dy) or relative to x axis (b”=dy/dx), so that b’b”=1. So we have three relationships, as pictured in previous Figure B1.6: b=|R|B OLS y on x – scatter-reduced slope. b’=|R|/B OLS x on y, slope relative to y axis, scatter-reduced. b”=B/|R| OLS x on y, slope relative to x axis, scatter-enhanced. It follows that b/b”=R2, when the slopes of both OLS lines are measured with respect to the x axis. If further random fluctuation is imposed on a given data set, R2 is reduced. This means that the ratio of the slopes of the OLS lines is reduced. How does this reduction come about? The slope of the OLS line y-on-x (b) is reduced, while the slope of the OLS line x-on-y (b”) is enhanced, so that both contribute to reduction in R2. But this means that the slope of each OLS line is affected by the degree of scatter. Hence

85

the OLS slope is a mixed indicator of steepness of linear trend and of scatter around this trend. Note that R2 can be visualized as degree of lack of scatter, either along the symmetric line or along the combination the two OLS lines. It would be misleading to visualize R2 as degree of lack of scatter along a single OLS line. This can be seen in Figure B1.6: The scatter expressed by R2 is distributed lopsidedly around the OLS line, while it is distributed evenly around the symmetric regression line. When R2 is low, reporting a single OLS slope along with R2 effectively means counting the real slope at half-weight (B in |R|B) and the degree of scatter at one-and-a-half weights (R2 plus |R| in |R|B). It also follows from the equations above that the symmetric slope B is the geometric mean of the slopes of the two standard OLS lines in relation to the same axis (b and b”): B=±(bb”)1/2, the sign being the same as for b and b”. When the slope of y-on-x is measured with respect to the y axis (b’=1/b”), the relationship is B=±(b/b’)1/2. ______________________________________________________________________ Exercise B2.1 On top of your graph in Exercise B1.1 also graph the symmetric regression line and determine its equation. Proceed as follows. a) Assume the two OLS lines are y=1.1+0.6x. and y=0.50+1.00x (they are close). Calculate the slope (B) of the symmetric line as the geometric mean of the slopes of the two OLS lines. .hetsbaöt b) Pass the line with that slope through the center of gravity G(1.5, 2.0). c) The symmetric line has the form y=a+Bx. To find a, plug x=1.5, y=2.0 into this .tmehartme equation and solve for a. _______________________________________________________________________

86

B3. When is Linear Fit Justified?
___________________________________________________________________  A linear fit of data is acceptable only if the data cloud is uniformly dispersed within a roughly elliptic area, with no visible curvature or structure. One cannot be sure that this applies, unless one graphs the data. Even then the linear fit must not violate logical constraints.  The use of R2 is subject to the same limitations.  The least squares regression method is quite sensitive to extreme values (outliers), and so is symmetric regression and R2. ________________________________________________________________ When we are given just the equation of regression line y=a+bx plus R2, but no graph, we tend to imagine what was shown in Figure B1.5: a roughly ellipse-shaped data cloud, tilted according to the slope b in y=a+bx, with ellipse flatness corresponding to the value of R2. This is an adequate mental picture under the following conditions:  the center of gravity C (mean x, mean y) does express the center of the data cloud in a meaningful way;  the regression line does express the main axis of the data cloud in a meaningful way ; and  R2 does express the dispersion of data around this line in a meaningful way. Recall that the main axis of the data cloud is the line of symmetric fit. If instead, y=a+bx stands for OLS regression y-on-x and R2 is low, we would tend to underestimate the actual slope of the ellipse main axis. But this is a relatively minor distortion. Real trouble is that all too often data clouds do not look at all like ellipses. Figure B3.1. An example where linear regression would be misleading.

y

C

x

87

Many data clouds do not resemble ellipses Data clouds may look like bent sausages or even like croissants, as in Figure B3.1 (which is inspired by a published graph of data and regression line). Here linear regression is not justified at all because  the center of gravity C lies in a zone where few data points occur;  the regression line (dashed line in Figure B3.1) passes through an a zone with few data points and does not express the configuration, conjuring in our minds the false image of a tilted ellipse; and  the value of R2, low as it is bound to be in Figure B3.1, wrongly conjures the image of an almost circular data cloud (as in Figure B1.5, center) rather than the actual complex configuration. No simple curve fits here, but a roughly parabolic fit (dotted curve) would be appreciably more expressive of the pattern than a straight line. How do we know whether our data cloud is roughly elliptic or not? The blunt advice is: Always graph all data y vs. x before pushing the regression button. Then carry out linear regression only when it makes sense. Whenever the data cloud has even a slight curvature (bent sausage rather than a straight one), consider some data transformation so as to straighten the pattern, before regressing. How do we carry out such a transformation? We’ll come to that. Right now the point is to impress on you that improper application of linear regression to data configurations not suited for it can lead to monumental mischaracterization of the data. This is so important that further cautionary examples are given. Grossly different patterns can lead to the same regression lines and R2 Consider the examples in Figure B3.2. Assume we have no information on any conceptual limitations. Those four data configurations have been chosen such that they all would lead to exactly the same linear regression lines: OLS y on x yields y=3.00+0.50x, OLS x on y yields y=0.75+0.75x, and symmetric regression line is y=1.9+0.61x. The center of gravity is the same: x=9.00, y=7.50. The four configurations also would lead to the same correlation coefficient, a pretty high one (R2= 0.67), provided that one deemed it appropriate to apply a linear fit. But in which of the cases does a linear fit make sense?  Constellation y1: A linear fit looks acceptable because the data cloud is uniformly dispersed, with hardly any visible curvature. One could draw an ellipse around the data points, and the crowdedness of points would be roughly the same throughout the ellipse. (True, one might detect an empty region in the lower center, meaning that a slightly bent pattern would fit better. But in the absence of conceptual constraints we might gloss over it.)

88

Figure B3.2. Data that lead to exactly the same linear regression lines and R-square (Anscombe 1973, Taagepera 2008: 201). But does a linear fit make sense?
y1 10

5

x 0 0 10 20

y2 10

5

x 0 0 10 20

y3 10

5 x 0 0 10 20

y4 10

5

x 0 0 10 20

89

Constellation y2: The points fit neatly on a parabolic-looking curve, and a corresponding transformation should be applied before statistical testing. Applying a linear fit would be ludicrous, and random deviation from a regular pattern is much less than intimated by R2=0.67. The parabolic transformation could be based on statistical considerations, but this would also be prime time for asking why this is so that y first rises and then falls with increasing x. Constellation y3: It has 10 points perfectly aligned, while one point is a blatant outlier. It clearly does not belong and should be omitted, before carrying out regression. (The statistical justification for deletion is that it deviates by more than 3 standard deviations.) When this outlier is omitted, the slopes of all regression lines are lower than previously calculated. Also R2 approaches 1.00. One should try to figure out how the outlier came to be included in the first place. Maybe there was a typo in the data table – it happens. Constellation y4: The pattern is far from a rising straight line. We observe two distinct populations where y actually decreases with increasing x, plus an isolate. This pattern should make us wonder about the underlying structure: Why is there such an odd pattern? The rising overall regression line would clearly misrepresent the data.

Note that none of the peculiarities of the three latter cases would be noticed, if one just used tabulated data and went on to passive regression. One must graph the data! The use of R2 is subject to the same limitations. ___________________________________________________________________ Exercise B3.1 Copy Figure B3.2 and add the symmetric regression line y=1.9+0.61x to all 4 graphs. [Note that the center of gravity (9.00,7.50) is on all regression lines.] The discrepancies characterized above will become more apparent. (CAUTION: The scales on all four graphs are somewhat different, so one must scale off the distances on each of them separately. Sorry for this inconvenience.) ___________________________________________________________________ Exercise B3.2 All four configurations in Figure B3.2 have the same arithmetic means for x and y: x=9.00, y=7.50. a) In which cases, if any, would the use of these particular values of arithmetic means be justified, because they adequately characterize something about the data cloud? b) In which cases, if any, would calculation of arithmetic means by justified once something has been done with the data? c) In which cases, if any, should one use means different from arithmetic? d) In which cases, if any, should one give up on trying to calculate any kind of means, because they would leave a mistaken impression of the actual configuration? Do not assume that one and only one case fits each question! _______________________________________________________________________

90

Sensitivity to outliers Linear regression is trickier business than some social scientists realize. In particular, the least squares regression method is quite sensitive to extreme values (outliers), and so is R2 (Kvålseth 1985) – and symmetric linear regression shares the same problem. This is illustrated by the third example in Figure B3.2. While the single outlier affects the center of gravity only slightly, it makes the slope of the regression lines much steeper and lowers R2 from nearly 1.00 to 0.67. In our simple terms, any point far out of a generally ellipse-shaped data cloud can mess up the results. Outliers can legitimately be excluded from a set by some statistical considerations, such as being off by three (or even just two) standard deviations, in an otherwise normal distribution. There are also more refined regression methods that minimize their impact, such as Tukey’s outlier-resistant method (Kvålseth 1985). The main thing is to know when trouble looms, so that one can consult. How does one smell trouble? By graphing and eyeballing. Empirical configuration and logical constraints Linear regression, be it standard OLS or symmetric, may be used only when linear fit is justified in the first place. When is it, and when isn’t it? One must check for acceptable configuration of the empirical data, as has been done above. In addition conceptual agreement with constraints is also needed. Conceptual agreement includes passing through anchor points and avoidance of forbidden areas. Thus for volatility in Chapter A6 the best linear fit of V on N narrowly misses the anchor point at N=1. (The reverse fit of N on V would err in the opposite direction, and the symmetric regression line would be in-between.) The linear model does respect the anchor point but goes through the ceiling at V=100 – so we have to treat it with great caution, as a merely local model. Linear fit is acceptable only when volatility is much below 50 (which it most often is). Compatibility of linear fits with logical constraints will surface again and again later on. ______________________________________________________________________ Exercise B3.3 In the top graph (y1) of Figure B3.2, join the top 6 points with a curve, then do the same for the bottom 5 points. Does the pattern still look straight? Compare to the next pattern (y2). _______________________________________________________________________

91

B4. Graph More than just the Data! ________________________________________________________

92

For x=2, our regression line would predict y=-0.70. A negative approval rating? Approval ratings usually do not go below zero. If this is the case here, then such prediction makes no sense. Yet, our equations must not predict absurdities, if we want to be taken seriously as scientists. They should not do so even outside the empirical range of input variables. Regarding broader context, we should ask why the intercept is around a=-4.06? What does this number tell us? Is it large or small compared to the intercepts in some other data sets? Unless one asks contextual questions, it is pretty pointless to calculate and report a precise regression equation that predicts absurdities for some values of x. It is dead on arrival in the printed page (Taagepera 2008: 82-91). It is high time to graph more than just the data. Figure B4.1. Graphing only the data and regressing linearly. Dotted line: y=x.

4

y
3

2

1

0 2 3

x

4

5

Graph the equality line, if possible We might start by graphing the equality line (y=x), if equality can be defined. It certainly can, if x and y are percentages. In contrast, if x is a country’s area and y its population, then no equality can be defined. Here it can, provided that both ratings are on the same scale.

93

By thinking of equality, we add a conceptual comparison line: equal support for both institutions. In Figure B4.1, it’s the dotted line, top left. Now we can see that support for assembly always falls short of support for president, even while it seems to be catching up at high values of x. One may say that this was obvious even without graphing the equality line – but I have seen too many published graphs where this line was not drawn in and the “obvious” implication was missed. Would y ever catch up with x? To answer this question, it is time to graph even more, besides the data. Figure B4.2. Introducing the limits of the conceptually allowed area and the logical anchor points. 5

Δ

4

y
3

Equality line y=x

2

1

0

0

Δ

1

2

3

x

4

5

Graph the conceptually allowed area I should have told you right at the start about the scale on which people were asked to rank the president and the assembly. It matters whether the given data refers to variables

94

that can range from 0 to 5 or from 0 to 10 – or worse, from 1 to 5. The range can even differ on the two axes. Published articles all too often hide or omit this all-important information. Suppose the conceptually allowed range is 0 to 5 on both axes. Indicate the resulting limits of the area where data points could possibly occur (Figure B4.2). Now the same data take on a different appearance: The curvature somehow looks more pronounced. We are also motivated to ask: What are the extreme possibilities? Our data suggest that the assembly’s ratings do not surpass the president’s. They also cannot fall below 0. In the absence of any other information, what could the assembly’s rating be when even the president’s rating is 0? The only answer that agrees with the information above is that the assembly also must have zero rating. Similarly, when even the assembly is rated a full 5, then the information above does not allow the president to have any less than 5. These points (0,0 and 5,5) are logical anchor points for such data. They are indicated in Figure 4.2 by triangular symbols reminiscent of letter A for “Anchor”. Now we come to a corollary of the previous anti-absurdity norm: A data fit is meaningful only when it remains within the conceptually allowed area and includes the logical anchor points. The linear data fit (dashed line in Figure B4.2) violates logic. For presidential ratings below 2.4 it would predict negative ratings for the assembly. For assembly ratings above 4.4, it would suggest presidential ratings above 5. We must use a different format. We should keep this format as simple as possible while still satisfying the basic constraints. Find the equation of the curve joining the anchor points The simplest algebraic format that takes us smoothly from 0,0 to 1,1 is the “fixed exponent” equation Y = Xk This is generally the most manageable format for quantities with clear lower and upper limits. Physicists often call such equation a power equation, but this term confuses and angers some political scientists who deal with political power. So I use “fixed exponent equation”. In our case the ranges are from 0 to 5, which would make even the simplest form more complex: y=5(x/5)k. It is less confusing to shift to Y=y/5 and X= x/5. Table B4.1 and Figure B4.3 show these “normalized” values. It can be checked that k=3.6 yields a fair fit to data (plus exact fit to anchor points). Indeed, deviations of Y=Xk 3.6 from actual data seems to fluctuate up or down fairly randomly. But how did I get the value k=3.6? The equation Y=Xk implies logY=klogX and hence k=logY/logX. Pick a location on the graph that approximates the central trend, such as 0.60,0.16. (It need not be an actual data point.) Then k=log0.16/log0.60=(-.796)/(-.222)=3.587≈3.6. If you feel uneasy with logarithms, see next chapter.

95

Table B4.1. Original and normalized data, and fit with Y=Xk. ______________________________________________________ x 0 2.0 2.5 3.0 3.5 4.0 4.5 5 Actual y 0 0.2 0.2 0.9 1.1 2.4 3.6 5 3.6 y=5(x/5) 0 0.18 0.41 0.79 1.38 2.24 3.42 5 X 0 0.40 0.50 0.60 0.70 0.80 0.90 1 Actual Y 0 0.04 0.04 0.18 0.22 0.48 0.72 1 3.6 0 0.037 0.08 0.16 0.28 0.45 0.68 1 Y=X Deviation + + _______________________________________________________

Figure B4.3. Normalizing to range 0 to 1, and fitting with Y=X3.6. 1

.8

Y
.6

.4

.2

0

0

.2

.4

.6

X

.8

1

If we want a more precise fit, we can run linear regression logY against logX; its slope is k. But how much precision do we really need? Even a coarse fit that respects the conceptual anchor points is vastly preferable to a 3-decimal linear fit that predicts

96

absurdities. We need enough precision to compare these data to some other data of a similar type. These could be ratings at a different time. They could also introduce a different variable, such as support for the cabinet, or refer to a different country. Only such comparisons lend substantive meaning to the numerical value k=3.6, by placing it in a wider context. Note that regression of logY against logX becomes difficult when some values of X or Y are very low – and a single zero value blows it up. In such cases extra help may be needed. Sometimes (1-Y)=(1-X)k yields a better fit than Y=Xk. An example will be given in Part C of the book. __________________________________________________________________ Exercise B4.1 Lori Thorlakson (2007) defines three measures of centralization of power in federations, all running from 0 to 1: Revenue centralization (R); Expenditure centralization (E); Index of federal jurisdiction (J). Their mean values in 6 federations, from the 1970s to 1990s, are as follows. The averages of the three measures are also shown, as Index of federal power (IFP). R E J IFP Austria .73 .70 .95 .79 Australia .73 .59 .87 .73 Germany .65 .58 .96 .73 Switzerland .58 .48 .92 .66 US .60 .56 .80 .65 Canada .48 .42 .85 .58 a) In what order are the countries listed? What are the advantages and disadvantages of this order, compared to alphabetical listing? b) Graph E vs. R, along with what else is needed. c) On the same graph, enter the points J vs. R. d) Establish the likely shape of the relationships, and calculate approximately the relevant parameters. egy.zuzarb .ketzuz e) Graph the corresponding curves. Comment on the degree of fit to the data points. f) From the previous parameters (NOT directly from the data!), calculate the approximate relationship between E and J. hr.hat g) Graph the resulting curve E vs. J, and also the data points. Comment on the degree of fit. h) What do you think are the purposes of this exercise? ____________________________________________________________________ Exercise B4.2 Use the data in previous exercise. a) Graph E vs. R and J vs. R on log-log scale, along with equality line. Given that all values are between 0.4 and 1, you need only one period on log-log graph paper. You may even magnify it, to get more detail. b) Draw in the best fit lines that pass through (1,1). These correspond to curves Y=Xk. CAUTION: While the curves Y=Xk converge at 0,0, the lines logy=k logx cannot, because log0 vanishes to minus infinity.

97

c) Determine their slopes, by measuring vertical and horizontal distances, then dividing them. These are the values of k. How do they compare with values observed in previous exercise? ____________________________________________________________________ Why linear fit may lack comparison power When both variables have a floor and a ceiling, they can be normalized to X and Y ranging from 0 to 1. If they are anchored at 0,0 and 1,1, then Y=Xk or (1-Y)=(1-X)k often fits. It is then easy to compare various data sets, ranking them by values of k, which express the degrees of deviation from straight line. Could such comparisons also be made using linear fits? It is easy to compare data sets on the basis of a single parameter k in Y=Xk, but it is much messier to do so for two (a and b) in y=a+bx. More serious, if the real pattern is curved, then values of a and b can vary wildly for different parts of the same data set. If we calculate k in Y=Xk separately for the lower and upper halves of the data in Figure B4.3, little will change. In contrast, a and b in the linear fit will change beyond recognition. Comparisons with other data sets cannot be made on that basis. In this sense, the “simple” linear fit is not simple at all. It is well worth while to master the rather simple mathematics of the fixed exponent equation rather than grind out linear parameter values meaningless for comparison purposes. This is what we’ll do soon, but before we do, we must take a serious look at logarithms and exponentials.

98

B5. Getting a Feel for Exponentials and Logarithms ______________________________________________________

We have 10a10b=10a+b, because it simply adds the number of zeros. Also 10a/10b=10a-b, which subtracts the number of zeros, and (10a)b=10ab, which multiplies the number of zeros. It results that 10-a=1/10a and 100=1.  Decimal logarithms are fractional exponents of 10 that lead to the given number: 10logA=A.  It follows that when numbers are multiplied, their logarithms add: AB=C  logA+logB=logC. Also y=Am  logy=mlogA.  Keep in mind that log10=1, log1=0, and log0-infinity.  The formulas established for exponents of 10 apply to any other number n too: nanb=na+b, na/nb=na-b, (na)b=nab, n-a=1/na, n0=1, and the b-th root of n is n1/b.  Natural logarithm (lnx) is just a multiple of logx: lnx=2.3026 logx. Conversely, logx=0.434 lnx. ________________________________________________________________

If you know how to use logarithms and exponentials, you can bypass this chapter. But make sure you really understand them, beyond using them by rote. We have introduced logarithms very gradually – only to the extent they were indispensable for the problem on hand. This approach also gave time for some basic notions to sink in, before being flooded by more. Chapter A5 said that the logarithmic scale is a scale that shows 0.1, 1, 10 and 100 at equal intervals. Chapter A8 showed that curves like C=42 years/N2 can be turned into straight lines by taking logarithms on both sides of the equation, leaving explanation for later. So did Table A8.1, which listed logarithms for various numbers. Chapter A9 stated that the (decimal) logarithm of a number such as 10,000 – “1” followed by zeros – is simply the number of zeros. The (decimal) logarithm of any positive integer is the number of zeros that would follow the first cipher, if all the other ciphers were replaced by zeros, plus a fraction. It further showed how powers of 2 could be placed logically on the scale that shows 0.1, 1, 10 and 100 at equal intervals. It emerged that log2 must be around 0.30. Furthermore, log 3 must be just short of 0.5 and log5 must around 0.70 (Exercise A9.1). In Chapter A12, we needed the following rule, and stated it without showing how it comes about: When y=aAbB/cC, then logy=Aloga+Blogb-Clogc. In Chapter B4, we again needed logarithms, to calculate the values in Table B3.1. By now you should be persuaded that there is no escape: One cannot do basic logical models without logarithms and their counterparts, the exponentials. I’ll still try to keep it as simple as possible, and previous gradual introduction may have reduced anxiety. I do not want you to memorize “When numbers are multiplied, their logarithms add” (although it’s true and useful) without internalizing what it means. Only then are you prepared to use them without hesitation, knowing what they mean.

99

Then, if you have made a mistake in calculations (as I often do), you’d be able to smell it and correct it. Exponents of 10 We use 103 as shorthand for 10×10×10. Thus, 103=1,000. More generally, the “exponent” a in 10a is the number of zeros that come after “1”. It follows that 101=10. Also 100=1, given that here “1” is followed by no zeroes. If we multiply 100 by 1,000, we get 100,000. Using exponent notation, we have 10 ×10 =105, which is 102+3. The numbers of zeros are added. This is how multiplication turns into addition. When multiples of 10 are multiplied, their exponents are added: 10a10b=10a+b.
2 3

When multiplying 10 by itself 3 times, we get 100×100×100=1,000,000. In exponent notation, 102×102×102=(102)3=106. Thus (10a)b=10ab. If we divide 10,000 by 10, we get 1,000. Using exponents: 104/101=103, which is 104-1. Hence division of numbers leads to subtraction of exponents: 10a/10b=10a-b. Now consider the reverse division: 10/10,000 yields 1/1,000=0.001. The previous rule makes it correspond to 101/104=101-4=10-3. Note that the “-3” corresponds to the number of zeros that precedes “1”. We must also conclude that 10-a=1/10a. If we divide 100 by 100, we get 1. This corresponds to 102/102=102-2=100. Thus, it is confirmed that 100=1. For future use, all the equations in bold may be worth memorizing – but only if you understand what’s behind them. This means you can prove these relationships, if challenged. Otherwise, memorization does you little good. Conversely, if you understand, all this may look so natural that no memorization is needed. _____________________________________________________________________ Exercise B5.1 OK, now quickly, without looking at the above: Why is 100 equal to 1? Why is 10-a equal to 1/10a? Why is 103 times 104 equal to 107? If you cannot respond, in terms of numbers of zeros, return to the beginning of the section. ___________________________________________________________________ Fractional exponents of 10 The next question may sound crazy: What could 101/2 or 100.5 stand for? If you take the previous rule seriously, it would mean “1” followed by one-half of a zero! It seems to make no sense. But hold it! Also consider the previous rule 10a10b=10a+b. When multiplying 100.5 by itself, we would get 100.5100.5=100.5+05 =101=10. But this is the very definition of square root of 10, which is approximately 3.16, given that 3.16×3.16=10.

100

Thus 101/2=100.5 stands for square root of 10. Yes, it’s as if 3.16 were “1” followed by one-half of a zero. Now consider the cube root of 10, which is 2.154, because, as you can check, 2.1543=10. We could then say that 2.154 is somehow like “1” followed by one third of a zero, because 101/3×101/3×101/3 =101/3+1/3+1/3=101, which is “1” followed by a full zero. What about exactly 3? It is somewhat less than 3.16 but much more than 1.154. So it should be “1” followed by somewhat less than one-half of a zero but much more than one third of a zero. By now you may get the message that we can assign an exponent of 10, a sort of a “fractional number of zeros”, to any number between 1 and 10. For instance, 2 is 10 with exponent 0.30. How can we prove it? Recall Chapter A9. Note that 210=1,024. This is quite close to 1,000=103. Thus 210≈103. Take the 10th root on both sides: (210)1/10≈103)1/10. Multiply through, and we get 2≈100.3. Decimal logarithms This “fractional number of zeros to follow 1” – this is what the decimal logarithm is. Thus, log3.16=0.500, log 2.154=0.333, and log2=0.30. Hence, by definition 10log2=2. More generally, for any number A, 10logA=A. When numbers are multiplied, their logarithms add. Indeed, consider AB=C. We can write it as 10logA10logB=10logC. It follows from 10a10b=10a+b that logA+logB=logC. AB=C  logA+logB=logC. Also log(A2)=log(AA)=logA+logA=2logA. More generally, log(Am)=mlogA. Note that m enters here, not logm. In other words, y=Am  logy=mlogA. When we know the logarithms for numbers between 1 and 10, we effectively know them for all numbers. Take 316. We can break it down into 3.16×100. Hence log316=log100+log3.16=2+log3.16=2+0.50=2.50. It works for fractions, too. Take 0.002. We have 0.002=2/1000. Hence log0.002=log2-log1000=0.30-3=-2.7. The “3” in 0.3 seems lost. How come? The inverse of 0.002 is 1/0.002=1000/2=500. And sure enough, log 500=log5+log100=2+log5=+2.7. This is how the “.3” leads to “.7”. What could be logarithm of 0? Recall that log0.001=log(1/1000)=-3. Each time we divide a number by 10 we subtract “1” from its logarithm. How many times do we have to divide 1 by 10 so as to obtain 0? We’d have to do it infinite times. Thus log0 tends toward minus infinity: log0  -∞.

101

Hece 0 cannot be placed on a logarithmic scale. What about logarithms of negative numbers? Let us say that they do not have any. Earlier, we approached logarithms from a different direction. Understanding them from different directions reinforces our ability to use them, making few mistakes and – most important – realizing quickly when we have made a mistake. What are logarithms good for? They turn multiplications into additions – but who needs going through logarithms when one can just multiply? However, they also turn exponent expressions into multiplications, and this is where one cannot do without logarithms. And all too many logical models involve exponents. Take the expression y=5(x/5) 3.6 in previous Table B4.1. How did I get y=0.18 when x=2.0? For x=2, y=5(0.4) 3.6. Here we cannot go ahead without applying logarithms: logy=log5+3.6log0.4. A pocket calculator with a LOG (or LOG/10x or log) key comes handy. (I can estimate the logs by other means, but it takes time.) On a usual pocket calculator, enter 5, push LOG and get 0.699. (On some calculators, you must first push LOG, then 5 and then “=”.) This way you get logy=0.699+3.6×(-0.398)=-0.734. Now take the “antilog” of -0.734, which means taking 10-0.734. On most pocket calculators, once you have -0.734 entered, push “2nd function” and “LOG”, and you get y=10-0.734=0.1845≈0.18. This is a most important property of logarithms: They turn exponent expressions, which we cannot compute directly, into multiplications: A=Bm  logA=m logB. Many pocket calculators offer a shortcut for 5(0.4) 3.6 – the “yx” key. Enter 5, push “×”, enter 0.4, push “yx”, enter 3.6, push “=”, and you get 0.18 directly. Does such a calculator bypass taking logarithms? No. It just automatically takes log0.4, multiplies by 3.6, and takes the antilog. Exponents of numbers other than 10 The formulas established for exponents of 10 apply to any other number n too: nanb=na+b na/nb=na-b (na)b=nab n-a=1/na n0=1. It follows that the b-th root of n is n1/b. Also the b-th root of na is na/b. These relationships frequently enter model building and testing.

102

Logarithms on other bases than 10 Logarithms can be established on bases other than 10, but this may make it confusing. The only other type of logarithms needed for most models is the ”natural “ one, designated as “ln”. It is based on the number e=2.718… instead of 10. By definition, lne=1, just as log10=1. What’s so “natural” about 2.718…? We’ll come to that in Part C. When logx means logarithm to the base 10, then lnx is a fixed multiple of logx: lnx = 2.3026 logx. Conversely, logx = 0.434 lnx. The previously established relationships still apply. In particular, AB=C  lnA+lnB=lnC, and ln(Am)=mlnA. Many pocket calculators have separate keys for LOG (and 10x for antilog) and LN (and ex for antilog). We’ll address the natural logs when the need arises.

103

B6. When to Fit with What ________________________________________________________
 The conceptually forbidden areas inevitably constrain relationships between two variables. Conceptual anchor points and ceilings add further constraints.  When the entire field y vs. x is open, try linear fit. Graph y vs. x and see if the pattern is straight.  When only one quadrant is allowed, try fixed exponent fit. Graph logy vs. logx and see if the pattern is straight.  When two quadrants are allowed, try exponential exponent fit. Graph logy vs. x and see if the pattern is straight.  When the pattern is not linear look for further constraints. ________________________________________________________________ The example in Chapter B4 (satisfaction with head of state and with national assembly) concluded that a linear fit violated logical anchor points. In that particular case it was k advisable to fit the data with y=x . This meant graphing on log-log scales and trying a linear fit for log y vs. log x, rather than y vs. x. But more broadly, how does one know which equation to use? As a first step, do what was done in Chapter B4 – consider the conceptually allowed area. The following presents some of the simplest and most prevalent constraints. A more systematic review of frequently occurring forms is given in Taagepera (2008), along with cautionary notes. Our basic observation is that the conceptually forbidden areas inevitably constrain relationships between two variables. Conceptual anchor points and ceilings add further constraints. We should start with the simplest mathematical format that satisfies such constraints. We should add more complexity only when data do not fit the simplest format. Such lack of fit usually means that further logical constraints have been overlooked. It is amazing, though, how often nature (including social nature) conforms to the simplest forms. (Recall “Why would the simplest forms prevail?” in Chapter A12). Unbounded field – try linear fit The only situation where a linear model is justified is when both x and y can conceivably take any values – from minus infinity to plus infinity. The allowed area is an unbounded field (Figure B6.1). Apart from time, such quantities are rather rare. Their zero point tends to be arbitrary, and hence there are no logical anchor points. Calculating the parameters of a line in unbounded field If we expect a linear pattern y=a+bx because it is unbounded field, no transformation is needed. Just graph y vs. x. If the data cloud is linear, then y=a+bx applies. Then we can regress y vs. x. How can we find the coefficients a and b, using the visual best-fit line?  Intercept a is the value of y where the line crosses the y axis (because here x=0.)  Slope b is the ratio -a/c, c being the value of x where the line crosses the x axis (because here y=0.)

104

However, we can find the coefficient values in y=a+bx from any two suitable points.  Take two points, far away from each other: x1,y1 and x2,y2. These should be “typical” points in the sense of being located along the axis of the data belt, not high or low compared to neighboring points.  For y=a+bx we have b=(y1-y2)/(x1-x2). Then a=y1-bx1.  When a=0 is imposed, so as to force the line to go through 0,0, the equation is reduced to y=bx. Then b=y1/x1. Figure B6.1. When to use linear regression on unmodified data.

Unbounded field (any x, any y)
Zero point often arbitrary: (0) +∞

Try fitting with y=a+bx (linear pattern) b=+1 a c

(0) -∞ (0)

+∞ b=-1/2 a

-∞

c

One quadrant allowed – try fixed exponent fit Most often, however, we deal with quantities that cannot go negative: populations, votes, parties. In a y vs. x graph, only one quadrant of the open field is allowed (Figure B6.2), and there is a natural zero that cannot be shifted: zero persons, zero parties, etc. E.g., for a country with zero square kilometers, it is reasonable to expect zero population and zero parties. Here one should consider the fixed exponent pattern y=Axk. It has an anchor point at 0,0 when k is positive. It does not lead to absurdities. In contrast, most straight lines do, predicting a negative value of y for some positive values of x. For the straight line, we calculated the parameters a and b expressing a particular line. How do we calculate the parameters A and k for a fixed exponent curve? This will be done soon.

105

Figure B6.2. When to try the fixed exponent format.

Zero point not arbitrary (natural zero) +∞

Try fitting with y=Ax (fixed exponent pattern) k=+2 k=+1

k

FORBIDDEN 0 -∞ 0

ALLOWED

A 1

+∞

FORBIDDEN A -∞ 1 k=-1

Two quadrants allowed – try exponential fit An intermediary situation arises when one of the quantities can range from minus infinity to plus infinity (such as time), while the other cannot go negative (such as population). In a y vs. x graph, two quadrants of the open field are now allowed (Figure B6.3). There is a natural floor at zero persons, but no natural zero on time and hence no anchor point. Here one should consider the exponential pattern, y=A(Bx). It can also be expressed as y=A(ekx), where e=2.71… is the basis of natural logarithms. The fixed exponent equation y=A(xk) and the exponential equation y=A(Bx) may look confusingly similar. The first has x to a fixed exponent (power), while the second has a constant to the exponent x. The difference in outcomes is huge. Further constraints often occur. For instance, a growth pattern in time may be squeezed into the space between a conceptual floor and ceiling. And an unbiased seatsvotes relationship for a two-party system has anchor points not only at 0,0 and 1,1 but also at 0.5,0.5. Such constraints impose more complex patterns – see Taagepera (2008). We’ll address them only when the need arises.

106

Figure B6.3. When to try the exponential format.

Two quadrants allowed (any x, but y>0)  Try fitting with y=A(Bx)=A(ekx)
(natural zero on y, arbitrary zero on x) +∞ (exponential pattern)

ALLOWED

a

k>1

-∞

(0)

+∞

(0) k<1

FORBIDDEN
a -∞ (0) How to turn curves into straight lines, before regression Humans are pretty good at telling whether a line is straight, while various curves may look all the same to us. Suppose the conceptually allowed area suggests a fixed exponent or exponential relationship. It helps if we can transform the data in such a way that they would form a straight line – if the relationship is truly a fixed exponent or exponential. Then we could graph the transformed data and see at a glance whether the transformed data cloud looks roughly like an ellipse. When this is the case, linear regression of the transformed data is justified. (It should preferably be symmetric regression.) If the transformed data cloud still looks bent or otherwise odd, we would have to ponder why this is so and what we have to add to the model so as to straighten out the data cloud. When the transformed graph does show a linear relationship, we should calculate or estimate its two parameters. From these we can determine the parameters of the original model – A and k in y=Axk, and similarly for exponentials. It can be done without push-button regression, and in fact, this hands-on approach is sometimes preferable. My experience is that students cannot understand and interpret computer-generated regression outputs unless they have acquired the ability to do rough graphs by hand and calculate the parameters using nothing more than a pocket calculator. The description of how to proceed is presented next.

107

Calculating the parameters of fixed exponent equation in a single quadrant If we expect fixed exponent pattern y=Axk because only one quadrant is allowed, taking logarithms leads to linear relationship between logy and logx: log y=log A+klog x. Designating log A as a takes us to the familiar linear form (log y)=a+k(log x). Hence we should graph log y vs. log x. If the transformed data cloud is linear, then y=Axk applies. Then we can regress log y vs. log x. How can we find the coefficients A and k in y=Axk? We can do it in two ways. Finding the coefficient values in y=Axk from the log-log graph:  Coefficient A is the value of y where the line crosses the logy axis (because here log x=0 and x=1).  Exponent k is the ratio -A/c, c being the value of log x where the line crosses the log x axis (because here log y=0.) Finding the coefficient values in y=Axk from two points on the original curved graph y vs. x:  Take two “typical” points of the data belt, far away from each other: x1,y1 and x2,y2.  For y=Axk we have k=log(y1/y2)/log(x1/x2). Then A=y1/(x1k).  When A=1 is imposed, the equation is reduced to y=xk. Then k=logy1/logx1. Calculating the parameters of exponential equation in two quadrants If we expect exponential pattern y=A(Bx) because only two quadrants are allowed, taking logarithms leads to linear relationship between logy and non-logged x: logy=logA+x(logB). Designating logA as a and logB as b takes us to the familiar linear form (logy)=a+bx. Hence we should graph logy vs. x itself. If the data cloud is linear, then y=A(Bx) applies. Then we can regress logy vs. x itself. There are often good reasons to use the alternative exponential expression y=A(ekx) and natural logarithms (ln). By definition, lne=1. Hence the logarithms are related as lny=lnA+kx=a+kx. We again graph lny vs. x itself. If the data cloud is linear, then y=Aekx applies. Then we can regress lny vs. x itself. Recall that natural (lnx) and decimal (logx) logarithms relate as lnx=2.30logx and, conversely, logx=0.434lnx. Often we can use either logarithm. How can we find the coefficients A and B in y=A(Bx) or A and k in y=Aekx? In principle, we can again do it in two ways – using the semilog graph or using two points on the original curved graph y vs. x. However, on the semilog paper you may get confused between log and ln, so better use the two-point formula:  Take two “typical” points of the data belt, far away from each other: x1,y1 and x2,y2.  For y=A(Bx) we have logB=log[(y1/y2)/(x1-x2)]. Then B=10logB and A=y1/(Bx1).  For y= A(ekx) we have k=ln[(y1/y2)/(x1-x2)]. Then A=y1(e-kx1). This is often the more useful form. 108

For these recipes to start to make sense, examples should be worked out. A few examples are given here. We’ll encounter many more as we consider actual data graphs, in Part C of the book. When the exponential format applies, use y=A(ekx) rather than y=A(Bx). _____________________________________________________________________ Exercise B6.1 Consider the following two data sets. The first is selected and rounded off from Lijphart (1999: 76-77 and 177) and is representative for democracies around 1995. We have dealt with effective number of parties (N) earlier. Interest group pluralism (I) is a compound indicator that can range from 0 to 4. These data sets are not the best ones for a first exercise where one must choose the functional form – they have their peculiarities. I’ll try to locate less ambiguous sets. Country Australia Botswana Canada Germany India Malta Norway Papua-NG Switzerland Venezuela Arithm. mean Geom. mean N 2.2 1.4 2.4 2.9 4.1 2.0 3.4 6.0 5.2 3.4 3.30 3.02 I 2.7 2.6 3.6 1.4 2.3 3.3 0.4 2.1 1.0 1.9 2.13 1.83

The second set is scaled off from Dalton and Shin (2006: 250), where D=mean % level of popular support for democracy, and C=mean % level of popular support for capitalism around 2000 in Northern Pacific. Country Canada China Indonesia Japan Korea, S. Philippines Singapore US Vietnam Arithm. mean Geom. mean D 81 60 54 82 79 42 83 80 60 69.267.6C 57 29 48 52 57 36 58 63 44 49.848.4-

109

a) For both sets separately, graph the allowed regions and the data points on regular scales. (For the first, make sure to graph N from 0 on, even while we know that values below 1 cannot occur.) b) Given the constraints, which of the three basic patterns (linear, fixed exponent, and exponential) must be excluded and which may apply, in each case? (It should be straightforward for Dalton & Shin data. It may be more ambiguous for Lijphart data.) c) On the graphs, indicate possible anchor points. By eye, draw in a best fitting curve. d) For each data set, give the general basic equation format which respects the constraints. How should we transform the variables so that the simplest format would apply? (It should be straightforward for Dalton & Shin data. It may be more ambiguous for Lijphart data.) e) Which combination of means should we graph in each case? Comment on its location with respect to your curves. ketzhu oahket f) Now graph each data set on the graph paper which would turn the pattern into a linear one. Also place the suitable combinations of means on these graphs. g) Draw the lines through these mean points and the relevant anchor points. h) Pick at least 2 points on these lines, transfer them back to your original graphs, and draw a curve through them. Comment on how they differ or fail to differ from the curves you originally drew by eye. i) What must have been the purposes of this exercise? _____________________________________________________________________ Exercise B6.2 For the Lijphart data, use anchor point and the point determined by the suitable mean values to calculate the parameters in the equation. arb -.tletoah -egy _______________________________________________________________________ Exercise B6.3 For the Dalton & Shin data data, use anchor point and the point determined by the suitable mean values to calculate the parameters in the equation. ______________________________________________________________________ EXTRA: Two quadrants allowed, with a ceiling – try "simple logistic" fit This is a "drawn-out S" pattern. When the ceiling is at y=C and size is y0 at time t0, then y= C/{1 + [(C-y0)/y0]exp[-k(t-t0)]}, where larger k means steeper growth from near-zero to near-ceiling. It follows that y/(C-y) = exp[k(t-t0)] and hence log[y/(C-y)] = kt-kt0. To test fit to the simple logistic model, graph log[y/(C-y)] vs. t. This is the basis for statistical packages of LOGIT regression.

110

B7. Medians and Geometric Means
____________________________________________________________________ Arithmetic means are often misleading. Geometric means are often more meaningful. Take three incomes, 4, 10 and 100 units. The arithmetic mean (38) depends too much on the largest income. The geometric mean (16) is closer to the central figure (median).  To calculate the arithmetic mean of n numbers, add them, then divide by n: (100+10+4)/3=38. For geometric mean, multiply them: 100104=4000, then take the n-th root, for instance as follows. On a pocket calculator, push key ‘yx’, enter n, push key ‘1/x’, push key ‘=’ and get 40001/3=15.87≈16.  When x can conceptually take any positive and negative values, its distribution may well be normal, and the median agrees with the arithmetic mean.  When x can conceptually take only positive values, its distribution tends to be lognormal rather than normal, and the median agrees with the geometric mean.  Whenever running a normal distribution yields a standard deviation larger than one-half of the mean, one should dump the normal fit and try a lognormal fit instead.  When x cannot conceptually take negative values but can drop to 0, then a pseudo-geometric mean may reflect the median. ___________________________________________________________________ Geometric means entered right in the first chapter – they could not be avoided. Now it is time to compare the median and various means more systematically. When talking about means or averages, we instinctively tend to visualize a value such that half the items are smaller and half the items are larger. This is the definition of the median. Instead of the median, we actually often deal with the arithmetic mean or the geometric mean. Depending on the nature of data, either one or the other approximates the median. Why don’t we just use the median, if this is what we really are after? The median itself is often awkward to handle, as will be explained later on. But we should know when to use which mean, and also know when neither is satisfactory. What difference does it make? Suppose we are told that the monthly mean income in a country is 10,000 euros. This statement may leave the impression that half the people earn more than 10,000 while one-half earn less – but we wonder why many more than one-half the people we know earn less. Well, if 10,000 is the arithmetic mean of a few millionaires and many more common people, the median might be just around 5,000. Indeed, take three incomes, 1,000, 3,000 and 26,000 euros. The arithmetic mean (10,000) depends too much on the largest income. The geometric mean (4,300) is closer to the median (3,000). In Chapter B6, it mattered very much whether a variable could take only positive values or any values from minus to plus infinity. Precisely the same consideration enters here. For variables ranging from minus to plus infinity, the arithmetic mean is the likeliest to reflect the median. For variables restricted to positive values, the geometric mean is the likeliest to reflect the median. 111 

Arithmetic mean and normal distribution Suppose 35 items are distributed as in Table B7.1. These are the frequencies of measurements in Chapter A11, Extra 1. Most cases are in the center. It is most important that the two wings are symmetric. The median is 5, and so is the arithmetic mean (A): A = Σxi/n = (75+2×76+3×77+… +85)/35 = 2800/35 =5. Table B7.1. Hypothetical sizes of 35 items. ____________________________________________________________________ Size 75 76 77 78 79 80 81 82 83 84 85 Number 1 2 3 4 5 5 5 4 3 2 1 of cases ____________________________________________________________________ One particular symmetric distribution that occurs frequently is the bell-shaped normal distribution, roughly shown in Figure B7.1 This is a logically based model for how sizes can be expected to distribute themselves under certain conditions, when no constraints are imposed. No proof will be given here. Normal distribution has long symmetric tails in both directions, and it is characterized by the arithmetic mean, which is also the median, plus a typical width, standard deviation (σ, sigma). By definition, the tail that goes beyond one standard deviation includes 1/2e=1/(2×2.718)=18.4 % of all the cases. Figure B7.1. Normal and lognormal distributions. .

Beyond a couple standard deviations, normal distribution soon falls to very low values. However – and this is important – it falls to utter zero only at plus and minus infinity. Thus, in principle, it does not apply to quantities that cannot go negative. If we applied it to heights of people, it would suggest that, in extremely rare cases, people with

112

negative heights would occur. We must not apply models to cases where absurdities result. If the mean is far away from zero and standard deviation is much less than the mean, then normal distribution works still out as a pretty good approximation. Suppose that men’s mean height is 170 cm and the standard deviation is 17 cm. Then the zero point is so far away (10 standard deviations) that it might as well be at minus infinity. However, suppose it is reported that the mean number of telephones per 1000 people in various countries is 60, with standard deviation 70. (I have seen such reports in print.) This would mean that more than 18% of the countries have negative numbers of telephones! Obviously, the actual distribution is not normal, and the normal model must not be applied. Our models must not predict absurdities. In sum, when the distribution is symmetric, which is the case for normal distribution among others, the arithmetic mean yields the median. Strictly taken, normal distribution applies only to quantities which can range from plus to minus infinity. For quantities that cannot go negative, we can still try it. If standard deviation turns out to be much less than the mean, then the normal distribution is still a good approximation. When, to the contrary, standard deviation exceeds one quarter of the mean, then normal distribution must not be used. What do we do instead? Try lognormal distribution. Geometric mean and lognormal distribution Recall Chapter A1: To calculate the arithmetic mean of n numbers, we add them, then divide by n: (100+10+4)/3=38. For geometric mean, we multiply them: 100104=4000, then take the n-th root, for instance as follows. On a pocket calculator, push key ‘yx’, enter n, push key ‘1/x’, push key ‘=’ and get 4,0001/3=15.87≈16. Suppose we have 10 items of sizes shown in Table B7.2. These might be the weights of dogs in US pounds. The median is (10+11)/2=10.5. The arithmetic mean is much larger – 21.9 – because it is much too much influenced by the few largest components. But the geometric mean is quite close to the median: G = (Пxi)1/n = (1×2×5×….×100)1/10 = 10,010,000,0000.01 = 10. Here the geometric mean reflects the median. This is so because we have many small and few large entries. If we group the data by size brackets of 0 to 19.9, 20 to 39.9, etc., we get 7 – 1 – 1 – 0 – 0 - 1 items per group – the distribution is far from symmetric, with peak at the low end. Table B7.2. Hypothetical sizes of 10 items. ____________________________________________________________________ x 1 2 5 7 10 11 13 20 50 100 log x .00 .30 .70 .85 1.00 1.04 1.11 1.30 1.70 2.00 __________________________________________________________________

113

The geometric mean corresponds to the median when the distribution is lognormal. This is a distribution where all entries are positive, and it has a long drawn out tail toward the large values. Why is it that this distribution is called lognormal? Because the distribution of the logarithms of such data is normal. Take the logarithms of the 10 items above, as also shown in Table B7.2. Group these logarithms by size brackets of 0 to 0.49, 0.5 to 0.99, etc., and we get 2 – 2 – 4 – 1 – 1. This is more balanced, with a peak at the center. (By juggling the data a bit, I could get a perfectly symmetric distribution, but this would be somewhat misleading. With so few data points, we obtain only approximation to smooth distributions). What is the meaning of going by brackets 0 to 0.49, 0.5 to 0.99, etc.? We effectively divide data into multiplicative slots: going from 1 to 3.16, then from 3.16 to 10, then from 10 to 31.6, and so on. The geometric mean has a similar relationship to the arithmetic mean: The geometric of numbers corresponds to the arithmetic mean of the logarithms of those numbers. Indeed, G=(Пxi)1/n leads to logG = (1/n)log(Пxi) = (1/n)log(x1 x2 x3…) = = (1/n)(log x1+logx2 +logx3…) = Σ(logxi)/n. When our pocket calculator does not have a “yx” key but has “logx” and its reverse, “10x”, then we can take the logarithms of all numbers, calculate logG, and put 10 to the power logG. Means and the median When talking about means or averages, we most often are really interested in the median, the value such that half the items are smaller and half the items are larger. When the distribution is normal, the median equals the arithmetic mean – this was the case for numbers in Table B7.1. When the distribution is lognormal, the median equals the geometric mean– this was approximately the case for numbers in Table B7.2. But if we are really interested in the median, then why don’t we just calculate the median? The median is often awkward to handle for the following reason. For arithmetic and geometric means, we just have to add or multiply the numbers in any random order. This is easily done on a pocket calculator. For median, we would have to write them out, arranged by size. This is minor hassle, and of course, computers can handle it easily. But suppose someone else had a data set like the one in Table B7.2. He reports that for these 10 items, A=21.9, G=10.0, median =10.5. Much later, you find two other items which fit in, with values 6 and 8. They clearly reduce the median and the means, but by how much? What are the new values the median and the means? For the arithmetic mean, it’s simple. You restore the previous total (10×21,9=219) and add the new items. The new arithmetic mean is A’=(10×21,9+6+8)/(10+2)=133/12=19.4. Similarly, the new geometric mean is G’=(10.010×6×8) 1/(10+2)=(4.8×1011)1/12=9.4. For the median, in contrast, you are stuck.

114

Having its previous value does not help you. You’d have to start arranging the items by size – but you don’t have the original data! You can’t find the author, or he may have already discarded the data. This is a major reason why medians are harder to handle than means. We can build on previous means, while with medians we may have to start from scratch. The sticky case of almost lognormal distributions Suppose again that it is reported that the mean number of telephones per 1000 people in various countries is 60, with standard deviation 70. The distribution clearly cannot be normal. We are well advised to try lognormal distribution and calculate the geometric mean rather than the arithmetic. But suppose there is one country without a single telephone. Single-handedly, it sinks the geometric mean to zero! What can we do? Suppose we have the data in Table B7.3. The arithmetic mean clearly exceeds the median, but G=0 would under-represent it outrageously. Quite hesitatingly, I suggest the following approach: Lend all items the value of the smallest non-zero item, calculate G, and then take the loan back. Here we would shift from x to x+2. Re-subtracting 2 results in a “pseudo-geometric” mean of 9.3 – and it is close to the median. Table B7.2. Hypothetical sizes of 6 items. ____________________________________________________________________ Median A G x 0 2 7 11 20 100 9 23.3 0 x+2 2 4 9 13 22 102 11 25.3 11.3 -2: 9 23.3 9.3 __________________________________________________________________ Conclusion: The conceptual range matters Chapters B6 and B7 are very much interrelated. The conceptually allowed range matters. When any values are possible, from minus to plus infinity and things are randomly distributed, they are likely to follow a “normal” pattern around a median – this is why this pattern is called normal. Its equation (not shown here) is established on sheer probabilistic grounds. Also, when two variables can range from minus to plus infinity, their relationship may well be linear. When only positive values are conceptually possible, this constraint “squeezes” the negative side of the normal distribution to zero, and lognormal distribution results. Also, when two variables are constrained to only positive values, it “squeezes” their relationship from linear to fixed exponent. It works even more clearly in the opposite direction. Take the logarithms of 1 and of 0. Log1 is pulled down to 0, while log0 is pulled way down, to minus infinity. We are back to the unbounded range, where the logarithms of values from 1 to 0 fill the entire

115

negative side. This is how the lopsided lognormal distribution of x, be it people’s incomes or country sizes, is pulled out into nicely symmetric normal distribution of logx. The geometric mean of x then corresponds to the arithmetic mean of logx. And the curve y=Axk, limited to x>0 and y>0, is unbent and unleashed into the straight line logy=logA+klogx, which can take negative values. When we notice these broad correspondences, many seemingly arbitrary and unconnected features start making sense in a unified way. Physical and social relationships respect the resulting patterns, not because of some conscious decision to do so but precisely because these patterns describe what happens when no conscious decision is taken. Randomness has its own rules.

116

B8. What to Look for and Report in Multivariable Linear Regression
_________________________________________________________________  Use regression only for exploratory research or for testing logical models. Don’t even think of using regression for model construction itself.  Graph possibly meaningful relationships, so as to avoid running linear regression on curved data.  Guard against colinearity or “co-curvilinearity”.  Use Occam’s Razor: Cut down on the number of variables.  Distinguish between statistical significance (the “stars”) and substantive meaningfulness.  Report not only the regression coefficients and the intercept but also the domains, medians and means for all input variables. __________________________________________________________________ In social science literature we often encounter tables of multi-variable linear regression. Those numerous numbers seem to involve useful information – why else would they be published? But what can you actually read out of them? We’ll first approach the issue from the viewpoint of someone who wants to make use of multi-variable regression tables so as to gain social insights. Thereafter, we’ll ask what we should do when running such a regression ourselves. When reporting the results, we should give enough information so that readers can understand – and possibly carry out some further analysis. Only this way can bits and pieces of knowledge become cumulative science. The principle of multi-variable regression is the same as for single variable regression (y vs. x), but more than one input variable is fed in. The output is usually construed as y=a+b1x1+b2x2+…, but it really should be shown as y a+b1x1+b2x2+…, because the relationship is valid in only one direction (recall Chapter B1). Symmetric multivariable regression is difficult. The equations I tentatively offered (Taagepera 2008: 174-175) are plain wrong. So we are reduced to directional regression, with all its risks. There is still quite a lot we can do, provided that major pitfalls are avoided. The main purpose of determining a regression line is to estimate the output for given inputs. Given the values of inputs x1, x2,…, we should be able to deduce the corresponding most likely value of y from y a+b1x1+b2x2+…. But this is not all we are interested in. Maybe scatter of y, for given x1, is so wide that the impact of x1 is “statistically insignificant” and we should omit x1. On the other hand, maybe x2 is highly significant in the statistical sense, but it varies over such a short range that it hardly affects y, given the wise range of y. Could we then overlook x2? We may also be interested in what the median value of y is, and how large or small y can get. Maybe we have the values of x2, but the values of x1 are hard to get. Can we still estimate y, and how well? The importance of such questions will become clearer as we walk through an actual example.

117

Making use of published multi-variable regression tables: A simple example Table B8.1 shows a part of an example presented in Making Social Sciences More Scientific (Taagepera 2008: 207), based on a regression in Lijphart (1994). The output NS is the effective number of assembly parties. It may be logically expected to depend on two factors: How high is the “effective threshold” (T) of votes at which a party is likely to win a seat; and the size of the representative assembly (S), meaning the number of seats available. As the latter can vary over a wide range (from 60 to 650), it is likely to be distributed lognormally rather than normally, and hence logS is used in regression, rather than S itself. What do these numbers mean, in Table B8.1? What can we conclude or deduce from them? Table B8.1. Effective number of assembly parties (NS) regressed on effective threshold (T) and logged assembly size (logS). _____________________________________________________________________ Independent Domain Mean Median Coefficients for variables (Range) NS ______________________________________________________________________ Effective threshold (T) 0.1 to 35 11.6 7.0 -0.05** Log assembly size (logS) 1.8 to 2.8 2.2 2.2 0.12 Intercept 3.66 0.30 R2 2 Adjusted R 0.28 ______________________________________________________________________ * : statistically significant at the 5 per cent level. ** : statistically significant at the 1 per cent level. The coefficients mean that NS can be estimated from NS 3.66-0.05T+0.12logS. 2 The two R values are measures of scatter around this directional equation. This equation accounts for 30/28% of the variation in y. The remaining 70/72% remain random scatter in the sense that threshold and assembly size cannot account for it, at least the way they are entered in the regression. (“Accounting for” in a statistical sense must be distinguished from “explaining” – only a logical model can explain the process by which T and S affect NS.) The two stars at T indicate that this factor has a definite impact on the output. The lack of stars at logS indicates that scatter is so large that it is statistically uncertain whether logS has any systematic impact on the output. If so, then we might as well omit logS altogether and estimate from T alone. Scatter might be hardly reduced. It might be tempting to use the equation above, just dropping the logS term: NS 3.66-0.05T. Wrong. This would imply that we assume that logS=0, hence S=10. This would be ludicrous: hardly any national assemblies are that small. We need to replace the omitted factor not by 0 but by its mean value. Unfortunately, mean values are all too often omitted from reported regression results. The 118

result is that, to calculate the output for a given threshold value, we have also to locate the assembly size, even while it is stated that assembly size lacks statistical significance! The table above does add the mean, so that we can drop logS, if we so wish. This mean is 2.2, corresponding to assembly size S=160. The best estimate of the output becomes NS 3.66-0.05T+0.12(2.2). Hence NS 3.92-0.05T. The difference between 3.92 and 3.66 may not appear large, but given that few countries have fewer than 2.0 or more than 5.0 parties, a difference of 0.3 is 10% of the entire range. The table further adds the median and the domain, meaning the range from the smallest to the largest value. Why are they needed? Regression makes sense only when the variables are fairly normally distributed, so that their medians and arithmetic means coincide. This is the case for logS. (For S itself the mean would exceed the median appreciably.) For T, the mean exceeds the median by almost 5 units. Is the gap excessive? It depends on how widely a variable is observed to range – its domain. The domain of T goes from near-zero to 35, so a 5-unit discrepancy is appreciable. We might obtain a better fit for the number of parties, if we carried out the linear regression on the square root of T rather than T itself.
(Actually, the distribution of T is not just a peak with a longer tail in one direction – it has two separate peaks. Hence the very use of regression becomes problematic. Once we omit logS, leaving only one input variable, it would be high time to graph NS against T, see what the pattern looks like, and try to express it as an equation.)

There is another reason for reporting the domain. Researchers sometimes neglect specifying the measures they use. For S, were decimal or natural logarithms used as “logS“? When the domain is given, it becomes clear that decimal logarithms are used, because the corresponding range of S would be 60 to 650 seats, which is reasonable. If logS stood for natural logarithms, S would range from 2 to 16 seats!
Often there are several ways to measure what looks the same conceptually. For instance, cabinet duration is observed to range from a few months to 40 years by a fairly lenient measure, but only to about 5 years by a more stringent one (which resets the clock whenever there are elections). How am I to know? Well, if the domain for cabinet duration is given as 0 to 5, then I have a pretty good idea of which indicator has been used. Also, one often talks of “corruption index” when actually using an index of lack of corruption (so that honest countries have highest scores). Authors often are so used to a given measure that they neglect to specify it – or they even mis-specify, as for corruption.

Exercise B8.1 shows what further information one can glean from regression results, provided that the mean, median and domain are included.

119

_____________________________________________________________________ Exercise B8.1 Use Table B8.1 for the following. a) Calculate NS for the median values of T and logS. One can expect the median value of NS to be close to this result. b) What is the lowest value of NS that could result from the extreme values of T and S on the basis of this regression line? c) What is the highest value of NS that could result from the extreme values of T and S on the basis of this regression line? d) By how much do these extreme values of NS-i differ from its presumed median. Comment on what it implies. e) Compare the extents to which T and S, respectively, are able to alter NS. Could you have predicted it just by looking at their respective coefficients? f) Given that the impact of logS is not statistically significant, we should be able to ignore it and still get basically the same result. How would the expected range of NS change ? g) Which of the previous questions could you answer, if only the columns for “Coefficients” were reported? ______________________________________________________________________ Guarding against colinearity Suppose we have an output z that might depend on some factor x and possibly also on another factor y. (I owe this example from actual published work to Kalev Kasemets.) We run simple OLS z on x and find p(x)<0.001. z = 0.634(±0.066)x + 0.789(±0.197) R2=0.828 The “(±0.066)” indicate the possible range of error on the coefficient of x. The low value of p(x) says roughly that chances are extremely low that the correlation is random chance. And R2=0.828 indicates that 83% of the variation in z is accounted for by variation in x. This looks pretty good. Now run OLS z on y, and the outcome is z = 0.325(±0.027)x + 0.602(±0.169) R2=0.887 p(y)<0.001. This looks like even a slightly better fit: R2=0.887 indicates that 89% of the variation in z is accounted for by variation in y. We get greedy and feed both of them into multi-variable linear regression. With two inputs, we should obtain an even better fit. We get z = -0.208(±0.268)x + 0.426(±0.133)y + 0.0577(±0.174) R2=0.890 2 p(x)=0.447, p(y)=0.005. This R =0.890 is practically no higher than 0.887 for y alone. The impact of x is now negative! It reduces z rather than increasing it! The p(y) is higher than before, meaning a higher probability that the correlation between y and z is random chance. The p(x) is disastrously higher than before, showing an almost 50-50 probability that the correlation between y and z is random chance.

120

So which way is it? Does x enhance z in a highly clear-cut and significant way, or does it reduce z, in a quite uncertain way? We cannot have both. The answer is that the two inputs must be highly correlated themselves. Indeed, the two inputs together cannot account for 83+88=171% of the variation in the output! The inputs must be correlated by at least 71%, roughly speaking. What happens if we ignore this “colinearity” of x and y, and plug them both into a multi-variable regression? They destroy and confuse each other’s impact. Among the two, y has a slightly higher correlation with z. It sort of cannibalizes the effect of x, reducing it to nothing. The small negative and uncertain coefficient of x does NOT show the total effect of x on z – it shows the residual effect of x, once its effect through y has been accounted for. We effectively have a causal chain xyz or maybe rather (y≈kx)z. By plugging both x and y into the same regression we arbitrarily assumed that causality is xzy – meaning x and y affecting z separately, which is here a blatantly false assumption. The computer program does not argue with you. I you feed in junk, it obediently processes it, but “junk in, junk out”. What are we supposed to do? In the present case, where both inputs account for more than 80% of variation in output, it’s fairly simple: use only one. Which one? One might pick the one with the higher R2, in the present case y, which tacitly implies that x acts through y: xyz. But this is not the only consideration. If we have logical reasons to assume that the causal process is yxz, then a small shortfall in R2should not deter us from using x. Also graph z vs. x, z vs. y and z vs. x. Curvatures or irregularities in the data clouds may give you hints on how the variables are related. The real hassle comes when the inputs are only mildly correlated to each other – not R2=0.70 but R2=0.30. Here one of the inputs may act on the output both indirectly and also directly: xy zx. There are statistical ways to handle such situations, but also graph the data. The main thing is: If it looks odd, it probably is – then it’s time to double check or ask for help. Do not report absurdities, without expressing doubts. Running a multi-variable linear regression Suppose that many factors come to mind which could conceivably have an impact on the values of some output variable y. Among these, A, B, C, D, E and F are the prime suspects. Furthermore, the output might be different for women and men. As a first exploratory step, we might run multi-variable OLS regression. It may suggest which factors have a definite impact and thus guide our search for a logical model. Once a model is constructed, its testing may need another regression. The process includes the following stages. Processing data prior to exploratory regression, Running exploratory regression, Reporting its results, Re-running exploratory regression with fewer variables Model-testing regression, once a model is devised.

121

Processing data prior to exploratory regression It matters what we feed into the computer. All relationships are not linear. Before applying linear analysis we better do some thinking. Instead of feeding in x, maybe linearity is more likely with 1/x – or even 1/x2. (The latter is the case between cabinet duration and the number of parties, for instance.) When there is linearity between y and 1/x2, then there is no linearity between y and x. If we still regress on x, we would fit a curved data cloud with a straight line. We would obtain some correlation but not to the full extent possible. Most important, we would miss the logical nature of the connection. But what can we do, short of working out elaborate logical models? At the very least, consider the conceptually allowed ranges. If factors A and D can in principle range from minus to plus infinity, enter them as they are. But if factors C and D can take only positive values, better enter logC and logD. Also consider whether all factors are mutually independent. We talked about guarding against colinearity. It is worth repeating. Suppose input variables D and E are strongly connected through D=a-bE. Then one of them should be left out. Which one? Before answering this question, also consider “co-curvilinearity”: Maybe D and E are even more strongly connected through D=a/E2. How do we know? Graph each potential input variable against each other. If the data cloud looks like a fat ellipse, almost circular, then the two variables are independent and can be used together. But if the data cloud looks like a thin tilted ellipse or bent sausage, then the variables are interconnected and we better keep only one of the two. But which one should it be? If y may be affected by D and E, which are interrelated, then it might be that one of them acts through the other: E D y OR D E y. Which one could it be? Make graphs and run correlations y vs. E and also y vs. D. The one with the higher R2 (once data clouds are straightened out) is likely to be more directly connected, and this is most often the one to keep. True, it could be that one of the factors acts on y both indirectly and also directly: E D y  E, but one of them is likely to predominate. Let us keep it simple, if we can. Running exploratory regression Suppose co-curvilinearity eliminates factors E and F. We are left with A, logB, logC and D. We run multivariable regression on them. Suppose we report the results, using today’s standard format (Table B8.2). Instead of just R2, there may be a somewhat differently labeled coefficient. Stars indicate the strength of supposed significance in a statistical sense, which often is misinterpreted (Taagepera 2008: 77-78). The computer printout may have other statistical features that will not be discussed here.

122

Table B8.2. Typical minimal reporting of multi-variable linear regression analysis. ________________________ Factor A -0.03*** Factor B (log) 0.12** Factor C (log) 0.28 Factor D 3.77* Dummy (F=1) 0.74* Intercept 4.07 2 0.39 R _________________________ What this table means is that the values of y can be best predicted from values of A etc. by applying the equation of format y a+b1x1+b2x2+…and plugging in the coefficient values shown in the table. For males, it is y  4.07- 0.03A + 0.12logB + 0.28logC + 3.77D. For females, add 0.74. Together, these variables account for 39 % of the variation in y, as R2 tells us. The number of stars suggests that A affects y most certainly, followed by logB, while the impact of D is less certain, and the impact of logC could well be random chance. Whether gender may have some impact also remains in doubt. Lumping less significant variables: The need to report all medians and means Occam’s Razor is a principle that tells us to try to prune off everything that is not essential. (Recall Albert Einstein’s advice: make your models as simple as possible – and no simpler.) Which factors should we discard? On the face of it, we should keep only the two most significant. Then the equation above might seem to be reduced to y4.070.03A+0.12logB – but not so fast! By so doing, we would assume that the mean values of logC and of D are 0, which might be widely off the mark. As pointed out earlier, we must plug in the average values of logC and of D. Also, assuming roughly equal numbers of females and males, we should add 0.74/2=0.37. Thus the reduced equation would result from y  4.07-0.03A+0.12logB+0.28(aver. logC)+3.77(aver. D)+0.37. But what are “averages” – medians or arithmetic means? And have we reported their values? Without these values, anyone who wants to make use of our findings to predict y would have to enter into the equation not only the values A and B but also C and D. He would have to dig up the values of C and D, even while our analysis concludes that they are rather insignificant! Indeed, for this reason all too many published regression results are useless for prediction: Too many variables are shown, and their average values are not. So we better report the averages. As pointed out earlier, we better report both median and arithmetic mean, to give the user a choice and also for the following reason. If median and arithmetic mean differ appreciably, this would indicate that the distribution of values cannot be normal. This means that assumption of linear relationship is on shaky grounds – and we should warn

123

the readers. Actually, we should try to transform our data, prior to linear regression, so that median and arithmetic mean would be roughly equal. Table B8.3. Multi-variable linear regression results, when also reporting averages. RC=Regression Coefficient. Factor median weights (median×RC) emerge. ___________________________________________________________________ Median Mean RC Median×RC RC after lumping Factor A 1.20 1.23 -0.03*** -0.036 -0.03*** Factor B (log) 0.50 0.57 0.12** 0.06 0.12** Factor C (log) 6.5 6.4 0.28 1.82 --Factor D 0.40 0.37 3.77* 1.51 --Dummy (F=1) 0.5 0.5 0.74* 0.37 --Intercept 4.07 7.77(=4.07+1.82+1.51+0.37) 2 R 0.39 _______________________________________________________________ Suppose we have done the needed transformations and can happily report the results in Table B8.3, where medians and means roughly agree. I have also included the product of each factor’s median and its regression coefficient. This is the median weight it contributes to y. If we drop factors C, D and gender, then we must add these weights to the intercept. The new intercept value is 4.07+1.82+1.51+0.37=7.77. So the equation with some variables lumped is y  7.77 - 0.03A + 0.12logB, as reflected in the last column of Table B8.3. Re-running exploratory regression with fewer variables If at all possible, we should now carry out a new regression, using only A and B. It should yield an intercept close to 7.77 and coefficients close to the previous ones. If this is not so, there is something in the data constellation that we should check more closely. The value of R2 can be expected to drop below 0.39 because we no longer account for the variation due to C, D or gender. If the drop is appreciable, we may have to reconsider. Indeed, the drop in R2 may be serious. Look at the factor weights in table above : They are large for C and D (1.82 and 1.51), while tiny for A and B (-0.036 and 0.12). How much impact could such tiny inputs have on the output? The question is justified, but to answer it, we must also take into account how widely the variables are observed to range. We saw that this is called their domain. Report the domains of all variables! Table B8.4 adds the observed domains of variables. How widely could the estimates of y vary? We must consider the extremes of inputs, keeping track of the signs of their extreme values and coefficients. The lowest value of y results from ymin = 1.5(-0.030)2(0.12)+4(0.28)-0.4(3.77)+0 = -0.67. The highest value of y results from ymax = 0.9(-

124

0.030)+3(0.12)+9(0.28)+1.2 (3.77)+0.74 = 8.12. The actual domain of y is likely to be somewhat smaller, because extreme values of inputs rarely coincide, but clearly y can vary over several units. Table B8.4. Multi-variable linear regression analysis, also reporting the domains. ___________________________________________________________________ Median Mean Domain Regr.coeff. Span Span×RC Factor A 1.20 1.23 0.9 to 1.5 -0.030*** 0.6 -0.018 Factor B (log) 0.50 0. 57 -2 to +3 0.12** 5 0.60 Factor C (log) 6.5 6.4 4 to 9 0.28 5 1.40 Factor D 0.40 0.37 -0.4 to +1.2 3.77* 1.6 6.03 Dummy (F=1) 0.5 0.5 0 to 1 0.74* 1 0.74 Intercept 4.07 R2 0.39 _______________________________________________________________ The “span” is the extent of the domain, the difference between the largest and smallest values. Then (Span×Regression Coefficient) is the extent by which the given factor could alter y. While it is highly likely that A has an impact on y (3 stars!), this impact is tiny. Even the extreme values of A would alter y by only 0.18. If we dropped A, our ability to predict y would hardly be affected. In contrast, variation in Factor D alone determines most of the domain of values y could take, if the impact of D is real. Trouble is, it is uncertain whether D has any definite impact. We should distinguish between statistical significance (the “stars”) and substantive meaningfulness. In terms of health issues, suppose presently 54,000 people per year die of a given disease. Suppose some rather expensive and painful medication reliably reduces mortality – to 53,000. Is a drop of 1,000 worth subjecting 54,000 people to this treatment? Now suppose that the uncertain Factor D can easily be altered – like making drinking water slightly more or slightly less acid. If it works, it would have a large impact on mortality, but we are not quite certain it would even have an effect (single star). Which factor should we focus on in the present example? I have no single answer. It depends on many other considerations. But one thing is certain: Reporting only the information in Table B8.2 could badly mislead readers into thinking that Factor A is the best predictor, just because it has the most stars. Whenever one runs a regression, one has the possibility to determine the domains, medians and means of all the input variables. Do report this information. Omitting the explanatory columns (Span and Span×RC) in Table B8.4, our table of regression results should look like the one in Table B8.5. Then the reader can play around with it, weighing the impact vs. significance of various factors. When she has access to the values of only some of the factors, she can still make predictions, using the mean or median values for missing factors.

125

Table B8.5. Recommended format for reporting multi-variable linear regression analysis. ______________________________________________ Median Mean Domain Regr.coeff. Factor A 1.20 1.23 0.9 to 1.5 -0.030*** Factor B (log) 0.50 0.57 -2 to +3 0.12** Factor C (log) 6.5 6.4 4 to 9 0.28 Factor D 0.40 0.37 -0.4 to +1.2 3.77* Dummy (F=1) 0.5 0.5 0 to 1 0.74* Intercept 4.07 R2 0.39 ________________________________________________ Model-testing regression All the preceding refers to exploratory regression – trying to get some idea of what might affect a given output. It may help to focus on just a couple of inputs. The next step would be to ponder how these factors might impact the output. This means trying to build a logical model. More graphing may be involved – and don’t even think of using regression for model construction itself! It is unlikely that the predictive model would include more than 1 to 3 input variables. All others are likely to act through these or be negligible, at least in a first approximation. By this time, we no longer accept just any regression coefficient values. Suppose the logical model is y=KAB/C2. We know we have to regress logy against a=logA, b=logB and c=logC. But we also know that the result must be logy=k+1.00a+1.00b-2.00c – only the intercept k=logK is not predicted. What do we do when the coefficients found differ appreciably from 1, 1, and -2? The answer depends on the specific situation. For the moment, let us just keep in mind the difference between preliminary regression (preliminary to model building effort) and final regression (testing the model).

126

B9. Substantive vs. Statistical Significance
High statistical significance alone could be pointless for making sense of the world. We must look for substantive significance. This point was briefly made in previous chapter as well as in Taagepera (2008: 77-78 etc.) The table below is an attempt to visualize a similar message in Professor McCloskey's The cult of statistical significance (<http://www.amazon.com/Cult-Statistical-Significance-EconomicsCognition/dp/0472050079/ref=sr_1_1?ie=UTF8&s=books&qid=1265515998&sr=8-1>) Extent (size) of effect e.g., by how much a cure reduces the death rate for a disease Tiny e.g., from 39% to 38% No substantive significance Low p<.1 Forget it If it were real, it would have high substantive significance. So try to refine and add data, so as to raise statistical significance -- the potential payoff is high! Also try to build a logical model, as it would help you to refine the data. Low substantive significance High p<.01 A reduction by 1 percent point is peanuts, regardless of how certain it is. BUT don't yet give up. Try to elucidate a plausible underlying mechanism, i.e., build a logical model. Then you might find ways to enhance the extent of the effect. High substantive significance Congratulations! BUT beyond happy application, also try to elucidate the underlying mechanism, i.e., build a logical model. Otherwise you would be at a loss when the effect suddenly weakens.* Appreciable e.g., from 39% to 18% Median substantive significance

Statistical significance

* At a refractive materials plant, they had a product vastly superior to those of all competitors. The management was just happy about it and never bothered to find out why. Then a foreman died, and the quality plunged, never to rise again, no matter what they tried. By this time it was too late to ask "why?"

127

C. Examples of Gradual Model Building
C1. Choice of Examples
___________________________________________________________________  Even the best graphs published in social sciences can be improved on.
 Graph more than the data – graph the entire conceptually allowed area and anchor points so as to make sure any data analysis makes sense from a substantive viewpoint.

________________________________________________________________ Part C presents examples from published social science literature, where data analysis and modeling stopped short of maximal payoff. It shows what more could be done. No full fledged quantitatively predictive logical model could be developed in these cases – it is not that easy! The point is to show that more could be done in the direction of model building than at first met the eye. How did I pick these examples? This was done in the course of reading works that interest me, by scholars whom I highly respect. The task is to show how to improve on the best that social sciences offer. I thank the authors of these examples for their kind permission to make use of their work. Historically, this part was written long before the present Part B, and it shows. Some of the items Part B discusses in detail are introduced as if were something new. I have not deleted all repetitions in the present draft, for various reasons. For one, I am not only lazy but also pressed for time. Second, some repetition is good. Third, I want to use this text in teaching for a while, before deciding which aspects need repetition, in order to really sink in. Fourth, for a quicker overview, one might want to teach only Parts A and C. In case of doubt, the possibly repetitious sections are given in smaller script. Graph more than the data This section largely repeats what has been said earlier, especially in Chapter B4.
It is always a good idea to graph the data. This is not very original. All good statistics books recommend it. Nonetheless, social scientists all too often rush off to run regressions and report the results without graphing the data. Having many input variables that cannot all be graphed at the same time is a poor excuse. One can always graph the main output against the main input variable – or successively against two, if two inputs seem to have equal strength. The rest is usually noise, even if further input variables happen to be “statistically significant” to a mild degree. Here we deal with examples where a data graph was published. The next step advised may be more puzzling: Graph more than just the data! (Taagepera 2008: 202). It may sound nonsensical. What else is there to graph? There is. Many features can be added that put the data in better perspective. Delineate the entire region where data could possibly occur. If both x and y are percentages, the conceptually allowed region consists of the square ranging from 0 to 100 on both axes. Once this square is drawn in, one may discover that the actual data points are located in a specific region within this square, in a manner that makes us wonder why – and asking “why?” is the first step toward a logical model. Drawing in the equality line y=x makes sense in some cases. Once

128

this is done, one may realize that the data cloud has a much shallower slope than the one for y=x within some range of x and a much steeper one within some other, again raising the question “why?”. Graphing more than just the data is often easy to do, yet it is so often neglected. Avoiding this shortfall alone could advance our understanding of social processes and relationships appreciably. The examples presented lead to asking “why?” when it would not be asked in the absence of graphing more than the data. Sometimes we can proceed much further, but even if we can’t do so at the moment, we have a broader view of the issue on hand. Graphing more than just the data sometimes means picking a fight with ready-made social science computer programs that insist on showing only the space needed to accommodate the data. The worst programs even refuse to allow marking the scales on the axes as 0, 5, 10 etc. (or as 1, 10, 100 etc.). Instead, they insist on interval markings derived from the range or standard deviation of the data on hand, such as 3.47, 5.63, 7.79… This makes sense only when one is mesmerized by these particular data, oblivious to any context, but it makes it awfully hard to compare these data with other data sets – and with logical models. What is one to do in such a case? One must stop being a servant of one’s computer and become a scientist. Computers are supposed to help us, not to order us around. If the program restricts you, look for another. If you cannot find a science-friendly program, graph by hand. A coarse graph that makes sense is worth more than a neat but nonsensical one. Graphing only the space where data points happen to occur all too often makes little sense.

129

C2. Ingredients of Federalism: Allowed Areas and Regression Lines
___________________________________________________________________  Placing a data graph in a frame which exceeds the logically allowed area can distort our impression of the data.  Showing a regression line which extends into logically forbidden area can distort our impression of the data.  Recall that there is no single OLS regression line. When data are scattered, the slopes of the two OLS lines (y on x, and x on y) diverge wildly. A symmetric regression line with an intermediary slope is available.  When the conceptually possible range of x goes from a to a larger number b, it can be converted to X ranging from 0 to 1 by X=(x-a)/(b-a). ________________________________________________________________ Arend Lijphart’s Patterns of Democracy (1999) is a landmark in the study of democratic institutions. Among these it deals with various aspects connected to federalism. The accompanying graphs also offer food for thought from a methodological viewpoint. Constitutional rigidity and judicial review One might expect that federalism calls for an appreciable degree of “constitutional rigidity”, meaning a written constitution that is hard to change, because this is the only way to specify and protect the rights of the federal subunits. Rather than measured quantities, Lijphart uses here informed estimates. He does so, on a 1 to 4 point scale, where 1 stands for no written constitution and 4 for one that is extremely hard to modify. A related concern is “judicial review of laws”, meaning that courts can pass judgment on whether a law conforms to the constitution. Lijphart again uses a 1 to 4 scale, where 1 stands for no judicial review and 4 for a very powerful one. Judicial review and constitutional rigidity should go together at least in the following aspect: One cannot declare a law unconstitutional when there is no constitution! Thus a rating “1” on constitutional rigidity should exclude anything above “1” on judicial review. But how are they related otherwise, if at all? Figure C2.1 shows Lijphart’s (1999: 229) graph of judicial review vs. constitutional rigidity and the same graph with some additions. Cover the latter one, for the moment, and focus on the first. What do we see? There is an almost spherical data cloud in the center of the field, while the extremes are uninhabited. And there is a regression line, which is only slightly tilted. It suggests that the extent of judicial review increases with constitutional rigidity, but only mildly. The text in Lijphart (1999: 229) states that the correlation coefficient is 0.39. This corresponds to a very low R2=0.15, which reflects the extreme scatter of data points. Note: SOME OF THE SCANNED-IN FIGURES IN PART C MAY NOT COPY IN NORTH AMERICA EVEN WHILE THEY DO IN EUROPE.

130

Figure C2.1. Original graph (Lijphart 1999: 229), and addition of conceptually allowed region and of several regression lines.

131

However, when the borders of the conceptually allowed region are drawn in (from 1 to 4 on both axes), as is done in the second graph, the picture changes. What looked like areas devoid of data points now can be seen as areas where no data points can possibly occur. As for the conceptually allowed region, almost all of it can be is seen to be inhabited fairly uniformly. The only combination that does not occur is stringent judicial review based a nonwritten constitution. Except for this top left zone, anything seems to go. All three other corners are inhabited. Stringent constitutional rigidity can go with absolutely no judicial review (Switzerland, SWI). Most surprising, countries with no unified constitutional document can still have appreciable judicial review (Columbia, COL). Bur why is the impression from the two graphs so different? It’s because the meaningless frame in the original graph misleadingly intimates that data points could run from 0 to 5, rather than the actual 1 to 4. Moreover, the regression line is extended into the forbidden region. This elongated line increases the impression of fair correlation when there is almost none. How did this happen? Lijphart tells me that he drew his graphs by hand using a 1 to 4 square. The person who redrew the graphs on computer thought it looked nicer with an expanded frame. It may – but it leaves a wrong impression. In science, truth must become before beauty. Placing a data graph in a frame which exceeds the logically allowed area can distort our impression of the data – and so can showing a regression line which extends into logically forbidden area. There are further problems with the regression line. As pointed out in Chapter B1, there are two OLS regression lines for any data. The one used in the top figure regresses Review on Rigidity. This means drawing a tentative line, measuring the vertical distances between the line and all the points, taking the squares of those distances, adding them, and shifting the line so as to minimize this sum of the squares (Chapter B1). The points close to the line contribute small squares. The ones far away contribute large squares. On the right side of Figure C2.1, the large square for SWI pulls the line down, the large square for the US pulls the line up, and the small square for AUL hardly matters. Now turn the graph by 90 degrees so that the judicial review axes becomes horizontal and constitutional rigidity increases upwards. Try the same balancing act, up against down. Compared to the previous OLS line, the data points IND, GER and US at the far left all pull downwards. The balanced position is close to GER. At the far right, all the points, ranging from SWI to ISR, NZ and UK, pull upwards, compared to the previous OLS line. The balanced position might be in between NET and UK. The overall balance among all the points is approximately as shown in the lower graph of Figure B2.1. This is the reverse OLS regression of Rigidity on Review. Here the two OLS lines diverge wildly, because scatter is huge, as reflected in R2=0.15. The bottom graph also adds the symmetric regression line, which treats the two variables on an even basis (Chapter B2). This is the real “trend line”, if any is to be

132

shown at all, given such a heavy scatter. In sum, there is no single OLS regression line. When data are scattered, the slopes of the two OLS lines (y on x, and x on y) diverge wildly. A symmetric regression line with an intermediary slope is available. But is a linear fit acceptable on conceptual grounds? In particular, does the figure have any anchor points through which an acceptable fit must pass? No firm anchor points can be seen. It is true that one cannot declare a law unconstitutional when there is no constitution, but lack of a single written document still doesn’t mean total absence of basic norms. This is why Iceland (ICE) can have a mild degree of judicial review. On the other hand, even an utmost degree of constitutional rigidity doesn’t call for judicial review (witness Switzerland) – it just makes such a review possible and highly likely. In sum, no firm conceptual constraints exist, but (1,1) still might be our best guess in the absence of any other information. The symmetric regression line passes close to it, while the standard OLS lines deviate widely, in opposite directions. A similar vague claim might be made for opposite corner (4,4), and fixed exponent curve Y=Xk might be fitted to join them. (This would need a change in scales!) Compared to the diagonal (UK-US), the best fit would curve somewhat down, to account for Switzerland and countries close to it, while the opposite corner is empty. It would account for the general trend better than any of the linear regression lines. Still, it has not been drawn in, because, with such scatter, any curve would be misleading. So, what have we achieved, in terms of logical model building? We merely have cleared away some possible misconceptions, by delineating the allowed area. This by itself has value. As for positive model building, meet the real world. Most relationships observed in nature and society are fuzzy. Reproducing them in cleaner form in a lab helps, but this is easier in physics than in sociology. The other approach is building logical models of what should be. In the present case, we may ask if there might be a third factor that enters and which is different for countries toward the top left and for countries toward the bottom right. Inspection of our graph reveals no such factor. Maybe some day such a factor will be discovered, but it may also be that the scatter remains random noise, based on past history. Degree of federalism and central bank independence Figure C2.2 shows another similar case in Lijphart (1999: 241). Here the degree of federalism and decentralization is estimated on a scale from 1 to 5. The central bank independence is a compound of many estimates, which makes it a quasi-continuous measure. In principle, it could range from 0 to 1.0 (100 %).

133

Figure C2.2. Original graph (Lijphart 1999: 241), and addition of conceptually allowed region and the reverse regression line.

134

In the original graph the elongated data cloud seems to occupy most of the space, up and down, while not reaching the extremes on left and right. Introducing the conceptually allowed region alters the picture. It shrinks the field left and right, while expanding it somewhat at the bottom and markedly at the top. It becomes visible that both extremes on the x-scale occur widely – utter centralization as well as complete federalism. In contrast, the extremes of complete or no central bank independence do not occur. Compared to the conceptually allowed range from 0 to 1.0, a relatively limited band of central bank independence is inhabited, from 0.2 to 0.7. Correlation is somewhat stronger than in the previous case, with R2=0.32, hence the two standard OLS lines are closer to each other. (The symmetric regression line, in between, is not shown.) No conceptual anchor points can be established. Once again, we have made only little headway toward a quantitatively predictive logical model, by establishing at least the conceptually allowed region. ____________________________________________________________________ Exercise C2.1 a) In Figure C2.2, draw the lines joining the two uppermost data points (MAU and GER) and the two lowermost (NOR and IND). b) Construct the line halfway between the two extreme lines. Comment on its location as compared to the two OLS lines – and to the likely location of the symmetric regression line. c) Determine the equation of this line (i.e., the parameters a and b in y=a+bx), using two points on the line (see Chapter B6). d) Express the zone of occurrence of data points in the form of error range around this line: y=a+bx±…(cf. Exercise A7.4). e) Has your impression about the data cloud changed, as compared to what you could see in the original graph? If it has, make a mental note, for use in further work. ______________________________________________________________________ Bicameralism and degree of federalism Figure C2.3 (from Lijphart 1999: 214) might allow us to go slightly further. Bicameralism is estimated on a scale from 1 to 4. Delimiting the allowed region shows two empty corners. Accordingly, correlation is higher (R2=0.41), and the two OLS lines are relatively close to each other. But here we might claim that full-fledged federalism does call for full bicameralism – two equally powerful chambers – so that both population and federal subunits can be represented on an equal basis. Indeed, for x=5, 4 out of 5 data points have y=4. It is trickier for x=1. Even some purely unitary countries have two chambers on historical grounds. (The second chamber used to represent aristocracy.) We may tentatively accept an anchor at x=1, y=1. This location is heavily populated with empirical data points, but it isn’t unique in this respect.. If we accept an anchor at 1,1, then a fit with Y=Xk could be tried.

135

Figure C2.3. Original graph (Lijphart 1999: 214), and addition of conceptually allowed region, reverse regression line, and a fit with Y=Xk.

To do so, we must first convert the ranges 1 to 5 for x and 1 to 4 on y to ranges from 0 to 1. How do we do that? First, we must pull the lower limit from 1 to 0, by subtracting 1. The total span of possible values on x is 5-1=4. On y, it is 4-1=3. So we 136

have to divide by these spans. The result is X=(x-1)/4 and Y=(y-1)/3. Check that now the lower anchor point corresponds to X=0, Y=0, and the upper anchor point corresponds to X=1, Y=1. Any curve Y=Xk would join these points, but most of them leave most data points on one side of the curve. By trial and error, we find that Y=X0.5 has about an equal number of data points on either side. This curve is shown in Figure C2.3. Statistically more elegant ways to determine the best fitting value of n exist. But we are not after a statistically precise value, which would change whenever a single data point is added or removed. We want to make general sense of the pattern. Do we now have a logically grounded model? This is hardly so. Conceptually, the anchor points aren’t strongly imposed, and some data points deviate markedly from the central curve.. For X=0, we often have Y>0. Conversely, for Y=0, several cases X>0 also occur. Still, the points (0,0) and (1,1) are the most heavily populated points in their neighborhoods, a fact Y=X0.5 acknowledges, while the linear regression lines do not – not even the symmetric one (not shown in Figure) do not, So Y=X0.5 has advantages. It has vastly more chances of eventually finding a theoretical justification or explanation. In all those graphs, we were dealing with subjective estimates rather than objective measurements. It’s better than nothing, although we feel on firmer grounds with measurements. Another common feature that applies to all previous examples is that they have conceptual constraints on all 4 sides. This makes a fit with the format Y=Xk conceivable, even while it clearly does not work out in Figure C2.2 and is highly doubtful in Figure C2.1 too. Better examples for fit with Y=Xk will be given in the next two chapters. Thereafter, ii is time to offer other examples, before you start believing that this is the universal format. All relationships are not linear, nor do they all follow Y=Xk. Conversion to scale 0 to 1 In the last example, we had to convert the ranges 1 to 5 for x and 1 to 4 on y to ranges from 0 to 1, so as to be able to apply the simple format Y=Xk. We reasoned it through, using these particular ranges. But the need to convert to scale 0 to 1 occurs so frequently, that the general conversion formula might be useful. Suppose the original conceptually possible range of x goes from a to a larger number b. Those numbers can be negative or positive. First, we pull the lower limit from a to 0, by subtracting a from all values of x. We obtain x-a. The total span of possible values is b-a. So we have to divide the values x-a by these spans. The result is X=(x-a)/(b-a). Check that the lowest possible value, x=a, leads to X=0, Y=0, and the highest possible value, x=b, leads to X=1. ____________________________________________________________________ Exercise C2.2 A frequent scale goes from -10 to +10 (e.g. for left-right placements). Convert it to scale 0 to 1. NOTE: Keep in mind that subtracting a negative number is not the same as subtracting the corresponding positive number!

137

____________________________________________________________________ Exercise C2.3 Using a scale 1 to 10 can be awfully confusing, because it is so close to 0 to 10. Respondents may place themselves at 5 when they wish to place themselves at the center. a) Convert scale 1-to-10 to scale 0-to- 1. b) Convert the value 5 on the scale 1-to-10 to scale 0-to- 1. c) Have I cured you from ever using scales 1-to-10 or 1-to-5, in preference to 0-to10 or 0-to-5? ____________________________________________________________________

138

C3. Capitalism and Democracy: Applying Y=Xk
___________________________________________________________________  If the equality line makes sense on an x-y graph, enter it as a reference line.  If both x and y can extend from 0 to a conceptual limit (such as 100), normalize it to go from 0 to 1.  If X=0 and Y=0 go together conceptually, and so do X=1 and Y=1, then Y=Xk is the simplest model to be tested.  If Y=Xk does not fit, try (1-Y)=(1-X)k next. ________________________________________________________________ Support for democracy and capitalism Figure C3.1 is reproduced from Citizens, Democracy, and Markets Around the Pacific Rim (Dalton and Shin 2006: 250). It graphs the level of popular support for capitalism against the level of support for democracy. What can we see in this graph? The general trend is that y increases with increasing x, but with some scatter. One can comment on the individual countries, but let us for the moment overlook the country labels and focus on the cloud of data points as such. What else can we learn from it? A knee-jerk reaction of some social scientists is to pass the best fit line through the cloud. Take an erasable pencil and a straight-edge, and lightly draw in the approximate best fit line. You don’t have a straight-edge handy? Of course you do. Use the edge of a book or writing bloc. You don’t have a computer handy to run a regression? To the extent of precision needed for most purposes, you can do it visually. Just shift the line until you have equal number of points on both sides, at both ends. This line may roughly pass through Korea and slightly below Vietnam. One could use this line for prediction of democracy/capitalism relations in countries not shown in this graph. There is one problem. If support for democracy is less than 13 % or so, the linear fit would predict support for capitalism lower than 0 percent! Oh well, you might say, be realistic – support for democracy never drops that low. Never mind, science often proceeds beyond what is being considered realistic. It asks what would happen under extreme circumstances. A basic norm is: a conceptual model must not predict absurdities even under extreme circumstances. For every conceivable level of support for democracy a proper model must predict a non-absurd level of support for capitalism – and vice versa. Could we say that the best-fit line applies only down to about 13 %, and below that level support for capitalism is expected to be zero? This would introduce a kink in the predicted average pattern, and most physical and social relationships are smooth. In sum, this is about all we can do with the best fit line approach, apart from adding a measure of scatter, like R-squared, which may be around 0.5. For the given level of support of democracy, China looks low on support for capitalism, while Philippines look high. What else can we do? Follow the approach introduced in Chapter B4 (Graph More than just the Data!)

139

Figure C3.1. The starting point: Data alone, as shown in Dalton and Shin (2006: 250).

First of all, delineate the allowed region where data points can conceivably occur. This means taking a straightedge and drawing in the horizontal lines at support for capitalism 0 % and 100 %, and the vertical lines at 0 and 100 % support for democracy. The new lines are parallel to the ones already shown. They produce a square inside the earlier square – and this square is what physics texts would show when graphing such data. This inner square has conceptual meaning. The square shown in Figure C2.1 is mere decoration, a more austere version of mermaids embellishing the margins of ancient maps. Next, draw in the equality line, y=x, if you can. It isn’t always possible. If we graph populations of countries against their areas, there is no equality line between

140

people and square kilometers. But here it’s possible, because both axes have the same units (or quasi-units) – percent. Figure C3.2. Data plus allowed region and equality line

These two additions are shown in Figure C3.2. At once, we see more clearly that all the data points are below the equality line. It would seem that people in all countries tend to voice more support for democracy than for capitalism. The US, Singapore and Japan now look closer to the right border than they did on the original graph. We realize they are pretty close to the maximum possible. These are useful insights.

141

Next consider conceptual anchor points. These are mandatory or highly plausible extreme points through which a continuous curve expressing the relationship must pass. If support for capitalism is 0 % in a country, what level of support for democracy might we expect? It would be hard to offer anything but 0 %. We cannot have less than 0 %, and for proposing something more than 0 % we would have to bend away heavily from the observed pattern. At the other extreme, what level of support for democracy might we expect when the support for capitalism is 100 %? Again it would hard to settle on anything but 100%. Oh yes, we could offer all sorts of other patterns. Maybe at 90 % support for democracy, support for capitalism would suddenly shoot up to 100 %. But why would it be at 90 %, rather than at 95 %? Or could it be that people never become 100 % supportive of capitalism, even at 100 % support for democracy? But would they stop at 80 % or at 90 %? A basic rule in science is: Keep things as simple as you can. Do not assume complexities unless evidence makes you do so. There are enough of real complexities to deal with, without the need to introduce imaginary ones. Our simplest assumption is that the curve y vs. x starts at the point (0,0) and ends at (100,100). This means that for any value of x that can conceptually occur we have a value of y that can conceptually occur – and vice versa. Our predictions may turn out to be wrong, but at least we do have a prediction for any value of x or y, a prediction we can subject to verification. This is better than saying “We don’t know”. Fit with fixed exponent function Now determine the simplest curve that respects the anchor points and the data. The simplest curve to join (0,0) and (100,100) would be the equality line y=x. But it clearly does not fit the data. What is the next simplest option? It’s a curve that initially keeps below the line y=x and bends up to join it at (100,100). What is a simple curve? It’s a curve with the simplest possible description. For curves starting at (0,0) and ending at (1,1), it’s the curve corresponding to the fixed exponent function or power function Y=Xk. Here the exponent k expresses the deviation from equality line Y=X. For k=1, Y=Xk becomes the equality line Y=X. The more the value of k increases above 1, the more the curve bends downwards. For k<1, it bends upwards. In the graph on hand, however, the scales go from 0 to 100 rather than 0 to 1. We have to change scales so that the previous 100 becomes 1. How do we do it? We have to switch from x and y to X=x/100 and Y=y/100. In other words, we shift from percent shares (out of 100) to fractional shares (out of 1). Why do we do it? If we kept the percent, Y=Xk would correspond to y=100(x/100)k. Such a long expression would be more confusing than it’s worth while.

142

The fixed exponent format may look complex. Would it not be simpler to avoid curves and use two straight line segments? The first one could run from (0,0) roughly to the data point for Canada, at about x=79%, y=58%, and the second one from Canada to (100,100). Let us consider how to describe it. We’d have to write y=a+bx, 0<x<79, y=a’+b’x, 79<x<100. Count the letter spaces. The two straight line expressions have 26 of letter spaces. That’s messy, compared to 4 letter spaces in Y=Xk. Better get used to fixed exponent functions, Xk. Two straight line segments would be considered simpler than a single curve only if one takes as an article of faith that all relationships in the world are or should be linear. But they aren’t, and for good conceptual reasons (see Chapter B6). Belief in straight line relationships is even more simplistic than the belief of Ptolemaic astronomers that all heavenly bodies must follow circular paths. The world isn’t flat, and all relationships are not linear. Which value of k would fit best the data in Figure C3.2? Recall Chapter B6. Fitting to Korea as the central point (X=.78, Y=.58) leads to n=log.58/log.78=2.19. . Some points on the curve Y=X2.19 are shown below: X 0 .2 .4 .6 .8 1 Y 0 .03 .13 .33 .61 1 Figure C3.3 complements the previous one by adding the anchor points (shown as triangle symbols) and the curve Y=X2.19 – and… it visibly isn’t a good fit at low values of X! Only China is close to the curve, while Vietnam, Philippines and Indonesia are all above it. What could we do differently? Note that we based our variables on support for democracy and capitalism. We could as well reverse the axes and consider the lack of support. This “avoidance of democracy/capitalism” would correspond to variables 1-X and 1-Y, respectively. Instead of Y=Xk, the simplest format now would be (1-Y)=(1-X)k and hence Y=1-(1-X)k. Fitting again to Korea leads to k=log(1-.58)/log(1-.78)=1.75. Some points on the curve Y=1-(1-X)1.75 are shown below (see next chapter for how to calculate them): X 0 .2 .4 .6 .8 1 Y 0 .12 .25 .40 .60 1 This curve, too, is shown in Figure C2.3, and it visibly offers a better balance between China on the one hand and Philippines and Indonesia on the other. So this is the curve we might take as a guide for predicting (i.e., offering our best guesses) for further countries. We are now close to the best quantitative description one could offer when respecting the anchor points. We could improve on it slightly by taking into account not just Korea but all data points and running a regression. But if we wish to run a linear

143

regression we first have to transform the data so that a linear form is expected (recall Chapter B6). We have to take the logarithms, on both sides of (1-Y)=(1-X)k: log(1-Y)=klog(1-X).

Figure C3.3. Model based on continuity between conceptual anchor points and fitted to data. NB! The label “1.25” should read “1.75”).

This equation means that log(1-Y) is a linear function of log(1-X). Hence, if we want to run a linear regression, we must do it not “Y on X” but “log(1-Y) on log(1-X)”. Moreover, we have to do it subject to the condition that the intercept term must be zero, because we need log(1-Y)=klog(1-X), not just log(1-Y)=a +klog(1-X) with any odd value 144

of a. The value of k thus found might be somewhat different from 1.75, but not by very much. Note that we would not know which transformation to undertake, prior to regression, unless we first think about anchor points. Even so, the simplest expression we thought of (Y=Xk) did not pan out. We had to realize that what is simplest differs depending on whether we thought in terms of support or avoidance. Only then could we run a regression that would make conceptual sense and also fit the data. CAUTION: If any data points are close to 0 when fitted with Y=Xk, or close to 1 when fitted with(1-Y)=(1-X)k, then taking logarithms blows even small deviations from the average pattern to huge proportions. Graphing on log scales can’t be done, nor can linear regression. In such cases linear testing calls for more complex approaches. Is this model building?

145

C4. The Importance of Slopes in Model Building
___________________________________________________________________  Algebraically formulated models do not take a stand on which came first, chicken or the egg – they are just said to go together. Thus the distinction between “independent” or “dependent” variables rarely matters – the chicken and the egg are interdependent. Depending on the issue considered, either could be an input or an output.  The direction and steepness of slopes of curves y=f(x) is expressed as dy/dx. The slope of y=f(x) is often designated as y=f’(x).  For fixed exponent function y=Axk, the slope is dy/dx=kAxk-1.  When functions are added/subtracted, their slopes add/subtract: y=Af(x)+Bg(x)+C  dy/dx=Af’(x)+Bg’(x).  When functions are multiplied together, the resulting slope cross-multiplies slopes and functions: y=f(x)×g(x)  dy/dx=f’(x)×g(x)+g’(x)×f(x).  When two functions are divided, it becomes more complex : y=f(x)/g(x  dy/dx=[f’(x)×g(x)-g’(x)×f(x)]/[ g(x)2].
  The value of the fixed exponent k in Y=Xk can be calculated from data for X and Y as k=logY/logX. To calculate Y from Y=Xk, enter the value of X, push the key “yx”, enter the value of k, and push “=”.

___________________________________________________________________ The previous example applied what was learned in Chapter B4 and systematized in Chapter B6. No new concepts were introduced. Now we use the same example to introduce the notation for slopes and ways to make use of it. As a preliminary, the notion of algebraic equations, introduced in Chapter B2 (Symmetric Regression) must be stressed. Many variables are interdependent, not “independent” or “dependent” If Y=X2.19 had worked out, it would have meant that support for capitalism is sort of an attenuated version of support for democracy. Or vice versa, that support for democracy is sort of a reinforced version of support for capitalism. Algebraically formulated models do not take a stand on which came first, chicken or the egg – they are just said to go together. Zero support for democracy goes with zero support for capitalism, according to this model, and vice versa. Zero avoidance of democracy and capitalism also go together. In between these extremes, one is a reinforced/attenuated version of the other. We often think of processes in terms of some factors entering as “inputs” and producing some “outputs”. When graphing an output variable and an input variable, the convention is to place the input on the horizontal “x-axis” and the output on the vertical “y-axis” (cf. Chapter A8). When writing equations, the convention is to put the output first: C=k/N2, rather than k/N2=C. But what was an input in one context may become output in some other. Thus, in Exercise A7.4, we wanted to determine the value of parameter k, when N and C are given for some countries. Then k was for the moment the “output”, and we wrote k=N2C.

146

It is even more difficult to propose a unique direction when considering support for democracy and capitalism. Dalton and Shin (2006) decided to graph Capitalism vs. Democracy (cf. our Figures C3.1 to 3). Did they imply that support for democracy comes first and somehow causes support for capitalism? Not necessarily. It is more likely that both are affected by a large number of other factors (X, Y, Z) and also boost each other:  C (X, Y, Z)   D The relationship of C and D may well be symmetric, but when graphing, we are stuck with the choice between two asymmetric ways: either we graph C vs. D or D vs. C. We have to do it one way or the other, but don’t mistake it for a causal direction. The same goes for equations. We have used Y for support for capitalism and X for support of democracy, leading to Y=X2.19. But we could as well transform it into X=Y0.405, where 0.405=1/2.19 (cf. Exercise B4.1). Such transformations are inherent whenever we use algebraic equations, where the equality sign “=”is valid in both directions, in contrast one-directional regression equations such as y a+b1x1+b2x2+…. Some social scientists pay much attention to whether variables are “independent” or “dependent”, but physical scientists are accustomed to think in terms of interdependent variables. Causal direction may vary. Sometimes the existing number of teachers affects literacy, and sometimes existing literacy affects the number of teachers. Thus it makes more sense to talk of input and output variables under the given circumstances rather than inherently “independent” and “dependent” ones. The terms “input” and “output variables” indicate their respective roles in the moment's context, without passing judgment on some inherent direction. Notation for slopes It matters by how much the slope of a straight line is up or down, but this slope is constant. For curves, slope steadily changes. Suppose a function y=f(x) is graphed, y vs. x. The notation for the steepness of slope is dy/dx, which means the ratio of a tiny change (dy) for y which takes place over a tiny change (dx) for x.  When dy/dx=0, it means that y does not change with x – the curve is horizontal.  When dy/dx=+0.3, the slope is moderately up.  When dy/dx=-10, the slope is steeply down. Straight lines have a constant slope. Slope is the “b” in y=a+bx. Curves have changing slopes. The slope of a curve at a given point is the slope of the line that barely touches the curve at this point but does not cross it – the tangent to the curve. Just like the curve itself has an equation, so has the slope. For instance, the slope for Y=Xk is dY/dX=kXk-1. How did we obtain this expression? For mathematical proof, see a book on differential calculus. In the following, we simply check on whether this equation makes sense, by yielding credible results.

147

Take Figure C3.3. Using the slope formula above for Y=X2.19 yields dY/dX=2.19X1.19. Plug in simple values of X. When X=0, we obtain dY/dX=0. This means that the curve is completely flat, they way it looks indeed in Figure C3.3. At the other end, when X=1, we obtain dY/dX=2.19. This means that the slope is much steeper than 1 (which is the slope of the line Y=X). This is the way it looks indeed in Figure BC.3. Now look for the location where the tangent to the curve would be parallel to Y=X, so that its slope is 1. I find that this is very roughly the case X=0.48. The slope formula yields dY/dX=2.19×0.481.19=2.19×0.42=0.92 – close enough to 1, given my imperfect visual placement of the tangent. Why bother about slopes, when building and testing logical models? Two reasons are foremost, among others:  Finding minimum or maximum values for models that do not involve slopes, and  Building models on the basis of slopes themselves. This is why slopes are important in model building. Examples of both follow. ____________________________________________________________ Exercise C4.1 So what would be the exact location where the slope is 1.00? We would have to solve the equation 2.19X1.19=1. It can be transformed into logX=-log2.19/1.19. Then logX=-0.286, and X=0.518. Now find the value of X where the slope is 2.00. Just replace 1 by 2, in .tsõketsba 2.19X1.19=1, and start juggling the quantities. ____________________________________________________________________ Exercise C4.2 Suppose people have been asked to express their degree of trust in their country’s legal system (L, rated on a scale from 0 to 1) and in their country’s representative assembly (A, also rated on a scale from 0 to 1). We wonder how L and A might be interrelated. This means defining the function L=f(A) or, in the reverse direction, A=f(L). a) Which of L and A might be the independent and which the dependent variable? Or would it make more sense to assume that they are just mutually interdependent? b) Graph L vs. A or A vs. L, depending on your previous response. Label the locations 0, 0.5 and 1 on both scales. Mark off the areas where data points could not possibly be – the forbidden regions. c) If the variable you graphed on the x-axis increases, what can one expect of the other, on the average? Indicate it in the form dy/dx<0, dy/dx=0, or dy/dx>0. d) When L=0, what can one expect of A? When A=0, what can one expect of L? Keep it as simple as possible. Mark the resulting point(s) on the graph. e) When L=1, what can one expect of A? When A=1, what can one expect of L? Keep it as simple as possible. Mark the resulting point(s) on the graph. f) What is the simplest way to join these points? Draw it on the graph, and give its equation. Let’s call it the simplest model allowed by the anchor points. g) Suppose we are given two data points, (L=0.45, A=0.55) and (0.55,0.50). Show them on the graph. Is it likely that our simplest model holds? Why?

148

g symbols different from the previous. (E.g., if you previously used small circles, then now use crosses.) Is it likely that our simplest model holds? Why? i) Pass a smooth curve through the points in part (h). What might be the form of the corresponding equation? j) Assuming that this shape holds, how would you express in words trust in legal system as compared to trust in assembly? ____________________________________________________________________ Assembly size which minimizes communication load The sizes (S) of legislative assemblies empirically tend to follow a cube root pattern: S=(2P)1/3, where P is the adult literate population of the country. The reason for it is that, as S increases, the communication channel burden on a single representative decreases in some ways and increases in others. The number of constituents they must satisfy (P/S) goes down, but their number of communication channels in the assembly (S2/2) goes up (recall Chapter A7). It can be shown (see Taagepera 2007:199) that total number of channels (c) is close to c=2P/S+ S2/2. As S increases, c first goes down, but then begins to increase again. The optimal size is the one where c is lowest – this is the size which minimizes the communication load. How do we determine this optimal assembly size? We could draw the curves c vs. S for each population size and look at which S we get the lowest c. We can make it easier for us by observing that minimal c corresponds to the location where the curve is horizontal, meaning dc/dS=0. It will be seen that we can calculate the equation for the slope from the equation for the curve, c=2P/S+ S2/2. It turns out to be dc/dS=-2P/S2+S. This applies to slope at any assembly size – anywhere on the curve. But now comes the clincher: We also require that dc/dS=0. The result is -2P/S2+S=0. Rearranging leads to S=(2P)1/3. Having a model to support empirical observation, this relationship now qualifies as a law in the scientific sense – the cube root law of assembly sizes. This is an example where the model as such did not include slopes. Calculating the slope and then requiring it to be zero was just a device to locate the optimal value. Here we tried to minimize the out put. Sometimes, to the contrary, we try to maximize some output. Both minima and maxima share the same feature: the slope is flat – dy/dx=0. Quite frequently, however the logical model itself is built in terms of slopes. Exponential growth is an important example. The model “slope proportional to size” – exponential growth As a preparatory step, think about the rate at which a wall under construction increases in length (L) over time (t). Suppose a steady number of masons work on it and produce k meters per day. Starting from the day construction began, L=kt. The slope is the rate of construction: dL/dt=k.

149

Now consider the way a bacterial colony grows when food and space are plentiful. At regular intervals, each bacterium splits into two. Hence the growth rate is proportional to the number (N) of bacteria. When this number doubles, the rate of growth also doubles. This means that here we have the slope dN/dt proportional to N itself: dN/dt=kN. Compare this equation to the one above. When something is built from the outside, at a steady rate, dL/dt=k applies. In contrast, when something is building itself from the inside, at a steady rate, dN/dt=kN applies. So these “differential equations” express the deep meaning of the process. These differential equations, however, do not enable us to predict how much stuff there is, at a given time. For this purpose we must use their “integrated” form. For dL/dt=k, it is simply where we stared from: L=kt. For dN/dt=kN, it is N=N0ekt – an exponential equation, where N0 is the number of bacteria at time chosen as t=0. Table C3.1 resumes the picture for quantities built from the outside and from the inside. The differential and integrated equations are fully equivalent expressions for the same model, but they may look at times quite different, and they serve different purposes. The differential equation expresses and explains the process, while the integrated equation enables us to predict. Table C4.1. Equations for external construction and self-propelled growth – and constraint patterns. _____________________________________________________________________ Type of process Differential  Integrated Corresponding constraints equation equation ______________________________________________________________________ External construction dL/dt=k  L=kt Unbounded field kt Self-propelled growth dN/dt=kN  N=N0e Only two quadrants allowed (No process in time) dy/dx= k(y/x) y=Axk Only one quadrant allowed ____________________________________ __________________________________ Starting from a very different angle, we are back to some equations which emerged in Chapter B6. Unbounded field led to trying the linear approach, which emerges in L=kt. Only two quadrants allowed led to trying the exponential approach, which emerges in N=N0ekt. For good measure, Table C4.1 also includes the two forms of the fixed exponent equation y=Axk which was suggested when only one quadrant is allowed. For the moment, accept that the curve y=Axk has for slope, at any given x, dy/dx = kAxk-1. Note that this expression includes Axk, which is the value of y. Hence, dy/dx = kAxk-1 = (Axk)(k/x) = ky/x.

150

Thus, it turns out that here the slope depends on the ratio of y and x. Does this pattern correspond to another growth pattern in time? No, it cannot deal with time, because time extends from minus to plus infinity, while here x is limited to positive values. The slope of fixed exponent function It was claimed that y=Axk has for slope dy/dx = kAxk-1. I will omit mathematical proofs Rather, I’ll try to show why and how it makes sense. Does the formula give the right result for the straight line y=x? We know that the slope of this line is 1. The formula yields dy/dx=1x1-1=x0 =1. This is indeed the slope of y=x. For the parabola y=x2, the formula yields dy/dx=2x. Does it make sense? Look at Figure C4.1. Figure C4.1. Parabola y=x2 and its slope dy/dx=2x. y 2 8 6 4 2 -3 -2 -1 0 1 1/2 2 1/3 1

2

3

x

   

At x=0, dy/dx=2x yields dy/dx=0, which certainly is the case: the curve is horizontal at this point. At x=2, dy/dx=2x yields dy/dx=4. On the graph, as x increases by ½ units, y increases by 2 units, so 2/(1/2)=4. At x=3, dy/dx=2x yields dy/dx=6. On the graph, as x increases by 1/3 units, y increases by 2 units, so 2/(1/3)=6. At x=-2, dy/dx=2x yields dy/dx=-4. On the graph, as x increases by ½ units, y decreases by 2 units, so -2/(1/2)=-4.

151

How slopes combine when functions are combined It can be seen that slopes are important building blocks and tools for logical models. The following introduces some basic relationships, omitting mathematical proofs. At the start, I’ll try to show why and how they make sense, but when reaching for division, I give up. You’ll begin understanding as you make use of these formulas. Let us add a constant to y=x, maybe 3, so that it becomes y=x+3. By so doing, we just raise the line by 3 units – the slope does not change. Now multiply x by some constant, maybe 2. By so doing, we make the slope twice steeper. More generally, d(a+bx)/dx=b. The same applies to any other curve, when we add or multiply by constants. Take the parabola y1=x2, multiply it by b and add a: y=bx2+a. The slope of y1=x2 was dy1/dx=2x2. The slope of y=bx2+a is dy/dx=2bx. Suppose we now add together a straight line, y2=a+bx, and a parabola, y1=cx2. We get a curve y=a+bx+cx2. Its slope is dy/dx=b+2cx. This rule applies to sums of any functions. We often designate the slopes of functions by the same letter followed by apostrophe. Thus, for the functions y1=f(x) and y2=g(x), we have dy1/dx=f’(x) and dy2/dx=g’(x). When such functions are multiplied by constants and then added, the same happens to their slopes: When y=Af(x)+Bg(x)+C, then dy/dx=Af’(x)+Bg’(x). What happens to slopes when two functions are multiplied together? Before adding the slopes, each is multiplied by the value of the other function: When y=f(x)×g(x), then dy/dx=f’(x)×g(x)+g’(x)×f(x). Figure C4.2. Area increase results from cross-multiplication of lengths and length increases. Cf. Exercise A11.1.

Δy

xΔy

Δx Δy

yΔx y

xy

x

Δx

152

How can this be visualized? Look at the rectangle in Figure C4.2, with sides x and y. Its area is A=xy. Extend both sides by a tiny bit, Δx and Δy, respectively. The area increases by ΔA=xΔy+ yΔx when we neglect the even tinier corner Δx Δy. The crossmultiplication in dy/dx=f’(x)×g(x)+g’(x)×f(x) has the same origin. By the way, when do we use Δx and when dx? When the difference in x is appreciable, Δx is used. When this difference is made ever smaller (Δx0) and becomes “infinitesimally small”, then we use dx. In Figure C3.2, the corner Δx Δy still has some area, but dxdy tends to 0. Figure C3.2 can be used to make it more credible that the slope of y=xk is dy/dx=kxk-1. In this figure, make y equal to x. Now we have a square with area A=x2. Extend both sides by Δx. The area increases by ΔA=xΔx+xΔx=2xΔx when we neglect ΔxΔx. Hence ΔA/Δx=2x. Using y instead of A, and going from Δx to dx, the result is that y=x2 leads to dy/dx=2x. This is how it comes about, in y=xk, that for slope we multiply by k and reduce xk to xk-1. This rule applies to negative k too. For y=1/x=x-1, dy/dx=(-1)x-2=-1/ x2. The negative sign makes sense: as x increases, y decreases – the slope is negative. It is harder to visualize that the slope is sort of “stronger” than the curve itself. Believe it – or pick simple examples and check it. Finally, combining the outcomes for y=1/x and y=f(x)×g(x), leads to the formula for division of two functions: When y=f(x)/g(x), then dy/dx=[f’(x)×g(x)-g’(x)×f(x)]/[ g(x)2]. We may not need all these formulas in the examples that follow, but this is the full set for basic combinations. Some of them are likely to dumbfound you at first. They will become clearer with use. It helps if you give yourself simple examples where you can figure out the result by other means – and discover that the formula yields the same result. ______________________________________________________________________ Exercise C4.3 The previous model for cabinet duration in years, C=42/N2, has the form Y=AXk. Keep in mind that 1/N2=N-2. a) What would be the expression for dC/dN? b) Do such slopes agree with the graph in Exercise A8.1? c) Why doesn’t it apply to Figure A12.1? ________________________________________________________________________ How to determine the value of k in Y=Xk This section reviews the method introduced in Chapter B4 and expanded in B6.
Pick a data point that looks representative of the center of the data cloud, such as Korea’s X=.78;Y=.58 in Figure C3.3. Take logarithm on both sides of Y=Xk. It leads to logY=klogX. Hence

153

k=logY/logX=log0.58/log0.78=(-0.237)/(-0.108)=2.19. Note that we must use the fractional values 0.58 and 0.78, NOT the percent values 58 and 78. The latter would lead to a wrong value, k=0.93. On a pocket calculator, logX is usually obtained by entering the value of X and then pushing LOG. (Some calculators, though, require LOG first, and then the value of X.) The obverse problem is to calculate Y=Xk for various values of X, so as to have points through which the curve can be passed. Enter the value of X, push the key “yx”, enter the value of k, and push “=”.

154

C5. Support for Democrats in US States: Applying Y=Xk
___________________________________________________________________  This chapter offers more examples where fixed exponent model applies. ________________________________________________________________ Having worked through the previous example, you should now be in a position to tackle the following. Figure C5.1 shows the Democratic percentages of votes in various US states in presidential elections 2000 compared to what they were in 1996, as reproduced from Johnston, Hagen and Jamieson (2004: 50). The utterly dark label at the line shown reads “Vote2000 = -11.4 + 1.1*Vote1996”. It represents the OLS (Ordinary Least Squares) regression line, which is one way to fit a line to these data points. Several features can be added that might add to our understanding of the changes from 1996 to 2000. Figure C5.1. The starting point: Data and regression line, as shown in Johnston, Hagen and Jamieson (2004: 50).

Before you turn page, do the following, and write it down. a) State which features of interest are missing. b) Add these missing parts to the graph, as precisely as possible. c) Try to find what the graph then tells you. Try to do it on your own, before looking at my solution. Think in terms of allowed areas, anchor points, continuity, equality lines, baselines, and simplest curves joining anchor points. This is not to say that all of them enter here in a useful way.

155

Solution Let x stand for “1996 Democratic percentage” and y for “2000 Democratic percentage”. For both, the conceptually allowed range is from 0 to 100. Draw it in, as precisely as you can. How can we do it? Take a piece of paper with a straight edge. Mark on it the equal distances 30, 40, 50 … on the x axis. Move it first so as to reach 0 on the left and then 100 on the right. Mark those spots. This was done at the level y=20. Repeat it at the top level of the graph. Use these two points to draw the vertical line x=0. Do the same at x=100. Then do the same for y axis. (Watch out: the scale is not quite the same for y as for x.) You may find that you do not have enough space around the original graph to fit in the entire allowed region. If so, copy the graph, reducing it. If there isn’t enough space, then tape extra paper on all sides. Just don’t say “It can’t be done” for technical reasons. You are in charge, not technology. Don’t be afraid to use less than modern means, if they serve the scientific purpose. Now draw in the equality line, joining (0,0) and (100,100). Check that it does pass through the (30,30) and (50,50) points visible in the original graph. It’s a natural comparison line, the line of no change. It becomes visible that support for Democrats 1) decreased, from 1996 to 2000, and 2) not a single state went against this trend. What about anchor points? If a state existed where support for democrats already was 0 % in 1996, it would not be expected to buck the trend and go up in 2000. Hence (0,0) is an anchor point. Also, if a state existed where support for democrats still was 100 % in 2000, it would not be expected to buck the trend and be less than that in 1996. Hence (100,100) is also an anchor point. The simplest curve that joins these anchor point again is Y=Xk, with X=x/100 and Y=y/100. At X=0.50 we have approximately Y=0.44, in agreement with the regression line shown in Figure C5.1.Hence k=log0.44/log0.50=1.18, so that the resulting curve is Y=X1.18. Some points on this curve are shown below: 0 .1 .3 .5 .7 .9 1 X Y 0 .07 .24 .44 .66 .88 1 They enable us to draw in the approximate curve. Figure C5.2 shows these additions. The graph might look a bit nicer when done on computer, but it might be less precise, even if you could scan in the original graph. Mules and computer programs can be stubborn. It’s better to have it your way and correct than their way and incorrect. The curve Y=X1.18 that respects the anchor points almost coincides with the linear regression line for y above 50 %. It differs appreciably for low values of y. Although the

156

curve Y=X1.18 seems to fit the data, we also should keep in mind the opposite approach to support for Democrats – the avoidance of them, (1-Y)=(1-X)k. In this case k=log(1-.44)/log(1-.50)=0.837. For X=.30 and X=.70, Y=1-(1-X).837 yields the points (.30,.26) and (.70,.63). For y above 50 %, this curve (not shown) falls below most data points. Thus Y=Xk is the preferable approach in this case.

Figure C5.2. Allowed region, anchor points, and model based on continuity between conceptual anchor points, fitted to data.

If it were a pure two-party constellation, we would face a dilemma, because then 1-X and 1-Y would stand for the Republican shares of votes – and there is no reason to

157

prefer either party as a standard. Actually, however, third parties also received some votes, so we may be off the hook. What have we learned that wasn’t evident from the original graph? It may seem from Figure C5.1 that states varied fairly widely in their support for democrats. In contrast, Figure C5.2 pins down that this variation was rather modest when compared to the conceptually possible range. [Johnston, Hagen and Jamieson (2004) explicitly excluded Washington, DC from their graph, because its lopsidedly Democrat vote in both years was outside the usual range.] We can see that no state bucked the trend away from Democrats. Compared to the regression line, the average trend is expressed in a way that does not predict absurdities at extreme values of x or y. A major payoff is that we express information in a more compact form (recall end of Chapter B4 and exercise B4.1). Indeed, instead of the two numerical values in the regression equation (intercept -11.4 and slope 1.1) we make do with a single one (exponent 1.18). This would be extremely important, if we continued to study change in support during other periods. The equations of regression lines would be hard to compare systematically because we’d have to keep track both of intercept and slope – and a single outlier could alter the intercept, even while it is just noise from the viewpoint of systematic comparison. The relationship of the single exponent k to the average nationwide shift might tell us something of interest. In more formal terms, compared to linear fit Y=a+bX, the format Y=Xk not only respects conceptual constraints but also makes do with a single parameter (k) instead of two (a and b). This parsimony makes comparisons of phenomena much easier and hence helps to answer “why?” ________________________________________________________________ Exercise C5.1 The data in Figure C3.1 are roughly as follows (these are the same numbers as in Exercise B6.1). ___________________________________________________________ Country Support for Support for Opposition to Opposition to Democracy Capitalism Democracy Capitalism ___________________________________________________________ Canada 81 57 China 60 29 Indonesia 54 48 Japan 82 52 Korea, S. 79 57 Philippines 42 36 Singapore 83 58 US 80 63 Vietnam 60 44 Arithm. mean 69.249.8Geom. mean 67.648.4__________________________________________________________

158

a) Calculate Opposition to Democracy / Capitalism on a scale from 0 to 1 (NOT %!), and enter these numbers in the table. b) Graph Opposition to Capitalism (Y) against Opposition to Democracy (X) and enter the data points from (a). c) Which form of equation could be conceptually acceptable, when assuming anchor points (0,0) and (1,1)? d) Calculate the free parameter in this equation when X=0.20 yields Y=0.47. e) Insert this parameter value into the general equation, and calculate the values of Y when X=0.2, 0.5, 0.7 and 0.9. f) Mark these points on the graph and pass a smooth curve through them. g) To what extent does this curve agree with the data points? Could some other smooth curve fit much better? h) For graphing the curve, what made me suggest the values of Y at X=0.2, 0.5, 0.7 and 0.9 rather than the more uniform set 0.2, 0.4, 0.6 and 0.8? i) How does this exercise connect to the discussion of Figures C3.1 to 3? What is its most specific technical purpose? _____________________________________________________________________ Exercise C5.2 Literacy percentages for males and females in three developing countries are (M=45;F=35), (60;45) and (70;60). Build a simple model that avoids absurdities. Show it graphically and as an explicit equation. Draw conclusions. Up to now, I have presented lots of hoops for you to jump through. Now it’s time for you to set the hoops for yourself. _____________________________________________________________________ Exercise C5.3 The graph on the next page is reproduced from Norris and Inglehart (2003: 27). The Gender Equality scale runs from 0 to 100. Approval of Homosexuality scale runs from 0 to 10. The line shown visibly is the OLS regression line, Gender Equality regressed on Approval of Homosexuality. Do with it what you can. Comment on data, existing analysis and on yours.

159

160

C6. How the Number of Parties Connects: Exponential Fit
___________________________________________________________________  The exponential function y=Aek(x-B) is a most prevalent function in natural and social phenomena. This is the simplest pattern where the curve approaches a floor or a ceiling without ever reaching it.  To determine k in y=Aek(x-B), plug two typical points on the curve into k=ln(y/A)/(x-B).  To calculate lny, enter y and push “LN”.  To calculate values of y=Aek(x-B), calculate k(x-X). Push the key “ex” or “expx”, push “multiply”, enter the value of Y and push “=”. ________________________________________________________________ We are back to Arend Lijphart’s Patterns of Democracy (1999). A central measure for the majoritarian-consensus continuum of democratic institutions is the number of parties in the legislative assembly. More specifically, it’s their effective number, which takes into account the relative sizes of the parliamentary parties (N) – cf. Chapter A6. This number has a definite lower limit at N=1, but no clear cut upper limit. True, an assembly of S seats could fit at most S parties, but S itself varies, and actual values of N are so much smaller (i.e., N<<S) that we might as well say that N has no finite upper limit. Interest group pluralism Consider Figure C6.1, which graphs interest group pluralism (I) versus N. Here I is a compound index based on estimates of so many different aspects that it is a quasicontinuous quasi-measure, ranging from 0 to at most 4. At the top, we see the original graph (Lijphart 1999: 183). It shows a slightly tilting OLS line (I vs. N) in the midst of such high scatter (R2=0.30) that much of the entire field is filled with points. Delineating the allowed region changes little. The original limits shown on the left, top and bottom are indeed the conceptual limits. It’s just that the field is wide open to the right. We have no data points at N=1. A democracy with only one party is rare, to put it mildly, although Botswana comes close, with N=1.35 (one huge and one tiny tribal party). However, the general trend seen in Figure C6.1 suggests that if N=1 ever materialized, then I might be close to its upper limit of 4. No other value is more likely, so let us take (1,4) as a tentative anchor point through which a reasonable data fit should pass. Note that the OLS line (y on x) falls below nearly all data points at very low N (i.e., N<2), because it is pulled down by the low values at medium N. In contrast, it hovers above nearly all data points at high N, because it is heavily affected by the outliers ITA and PNG. The other OLS line (x on y), shown in the lower graph, is no better – it errs in the opposite directions. With conceptual finite limits on 3 sides, only one anchor point usually can be proposed, plus a conceptual bottom or ceiling. Here a conceptual bottom at I=0 complements the anchor point at (1,4). The simplest pattern is a curve starting at (1,4) and gradually approaching I=0 at large N, without ever reaching it. As was seen in Chapter B6, the simplest format in such cases is an exponential function: y=Aekx, where

161

e=2.71… is a mathematical constant analogous to the well-known “pi”. The constant k indicates how rapidly y changes with increasing x. A positive value of k means increasing y. A negative value of k means a decreasing y. Figure C6.1. Original graph (Lijphart 1999: 183) and addition of an exponential fit.

In the present case y is replaced with I and x with N. It will soon be shown how we reach the curve I = 4e-0.31(N-1). This curve is shown in the bottom part of Figure C6.1. It passes below most data points at low N, so the fit is not good there. Indeed, a simple straight line passed through (1;4) might seem to fit the data cloud better, and the symmetric regression line (not shown)

162

would come close. Trouble is, such a line would predict a negative value of I for N larger than 6. Our data fits should not predict absurdities, even though such high values of N do not occur in that particular data set. So the exponential fit still is preferable. Moreover, it may help us find out why interest group pluralism is connected to the number of parties. But how can we determine a fit like I = 4e-0.31(N-1)? Fitting an exponential curve to interest pluralism data We can start in two ways, the basic exponential format y=Aekx, or a format which looks slightly more complex: x=Aek(x-B). We’ll do it both ways – and what looks more complex turns out simpler. The starting point is the one suggested in Chapter B6: pick two “typical” points and fit the equation to them. If we insist on the anchor point (1,4), then this would be one of them. Then pick a data point at a moderately large N that seems to be close to the central trend. In Figure C6.1 something close to VEN might do (3.4,1.9). The first format means here I=AekN. Plugging in the two points yields 4=Ae1k and 1.9=Ae3.4k. By dividing member by member cancels out A: 1.9/4=e3.4k/e1k= e(3.4-1)k. Hence 0.475= e2.4k. Take logarithms – and here it pays to push the lnx button rather than logx, because ln(ex)=x by definition. So here ln.475=2.4k. Hence k=ln0.475/2.4=-0.744/2.4=-0.3102, which we round off to -0.31. The value is negative, because the curve goes down. To find A, we plug k=-0.31 into one of the starting equations. Actually, I often do the calculations with both, to make sure against mistakes. Here 4=Ae1×(-0.31) and 1.9=Ae3.4×(-0.31)= Ae-1.054. They yield the following, respectively, by using the ex or 2ndF/lnx button. (Note that the exponential goes to the other side and hence changes sign!) A=4e+0.31=4×1.363=5.454 and A=1.9e+1.054=1.9×2.869=5.451. The difference is due to rounding off k, and A= 5.45 is sufficiently precise. The final result is I=5.45e-0.31N. It can be used to calculate points on the curve shown in the bottom Figure C6.1, but the anchor point (1,4) does not stand out. We can easily see what the value of I would be at N=0 – it would be 5.45, but this is pointless information when N≥1 is imposed. This is where the second format comes handy. I = 4ek(N-1). The second format means here I=Aek(N-B). The constants A and B go together in the following way: When N=B, then I=A, given that e0=1. As values of A and B, it is convenient to plug in the anchor point coordinates: I=4ek(N-1). To find k, we again plug in (3.4,1.9): 1.9=4e2.4k, and again k=-0.31 results. The final result now is I=4e-0.31(N-1).

163

It yields exactly the same values as I=5.45e-0.31N, but it also keeps the anchor point in evidence. And it was easier to calculate – the value of A was there without any calculations. Given the degree of scatter, no precise fit needs to be attempted. If we wanted to have more precision, a statistical fit, respectful of the anchor point, can be applied – see chapter appendix. What about a logical model? It is hard even to argue that interest group pluralism should decrease with increasingly consensus-minded politics (of which N is a good indicator). It is even harder to explain why it should decrease at the rate observed (as reflected in k=-0.31). These are questions we should ask, even if we do not reach answers. If we do not impose the anchor point (1,4), we might just graph the data on semilog paper (or logI vs N on regular paper) and see if the data cloud is linear or bent. The corresponding data are available in Lijphart (1999: 312-313). Given the wide scatter, no curvature probably emerges. We can then place the best fit line visually, or calculate it, preferably using the symmetric regression approach. My guess is that this exponential curve would be slightly steeper, passing through TRI at the top left and BEL at bottom right. Frequency of minimal winning and/or one-party cabinets Figure C6.2 offers a clearer anchor point. Lijphart (1999) distinguishes between majoritarian and consensual types of democracy. The majoritarian ideal is a one-party majority cabinet. The second best preferences are either a one-party minority cabinet or a multi-party minimal winning coalition, i.e. one that has majority and includes no members superfluous for achieving majority. Thus the percentage of minimal winning and/or one-party cabinets (y) is a measure of how much a country tilts toward majoritarianism rather than consensus. (The label in Figure C6.2 omits the important “and/or”.) This percentage y has clear limits at 0 and 100, while the effective number of parties (N) again reaches upward from 1. The OLS line (y on N) in the original graph (Lijphart 1999: 112) seems to express the relationship quite well. Scatter is quite low (R2=0.76), so what more would a social scientist wish to have? Let’s check for logical consistency. To begin with, this OLS line predicts a negative y for N larger than 5.7. But there is more. A value of N less than 2 can come about only when one party has absolute majority of seats and thus can form a single-party cabinet. For any N<2, therefore, y must be a clear 100 %, and we can see that this is so indeed for all data points at N<2. Yet the OLS line predicts values of y either less than 100 (for N>1.4). or more than 100 (for N<1.4). We must not predict absurdities.

164

Figure C6.2. Original graph (Lijphart 1999: 112), and addition of an exponential fit with a kink.

Actually, y=100 for N=2 is a very clear cut conceptual anchor point. With a conceptual bottom at y=0, the simplest format that respects these constraints is again an exponential function that passes through the anchor point: y = 100ek(N-2). Fitting the curve to pass through a central data point such as ICE leads approximately to y = 100e-0.5(N-2). [N≥2] However, it applies only at N≥2. We must complement it with y=100. [N<2] This means a kink at N=2, and I protested in Chapter C3 against such kinks in fitting data. (Kinks mean discontinuities in slope dy/dx.) Here the situation is different, because

165

a real conceptual discontinuity does occur at N=2: Below this value a single-party majority cabinet is always possible. True, at N slightly above 2, a single-party majority cabinet is also highly likely. We could use a curve that bends down smoothly, roughly through GRE, CAN and US, but its equation would be quite messy. Unless we obtain much better data and then the best exponential fit imposes reaching y=100 at N slightly larger than 2, we might stick with a kink at N=2. Once again a more precise statistical fit to the exponential format can be established, but in view of the scatter it doesn’t add much to the picture. A straight line passed through (2,100) might fit the data at N>2 even better than the exponential, but it would predict negative y at high N. Electoral disproportionality Figure C6.3 deals with electoral disproportionality (D). This is the difference between vote and seat shares of parties, and I will not describe how it is determined. While previous graphs had something else graphed against N, here Lijphart (1999: 169) has preferred the opposite direction, N against D. If these data had been graphed D vs. N, the OLS line would have looked like the x-on-y line in the bottom graph of Figure C6.3. Either way, R2=0.25, and it’s quite visible that neither OLS line does justice to the data cloud, because the cloud has a clear curvature. The allowed region again has finite limits on 3 sides, even while it may not look so in the graph. While N can range upward from 1, with no clear upper limit, D can range upward from 0, and no higher than 100 percent. This limit just isn’t visible in the graph. But the situation is more fluid than that. Consider the situation at N=1, meaning that a single party has won all the seats. What level of disproportionality could be expected? If it also had all the votes, D would be zero. At the other extreme, D would be 100 % only if that party obtained no votes at all. It is argued in Taagepera and Shugart (1989: 109-110) that the average outcome can be expected to be around D=25 %. At this point, the graphical representation may be confusing. In the previous two examples the three-sided allowed region was open to the right, while now it is open to the top (and the limit at D=100 on the right is not visible). You may find it easier to turn the page by 90 degrees, so that the D axis goes up. Taking N=1, D=25 as an anchor point and D=0 as the conceptual bottom (which is vertical in the actual graph!) leads to the exponential format D = 25ek(N-1). Picking NOR as a typical central point leads to k=-0.66 so that D = 25e-0.66(N-1). Here this fit is clearly more satisfactory than any linear fit could be. How close are we to a logical model? The anchor point is somewhat fluid. It makes sense that higher N leads to lower D, given that higher N tends to reflect more proportional electoral rules. Also, as N increases, its further impact on D should gradually

166

become milder. In this respect the exponential pattern makes conceptual sense. Actually, detailed modeling (Taagepera 2007: 232) suggests a more complex pattern. Besides N, it would also depend on the total number of seats in the assembly. Figure C6.3. Original graph (Lijphart 1999: 169), and addition of exponential fit with a vague anchor point.

_________________________________________________________________ Exercise C6.1 At the ECPR (European Consortium for Political Research) Workshops in Rennes 2008, Rense from Nijmegen told me the following. They help the unemployed to write better CVs and hence be more successful in seeking jobs. They think they do something socially useful, but they are told that this isn’t so. They do not affect the number of jobs

167

available. If they help one person to get a job, then simply someone else fails to get this job. They supposedly just help one person at the expense of another, and the social benefit is nil. Rense and his colleagues disagree but find it hard to respond. So who is right, or is there a third way out? Does it make sense to help the unemployed to write better CVs? Close your eyes for the moment and ponder: How would I tackle this issue: (If you omit this step you might lose more than you think!) Scribble down a few ideas. And now proceed to the following. a) Which variables are involved? OK, the number of positions filled (F) is one variable. What else? No idea? Well over which variable does F change? b) Label this other variable T. Sketch a graph, F vs. T. c) Introduce a simplifying assumption. Assume that initially all positions the economy needs (FM) are filled (F=FM) but then a recession sets and some workers lose their jobs, so that the number of filled positions drops to FR. Later, at T=0, the recession suddenly goes away and FM positions again need to be filled, but only FR positions are filled. Enter these levels (FM and FR) on your graph, at T=0. Prior to T=0, F= FR. (Actually, recessions recede only slowly. But taking this into account would make the issue pretty complex right at the start. Better assume instantaneous recovery, for the moment, and leave the more complex situation for later.) d) Also enter the ceiling, i.e., the level FM toward which the employers try to raise F during the period T>0. What is the simplest model to express that the number of positions filled tends toward the number of positions available the faster, the larger the gap between the two? Offer the format of the corresponding equation (with no specific numbers), and sketch the approximate graph. e) Why aren’t the vacant positions filled instantaneously? It’s because all employers and job seekers suitable for each other do not meet. The mutual fitting together takes time. How would this time change for those who present themselves with improved CVs? Enter (always on the same graph) the curve for those with improved CVs. f) Which curve would tend to be followed immediately after T=0? To which curve would it shift later on? g) Then how do those who help to write better CVs alter the situation? Are they of any help to job seekers as a group? Are they of any help to economy as such? h) Would these conclusions change if we made the description more realistic by assuming that the job market expands from FR to FM only gradually? i) And now to the most important step. Stop for a while and think back: How would you have tackled this issue on your own, compared to what I have put you through? Would you agree with my approach? If you don’t, fine – as long as you take me as seriously as you take yourself. At the end of the task, just enter: “Yes, I have pondered it.” ________________________________________________________________________ Exercise C6.2 At the ECPR (European Consortium for Political Research) Workshops in Rennes 2008, Staticia presented a paper that connected the subventions (S) paid by the European Union

168

to various Euroregions to the per capita GNP of those regions (G). She graphed S against logG, found a pretty nice linear fit, and concluded: support decreases as wealth increases. However, the model inherent in this fit violates at both extremes the maxim “Models must no predict absurdities”. Explain the nature of absurdities that result, and offer a better model. The intervening steps are as follows. a) Show logG on the x axis, G being GNP per capita in euros. This means marking 1, 10, 100, 1000 and 10,000 on the axis, at equal intervals. b) Show subventions S on the y axis, S being in units of million euros. This means marking 0, 100, 200, 300… on the axis, at equal intervals. c) Enter a straight line at a reasonable negative slope. This is how Staticia’s graph looked, with data point crowded around it. d) Now statistics must yield to thinking. What is the subvention predicted by this straight line at extremely large GNP/cap? Is it possible? e) And how large would the subvention be when a region’s GNP/cap is zero? Could the European Union afford to pay it? (Hint: Where on the x axis do you place 1, 0.1, 0.01, 0.001, …?) f) What shape does Staticia’s straight line assume when we graph S vs. G, rather than S vs. logG? Hint: Graph S vs. G, starting both from 0, and enter the predictions from parts (d) and (e). g) Which model would you recommend in view of the following: (1) neither S nor G can be negative; (2) when G=0, then S is some pretty high figure; and (3) when G increases, then S decreases. Sketch the corresponding curve on the graph S vs. G. (NOT S vs. logG!) h) Write the simplest equation that satisfies those conditions. i) If Staticia wishes to run a linear test for this logically supported model, which quantities should she graph. Compare to the quantities she did graph (S and logG). j) But Staticia did get a satisfactory statistical fit, by social science norms. How is this possible, if she graphed in an illogical way? That’s an unfair question, unless I also give you her data so you can visualize it. No need to answer. NOTE: Actually, what we have expressed here should stand for per capita subvention. For total subvention, the population of the region should also be taken into account, not only their wealth. Staticia could ignore it only because the regions have roughly similar populations.

__________________________________________________________
The slope of the exponential function The simplest exponential function is y=ex. This is what we get when we plug A=1, k=1 and B=0 into y=Aek(x-B). It has the unique property that, at any value of x, its slope equals the function itself: d(ex)/dx= ex. This is the only function to have this property. Indeed, require that slope must equal the function, and the function f(x)= ex results – and this is what defines the value of e in the first place.

169

This is when k=1.With any other value of k, the slope is proportional to the function itself: d(Aek(x-B))/dx=kAek(x-B). This is why the curve approaches zero without ever reaching it: As y becomes small, so does its further decrease. Whenever something grows or decreases proportional to its existing size, it follows the exponential pattern. Growth of bacterial colonies or human populations and growth of capital at fixed interest follow this pattern, with a positive k. The amount of radioactive material decreases according to this pattern, with a negative k. Indeed, the exponential function pops up whenever some variable x can take any values, from minus to plus infinity, while the connected variable y can take only positive values. This makes it a most prevalent function in natural and social phenomena, on a par with the linear y=a+bx and the fixed exponent y=A(x/B)k that can be reduced to Y=Xk. In the refined model for volatility (Chapter A6), the approach to a ceiling (of 100) replaces the approach to the floor (0). Just replace V with y=100-V, and you are back to the previous situation. The exponential pattern still applies. Appendix: How to determine the value of k in the exponential equation This section is an earlier version of “Fitting an exponential curve to interest pluralism data”, kept for the moment, just in case. It uses a slightly different notation.

The general format of the exponential equation is y=Yek(x-X). Once you have plugged in the coordinates of an anchor point, such as y=4ek(x-1) in the case of I and N in Figure C6.1, pick a data point that looks representative of the center of the data cloud, such as VEN (3.4; 1.9). Feed in its coordinates: 1.9 = 4ek(3.4-1)= 4e2.4k. Divide by Y: 1.9/4=e2.4k, hence e2.4k=0.475. Take the natural logarithm on both sides. The natural logarithm of x is designated as lnx, while logx usually stands for the logarithm based on 10. By definition, lne=1. So 2.4k=ln0.475=-0.744. Hence k=-0.744/2.4=-0.31. Instead of a numerical example, let’s play it through in an abstract format. Feed any two points (X;Y) and (x;y) into y=Yek(x-X). Then k=ln(y/Y)/(x-X). On a pocket calculator, lny is usually obtained by entering the value of y and then pushing “LN”. (Some calculators, though, require LN first, and then the value of y.) The obverse problem is to calculate y=Yek(x-X) for various values of x, so as to have points through which the curve can be passed. To calculate y=Yek(x-X), calculate first k(x-X). Then push the key “ex” or “expx”, push “multiply”, enter the value of Y and push “=”. Rather than bet one single central point, a better exponential fit to a data set is obtained as follows. Calculate lny for all data points. Graph lny vs. x. To the extent the exponential fit applies, the pattern would look linear. Pass the best fitting line through it. If there is an anchor point, force the line to go through it. The intercept of this line is Y, and its slope is k. Pick two points on that line and use k=ln(Y’/Y)/(X’-X).

170

C7. Moderate Districts, Extreme Representatives: Complex Fits
___________________________________________________________________  When both x and y are conceptually limited to the range from 0 to 1, with anchor points (0,0) and (1,1), the simplest fit is with Y=Xk.. When a further anchor point is imposed at (0.50,0.50), the simplest fit is with Y=Xk/[Xk +(1-X)k].  When this third anchor point is shifted away from (0.50,0.50), the simplest fit becomes even more complex.  Different logical approaches sometimes can be applied to the same problem and may fit data about equally well. One still has to ponder which one makes more sense.  All other things being equal, a model with no adjustable parameters is preferable to one with such parameters. All other things being equal, a smooth model is preferable to a kinky one. ________________________________________________________________ From this point on, we reach situations with more than two constraints, and the models involve more complex equations. Moreover, these equations are not explained in as much detail as was the case for exponential and fixed exponent equations. Why not? You are less likely to encounter them in further research. But if you do, it’s good to recognize the general patterns Our starting point is a graph in Russell Dalton’s Citizen Politics (2006: 231) reproduced here as Figure C7.1. It shows the degree of conservatism of representatives (on y-axis) as compared to the conservatism of their districts (on x-axis) in US House elections. The book explains the variables in the graph and adds: “There is a very strong congruence between district and representative opinions (r=.78) [thus R2=0.61], as one would expect if the democratic process is functioning”. Yes, we can see that “When x goes up, y goes up”. But what else is there to be seen? A puzzle emerges as soon as we also graph the equality line, y=x, as shown in Figure C7.2. A disconnect between district and representative opinions becomes evident, except near the 50-50 point. In conservative districts, representatives are more conservative than their average constituents, and in liberal districts (in the US meaning of “liberal”), representatives are more liberal than their average constituents. How do they differ from their constituents in their degree of conservatism/liberalism, and why precisely to such degree? By now you should be used to what follows. Delineate the entire conceptually allowed area. It both reduces and expands on the field in the original graph, as x and y can both range from 0.00 to 1.00. Next, consider conceptual anchor points. In a 100 percent conservative district, the representative has a strong incentive to vote 100 percent conservative, while in a 100 percent liberal district, the representative has a strong incentive to vote 100 percent liberal. No such data points can be seen, but the most extreme data points do approach the corners (0,0) and (1,1). Assuming continuity (or at least quasi-continuity), the mean curve should start at (0,0) and proceed to (1,1). All this

171

is also shown in Figure C7.2. So far, this pattern repeats what we have seen in previous examples. Here, however, there is a third anchor point, half-way up.

Figure C7.1. Original graph (Dalton 2006:231): Data alone.

The simplest fit of data to conceptual constraints How would a representative tend to vote in a 50-50 district? It depends on how the details of measuring a representative’s position are set up. But in the absence of any further information (the ceteris paribus principle), we have no reason to guess at more than 0.50 or at less than 0.50. So the only unbiased expectation is that the curve passes through the point (0.5,0.5). It agrees roughly with the data cloud.

172

Note that the data cloud itself appears somewhat different once the equality line, forbidden areas and anchor points are introduced. The free-floating bent blob of Figure C7.1 appears in Figure C7.2 as visibly squeezed between the y=x and y=0 lines at low x. The same applies more roughly above the equality line, at high x. The cloud rises sharply around the central anchor point. Figure C7.2. Original graph (Dalton 2006:231) plus conceptually allowed region, equality line and 3 anchor points.

When x and y can range from 0 to 1, and three conceptual anchor points impose themselves – (0,0), (0.5,0.5) and (1,1) – the simplest family of curves passing through these anchor points is (Taagepera 2008: 107-109) [3 anchor points, no bias] Y = Xk/[Xk +(1-X)k]. Here parameter k can take any positive values. This equation can also be expressed more symmetrically as Y/(1-Y) = [X/(1-X)]k.

173

When k=1, we obtain Y=X – a straight line. It would mean that the representatives are as extreme as their districts. Values of k exceeding 1 lead to curves in the shape of a “drawn-out s”. It would mean that the representatives are more extreme than their districts. This is what we observe. (Values of k less than 1 lead to curves in the shape of a “compressed s”. It would mean that the representatives are more moderate than their districts.) The parameter k expresses the steepness of the slope at X=0.5. For the given data, k=3.0 is close to best fit. Figure C7.3 shows the corresponding curve. It is visually close to what might look like a best fit curve. We will not go here into the messy issue of best statistical fit for nonlinear patterns. Figure C7.3. Previous graph and further addition of a smooth model based on continuity between three anchor points, fitted to the data.

174

Why do the representatives’ positions vary as they do? We have described the data, connecting it to a mathematical pattern imposed by conceptual anchor points. This is the simplest pattern possible without doing violence to anchor points. More complex patterns might offer a better fit, but it could be only slightly better, given the appreciable scatter of data points. This means that we cannot easily escape the general form Y=Xk/[Xk +(1-X)k]. But what is it that imposes k=3 rather than k=1 (equality line Y=X) or k=2 (a shallower central slope) or k=4 (a still steeper slope)? Figure C7.4. Graph in Figure C7.2, plus a kinky model based on extremes of representative behavior and their means.

Simplify the issue as much as you can, as first approximation; leave more realistic complexities to second and later approximations. Consider a two-party constellation with a purely Conservative Party and a purely Radical Party. The US Republican and Democratic Parties are quite imperfect approximations to such ideologically pure parties. Under such conditions we may assume that all districts with

175

176

founded upper and lower values. Its predictions do not depend on any free parameter. Its central slope is slightly shallower than the observed one. Figure C7.5. Graph in Figure C7.2, plus combination of two kinky models, based on different ways to take means.

Figure C7.6 compares the two models. Note that both satisfy the three anchor points. The first model has the advantage of continuity of slope. Nature rarely offers sharp kinks like those in the second model. However, at very low and high X, the kinky model agrees better with data. In the center it does less well than the continuous model, but it is unfair to compare a model with no adjustable parameters to one that has one. Of course the model with adjustable parameter can do better – but that still leaves the need to explain why the parameter has the value it has. For a model with no adjustable parameter, the kinky model does remarkably well – and nothing further is left to be explained. [NB! In my Figure the central section of the kinky model is drawn a bit steeper than it should be. This means that the fit to data is a bit less good than it looks in the graph!]

177

Thus both models have advantages. Testing with further data (other US Congresses and other two-party parliaments) might show how stable the central pattern is. Variations in parameter value of the continuous model might also cast some light on what it depends on. The US House has lately been relatively highly polarized. How would the pattern differ in earlier, less polarized times? It should be kept in mind that the kinky model presumes clearly radical and clearly conservative parties. This presumption breaks down for the times the Democrats had a large conservative southern wing. Figure C7.6. Comparison of smooth and kinky models.

The envelopes of the data cloud The envelopes of a data cloud are the curves that express this cloud’s upper and lower limits. Here the cloud visibly is the widest at central values of X and narrows down at extreme values. Like the median curve, the upper and lower envelopes tend to follow a drawn-out s shape, starting at (0,0) and ending at (1,1). However, they do so with a bias: They reach the mean level of Y (i.e., Y=0.50) around X=0.35 and X=0.65, respectively, meaning a bias of B=±0.15 compared to the unbiased X=0.50. 178

The simplest family of curves passing through these anchor points is (Taagepera 2008: 109-110) [3 anchor points, bias exponent b] Y= Xbk/[Xbk+(1-X b)k], which can also be expressed more symmetrically as Y/(1-Y) = [Xb/(1-Xb)]k. Here the exponent b is connected to bias B as b=-log2/log(0.5+B). For unbiased system, b=1. For B=±0.15, b=0.66 and b=1.61, respectively. We should keep the same value k=3 that we used for the mean curve, unless the data strongly hints at something else. Figure C7.7. Envelope curves.

Figure C7.7 shows the resulting envelope curves, along with the average continuous curve. Fairly few data points remains outside the area they delimit, and fairly few blank areas remain inside. [Extending the bias range to B=±0.19 would make the area to include almost all outliers. Maybe it should be done during a revision.] Both the average trend and the limits of the data cloud are expressed reasonably well by 179

Y = X3b/[X3b+(1-X b)3], where b=1 for the average curve and b=0.66 and b=1.61, respectively, for the envelope curves. Table C7.1 shows some numerical values for the three curves. Table C7.1. Values of Y for upper envelope, mean pattern and upper envelope
_______________________ X Y upper mean lower _______________________ 0 0 0 0 0.1 .020 .001 .000 0.2 .125 .015 .001 0.3 .353 .073 .005 0.4 .631 .229 .025 0.5 .835 .500 .104 0.6 .939 .771 .325 0.7 .982 .927 .682 0.8 .996 .985 .925 0.9 1.000 .999 .994 1 1 1 1 ________________________

.

Conclusions This example has two purposes, one methodological and one substantive. On the methodological side, it has shown that much more could be extracted from published data than its previous publication did. The data could be described more specifically than just “When x goes up, y goes up”. This more specific description offered a better starting point for asking why the data followed this particular pattern. Recall the following. When both x and y are conceptually limited to the range from 0 to 1 (or are normalized to fit that range), with anchor points (0,0) and (1,1), then the simplest model fit is with Y=Xk.. When a further anchor point is imposed at (0.50,0.50), the simplest fit is with Y=Xk/[Xk +(1-X)k]. When this third anchor point is shifted away from (0.50,0.50), the simplest fit becomes even more complex. Different logical approaches sometimes can be applied to the same problem and may fit data about equally well. One still has to ponder which one makes more sense. All other things being equal, a model with no adjustable parameters is preferable to one with such parameters. All other things being equal, a smooth model is preferable to a kinky one. On the substantive side, this more detailed data analysis leads to a logically based answer to the question “Why are representatives more extreme than their districts?” It also leads to more specific follow-up question “Why are they more extreme by the extent they are, and not more or less so?” Here our kinky model offers a clear answer that does

180

not quite fit. The continuous model can be made to fit better, thanks to its adjustable parameter, but the specific value of this parameter remains to be explained.

181

C8. Centrist Voters, Leftist Elites: A Cautionary Note
___________________________________________________________________  “Always” start scales from zero, not from “1”.  Among competing models, the one that covers a wider range of phenomena carries it. ________________________________________________________________
The next graph in Dalton (2006: 233), reproduced in Figure C8.1, seems to represent much the same issue as in Figure C7.1. Instead of representatives and districts, we have elites and voters. Instead of conservatism in the US, we have opinions on Left/Right scale in Europe. At first glance, we may see a quite similar upward curve in both figures. This time, the equality line has already been inserted in the original figure, and like in Figure C7.2, some data points rise above this line on the right. But there are differences.

Figure C8.1. Original graph (Dalton 2006:233): Data plus equality line.

182

First, take a quick look at Figure C8.1, turn away, and answer the two following questions. If you wish to place yourself at the center of the Left/Right scale, which score would you offer? Also: How many scale intervals are there on the x scale? If you said 10 scale intervals, look again and count them. There are 9 intervals, because the lowest possible rating is 1, not 0. If you placed yourself at 5, you didn’t declare yourself a pure centrist but ever-so-slightly Left-leaning, because the arithmetic mean of 1 and 10 is 5.5, not 5.0. One can suspect that many respondents may have placed themselves at 5 when they really meant 5.5. For them, the space from center to extreme Left would extend over 4 scale intervals, while the space from center to extreme Right would extend over 5 intervals. It would introduce some distortion – and can we be sure that voters and elites would make this mistake in equal proportions? The relationship between their opinions may be a bit distorted. We already met this distortion in Chapter C2, and the formula for converting to scale 0 to 1 was given. Here, there’s cause to repeat the advice: “Always” start scales from zero, not from “1”. Still, “always” in quotation marks, because exceptions occur. For one, you cannot start a log scale from zero. In the present case, if we continued with the scale 1 to 10, we would continuously be confused, because the models such as Y=Xk presume that X and Y start from 0. We better convert immediately to a scale X that runs from 0 to 1, and similarly for Y. Chapter C2 showed how. Designate the existing scale as (lower-case) x. To obtain X, we must subtract 1 from x, so as to bring the origin to zero. Then we must divide by 9, so as to reduce 10-1=9 to 1. Hence X=(x-1)/9. Similarly, Y=(y-1)/9. These scales are introduced in Figure C8.2. We further demarcate, of course, the allowed region and two anchor points (at 0,0 and 1,1), and extend the equality line to the right top corner. Now we may notice a difference, compared to previous example. It is made even more evident in Figure C8.2 by drawing in the vertical and horizontal lines passing through the center point (0.5,0.5). This center point is outside the data cloud. The data are biased toward the right. It means that, with 3 exceptions, party elites are left of the voters – even in right-wing parties! Instead of a symmetric curve, we have a biased one, like the lower envelope curve in Figure C8.7. We can fit the data to the previous equation that includes bias: Y= Xbk/[Xbk+(1-X b)k]. [3 anchor points, bias exponent b] The equation has two adjustable parameters, so we have to fit with two points in between 0 and 1. Let us fit a central location in the data cloud on the right, such as a spot below the point labeled CSU, and another on the left, such as in between the points labeled PS and LAB. By trial and error, we reach approximately b=1.73 and k=1.04. This k barely differs from 1.00, which would correspond to proportionality in the absence of bias. The corresponding curve is shown in Figure C8.2. It still surpasses the equality line at very high values of X (like in Figure C7.3), but to such a limited extent that it cannot be seen.

183

Figure C8.2 Original graph (Dalton 2006:233) plus corrected scales and a model including bias.

This is not the statistical best fit of Y= Xbk/[Xbk+(1-X b)k] to data, but it is sufficiently close to get the picture. Given the degree of scatter, we would be as well off with a fit to Y=Xk, statistically speaking – and it would be much simpler. Then why don’t we do it? Because we would lose comparability with the previous case – and comparisons are important for theory building. If we switched to a different model, we’d imply that the European and the US situations are not comparable – and this would be the end of theorizing. If we stick to the same general format, however, we imply that similar processes might be involved. Then we can decompose the question into several. Why is it that the US picture is unbiased while the European elites pull toward the Left (b=1.00 vs. b=1.73)? Why is it that the

184

swing at the crossing point is high for US elites and almost nil for the European (k=3 vs. k=1.04)? Or are these two features related? Is it a difference of countries or a difference in the question asked (voting on issues vs. opinion)? As further data of a roughly similar type are collected, we may be able to answer such questions. In contrast, it would be a dead end if we limited ourselves to the original graphs (Figures C7.1 and C8.1). We’d simply observe that “y increases with increasing x” and that’s it. Not only would we not have answers – we wouldn’t even have questions begging for answers. What about our kinky model in the previous chapter? It plainly fails here, on the right side of the field. This is illustrative about competing models. The one that applies over a wider range of phenomena carries it. Here neither model from the previous chapter works in a direct way. But the continuous model can be adjusted, by introducing a bias parameter. I see no way to adjust the kinky model. _______________________________________________________________________ Exercise C8.1 The graphs below have the same format as Figure C8.1: elite opinion vs. voter opinion. They come from an earlier book by Russ Dalton (1988) and include 14 parties in France, Germany and UK. Different from Figure C8.1, they present opinions on specific issues. The first graph has abortion (Dalton 1988: 215). In the second I have superimposed graphs on three issues where the patterns look rather similar: nuclear energy; further nationalization of industry; and aid to third world nations (Dalton 1988: 214, 216, 217).

185

a) What type of models could be used here? b) Specify the assumptions on which this model is based. c) Do everything you can, approximately – no detailed data fitting expected. d) Compare to any previous graphs you deem relevant. _____________________________________________________________________

186

C9. Toward Social SCIENCE

187

References
Anscombe, Francis J. (1973) Graphs in statistical analysis. The American Statistician 27: 17-21. Dalton, Russell J. (1988) Citizen Politics in Western Democracies. Chatham, NJ: Chatham House. Dalton, Russell J. (2006) Citizen Politics. Washington, DC: CQ Press. Dalton, Russell J. and Shin, Doh Chull (2006) Citizens, Democracy, and Markets Around the Pacific Rim: Congruence Theory and Political Culture. Oxford: Oxford University Press. Hawking, Stephen (2010) The Grand Design. Heath, Oliver (2005) Party systems, political cleavages and electoral volatility in India: A state-wise analysis, 1998-1999. Electoral Studies 24: 177-99. Johnston, Richard, Hagen, Michael G. and Jamieson, Kathleen H. (2004) The 2000 Presidential Election and the Foundations of Party Politics. Cambridge: Cambridge University Press. Kvålseth, Tarald O. (1985) Cautionary note about R2. The American Statistician 39: 27985. Lijphart, Arend (1984) Democracies: Patterns of Majoritarianism and Consensus Government. New Haven, CT: Yale University Press. Lijphart, Arend (1994) Electoral Systems and Party Systems. Oxford: Oxford University Press. Lijphart, Arend (1999) Patterns of Democracy: Government Forms and Performance in Thirty-Six Countries. New Haven, CT: Yale University Press. McCloskey, ….. (2009) The cult of statistical significance. <http://www.amazon.com/Cult-Statistical-Significance-EconomicsCognition/dp/0472050079/ref=sr_1_1?ie=UTF8&s=books&qid=1265515998&sr=8-1> Norris, Pippa, and Inglehart, Ronald (2003) Islamic Culture and Democracy: Testing the “Clash of Civilizations” Thesis, pp. 5-33 in Ronald Inglehart, editor, Human Values and Social Change. Leiden & Boston: Brill. Stein, James D. (2008) How Math Explains the World. New York, NY: HarperCollins/Smithsonian Books.

188

Taagepera, Rein (1976) Why the trade/GNP ratio decreases with country size, Social Science Research 5: 385-404. Taagepera, Rein (1979. People, skills and resources: An interaction model for world population growth, Technological Forecasting and Social Change 13: 13-30 Taagepera, Rein (1997) Expansion and contraction patterns of large polities: Context for Russia, International Studies Quarterly 41: 475-504. Taagepera, Rein (1999) The Finno-Ugric republics and the Russian state. London: Hurst. Taagepera, Rein (2007) Predicting Party Sizes: The Logic of Simple Electoral Systems. Oxford: Oxford University Press. Taagepera, Rein (2008) Making Social Sciences More Scientific: The Need for Predictive Models. Oxford: Oxford University Press. Taagepera, Rein (2010) Adding Meaning to Regression. European Political Science 10: 73-85. Taagepera, Rein and Sikk, Allan (2007) Institutional determinants of mean cabinet duration. Early manuscript for Taagepera and Sikk (2010). Taagepera, Rein and Sikk, Allan (2010) Parsimonious model for predicting mean cabinet duration on the basis of electoral system. Party Politics 16: 261-81. Taagepera, Rein and Hayes, James P. (1977) How trade/GNP ratio decreases with country size, Social Science Research 6: 108-32. Thorlakson, Lori (2007). An institutional explanation of party system congruence: Evidence from six federations. European Journal of Political Research 46: 69-95.

189

scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->