## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

by Michel Jambu

Ratings:

Length: 432 pages4 hours

With a useful index of notations at the beginning, this book explains and illustrates the theory and application of data analysis methods from univariate to multidimensional and how to learn and use them efficiently. This book is well illustrated and is a useful and well-documented review of the most important data analysis techniques. Describes, in detail, exploratory data analysis techniques from the univariate to the multivariate ones Features a complete description of correspondence analysis and factor analysis techniques as multidimensional statistical data analysis techniques, illustrated with concrete and understandable examples Includes a modern and up-to-date description of clustering algorithms with many properties which gives a new role of clustering in data analysis techniques

Publisher: Academic PressReleased: Sep 9, 1991ISBN: 9780080923673Format: book

You've reached the end of this preview. Sign up to read more!

Page 1 of 1

**Exploratory and Multivariate Data Analysis **

First Edition

Michel Jambu

*National Centre for Telecommunications Studies, Paris, France *

ACADEMIC PRESS, INC.

Harcourt Brace Jovanovich, Publishers

Boston San Diego New York

London Sydney Tokyo Toronto

**Why this book? **

**What is in this book? **

**For whom is the book written? **

**What the prerequisite knowledge needed? **

**Acknowledgments **

**1 Introduction **

**2 Examples of Applications **

**3 Steps in Data Exploration: Management, Analysis, Synthesis **

**4 Computer Aspects **

**1 Statistics **

**2 Fields of Statistical Data Exploration **

**3 Statistics and Experiments **

**4 Data Analysis, Inductive and Deductive Statistics **

**5 Variables, Statistical Sets, and Data Sets **

**1 Introduction **

**2 1-D Analysis of a Quantitative Variable **

**3 1-D Analysis of a Categorical Variable **

**4 1-D Analysis of a Categorical Variable with Multiple Forms **

**5 1-D Analysis of Time Series or Chronological Variables **

**6 Statistical Maps or Cartograms **

**1 Introduction **

**2 2-D Analysis of Two Categorical Variables **

**3 2-D Analysis of Two Quantitative Variables **

**4 2-D Analysis of a Quantitative Variable and a Categorical Variable **

**5 2-D Analysis of a Quantitative Variable and a Categorical Variable with Multiple Forms **

**6 Conclusion **

**1 Introduction **

**2 Joint 3-D Statistical Data Analysis **

**3 Joint N-D Statistical Data Analysis **

**4 Cartograms and N-D Analysis **

**1 Introduction **

**2 From Linear Adjustment to Factor Analysis **

**3 From the Origin of Factor Analysis to Modern Factor Analysis Techniques **

**4 Mathematical Description of Modern Factor Analysis **

**5 Factor Analysis Formulas **

**1 Basic Data Sets **

**2 Different Patterns of Principal Components Analysis **

**3 Standardized Principal Components Analysis **

**4 Interpretation of Principal Components Analysis **

**5 Classifying Supplementary Points into Graphics **

**6 Rules for Selecting Significant Axes and Elements **

**7 Standardized Principal Components Analysis Formulas **

**8 Applications and Case Studies **

**1 Introduction **

**2 Basic Correspondence Data Sets **

**3 Mathematical Description of Correspondence Analysis **

**4 Geometric Representation of the Sets I and J **

**5 Interpretation of the 2-D Correspondence Analysis **

**6 Factor Graphics **

**7 Classifying Supplementary Points into Graphics **

**8 Rules for Selecting Significant Axes and Elements **

**9 2-D Correspondence Analysis Formulas **

**10 Patterns of Clouds of Points **

**11 Patterns of Acceptable Data Sets **

**12 Case Studies **

**1 Introduction **

**2 Basic Data Sets **

**3 Equivalence between the Analyses of bJJ and kIJ **

**4 Interpretation of N-D Correspondence Analysis **

**5 Factor Graphics **

**6 Classifying Supplementary Points into Graphics **

**7 Rules for Selecting Significant Axes and Points of N(I), N(J), and N(Q) **

**8 N-D Correspondence Analysis Formulas **

**9 Patterns of Acceptable Data Sets **

**10 Case Studies **

**1 Introduction **

**2 Basic Data Sets **

**3 The Mathematical Description of Classifications **

**4 Partitioning Methods **

**5 Hierarchical Classification Methods **

**6 Specific Applications **

**7 Case Studies **

**1 Introduction **

**2 Proximities Data Sets **

**3 Proximities Data Sets from Individuals–Variables Data Sets **

**4 Elementary Description of Proximities Data Sets **

**5 Factor Analysis of Proximities Data Sets **

**6 Classification of Proximities Data Sets **

**7 Computation of Contributions **

**8 Conclusion **

**1 Place of Exploratory and Multivariate Data Analysis in Statistics **

**2 Basic Features for an Exploratory and Multivariate Data Analysis Software **

**3 Data Analysis Libraries **

**4 Future Prospects **

**1 General Notations **

**2 Specific Notations from Chapters 1 to 5 **

**3 Specific Notations from Chapters 6 to 9 **

**4 Specific Notations from Chapters 10 and 11 **

**1 Cars Models **

**2 Marks of Students **

**3 Statistics of Patents Registration **

**4 Preferences Given by Students **

**5 Responses to a Questionnaire on New Services in Telecommunications **

**6 Financial Data Set **

**7 Measurements Data Set on Skulls **

**8 Steel Samples **

**9 Economic Data Set Concerning Investments Abroad **

**10 Family Timetables **

**11 Semantic Field Associated with Colors **

**12 Proximities Data Set from the Family Timetables Data Set **

**13 Barataria Data of Grain-Size Measurements **

**14 Quality of Service in the Telephone Network **

**15 Crimes Data in the United States of America For 1977 **

**16 Table of percentage points of the χ² distribution **

English translation copyright © 1991 by Academic Press, Inc.

© BORDAS et C.N.E.T.-E.N.S.T., Paris 1989

All rights reserved.

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission in writing from the publisher.

ACADEMIC PRESS, INC.

1250 Sixth Avenue, San Diego, CA 92101

*United Kingdom Edition published by *

ACADEMIC PRESS LIMITED

24–28 Oval Road, London NW1 7DX

Library of Congress Cataloging-in-Publication Data:

Jambu, Michel.

[Exploration informatique et statistique des données. English]

Exploratory and multivariate data analysis/Michel Jambu.

p. cm.—(Statistical modeling and decision science)

Translation of: Exploration informatique et statistique des

données.

Includes bibliographical references and index.

ISBN 0-12-380090-0 (alk. paper)

I. Mathematical statistics—Data processing. I. Title.

II. Series.

QA276.4.J3613 1991

519.5'0285—dc20 90-23003

CIP

Printed in the United States of America

91 92 93 94 9 8 7 6 5 4 3 2 1

*To Catherine*, *Hugo Sébastien*, *Thomas *

L’essence de codage des données est de traduire fidèlement les relations observées entre les choses par des relations entre êtres mathématiques, de telle sorte qu’en réduisant par le calcul la structure mathématique choisie pour image du réel, on ait de celui-ci un dessin simplifié accessible à l’intuition et à la réflexion avec la guarantie d’une critique mathématique

.

J.P. Benzécri, *in Les Cahicas de l’Analyse, des Données. Vol. II, 1977, no 4, 369–406 *

**Preface **

After travelling around the world, studying many kinds of data, listening to many lectures on subjects of data analysis, and giving seminars, it became clear that the way data analysis is studied in France, with exploration by Benzécri and his associates, is actually different from data analysis anywhere else in the world.

When I published Data Analysis and Clustering

in 1983, correspond-dence analysis and related topics was known world-wide to French-speaking people but not in the English-speaking world. It was one of the first attempts to present correspondence analysis and associated methods of data analysis to readers of English-reading people. Several colleagues then encouraged me to publish a textbook on correspondence analysis and the French method of data analysis. I was not actually satisfied by this proposal, because data analysis is the same around the world, even if the techniques associated with it vary. Finally, I gathered data analysis materials from different sources. There were so many connections and interactions among them that I combined them in order to propose a modern way of thinking and practicising data analysis; the point is not only to use techniques but to use interactions and relations between them in view of summarizing data for improving knowledge, drawing valid conclusions, and aiding in decision making. The way was found; it remained to write the book.

The heart of this book contains methods of exploring data from a statistical data analysis point of view, from the most elementary, associated with univariate and bivariate statistical description, to the most advanced, associated with multivariate statistical description, factor analysis, correspondence analysis and clustering. They are presented in such a manner that they correspond to exploration of data sets, step-by-step, to allow readers to build their own data analysis strategies from their data sets. The titles of the chapters and the general plan of the book are as follows: The first chapter presents a general introduction to the basic principles and steps of statistical data analysis with some case studies. The following chapters are presented in the order of the data analysis process: elaboration of data sets (**Chapter 2), 1-D statistical data analysis (Chapter 3), 2-D statistical data analysis (Chapter 4), N-D statistical data analysis (Chapter 5), factor analysis of individuals– variables data sets (Chapter 6), principal components analysis (Chapter 7), 2-D correspondence data analysis (Chapter 8), N-D correspondence data analysis (Chapter 9), classification of individuals–variables data sets (Chapter 10), and analysis and classification of proximities data sets (Chapter 11). Chapter 12 is devoted to the computer aspects of data analysis. A list of notations, an appendix containing the data sets used as examples, and as usual, references, conclude the book. **

This book is written for anyone who analyzes data or expects to do so in the future, including students, statisticians, scientists, engineers, mana-gers, and teachers. The material presented here is relevant for applica-tions in various fields, such as physics, chemistry, medecine, business, management, marketing, economics, psychology, sociology, geosciences, biology, astronomy, quality control, engineering, computer science, education, linguistics, and virtually any other field where there are data to be analyzed, synthesized, or explored with the goal of improving knowledge or decision making. This book can also be used as a reference for a supplement to any course in applied statistics, or in applied sciences courses where statistics are taught.

**Chapters 1–5 do not assume any previous knowledge. The material can be understood by anyone who wants to learn it and who has some experience or interest in quantitative thinking. Chapters 6–9 assume a knowledge of the previous chapters and an understanding of data in terms of interactions between multiple data sets. These chapters are devoted to methods for solving complex problems involving complex data sets. The mathematical background needed is the first level in any linear algebra course. Chapter 10 assumes an interest in taxonomic problems but no specific knowledge, the mathematical background needed is the first level in any university. Chapter 11 assumes a knowledge of Chapters 6-10. It is an introduction to a more general case of data often used in taxonomy and in multidimensional scaling. Chapter 12 assumes an interest in monitoring computer software on real data. It contains some recommendations to users in data analysis. In conclusion, there is no mathematical, statistical, computer knowledge required; just common sense. **

I would need many pages to thank all the people that have led directly or indirectly to the publication of this book. I have dedicated this book to Professor J. P. Benzecri in acknowledgment of the role he played in my data analysis education. To all those who encouraged me to publish a text-book devoted to data analysis, correspondence analysis, and related topics, I extend my warmest thanks: I. Olkin, C. Hayashi, J. Kruskal, R. Sokal, N. Ohsumi, P. Tukey, J. R. Kettenring, D. Carroll, and D. Merriam, to name a few. Particular thanks are given to H. Teil and F. Murtagh for their critical reading and revising of the manuscript; to G. André, Chief Director of the Centre National d’Etudes des Télécommunications, who controlled efficiently the realization of the manuscript; to the staff of Academic Press for their excellent collabora-tion in passing the book through the press; last, but not least, to Mrs N. Tissédre, for her patient work on the pains-taking preparation of the manuscript. Final thanks go to the Centre National d’Etudes des Télécommunications and the Société Francophone de Classification for their generous financial help, and the S.C.C.M. Inc. for its excellent realization of figures.

Paris, 1990

**Chapter 1 **

**General Presentation **

The aim of data analysis is to discover the structure of a set of multivariate observations without the assumption of any mathematical hypotheses on the structure of these observations or variables. Because of the size and complexity of the data sets, this structure cannot be discovered directly; specific data processing methods are therefore required to manage, explore, analyze, synthesize, and communicate the results of data processing. These methods are oriented according to the desired goal: improving basic knowledge of a field; diagnosis; forecasting; planning; decision making. Whatever the goal, the statistical features of the observed data sets need to be highlighted. Data analysis methods are the most appropriate ones for doing this.

DataMean?

Data

is a set of organized information of any type, covering all aspects of a domain related to a specific goal (forecasting, improving knowledge, causal analysis, decision making, etc.). It is a quantification of the real world into an image, acceptable to the human brain, and then to the computer. For example, when the quality of cars is studied, the quality is initially defined in terms of certain criteria; the information concerning these criteria observed on a selected set of cars (a sample) is then gathered. For example, criteria such as mileage, number of repairs, headroom, weight, length, turn circle, and gear ratio are collected and recorded in a data file or data base. All the information is stored in a data set that contains heterogeneous data, in general. Examine the data set given in **Table 1.1. It is in the form of the rows and columns of a matrix. Each column and each row has a label; at the intersection of a column and a row is the information related to one variable observed on one car model. Naturally, there are many types of data sets. For example, consider the first column of the data set given in Table 1.1. It concerns the price of cars at a given time. This is a simple, or 1-D, data set as only one variable is observed. The whole data set given in Table 1.1 concerns the simultaneous observation of 12 variables on a given set of cars, and so it is a multiple, or N-D, data set. The complexity of data depends on the field of study and/or on the initial aim, and/or on the degree of detail associated with the study. Thus, the data sets studied by data analysis involve quantitative information (measurements, ratios, marks, indicators, etc) or qualitative (also called categorical) information (categories, logical attributes, intervals of quantitative information, etc.). A data set can involve homogeneous or heterogeneous information. Finally, depending on the goal, a data set can be divided into explanatory and explainable information. Generally, when the domain is large enough, the reference data sets contain all the different types of information. This is true in information systems or data bases. The problem is how to explore and process the data. **

**Table 1.1 **

**Car models data set (extract). **

(From *Graphical Methods for Data Analysis*, by J.M. Chambers, W.S. Cleveland, B. Kleiner, and P A. Tukey. Copyright © 1983 by Bell Telephone Laboratories Incorporated, Murray Hill, NJ. Reprinted by permission of Wadsworth & Brooks/Cole Advanced Books & Software, Pacific Grove, CA 93950.)

to analyze dataMean?

To analyze data

means to synthesize the content of data in a data base or a data file, by selecting specific data sets on which data analysis methods

can be applied. Obviously, no method can analyze a disorganized data set. To be described, data must follow specific rules such as homogeneity, exhaustivity, and comparability. Thus, the first step of data analysis is to extract relevant data sets

that can be analyzed whilst having in mind the objectives, which may vary. In an example about the quality of telephone service, the problem is to study levels of quality and to select statistically determined units from a given range of quality. In medicine, the problem is to study how different variables interact on a group of patients. In marketing, the problem is how to forecast the consumer behavior by observing selected variables on selected users. Basically, to analyze data means to choose data sets on which data analysis methods can be applied, with a view to decision making, selection, planning, forecasting, or understanding. And since data are too complex, too large, and too numerous, specific tools are needed to dissect data and to make either numerical or graphical summaries. This specific type of data processing follows a logical process described in **Section 3.1 on the different steps of data exploration. **

to synthesize dataMean?

In any statistical study, there are two steps: analysis and synthesis. To synthesize data means to gather the most significant or the most telling features within the data. The results are presented in a way that is convenient for the user. Thus, the problem is not only to analyze data in depth, but also to communicate the results in terms of valid conclusions that can be used to make reasonable decisions. When data analysis was first used, analysis meant both analysis and synthesis. But, according to recent developments in methods, size, and complexity of data, analysis and synthesis must be distinguished again. The basic principles of data analysis are presented here in comparison with other scientific trends.

Data analysis belongs to Statistics in the following sense:

Statistics is concerned with scientific methods for collecting, organizing, summarizing, presenting, analyzing data, as well as drawing valid conclusions and making reasonable decisions on the basis of such analysis.

(*cf*. Spiegel, 1961). It is opposed to experimental methods based on observing the variations of one variable with respect to all the others involved. Statistics and data analysis are based on data as they are collected. All of the possible variations for all of the variables cannot be studied and, most of the time, the control of these variables is impossible, as in economics, marketing, sociology, meteorology, geology, etc. Experimental methods are appropriate for specific classes of measurements. Statistics or data analysis methods can process a larger class of data than those used in experimental methods.

In Statistics, there are two currents: the inductive process and the deductive process. Data analysis is concerned with the deductive process; it means *to deduce *only from gathered data, and not to build a model first. The basic data analysis principles are expressed as follows:

(a) To extract structures from data, and not the reverse.

(b) To process simultaneously information involving multiple variables.

(c) To elaborate statistical information systems with a view to computer data processing.

(d) To use all the resources of a computer, particularly graphical tools.

Certain remarks can be made:

(a) Often the opposite is done; models smooth out data. Thus, it is taken as real what is purely a mathematical construct. It often happens that data are mutilated because it is thought that they cannot be processed by computer. But, it should be kept in mind that methods and software are now able to process data in depth.

(b) To analyze data variable by variable takes time and does not provide a synthesis. To do so, interactions between pieces of information must be studied globally.

(c) Sometimes data are built in successive layers, producing incoherency. Even if data are elaborated independently from data processing, they must be elaborated with a view to data processing.

(d) Graphics give more information than numerical tables. A histogram highlights the shape of a distribution: factor maps give more information than correlation matrices; dispersion box plots represent more than any statistical measures. In the following, some examples of real applications are given.

To study the economic quality of cars, 37 cars were selected as a representative set. The variables observed were the price, mileage, repair record, headroom, rear seat, trunk space, weight, length, turn circle, displacement and gear ratio. These variables are assumed to influence both the economic quality and the price of a car (the data are given in **Table 1.1). Figures 1.1 and 1.2 give the results of principal components analysis and its hierarchical classification performed on the car data set. The principal components analysis highlights two factors, and the resulting factor map shows the cars and the main criteria as points. To the right of the first axis are found the smaller cars and more generally the Japanese ones (Datsun, Honda) with high gear ratio and mileage; to the left of the first axis occur the larger cars and more generally the American cars, which are comfortable (rear seat, trunk space, headroom) but heavy and more expensive than the smaller cars. This is confirmed by the hierarchical classification given in Fig. 1.2. **

**Figure 1.1 **). Representation in the two first factors (principal components analysis).

**Figure 1.2 **Car models classification. Hierarchical clustering of a principal components data set.

The number of patents is considered a good indicator of industrial activity. Two branches are studied: the telecommunications branch and all of the branches mixed. The data set is organized into two subsets (*cf*. **Appendix 2, §3) simultaneously analyzed by correspondence analysis. The factorial map is given in Fig. 1.3. This map shows the relative position of each country for its own telecommunication branch with respect to the total of all branches. During the period 1980–1986, the number of patents registered increased for the USA, Japan, Italy, and Sweden, and was stable or decreased for FRG, Great Britain and France. For the telecommunication branch, the movement is expanded. The telecommunication branches of Japan, The Netherlands, USA, and Great Britain are increasing. But the telecommunication branches of France, FRG, Italy, and Switzerland are decreasing. This map is self-explanatory. **

**Figure 1.3 **Statistics concerning the patents registration according to the telecommunications branch and all the branches mixed. Correspondence analysis; representation in the two first factors.

To study the behavior and satisfaction (or lack of satisfaction) of users of new services, France Telecom carried out surveys using questionnaires on 1800 people. The new services are the electronic directory and all of the associated distributing services requested using the Minitel, which is a piece of telecommunication equipment resembling to a computer terminal. The questionnaire consists of 70 multiple choice questions. The data set analyzed is a logical data set involving 252 dummy variables and 1800 persons (the dummy variables are the replies to the questions). It was analyzed by *N*-*D *correspondence analysis; here, we give two selected graphics, showing a part of the dummy variables. **Figure 1.4 represents user satisfaction; Fig. 1.5 represents the usage of the new Minitel services. These two graphics can be superimposed. Data analysis processing of surveys needs more detailed analysis than for contingency data sets. This is will be discussed more in Chapter 9. **

**Figure 1.4 **Questionnaire on Minitel. *N *-D correspondence analysis; representation in the two first factors. Representation of the variables concerning the usage and the price of the Minitel.

**Figure 1.5 **Questionnaire on Minitel. *N *-D correspondence analysis; representation in the two first factors. Representation of the variables concerning the knowledge and usage of the Minitel services.

Krumbein and Aberdeen (1937) collected 98 bottom samples from the Kidal lagoon in Barataria Bay at the margin of the Mississippi delta, with the objective of evaluating the depositional environment of the lagoon. Data were recorded on the grain-size distribution of the samples (*cf*. **Appendix 2, §13). Only 69 samples (with complete description) are retained for processing by correspondence analysis ( cf. Fig. 1.6). The first axis clearly represents the evolution from coarse to fine grained sediments (cf. Teil, 1985). **

**Figure 1.6 **. PH defines the different grain sizes.

In 1965, an international organization wanted to study and compare the lifestyle chosen according to marital status (single or married), sex (male or female), country, and professional activity. In this study, lifestyle was viewed through 10 major activities (professional work, transportation, sleep, household, meals, shopping, children, personal care, TV, leisure). The data set was built by taking into account the number of hours spent by a group on these different activities (*cf*. **Appendix 2, §10). A principal components analysis was done ( cf. Fig. 1.7.); it highlights the relationships between the population groups and variables. For example, the second axis opposes the western countries (Europe) to the USA according to two groups of variables: meals and sleep for Europe on the one hand; personal care and shopping on the other hand for USA. **

**Figure 1.7 **Family timetables. Principal components analysis. Representation in the two first factors.

Data analysis involves several steps from data conception to the use of final results in decision making. We present the steps and the relations among them, set in a network where the vertices are the steps and the edges the relations (*cf*. **Fig. 1.8). Ten steps are identified and examined in detail. But, keep in mind that data analysis involves interaction with data and steps taken to analyze them. **

**Figure 1.8 **Data analysis network.

** STEP 1. Data decision**. At the beginning, there is someone who decides on an action. It could be the manager (in business), the scientist (in fundamental sciences), the physician (in medicine), the agronomist (in studying plants), the decision maker (in marketing), etc. What does he decide? To study a field based on some hypotheses. Therefore, he must define the aim and scope of the study, the boundary of the field, and depending on his knowledge, draw the main features and the orientations of what he wants, and then determine the data expected to be necessary to describe or explain the problem he is trying to solve.

** STEP 2. Data conception**,

Close Dialog## Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

Loading