You are on page 1of 54

Data Science Foundations Geometry

and Topology of Complex Hierarchic


Systems and Big Data Analytics 1st
Edition Fionn Murtagh
Visit to download the full and correct content document:
https://textbookfull.com/product/data-science-foundations-geometry-and-topology-of-c
omplex-hierarchic-systems-and-big-data-analytics-1st-edition-fionn-murtagh/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

It's All Analytics!: The Foundations of AI, Big Data,


and Data Science Landscape for Professionals in
Healthcare, Business, and Government Scott Burk

https://textbookfull.com/product/its-all-analytics-the-
foundations-of-ai-big-data-and-data-science-landscape-for-
professionals-in-healthcare-business-and-government-scott-burk/

Big Data Analytics Systems Algorithms Applications


C.S.R. Prabhu

https://textbookfull.com/product/big-data-analytics-systems-
algorithms-applications-c-s-r-prabhu/

From Big Data to Big Profits Success with Data and


Analytics 1st Edition Russell Walker

https://textbookfull.com/product/from-big-data-to-big-profits-
success-with-data-and-analytics-1st-edition-russell-walker/

Big data and analytics for insurers 1st Edition Boobier

https://textbookfull.com/product/big-data-and-analytics-for-
insurers-1st-edition-boobier/
Big and Complex Data Analysis Methodologies and
Applications Ahmed

https://textbookfull.com/product/big-and-complex-data-analysis-
methodologies-and-applications-ahmed/

Foundations of Data Science Avrim Blum

https://textbookfull.com/product/foundations-of-data-science-
avrim-blum/

Data Science and Big Data An Environment of


Computational Intelligence 1st Edition Witold Pedrycz

https://textbookfull.com/product/data-science-and-big-data-an-
environment-of-computational-intelligence-1st-edition-witold-
pedrycz/

Understanding Azure Data Factory: Operationalizing Big


Data and Advanced Analytics Solutions Sudhir Rawat

https://textbookfull.com/product/understanding-azure-data-
factory-operationalizing-big-data-and-advanced-analytics-
solutions-sudhir-rawat/

Big Data Analytics with Java 1st Edition Rajat Mehta

https://textbookfull.com/product/big-data-analytics-with-
java-1st-edition-rajat-mehta/
DATA SCIENCE
FOUNDATIONS
Geometry and Topology
of Complex Hierarchic Systems
and Big Data Analytics
Chapman & Hall/CRC
Computer Science and Data Analysis Series

The interface between the computer and statistical sciences is increasing, as each
discipline seeks to harness the power and resources of the other. This series aims to
foster the integration between the computer sciences and statistical, numerical, and
probabilistic methods by publishing a broad range of reference works, textbooks, and
handbooks.

SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London

Proposals for the series should be sent directly to one of the series editors above, or submitted to:

Chapman & Hall/CRC


Taylor and Francis Group
3 Park Square, Milton Park
Abingdon, OX14 4RN, UK

Published Titles

Semisupervised Learning for Computational Linguistics


Steven Abney

Visualization and Verbalization of Data


Jörg Blasius and Michael Greenacre

Design and Modeling for Computer Experiments


Kai-Tai Fang, Runze Li, and Agus Sudjianto

Microarray Image Analysis: An Algorithmic Approach


Karl Fraser, Zidong Wang, and Xiaohui Liu

R Programming for Bioinformatics


Robert Gentleman

Exploratory Multivariate Analysis by Example Using R


François Husson, Sébastien Lê, and Jérôme Pagès

Bayesian Artificial Intelligence, Second Edition


Kevin B. Korb and Ann E. Nicholson
Published Titles cont.

®
Computational Statistics Handbook with MATLAB , Third Edition
Wendy L. Martinez and Angel R. Martinez

Exploratory Data Analysis with MATLAB , Third Edition


®

Wendy L. Martinez, Angel R. Martinez, and Jeffrey L. Solka

Statistics in MATLAB®: A Primer


Wendy L. Martinez and MoonJung Cho

Clustering for Data Mining: A Data Recovery Approach, Second Edition


Boris Mirkin

Introduction to Machine Learning and Bioinformatics


Sushmita Mitra, Sujay Datta, Theodore Perkins, and George Michailidis

Introduction to Data Technologies


Paul Murrell

R Graphics
Paul Murrell

Correspondence Analysis and Data Coding with Java and R


Fionn Murtagh

Data Science Foundations: Geometry and Topology of Complex Hierarchic


Systems and Big Data Analytics
Fionn Murtagh

Pattern Recognition Algorithms for Data Mining


Sankar K. Pal and Pabitra Mitra

Statistical Computing with R


Maria L. Rizzo

Statistical Learning and Data Science


Mireille Gettler Summa, Léon Bottou, Bernard Goldfarb, Fionn Murtagh,
Catherine Pardoux, and Myriam Touati

Music Data Analysis: Foundations and Applications


Claus Weihs, Dietmar Jannach, Igor Vatolkin, and Günter Rudolph

Foundations of Statistical Algorithms: With References to R Packages


Claus Weihs, Olaf Mersmann, and Uwe Ligges
Chapman & Hall/CRC
Computer Science and Data Analysis Series

DATA SCIENCE
FOUNDATIONS
Geometry and Topology
of Complex Hierarchic Systems
and Big Data Analytics

Fionn Murtagh

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business
A CHAPMAN & HALL BOOK
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2018 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20170823

International Standard Book Number-13: 978-1-4987-6393-6 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized
in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying,
microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com
(http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers,
MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of
users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been
arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents

Preface xiii

I Narratives from Film and Literature, from Social Media and


Contemporary Life 1

1 The Correspondence Analysis Platform for Mapping Semantics 3


1.1 The Visualization and Verbalization of Data . . . . . . . . . . . . . . . . . 3
1.2 Analysis of Narrative from Film and Drama . . . . . . . . . . . . . . . . . 4
1.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 The Changing Nature of Movie and Drama . . . . . . . . . . . . . . 4
1.2.3 Correspondence Analysis as a Semantic Analysis Platform . . . . . . 5
1.2.4 Casablanca Narrative: Illustrative Analysis . . . . . . . . . . . . . . 5
1.2.5 Modelling Semantics via the Geometry and Topology of Information 6
1.2.6 Casablanca Narrative: Illustrative Analysis Continued . . . . . . . . 8
1.2.7 Platform for Analysis of Semantics . . . . . . . . . . . . . . . . . . . 8
1.2.8 Deeper Look at Semantics of Casablanca: Text Mining . . . . . . . . 10
1.2.9 Analysis of a Pivotal Scene . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Application of Narrative Analysis to Science and Engineering Research . . 11
1.3.1 Assessing Coverage and Completeness . . . . . . . . . . . . . . . . . 12
1.3.2 Change over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3.3 Conclusion on the Policy Case Studies . . . . . . . . . . . . . . . . . 15
1.4 Human Resources Multivariate Performance Grading . . . . . . . . . . . . 19
1.5 Data Analytics as the Narrative of the Analysis Processing . . . . . . . . . 21
1.6 Annex: The Correspondence Analysis and Hierarchical Clustering Platform 21
1.6.1 Analysis Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6.2 Correspondence Analysis: Mapping χ2 Distances into Euclidean Dis-
tances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.6.3 Input: Cloud of Points Endowed with the Chi-Squared Metric . . . . 22
1.6.4 Output: Cloud of Points Endowed with the Euclidean Metric in Factor
Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6.5 Supplementary Elements: Information Space Fusion . . . . . . . . . 23
1.6.6 Hierarchical Clustering: Sequence-Constrained . . . . . . . . . . . . 24

2 Analysis and Synthesis of Narrative: Semantics of Interactivity 25


2.1 Impact and Effect in Narrative: A Shock Occurrence in Social Media . . . 25
2.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.2 Two Critical Tweets in Terms of Their Words . . . . . . . . . . . . . 26
2.1.3 Two Critical Tweets in Terms of Twitter Sub-narratives . . . . . . . 26
2.2 Analysis and Synthesis, Episodization and Narrativization . . . . . . . . . 32
2.3 Storytelling as Narrative Synthesis and Generation . . . . . . . . . . . . . 33

vii
viii Contents

2.4 Machine Learning and Data Mining in Film Script Analysis . . . . . . . . . 35


2.5 Style Analytics: Statistical Significance of Style Features . . . . . . . . . . 36
2.6 Typicality and Atypicality for Narrative Summarization and Transcoding . 37
2.7 Integration and Assembling of Narrative . . . . . . . . . . . . . . . . . . . 40

II Foundations of Analytics through the Geometry and Topol-


ogy of Complex Systems 43

3 Symmetry in Data Mining and Analysis through Hierarchy 45


3.1 Analytics as the Discovery of Hierarchical Symmetries in Data . . . . . . . 45
3.2 Introduction to Hierarchical Clustering, p-Adic and m-Adic Numbers . . . 45
3.2.1 Structure in Observed or Measured Data . . . . . . . . . . . . . . . 46
3.2.2 Brief Look Again at Hierarchical Clustering . . . . . . . . . . . . . . 46
3.2.3 Brief Introduction to p-Adic Numbers . . . . . . . . . . . . . . . . . 47
3.2.4 Brief Discussion of p-Adic and m-Adic Numbers . . . . . . . . . . . 47
3.3 Ultrametric Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.1 Ultrametric Space for Representing Hierarchy . . . . . . . . . . . . . 48
3.3.2 Geometrical Properties of Ultrametric Spaces . . . . . . . . . . . . . 48
3.3.3 Ultrametric Matrices and Their Properties . . . . . . . . . . . . . . 48
3.3.4 Clustering through Matrix Row and Column Permutation . . . . . . 50
3.3.5 Other Data Symmetries . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Generalized Ultrametric and Formal Concept Analysis . . . . . . . . . . . . 52
3.4.1 Link with Formal Concept Analysis . . . . . . . . . . . . . . . . . . 52
3.4.2 Applications of Generalized Ultrametrics . . . . . . . . . . . . . . . . 54
3.5 Hierarchy in a p-Adic Number System . . . . . . . . . . . . . . . . . . . . . 54
3.5.1 p-Adic Encoding of a Dendrogram . . . . . . . . . . . . . . . . . . . 54
3.5.2 p-Adic Distance on a Dendrogram . . . . . . . . . . . . . . . . . . . 57
3.5.3 Scale-Related Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6 Tree Symmetries through the Wreath Product Group . . . . . . . . . . . . 58
3.6.1 Wreath Product Group for Hierarchical Clustering . . . . . . . . . . 58
3.6.2 Wreath Product Invariance . . . . . . . . . . . . . . . . . . . . . . . 59
3.6.3 Wreath Product Invariance: Haar Wavelet Transform of Dendrogram 60
3.7 Tree and Data Stream Symmetries from Permutation Groups . . . . . . . . 62
3.7.1 Permutation Representation of a Data Stream . . . . . . . . . . . . 62
3.7.2 Permutation Representation of a Hierarchy . . . . . . . . . . . . . . 63
3.8 Remarkable Symmetries in Very High-Dimensional Spaces . . . . . . . . . 64
3.9 Short Commentary on This Chapter . . . . . . . . . . . . . . . . . . . . . . 65

4 Geometry and Topology of Data Analysis: in p-Adic Terms 69


4.1 Numbers and Their Representations . . . . . . . . . . . . . . . . . . . . . . 69
4.1.1 Series Representations of Numbers . . . . . . . . . . . . . . . . . . . 69
4.1.2 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 p-Adic Valuation, p-Adic Absolute Value, p-Adic Norm . . . . . . . . . . . 71
4.3 p-Adic Numbers as Series Expansions . . . . . . . . . . . . . . . . . . . . . 72
4.4 Canonical p-Adic Expansion; p-Adic Integer or Unit Ball . . . . . . . . . . 73
4.5 Non-Archimedean Norms as p-Adic Integer Norms in the Unit Ball . . . . 74
4.5.1 Archimedean and Non-Archimedean Absolute Value Properties . . . 74
4.5.2 A Non-Archimedean Absolute Value, or Norm, is Less Than or Equal
to One, and an Archimedean Absolute Value, or Norm, is Unbounded 74
4.6 Going Further: Negative p-Adic Numbers, and p-Adic Fractions . . . . . . 75
Contents ix

4.7 Number Systems in the Physical and Natural Sciences . . . . . . . . . . . . 76


4.8 p-Adic Numbers in Computational Biology and Computer Hardware . . . . 77
4.9 Measurement Requires a Norm, Implying Distance and Topology . . . . . . 78
4.10 Ultrametric Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.11 Short Review of p-Adic Cosmology . . . . . . . . . . . . . . . . . . . . . . . 80
4.12 Unbounded Increase in Mass or Other Measured Quantity . . . . . . . . . 81
4.13 Scale-Free Partial Order or Hierarchical Systems . . . . . . . . . . . . . . . 81
4.14 p-Adic Indexing of the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.15 Diffusion and Other Dynamic Processes in Ultrametric Spaces . . . . . . . 83

III New Challenges and New Solutions for Information Search


and Discovery 85

5 Fast, Linear Time, m-Adic Hierarchical Clustering 87


5.1 Pervasive Ultrametricity: Computational Consequences . . . . . . . . . . . 87
5.1.1 Ultrametrics in Data Analytics . . . . . . . . . . . . . . . . . . . . . 87
5.1.2 Quantifying Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.3 Pervasive Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.4 Computational Implications . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Applications in Search and Discovery using the Baire Metric . . . . . . . . 89
5.2.1 Baire Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.2 Large Numbers of Observables . . . . . . . . . . . . . . . . . . . . . 89
5.2.3 High-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.4 First Approach Based on Reduced Precision of Measurement . . . . 90
5.2.5 Random Projections in High-Dimensional Spaces, Followed by the
Baire Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.2.6 Summary Comments on Search and Discovery . . . . . . . . . . . . 91
5.3 m-Adic Hierarchy and Construction . . . . . . . . . . . . . . . . . . . . . . 91
5.4 The Baire Metric, the Baire Ultrametric . . . . . . . . . . . . . . . . . . . 92
5.4.1 Metric and Ultrametric Spaces . . . . . . . . . . . . . . . . . . . . . 92
5.4.2 Ultrametric Baire Space and Distance . . . . . . . . . . . . . . . . . 93
5.5 Multidimensional Use of the Baire Metric through Random Projections . . 94
5.6 Hierarchical Tree Defined from m-Adic Encoding . . . . . . . . . . . . . . . 95
5.7 Longest Common Prefix and Hashing . . . . . . . . . . . . . . . . . . . . . 96
5.7.1 From Random Projection to Hashing . . . . . . . . . . . . . . . . . . 96
5.8 Enhancing Ultrametricity through Precision of Measurement . . . . . . . . 97
5.8.1 Quantifying Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . 97
5.8.2 Pervasiveness of Ultrametricity . . . . . . . . . . . . . . . . . . . . . 98
5.9 Generalized Ultrametric and Formal Concept Analysis . . . . . . . . . . . . 99
5.9.1 Generalized Ultrametric . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.9.2 Formal Concept Analysis . . . . . . . . . . . . . . . . . . . . . . . . 99
5.10 Linear Time and Direct Reading Hierarchical Clustering . . . . . . . . . . 100
5.10.1 Linear Time, or O(N ) Computational Complexity, Hierarchical Clus-
tering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.10.2 Grid-Based Clustering Algorithms . . . . . . . . . . . . . . . . . . . 100
5.11 Summary: Many Viewpoints, Various Implementations . . . . . . . . . . . 101
x Contents

6 Big Data Scaling through Metric Mapping 103


6.1 Mean Random Projection, Marginal Sum, Seriation . . . . . . . . . . . . . 104
6.1.1 Mean of Random Projections as A Seriation . . . . . . . . . . . . . . 105
6.1.2 Normalization of the Random Projections . . . . . . . . . . . . . . . 107
6.2 Ultrametric and Ordering of Rows, Columns . . . . . . . . . . . . . . . . . 108
6.3 Power Iteration Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.4 Input Data for Eigenreduction . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4.1 Implementation: Equivalence of Iterative Approximation and Batch
Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Inducing a Hierarchical Clustering from Seriation . . . . . . . . . . . . . . 111
6.6 Short Summary of All These Methodological Underpinnings . . . . . . . . 112
6.6.1 Trivial First Eigenvalue, Eigenvector in Correspondence Analysis . . 112
6.7 Very High-Dimensional Data Spaces: Data Piling . . . . . . . . . . . . . . 113
6.8 Recap on Correspondence Analysis for Following Applications . . . . . . . 114
6.8.1 Clouds of Points, Masses and Inertia . . . . . . . . . . . . . . . . . . 115
6.8.2 Relative and Absolute Contributions . . . . . . . . . . . . . . . . . . 116
6.9 Evaluation 1: Uniformly Distributed Data Cloud Points . . . . . . . . . . . 117
6.9.1 Computation Time Requirements . . . . . . . . . . . . . . . . . . . . 118
6.10 Evaluation 2: Time Series of Financial Futures . . . . . . . . . . . . . . . . 118
6.11 Evaluation 3: Chemistry Data, Power Law Distributed . . . . . . . . . . . 120
6.11.1 Data and Determining Power Law Properties . . . . . . . . . . . . . 120
6.11.2 Randomly Generating Power Law Distributed Data in Varying Em-
bedding Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.12 Application 1: Quantifying Effectiveness through Aggregate Outcome . . . 124
6.12.1 Computational Requirements, from Original Space and Factor Space
Identities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.13 Application 2: Data Piling as Seriation of Dual Space . . . . . . . . . . . . 125
6.14 Brief Concluding Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.15 Annex: R Software Used in Simulations and Evaluations . . . . . . . . . . 126
6.15.1 Evaluation 1: Dense, Uniformly Distributed Data . . . . . . . . . . . 127
6.15.2 Evaluation 2: Financial Futures . . . . . . . . . . . . . . . . . . . . . 128
6.15.3 Evaluation 3: Chemicals of Specified Marginal Distribution . . . . . 129

IV New Frontiers: New Vistas on Information, Cognition and


the Human Mind 131

7 On Ultrametric Algorithmic Information 133


7.1 Introduction to Information Measures . . . . . . . . . . . . . . . . . . . . . 133
7.2 Wavelet Transform of a Set of Points Endowed with an Ultrametric . . . . 134
7.3 An Object as a Chain of Successively Finer Approximations . . . . . . . . 137
7.3.1 Approximation Chain using a Hierarchy . . . . . . . . . . . . . . . . 138
7.3.2 Dendrogram Wavelet Transform of Spherically Complete Space . . . 138
7.4 Generating Faces: Case Study Using a Simplified Model . . . . . . . . . . . 139
7.4.1 A Simplified Model of Face Generation . . . . . . . . . . . . . . . . . 139
7.4.2 Discussion of Psychological and Other Consequences . . . . . . . . . 143
7.5 Complexity of an Object: Hierarchical Information . . . . . . . . . . . . . . 143
7.6 Consequences Arising from This Chapter . . . . . . . . . . . . . . . . . . . 144
Contents xi

8 Geometry and Topology of Matte Blanco’s Bi-Logic in Psychoanalytics 147


8.1 Approaching Data and the Object of Study, Mental Processes . . . . . . . 147
8.1.1 Historical Role of Psychometrics and Mathematical Psychology . . . 148
8.1.2 Summary of Chapter Content . . . . . . . . . . . . . . . . . . . . . . 148
8.1.3 Determining Depth of Emotion, and Tracking Emotion . . . . . . . 148
8.2 Matte Blanco’s Psychoanalysis: A Selective Review . . . . . . . . . . . . . 152
8.3 Real World, Metric Space: Context for Asymmetric Mental Processes . . . 155
8.4 Ultrametric Topology, Background and Relevance in Psychoanalysis . . . . 156
8.4.1 Ultrametric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.4.2 Inducing an Ultrametric through Agglomerative Hierarchical Cluster-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.4.3 Transitions from Metric to Ultrametric Representation, and Vice
Versa, through Data Transformation . . . . . . . . . . . . . . . . . . 157
8.4.4 Practical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.5 Conclusion: Analytics of Human Mental Processes . . . . . . . . . . . . . . 159
8.6 Annex 1: Far Greater Computational Power of Unconscious Mental Processes 160
8.7 Annex 2: Text Analysis as a Proxy for Both Facets of Bi-Logic . . . . . . . 161

9 Ultrametric Model of Mind: Application to Text Content Analysis 163


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.2 Quantifying Ultrametricity . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
9.2.1 Ultrametricity Coefficient of Lerman . . . . . . . . . . . . . . . . . . 164
9.2.2 Ultrametricity Coefficient of Rammal, Toulouse and Virasoro . . . . 164
9.2.3 Ultrametricity Coefficients of Treves and of Hartman . . . . . . . . . 165
9.2.4 Bayesian Network Modelling . . . . . . . . . . . . . . . . . . . . . . 165
9.2.5 Our Ultrametricity Coefficient . . . . . . . . . . . . . . . . . . . . . 165
9.2.6 What the Ultrametricity Coefficient Reveals . . . . . . . . . . . . . . 166
9.3 Semantic Mapping: Interrelationships to Euclidean, Factor Space . . . . . . 167
9.3.1 Correspondence Analysis: Mapping χ2 into Euclidean Distances . . . 167
9.3.2 Input: Cloud of Points Endowed with the Chi-Squared Metric . . . . 167
9.3.3 Output: Cloud of Points Endowed with the Euclidean Metric in Factor
Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.3.4 Conclusions on Correspondence Analysis and Introduction to the Nu-
merical Experiments to Follow . . . . . . . . . . . . . . . . . . . . . 169
9.4 Determining Ultrametricity through Text Unit Interrelationships . . . . . . 170
9.4.1 Brothers Grimm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.4.2 Jane Austen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.4.3 Air Accident Reports . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.4.4 DreamBank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.5 Ultrametric Properties of Words . . . . . . . . . . . . . . . . . . . . . . . . 174
9.5.1 Objectives and Choice of Data . . . . . . . . . . . . . . . . . . . . . 174
9.5.2 General Discussion of Ultrametricity of Words . . . . . . . . . . . . 175
9.5.3 Conclusions on the Word Analysis . . . . . . . . . . . . . . . . . . . 175
9.6 Concluding Comments on this Chapter . . . . . . . . . . . . . . . . . . . . 177
9.7 Annex 1: Pseudo-Code for Assessing Ultrametric-Respecting Triplet . . . . 177
9.8 Annex 2: Bradley Ultrametricity Coefficient . . . . . . . . . . . . . . . . . 178
xii Contents

10 Concluding Discussion on Software Environments 181


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10.2 Complementary Use with Apache Solr (and Lucene) . . . . . . . . . . . . . 182
10.3 In Summary: Treating Massive Data Sets with Correspondence Analysis . 182
10.3.1 Aggregating Similar or Identical Profiles Is Welcome . . . . . . . . . 182
10.3.2 Resolution Level of the Analysis Carried Out . . . . . . . . . . . . . 183
10.3.3 Random Projections in Order to Benefit from Data Piling in High
Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.3.4 Massive Observation Cardinality, Moderate Sized Dimensionality . . 184
10.4 Concluding Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Bibliography 187

Index 203
Preface

This is my motto: Analysis is nothing, data are everything. Today, on the web, we
can have baskets full of data . . . baskets or bins?
Jean-Paul Benzécri, 2011

This book describes solid and supportive foundations for the data science of our times,
with many illustrative cases. Core to these foundations are mathematics and computational
science. Our thinking and decision-making in regard to data can follow the insightful ob-
servation by the physicist Paul Dirac that physical theory and physical meaning have to
follow behind the mathematics (see Section 4.7). The hierarchical nature of complex reality
is part and parcel of this mathematically well-founded way of observing and interacting
with physical, social and all realities.
Quite wide-ranging case studies are used in this book. The text, however, is written in an
accessible and easily grasped way, for a reader who is knowledgeable and engaged, without
necessarily being an expert in all matters. Ultimately this book seeks to inspire, motivate and
orientate our human thinking and acting regarding data, associated information and derived
knowledge. This book seeks to give the reader a good start towards practical and meaningful
perspectives. Also, by seeking to chart out future perspectives, this book responds to current
needs in a way that is unlike other books of some relevance to this field, and that may be
great in their own specialisms.
The field of data science has come into its own, in a highly profiled way, in recent times.
Ever increasing numbers of employees are required nowadays, as data scientists, in sectors
that range from retail to regulatory, and so much besides. Many universities, having started
graduate-level courses in data science, are now also starting undergraduate courses. Data
science encompasses traditional disciplines of computational science and statistics, data
analysis, machine learning and pattern recognition. But new problem domains are arising.
Back in the 1970s and into the 1980s, one had to pay a lot of attention to available memory
storage when working with computers. Therefore, that focus of attention was on stored
data directly linked to the computational processing power. By the beginning of the 1990s,
communication and networking had become the focus of attention. Against the background
of regulatory and proprietary standards, and open source communication protocols (ISO
standards, Decnet, TCP/IP protocols, and so on), data access and display protocols became
so central (File Transfer Protocol, gopher, Veronica, Wide Area Information Server, and
Hypertext Transfer Protocol). So the focus back in those times was on: firstly, memory
and computer power; and secondly, communications and networking. Now we have, thirdly,
data as the prime focus. Such waves of technology developments are exciting. They motivate
the tackling of new problems, and also there may well be the requirement for new ways of
addressing problems. Such requirement of new perspectives and new approaches is always
due to contemporary inadequacies, limitations and underperformance. Now, we move on to
our interacting with data.
This book targets rigour, and mathematics, and computational thinking. Through avail-
able data sets and R code, reproducibility by the reader of results and outcomes is facilitated.
Indeed, understanding is also facilitated through “learning by doing”. The case studies and

xiii
xiv Preface

the available data and software codes are intended to help impart the data science phi-
losophy in the book. In that sense, dialoguing with data, and “letting the data speak”
(Jean-Paul Benzécri), are the perspective and the objective. To the foregoing quotations,
the following will be added: “visualization and verbalization of data” (cf. [34]).
Our approach is influenced by how the leading social scientist, Pierre Bourdieu, used the
most effective inductive analytics developed by Jean-Paul Benzécri. This family of geomet-
ric data analysis methodologies, centrally based on correspondence analysis encompassing
hierarchical clustering, and statistical modelling, not only organizes the analysis method-
ology and domain of application but, most of all, integrates them. An inspirational set of
principles for data analytics, listed in [24] (page 6), included the following: “The model
should follow the data, and not the reverse. . . . What we need is a rigorous method that
extracts structures from data.” Closely coupled to this is that “data synthesis” could be
considered as equally if not more important relative to “data analysis” [27]. Analysis and
synthesis of data and information obviously go hand in hand.
A very minor note is the following. Analytics refers to general and generic data process-
ing, obtaining information from data, while analysis refers to specific data processing.
We have then the following. “If I make extensive use of correspondence analysis, in
preference to multivariate regression, for instance, it is because correspondence analysis is a
relational technique of data analysis whose philosophy corresponds exactly to what, in my
view, the reality of the social world is. It is a technique which ‘thinks’ in terms of relation,
as I try to do precisely in terms of field” (Bourdieu, cited in [133, p. 43]).
“In Data Analysis, numerous disciplines need to collaborate. The role of mathematics,
although essential, is modest, in the sense that one uses almost exclusively classical the-
orems or elementary demonstration techniques. But it is necessary that certain abstract
conceptions enter into the spirits of the users, the specialists who collect the data and who
should orientate the analysis according to fundamental problems that are appropriate to
their science” [27].
No method is fruitful unless the data are relevant: “analysing data is not the collecting
of disparate data and seeing what comes out of the computer” [27]. In contradistinction
to statistics being “technical control” of process, certifying that work has been carried out
in conformance with rules, there with primacy accorded to being statistically correct, even
asking if such and such a procedure has the right to be used – in contradistinction to that,
there is relevance, asking if there is interest in using such and such a procedure.
Another inspirational quotation is that “the construction of clouds leads to the mastery
of multidimensionality, by providing ‘a tool to make patterns emerge from data’” (this is
from Benzécri’s 1968 Honolulu conference, when the 1969 proceedings had the paper, “Sta-
tistical analysis as a tool to make patterns emerge from data”). John Tukey (developer of
exploratory data analysis, i.e. visualization in statistics and data analysis, the fast Fourier
transform, and many other methods) expressed this as follows: “Let the data speak for
themselves!” This can be kept in mind relative to direct, immediate, unmediated statistical
hypothesis testing that relies on a wide range of assumptions (e.g. normality, homoscedas-
ticity, etc.) that are often unrealistic and unverifiable.
The foregoing and the following are in [130]. “Data analysis, or more particularly ge-
ometric data analysis is the multivariate statistical approach, developed by J.-P. Benzécri
around correspondence analysis, in which data are represented in the form of clouds of
points and the interpretation is first and foremost on the clouds of points.”
While these are our influences, it would be good, too, to note how new problem areas of
Big Data are of concern to us, and also issues of Big Data ethics. A possible ethical issue,
entirely due to technical aspects, in the massification and reduction through scale effects
that are brought about by Big Data. From [130]: “Rehabilitation of individuals. The context
Preface xv

model is always formulated at the individual level, being opposed therefore to modelling at
an aggregate level for which the individuals are only an ‘error term’ of the model.”
Now let us look at the importance of homology and field, concepts that are inherent
to Bourdieu’s work. The comprehensive survey of [108] sets out new contemporary issues
of sampling and population distribution estimation. An important take-home message is
this: “There is the potential for big data to evaluate or calibrate survey findings . . . to help
to validate cohort studies”. Examples are discussed of “how data . . . tracks well with the
official”, and contextual, repository or holdings. It is well pointed out how one case study
discussed “shows the value of using ‘big data’ to conduct research on surveys (as distinct
from survey research)”. Therefore, “The new paradigm means it is now possible to digitally
capture, semantically reconcile, aggregate, and correlate data.”
Limitations, though, are clear [108]: “Although randomization in some form is very
beneficial, it is by no means a panacea. Trial participants are commonly very different
from the external . . . pool, in part because of self-selection”. This is because “One type of
selection bias is self-selection (which is our focus)”.
Important points towards addressing these contemporary issues include the following
[108]: “When informing policy, inference to identified reference populations is key”. This is
part of the bridge which is needed between data analytics technology and deployment of
outcomes. “In all situations, modelling is needed to accommodate non-response, dropouts
and other forms of missing data.”
While “Representativity should be avoided”, here is an essential way to address in a
fundamental way what we need to address [108]: “Assessment of external validity, i.e. gen-
eralization to the population from which the study subjects originated or to other popula-
tions, will in principle proceed via formulation of abstract laws of nature similar to physical
laws”.
The bridge between the data that is analysed, and the calibrating Big Data, is well
addressed by the geometry and topology of data. Those form the link between sampled data
and the greater cosmos. Pierre Bourdieu’s concept of field is a prime exemplar. Consider, as
noted in [132], how Bourdieu’s work involves “putting his thinking in mathematical terms”,
and that it “led him to a conscious and systematic move toward a geometric frame-model”.
This is a multidimensional “structural vision”. Bourdieu’s analytics “amounted to the global
[hence Big Data] effects of a complex structure of interrelationships, which is not reducible
to the combination of the multiple [effects] of independent variables”. The concept of field,
here, uses geometric data analysis that is core to the integrated data and methodology
approach used in the correspondence analysis platform [177].
In addressing the “rehabilitation of individuals”, which can be considered as address-
ing representativity both quantitatively as well as qualitatively, there is the potential and
relevance for the many ethical issues related to Big Data, detailed in [199]. We may say
that in the detailed case study descriptions in that book, what is unethical is the arbitrary
representation of an individual by a class or group.
The term analytics platform for the science of data, which is quite central to this book,
can be associated with an interesting article by New York Times author Steve Lohr [146]
on the “platform thinking” of the founders of Microsoft, Intel and Apple. In this book
the analytics platform is paramount, over and above just analytical or software tools. In his
article [146] Lohr says: “In digital-age competition, the long goal is to establish an industry-
spanning platform rather than merely products. It is platforms that yield the lucrative
flywheel of network effects, complementary products and services and increasing returns.” In
this book we describe a data analytics platform. It is to have the potential to go way beyond
mere tools. It is to be accepted that software tools, incorporating the needed algorithms,
can come to one’s aid in the nick of time. That is good. But for a deep understanding of
all aspects of potential (i.e. having potential for further usage and benefit) and practice,
xvi Preface

“platform” is the term used here for the following: potential importance and relevance, and
a really good conceptional understanding or role. The excellent data analyst does not just
come along with a software bag of tricks. The outstanding data analyst will always strive
for full integration of theory and practice, of methodology and its implementation.
An approach to drawing benefit from Big Data is precisely as described in [108]. The
observation of the need for the “formulation of abstract laws” that bridge sampled data
and calibrating Big Data can be addressed, for the data analyst and for the application
specialist, as geometric and topological.
In summary, then, this book’s key points include the following.

• Our analytics are based on letting the data speak.


• Data synthesis, as well as information and knowledge synthesis, is as important as data
analysis.
• In our analytics, an aim too is to rehabilitate the individual (see above).
• We have as a central focus the seeking of, and finding, homology in practice. This is
very relevant for Big Data calibration of our analytics.
• In high dimensions, all becomes hierarchical. This is because as dimensionality tends
to infinity, and this is a nice perspective on unconscious thought processes, then metric
becomes ultametric.
• A major problem of our times may be addressed in both geometric and algebraic ways
(remembering Dirac’s quotation about the primacy of mathematics even over physics).
• We need to bring new understanding to bear on the dark energy and dark matter of
the cosmos that we inhabit, and of the human mind, and of other issues and matters
besides. These are among the open problems that haunt humanity.

One major motivation for some of this book’s content, related to the fifth item here, is
to see, and draw benefit from, the remarkable simplicity of very high dimensions, and even
infinite dimensionality. With reference to the last item here, there is a very nice statement
by Immanuel Kant, in Chapter 34 of Critique of Practical Reason (1788): “Two things fill
the mind with ever newer and increasing wonder and awe, the more often and lasting that
reflection is concerned with them: the starry sky over me, and the moral law within me.”

The Book’s Website


The website accompanying this book, which can be found at
http://www.DataScienceGeometryTopology.info
has data sets which are referred to and used in the text. It also has accessible R code which
has been used in the many and varied areas of work that are at issue in this book. In some
cases, too, there are graphs and figures from outputs obtained.
Provision of data and of some R software, and in a few cases, other software, is with
the following objective: to facilitate learning by doing, i.e. carrying out analyses, and re-
producing results and outcomes. That may be both interesting and useful, in parallel with
the more methodology-related aspects that can be, and that ought to be, revealing and
insightful.
Preface xvii

Collaborators and Benefactors: Acknowledgements


Key collaborating partners are acknowledged when our joint work is cited throughout the
book.
A key stage in this overall work was PhD scholarship funding, with support from the
Smith Institute for Industrial Mathematics and System Engineering, and with company
support for that, from ThinkingSafe.
Further background were the courses, based on all or very considerable parts of this
work, that were taught in April–May 2013 at the First International Conference on Models
of Complex Hierarchic Systems and Non-Archimedean Analysis, Cinvestav, Abacus Cen-
ter, Mexico; and in August 2015 at ESSCaSS 2015, the 14th Estonian Summer School on
Computer and Systems Science, Nelijärve Puhkekeskus, Estonia.
Among far-reaching applications of this work there has been a support framework for
creative writing that resulted in many books being published. Comparative and qualita-
tive data and information assessment can be well and truly integrated with actionable
decision-making. Section 2.7, contains a short overview of these outcomes with quite major
educational, publishing and related benefits. It is nice to note that this work was awarded
a prestigious teaching prize in 2010, at Royal Holloway University of London. Colleagues
Dr Joe Reddington and Dr Douglas Cowie and I, well linked to this book’s qualitative
and quantitative analytics platform, obtained this award with the title, “Project TooMany-
Cooks: applying software design principles to fiction writing”.
A number of current collaborations and partnerships, including with corporate and
government agencies, will perhaps deliver paradigm-shift advances.

Brief Introduction to Chapters


The chapters of this book are quite largely self-contained, meaning that in a summary way,
or sometimes with more detail, there can be essential material that is again presented in any
given chapter. This is done so as to take into account the diversity of application domains.

• Chapter 1 relates to the mapping of the semantics, i.e. the inherent meaning and sig-
nificance of information, underpinning and underlying what is expressed textually and
quantitatively. Examples include script story-line analysis, using film script, national
research funding, and performance management.
• Chapter 2 relates to a case study of change over time in Twitter. Quantification, includ-
ing even statistical analysis, of style is motivated by domain-originating stylistic and
artistic expertise and insight. Also covered is narrative synthesis and generation.
• Those two chapters comprise Part I, relating to film and movie, literature and docu-
mentation, some social media such as Twitter, and the recording, in both quantitative
and qualitative ways, of some teamwork activities.
• The accompanying website has as its aim to encourage and to facilitate learning and
understanding by doing, i.e. by actively undertaking experimentation and familiarization
with all that is described in this book.

• Next comes Part II, relating to underpinning methodology and vantage points.
xviii Preface

Paramount are geometry for the mapping of semantics, and, based on this, tree or
hierarchical topology, for lots of objectives.
• Chapter 3 relates to how hierarchy can express symmetry. Also at issue is how such
symmetries in data and information can be so revealing and informative.
• Chapter 4 is a review chapter, relating to fundamental aspects that are intriguing, and
maybe with great potential, in particular for cosmology. This chapter relates to the
theme that analytics through real-valued mathematics can be very beneficially com-
plemented by p-adic and, relatedly, m-adic number theory. There is some discussion of
relevance and importance in physics and cosmology.
• Part III relates to outcomes from somewhat more computational perspectives.
• Chapter 5 explains the operation of, and the great benefits to be derived from, linear-
time hierarchical clustering. Lots of associations with other techniques and so on are
included.
• The focus in Chapter 6 is on new application domains such as very high-dimensional
data. The chapter describes what we term informally the remarkable simplicity of very
high-dimensional data, and, quite often, very big data sets and massive data volumes.
• Part IV seeks to describe new perspectives arising out of all of the analytics here, with
relevance for various application domains.
• Chapter 7 relates to novel definitions and usage of the concept of information.
• Then Chapter 8 relates to ultrametric topology expressing or symbolically representing
human unconscious reasoning. Inspiration for this most important and insightful work
comes from the eminent psychoanalyst Ignacio Matte Blanco’s pursuit of bi-logic, the
human’s two modes of being, conscious and unconscious.
• Chapter 9 takes such analytics further, with application to very varied expressions of
narrative, embracing literature, event and experience reporting.
• Chapter 10 discusses a little the broad and general application of methods at issue here.
Part I

Narratives from Film and


Literature, from Social Media
and Contemporary Life
1
The Correspondence Analysis Platform for
Mapping Semantics

1.1 The Visualization and Verbalization of Data


All-important for the big picture to be presented is introductory description of the geometry
of data, and how we can proceed to both visualizing data and interpreting data. We can
even claim to be verbalizing our data. To begin with, the embedding of our data in a metric
space is our very central interest in the geometry of data. This metric space provides a latent
semantic representation of our data. Semantics, or meaning, comes from the sum total of
the interrelations of our observations or objects, and of their attributes or properties. Our
particular focus is on mathematical description of our data analysis platform (or framework).
We then move from the geometry of metric spaces to the hierarchical topology that
allows our data to be structured into clusters.
We address both the mathematical framework and underpinnings, and also algorithms.
Hand in hand with the algorithms goes implementation in R (typically).
Contemporary information access is very often ad hoc. Querying a search engine ad-
dresses some user needs, with content that is here, there and anywhere. Information re-
trieval results in bits and pieces of information that are provided to the user. On the other
hand, information synthesis can refer to the fact that multiple documents and information
sources will provide the final and definitive user information. This challenge of Big Data is
now looming (J. Mothe, personal communication): “Big Data refers to the fact that data
or information is voluminous, varied, and has velocity but above all that it can lead to
value provided that its veracity has been properly checked. It implies new information sys-
tem architecture, new models to represent and analyse heterogeneous information but also
new ways of presenting information to the user and of evaluating model effectiveness. Big
Data is specifically useful for competitive intelligence activities.” It is this outcome that is
a good challenge, that is to be addressed through the geometry and topology of data and
information: “aggregating information from heterogeneous resources is unsolved.”
We can and we will anticipate various ways to address these interesting new challenges.
Jean-Paul Benzécri, who was ahead of his time in so many ways, indicated (including in
[27]) that “data synthesis” could be considered as equally if not more important relative to
“data analysis”. Analysis and synthesis of data and information obviously go hand in hand.
Data analytics are just one side of what we are dealing with in this book. The other side,
we could say, is that of inductive data analysis. In the context or framework of practical
data-related and data-based activity, the processes of data synthesis and inductive data
analysis are what we term a narrative. In that sense, we claim to be tracing and tracking
the lives of narratives. That is, in physical and behavioural activities, and of course in
mental and thought processes.

3
4 Data Science Foundations

1.2 Analysis of Narrative from Film and Drama


1.2.1 Introduction
We study two aspects of information semantics: (i) the collection of all relationships; (ii)
tracking and spotting anomaly and change. The first is implemented by endowing all relevant
information spaces with a Euclidean metric in a common projected space. The second is
modelled by an induced ultrametric. A very general way to achieve a Euclidean embedding
of different information spaces based on cross-tabulation counts (and from other input data
formats) is provided by correspondence analysis. From there, the induced ultrametric that
we are particularly interested in takes a sequential (e.g. temporal) – ordering of the data
into account. We employ such a perspective to look at narrative, “the flow of thought and
the flow of language” [45]. In application to policy decision-making, we show how we can
focus analysis in a small number of dimensions.
The data mining and data analysis challenges addressed are the following.

• Great masses of data, textual and otherwise, need to be exploited and decisions need
to be made. Correspondence analysis handles multivariate numerical and symbolic data
with ease.

• Structures and interrelationships evolve in time.


• We must consider a complex web of relationships.
• We need to address all these issues from data sets and data flows.

Various aspects of how we respond to these challenges will be discussed in this chapter,
complemented by the annex to the chapter. We will look at how this works, using the
Casablanca film script. Then we return to the data mining approach used, to propose that
various issues in policy analysis can be addressed by such techniques also.

1.2.2 The Changing Nature of Movie and Drama


McKee [153] bears out the great importance of the film script: “50% of what we understand
comes from watching it being said.” And: “A screenplay waits for the camera. . . . Ninety
percent of all verbal expression has no filmic equivalent.”
An episode of a television series costs [177] $2–3 million per hour of television, or
£600,000–800,000 for a similar series in the UK. Generally screenplays are written spec-
ulatively or commissioned, and then prototyped by the full production of a pilot episode.
Increasingly, and especially availed of by the young, television series are delivered via the
Internet.
Originating in one medium – cinema, television, game, online – film and drama series are
increasingly migrated to another. So scriptwriting must take account of digital multimedia
platforms. This has been referred to in computer networking parlance as “multiplay” and
in the television media sector as a “360 degree” environment.
Cross-platform delivery motivates interactivity in drama. So-called reality TV has a
considerable degree of interactivity, as well as being largely unscripted.
There is a burgeoning need for us to be in a position to model the semantics of film script,
– its most revealing structures, patterns and layers. With the drive towards interactivity, we
also want to leverage this work towards more general scenario analysis. Potential applica-
tions are to business strategy and planning; education and training; and science, technology
The Correspondence Analysis Platform for Mapping Semantics 5

and economic development policy. We will discuss initial work on the application to policy
decision-making in Section 1.3 below.

1.2.3 Correspondence Analysis as a Semantic Analysis Platform


For McKee [153], film script text is the “sensory surface of a work of art” and reflects
the underlying emotion or perception. Our data mining approach models and tracks these
underlying aspects in the data. Our approach to textual data mining has a range of novel
elements.
Firstly, a novelty is our focus on the orientation of narrative through correspondence
analysis [24, 171] which maps scenes (and sub-scenes) and words used, in a largely automated
way, into a Euclidean space representing all pairwise interrelationships. Such a space is
ideal for visualization. Interrelationships between scenes are captured and displayed, as well
as interrelationships between words, and mutually between scenes and words. In a given
context, comprehensive and exhaustive data, with consequent understanding and use of
one’s actionable data, are well and truly integrated in this way.
The starting point for analysis is frequency of occurrence data, typically the ordered
scenes crossed by all words used in the script.
If the totality of interrelationships is one facet of semantics, then another is anomaly or
change as modelled by a clustering hierarchy. If, therefore, a scene is quite different from
immediately previous scenes, then it will be incorporated into the hierarchy at a high level.
This novel view of hierarchy will be discussed further in Section 1.2.5 below.
We draw on these two vantage points on semantics – viz. totality of interrelationships,
and using a hierarchy to express change.
Among further work that is covered in Section 1.2.9 and further in Section 2.5 of Chapter
2 is the following. We can design a Monte Carlo approach to test statistical significance
of the given script’s patterns and structures as opposed to randomized alternatives (i.e.
randomized realizations of the scenes). Alternatively, we examine caesuras and breakpoints
in the film script, by taking the Euclidean embedding further and inducing an ultrametric
on the sequence of scenes.

1.2.4 Casablanca Narrative: Illustrative Analysis


The well-known movie Casablanca serves as an example for us. Film scripts, such as for
Casablanca, are partially structured texts. Each scene has metadata, and the body of the
scene contains dialogue and possibly other descriptive data. The Casablanca script was half
completed when production began in 1942. The dialogue for some scenes was written while
shooting was in progress. Casablanca was based on an unpublished 1940 screenplay [43]. It
was scripted by J.J. Epstein, P.G. Epstein and H. Koch. The film was directed by M. Curtiz
and produced by H.B. Wallis and J.L. Warner. It was shot by Warner Bros. between May
and August 1942.
As an illustrative first example we use the following. A data set was constructed from
the 77 successive scenes crossed by attributes: Int[erior], Ext[erior], Day, Night, Rick, Ilsa,
Renault, Strasser, Laszlo, Other (i.e. minor character), and 29 locations. Many locations
were met with just once; and Rick’s Café was the location of 36 scenes. In scenes based in
Rick’s Café we did not distinguish between “Main room”, “Office”, “Balcony”, etc. Because
of the plethora of scenes other than Rick’s Café we assimilate these to just one, “other than
Rick’s Café”, scene.
In Figure 1.1, 12 attributes are displayed. If useful, the 77 scenes can be displayed as
dots (to avoid overcrowding of labels). Approximately 34% (for factor 1) + 15% (for factor
2) = 49% of all information, expressed as inertia explained, is displayed here. We can study
6 Data Science Foundations

1.5 Strasser
.

.
1.0

.
Factor 2, 15% of inertia

. . Ilsa Renault
. .
. . .
0.5

. .
. .
.
NotRicks . .
Other .
Laszlo .
. . Rick Int
. . .
. . .
0.0

. . .
Day . . .
. . .
. .
.
.
. Night
−0.5

. RicksCafe
. .
Ext
.
.
−1.5 −1.0 −0.5 0.0 0.5

Factor 1, 34% of inertia


.

FIGURE 1.1: Correspondence analysis of the Casablanca data derived from the script.
The input data are presences/absences for 77 scenes crossed by 12 attributes. Just the 12
attributes are displayed. For a short review of the analysis methodology, see the annex to
this chapter.

interrelationships between characters, other attributes, and scenes, for instance closeness of
Rick’s Café with Night and Int (obviously enough).

1.2.5 Modelling Semantics via the Geometry and Topology of Informa-


tion
Some underlying principles are as follows. We start with the cross-tabulation data, scenes
× attributes. Scenes and attributes are embedded in a metric space. This is how we are
probing the geometry of information, which is a term and viewpoint used by [236].
Underpinning the display in Figure 1.1 is a Euclidean embedding. The triangle inequality
holds for metrics. An example of a metric is the Euclidean distance, exemplified in Figure
1.2(a), where each and every triplet of points satisfies the relationship d(x, z) ≤ d(x, y) +
d(y, z) for distance d. Two other relationships also must hold: symmetry (d(x, y) = d(y, x))
and positive definiteness (d(x, y) > 0 if x 6= y, d(x, y) = 0 if x = y).
Further underlying principles used in Figure 1.1 are as follows. The axes are the principal
axes of inertia. Principles identical to those in classical mechanics are used. The scenes are
located as weighted averages of all associated attributes, and vice versa.
Huyghens’ theorem (see Figure 1.2(b)) relates to decomposition of inertia of a cloud of
points. This is the basis of correspondence analysis.
We come now to a different principle: that of the topology of information. The particular
topology used is that of hierarchy. Euclidean embedding provides a very good starting point
to look at hierarchical relationships. One particular innovation in this work is as follows:
the hierarchy takes sequence (e.g. timeline) into account. This captures, in a more easily
understood way, the notions of novelty, anomaly or change.
The Correspondence Analysis Platform for Mapping Semantics 7

l
x
5

y
Vertical

l
4

l
z
3
2

2 3 4 5 6

Horizontal

(b) Christiaan Huyghens, 1629–1695, from [24].


(a) The triangle inequality defines a metric: ev- Towards the bottom on the right there is a de-
ery triplet of points satisfies the relationship piction of the decomposition of the inertia of a
d(x, z) ≤ d(x, y) + d(y, z) for distance d. hyperellipsoid cloud.

FIGURE 1.2: (a) Depiction of the triangle inequality. Consider a journey from location x
to location z, but via y. (b) A poetic portrayal of Huyghens.

Let us take an informal case study to see how this works. Consider the situation of
seeking documents based on titles. If the target population has at least one document that
is close to the query, then this is (let us assume) clear-cut. However, if all documents in the
target population are very unlike the query, does it make any sense to choose the closest?
Whatever the answer, here we are focusing on the inherent ambiguity, which we will note
or record in an appropriate way. Figure 1.3(a) illustrates this situation, where the query is
the point to the right. By using approximate similarity the situation can be modelled as an
isosceles triangle with small base.
As illustrated in Figure 1.3(a), we are close to having an isosceles triangle with small
base, with the red dot as apex, and with a pair of the black dots as the base. In practice,
in hierarchical clustering, we fit a hierarchy to our data. An ultrametric space has proper-
ties that are very unlike a metric space, and one such property is that the only triangles
allowed are either equilateral, or isosceles with small base. So Figure 1.3(a) can be taken as
representing a case of ultrametricity. What this means is that the query can be viewed as
having a particular sort of dominance or hierarchical relationship vis-à-vis any pair of target
documents. Hence any triplet of points here, one of which is the query (defining the apex
of the isosceles, with small base, triangle), defines local hierarchical or ultrametric struc-
ture. Further general discussion can be found in [169], including how established nearest
neighbour or best match search algorithms often employ such principles.
It is clear from Figure 1.3(a) that we should use approximate equality of the long sides
of the triangle. The further away the query is from the other data, the better is this ap-
proximation [169].
What sort of explanation does this provide for our example here? It means that the
query is a novel, or anomalous, or unusual “document”. It is up to us to decide how to treat
8 Data Science Foundations

1.0 1.5 2.0 2.5 3.0 3.5


l
20

Height
15
Property 2

10

Isosceles triangle:
approx equal long sides

z
l
5

l
l

10 20 30 40
(b) The strong triangle inequality defines an ul-
Property 1 trametric: every triplet of points satisfies the
relationship d(x, z) ≤ max{d(x, y), d(y, z)} for
(a) The query is on the upper right. While we can distance d. Check by reading off the hierarchy,
easily determine the closest target (among the how this is verified for all x, y, z: d(x, z) = 3.5,
three objects represented by the dots on the left), d(x, y) = 3.5, d(y, z) = 1.0. In addition, the sym-
is the closest really that much different from the metry and positive definiteness conditions hold
alternatives? for any pair of points.

FIGURE 1.3: (a) graphical depiction, and (b) hierarchy, or rooted tree, depiction.

such new, innovative cases. It raises, though, the interesting perspective that here we have
a way to model and subsequently handle the semantics of anomaly or innocuousness.
The strong triangle inequality, or ultrametric inequality, holds for tree distances: see
Figure 1.3(b). The closest common ancestor distance is such an ultrametric.

1.2.6 Casablanca Narrative: Illustrative Analysis Continued


Figure 1.4 uses a sequence-constrained complete link agglomerative algorithm. It shows up
scenes 9 to 10, and progressing from 39 to 40 and 41, as major changes. The sequence-
or chronology-constrained algorithm (i.e. agglomerations are permitted between adjacent
segments of scenes only) is described in the annex to this chapter, and in greater detail in
[167, 19, 135]. The agglomerative criterion used, that is subject to this sequence constraint,
is a complete link one.

1.2.7 Platform for Analysis of Semantics


Correspondence analysis supports the following:

• analysis of multivariate, mixed numerical/symbolic data;


• web (viz. pairwise links) of interrelationships;
• evolution of relationships over time.

Correspondence analysis is in practice a tale of three metrics [171]. The analysis is based
The Correspondence Analysis Platform for Mapping Semantics 9

30
25
20
15
10
5
0

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
FIGURE 1.4: The 77 scenes clustered. These scenes are in sequence: a sequence-constrained
agglomerative criterion is used for this. The agglomerative criterion itself is a complete link
one. See [167] for properties of this algorithm.

on embedding a cloud of points from a space governed by one metric into another. The
cloud of observables is inherently related to the cloud of attributes of those observables.
Observables are defined by their attributes, and each attribute is, de facto, specified by its
associated observables. So – in the case of film script – for any one of the metrics we can
effortlessly pass between the space of film script scenes and attribute set. The three metrics
are as follows.
• Chi-squared (χ2 ) metric, appropriate for profiles of frequencies of occurrence.
• Euclidean metric, for visualization, and for static context.
• Ultrametric, for hierarchic relations and for dynamic context, as we operationally have
it here, also taking the chronology into account.
In the analysis of semantics, we distinguish two separate aspects.
1. Context – the collection of all interrelationships.
•The Euclidean distance makes a lot of sense when the population is homo-
geneous.
•All interrelationships together provide context, relativities – and hence
meaning.
2. Hierarchy tracks anomaly.
•Ultrametric distance makes a lot of sense when the observables are hetero-
geneous, discontinuous.
•The latter is especially useful for determining anomalous, atypical, innova-
tive cases.
10 Data Science Foundations

1.2.8 Deeper Look at Semantics of Casablanca: Text Mining


The Casablanca script has 77 successive scenes. In total there are 6710 words in these scenes.
We define words here as consisting of at least two letters. Punctuation is first removed. All
upper case is set to lower case. We analyse frequencies of occurrence of words in scenes, so
the input is a matrix crossing scenes by words.

1.2.9 Analysis of a Pivotal Scene


As a basis for a deeper look at Casablanca we have taken comprehensive but qualitative
discussion by McKee [153] and sought quantitative and algorithmic implementation.
Casablanca is based on a range of miniplots. For McKee its composition is “virtually
perfect”.
Following McKee [153], we will carry out an analysis of Casablanca’s “mid-act climax”,
scene 43. McKee divides this scene, relating to Ilsa and Rick seeking black market exit visas,
into 11 “beats”.
1. Beat 1 is Rick finding Ilsa in the market.
2. Beats 2, 3, 4 are rejections of him by Ilsa.
3. Beats 5, 6 express rapprochement by both.
4. Beat 7 is guilt-tripping by each in turn.
5. Beat 8 is a jump in content: Ilsa says she will leave Casablanca soon.
6. In beat 9, Rick calls her a coward, and Ilsa calls him a fool.
7. In beat 10, Rick propositions her.
8. In beat 11, the climax, all goes to rack and ruin: Ilsa says she was married to
Laszlo all along. Rick is stunned.
Figure 1.5 shows the evolution from beat to beat rather well. In these 11 beats or sub-
scenes 210 words are used. Beat 8 is a dramatic development. Moving upwards on the
ordinate (factor 2) indicates distance between Rick and Ilsa. Moving downwards indicates
rapprochement.
In the full-dimensional space we can check some other of McKee’s guidelines. Lengths
of beat get shorter, leading up to climax: word counts of the final five beats in scene 43 are:
50, 44, 38, 30, 46. A style analysis of scene 43 based on McKee [153] can be Monte Carlo
tested against 999 uniformly randomized sets of the beats. In the great majority of cases
(against 83% and more of the randomized alternatives) we find the style in scene 43 to be
characterized by: small variability of movement from one beat to the next; greater tempo
of beats; and high mean rhythm. There is further description of these attributes in Section
2.5.
The planar representation in Figure 1.5 accounts for 12.6% + 12.2% = 24.8% of the
inertia, and hence the total information. We will look at the evolution of scene 43, using
hierarchical clustering of the full-dimensional data – but based on the relative orientations,
or correlations with factors. This is because of what we have found in Figure 1.5, namely
that change of direction is most important.
Figure 1.6 shows the hierarchical clustering, based on the sequence of beats. Input data
are of full dimensionality so there is no approximation involved. Note the caesura in moving
from beat 7 to 8, and back to 9. There is less of a caesura in moving from 4 to 5, but it is
still quite pronounced.
Another random document with
no related content on Scribd:
have taken the food out of it. Much of this water passes out through
the leaves.
You know when you are very warm, you feel a moisture come on
your skin. That was once water in your blood. It creeps out through
tiny pores over all your skin.
The plant skin has such pores. The water goes off through them.
When the plant breathes out this water, then more hurries up through
the cells to take its place. So the sap keeps running up and down all
the time.
Plants not only send out water through the pores of the leaves, but
also a kind of air or gas. If they did not do that, we should soon all be
dead. Can I make that plain to you?
Did you ever hear your mother say, “The air here is bad or close”?
Did you ever see your teacher open a door or a window, to “air” the
schoolroom? If you ask why, you will be told “So many people
breathing here make the air bad.”
How does our breathing make the air bad? When our blood runs
through our bodies it takes up little bits of matter that our bodies are
done with. This stuff makes the blood dark and thick. But soon the
blood comes around to our lungs.
Now as we breathe out, we send into the air the tiny atoms of this
waste stuff. It is carbonic acid gas. As we breathe in, we take from
the fresh air a gas called oxygen. That goes to our lungs, and lo! it
makes the blood fresh and clean, and red once more.
So you can see, that when many people breathe in one room they
will use up all the good clean air. At the same time they will load the
air of the room with the gas they breathe out.
That is why the window is opened. We wish to sweep away the bad
air, and let in good air.
But at this rate, as all men and other animals breathe out carbonic
acid gas, why does not all the air in the world get bad? Why, when
they all use oxygen, do they not use up all the oxygen that is in the
world?
Just here the plants come in to help. Carbonic acid gas is bad for
men, it is food to plants. Oxygen is needed by animals, but plants
want to get rid of it. Animals breathe out a form of carbon and
breathe in oxygen. Plants do just the other thing. They breathe out
oxygen and take in carbonic acid gas.
The air, loaded with this, comes to the plant. At once all the little leaf-
mouths are wide open to snatch out of the air the carbonic acid gas.
And, as the plants are very honest little things, they give where they
take away. They take carbon from the air, and breathe into the air a
little oxygen.
Where did they get that? The air they breathe has both carbon and
oxygen in it. So they keep what they want,—that is, carbon,—and
send out the oxygen.
Now it is only the green part of the plant that does this fine work for
us. It is the green parts, chiefly the leaves, that send out good
oxygen for us to breathe. It is the green leaf that snatches from the
air those gases which would hurt us.
It is the green leaf that changes the harmful form of carbon into good
plant stuff, which is fit for our food. How does it do that? Let us see.
What makes a leaf green? Bobby who crushed a leaf to see, told me
“a leaf was full of green paint.”
Inside the green leaves is a kind of green paste, or jelly. Now it is this
“leaf-green” that does all the work. The “leaf-green” eats up carbon.
The “leaf-green” turns carbon into nice safe plant material. It is “leaf-
green” that sets free good oxygen for us.
“Leaf-green” is a good fairy, living in every little cell in the leaf. Leaf-
green is a fairy which works only in the day-time. Leaf-green likes
the sun. Leaf-green will not work in the dark, but goes to bed and
goes to sleep!
In such simple lessons as these, I can tell you only a little of what is.
The deep “how” and “why” of things I cannot explain. Even the very
wisest men do not know all the how and the why of the “leaf-green”
fairy.
I have told you these few things that you may have wonders to think
of when you see green leaves. After this lesson, will you not care
more for seeds and leaves than you ever did before?
LESSON VIII.
THE COLOR OF PLANTS.
Almost the first thing that you will notice about a plant is its color. The
little child, before it can speak, will hold out its hands for a bright red
rose, or a golden lily. I think the color is one of the most wonderful
things about a plant.
Come into the field. Here you see a yellow buttercup, growing near a
white daisy. Beside them is a red rose. Close by, blooms a great
purple flower. All grow out of the same earth, and breathe the same
air. Yet how they differ in color.
Some flowers have two or three colors upon each petal. Have you
not seen the tulip with its striped blossoms, and the petunias spotted
with white and red?
The flower of the cotton plant changes in color. Within a few days
this flower appears in three distinct hues. The chicory blossom
changes from blue to nearly white as the day grows warm.
Look at your mother’s roses. Some are white, others are red, pink, or
yellow. None are ever blue.
Then look at a wild-rose tree. The root and stem are brown. The
green color is in the leaves, and in some of the stems. The petals
are red. The stamens and pistils are yellow.
You never saw the red color get astray and run into the leaves. The
leaf-green did not lose itself, and travel up to the petals. The
stamens and pistils did not turn brown instead of golden.
Does not that seem a wonder, now that you think of it? Perhaps you
never noticed it before. It is one thing to see things, and another to
notice them so that you think about them.
Here is another fact about color in plants. All summer you see that
the leaves are green. In the autumn they begin to change. You wake
up some fine frosty morning and the tree leaves are all turned red,
yellow, brown, or purple. It is a fine sight.
It is the going away of the leaf-green from the leaf that begins the
change of leaf-color in the fall. The leaves have done growing. Their
stems are hard and woody. They do not breathe as freely as they
did. The sap does not run through them as it did early in the season.
The leaf-green shrinks up in the cells. Or, it goes off to some other
part of the plant. Sometimes part of it is destroyed. Then the leaves
begin to change.
Sometimes a red sap runs into the leaf cells. Or, an oily matter goes
there, in place of the “leaf-green.”
The leaf-green changes color if it gets too much oxygen. In the
autumn the plant does not throw out so much oxygen. What it keeps
turns the leaf-green from green to red, yellow, or brown.
The bright color in plants is not in the flower alone. You have seen
that roots and seeds have quite as bright colors as blossoms. What
flowers are brighter than many fruits are?
The cherry is crimson, or pink, or nearly black. What a fine yellow,
red, purple, we find in plums! Is there any yellow brighter than that of
the Indian corn? Is there a red gayer than you find on the apples you
like so well? What is more golden than a heap of oranges?
If you wish to find splendid color in a part of a plant, look at a water-
melon. The skin is green marked with pale green, or white. Next,
inside, is a rind of pale greenish white. Then comes a soft, juicy,
crimson mass. In that are jet black seeds.
Oh, where does all this color come from? Why is it always just in the
right place? The melon rind does not take the black tint that belongs
to the seeds. The skin does not put on the crimson of the pulp. See,
too, how this color comes slowly, as the melon ripens. At first the
skin is of the same dark green as the leaves, and inside all is of a
greenish white.
Let us try to find out where all this color comes from. Do you know
we ourselves can make changes in the color of flowers? Take one of
those big hydrangeas. It has a pink flower. But give it very rich black
earth to grow in. Mix some alum and iron with the earth. Water it with
strong bluing water. Lay soot and coal-dust upon the earth it grows
in. Very soon your hydrangea will have blue flowers, instead of pink
ones.
Once I had a petunia with large flowers of a dirty white color. I fed it
with soot and coal-dust. I watered it with strong bluing water. After a
few weeks my petunia had red or crimson flowers. Some of the
flowers were of a very deep red. Others were spotted with red and
white.
Now from this you may guess that the plant obtains much of its color
from what it feeds on in the soil.
But you may give the plant very good soil, and yet if you make it
grow in the dark, it will have almost no color. If it lives at all, even the
green leaves will be pale and sickly.
This will show you that the light must act in some way on what the
plant eats, to make the fine color.
The plant, you know, eats minerals from the earth. In its food it gets
little grains of coloring stuff.
But how the color goes to the right place we cannot tell. We cannot
tell why it is, that from the same earth, in the same light, there will be
flowers of many colors. We cannot tell why flowers on the same
plant, or parts of the same flower, will have different colors. That is
one of the secrets and wonders that no one has found out.
There are many plants which store up coloring matter, just as plants
store up starch, or sugar. The indigo, which makes our best blue
dye, comes from a plant. Ask your mother to show you some indigo.
When the plant is soaked in water the coloring stuff sinks to the
bottom of the water, like a blue dust.
Did you ever notice the fine red sumac? That gives a deep yellow
dye. The saffron plant is full of a bright orange color. Other plants
give other dyes.
Sometimes children take the bright petals of plants, or stems, that
have bright color in them, to paint with. Did you ever do that? You
can first draw a picture, and then color it, by rubbing on it the colored
parts of plants.
Some trees and plants, from which dyes are made, have the coloring
stuff in the bark or wood. That is the way with the logwood tree. The
best black dye is made from that.
You have seen how much dark red juice you can find in berries. Did
you ever squeeze out the red juice of poke or elder berries? It is like
red ink. Did you ever notice how strawberries stain your fingers red?
Grapes and blackberries make your lips and tongue purple.
No doubt you have often had your hands stained brown, for days,
from the husks of walnuts. All these facts will show you what a deal
of color is taken up from the soil by plants, changed by the sun, and
stored up in their different parts.
But the chief of all color in the plant is the leaf-green. We cannot
make a dye out of that.
Leaf-green is the color of which there is the most. It is the color
which suits the eye best of all. How tired we should be of crimson or
orange grass!
Though leaves and stems are generally green, there are some
plants which have stems of a bright red or yellow color. Yellow is the
common color for stamens and pistils. In some plants, as the tulip,
the peach, and others, the stamens are of a deep red-brown, or
crimson, or pink, or even black color.
LESSON IX.
THE MOTION OF PLANTS.
If I ask you what motion plants have, I think you will tell me that they
have a motion upward. You will say that they “grow up.” You will not
say that they move in the wind. You know that that is not the kind of
motion which I mean.
Some plants grow more by day, some by night. On the whole, there
is more growing done by day than by night. At night it is darker,
cooler, and there is more moisture in the air. The day has more heat,
light, and dryness. For these causes growth varies by day and by
night.
Warmth and moisture are the two great aids to the growth of plants.
Heat, light, and wet have most to do with the motion of plants. For
the motion of plants comes chiefly from growth.
The parts of the plant the motion of which we shall notice, are, the
stems, leaves, tendrils, and petals. Perhaps you have seen the
motion of a plant stem toward the sunshine.
Did you ever notice in house plants, that the leaves and branches
turn to the place from which light comes to them? Did you ever hear
your mother say that she must turn the window plants around, so
that they would not grow “one-sided”?
Did you ever take a pot plant that had grown all toward one side, and
turn it around, and then notice it? In two or three weeks you would
find the leaves, stems, branches, bent quite the other way. First they
lifted up straight. Then they slowly bent around to the light.
Perhaps you have noticed that many flower stems stoop to the east
in the morning. Then they move slowly around. At evening you find
them bending toward the west.
This is one motion of stems. Another motion is that of long, weak
stems, such as those of the grape-vine or morning-glory. They will
climb about a tree or stick.
Such vines do much of their climbing by curling around the thing
which supports them. If you go into the garden, and look at a bean-
vine, you will see what fine twists and curves it makes about the
beanpole.
Such twists or curves can be seen yet more plainly in a tendril. A
tendril is a little string-like part of the plant, which serves it for hands.
Sometimes tendrils grow out of the tips of the leaves.
Sometimes they grow from the stem. Sometimes they grow from the
end of a leaf-stem in place of a final leaf.
Tendrils, as I told you before, are twigs, leaves, buds, or other parts
of a plant, changed into little, long clasping hands.
Now and then the long slender stem of a leaf acts as a tendril. It
twists once around the support which holds up the vine. Thus it ties
the stem of the vine to the support.
You have seen not only climbing plants, such as the grape-vine. You
have seen also creeping plants, as the strawberry and ground-ivy.
You will tell me that a climbing plant is one which travels up
something. You will say, also, that a creeping plant is a vine which
runs along the ground.
The climbing plant helps itself along by tendrils. The creeping plant
has little new roots to hold it firm.
Look at the strawberry beds. Do you see some long sprays which
seem to tie plant to plant? Your father will tell you that they are
“runners.”
The plant throws out one of these runners. Then at the end of the
runner a little root starts out, and fastens it to the ground. A runner is
very like a tendril. There are never any leaves upon it. But the end of
a tendril never puts out a bud. The end of the runner, where it roots,
puts out a bud.
This bud grows into a new plant. The new plant sends out its
runners. These root again, and so on. Thus, you see, a few
strawberry plants will soon cover a large space of ground.
There is a very pretty little fern, called the “walking fern,” which has
an odd way of creeping about. When the slender fronds[8] reach their
full length, some of the tallest ones bend over to the earth. The tip of
the frond touches the ground. From that tip come little root-like
fibres, and fix themselves in the earth. A new plant springs up from
them.
When the new plant is grown, a frond of that bends over and takes
root again. So it goes on. Soon there is a large, soft, thick mat of
walking fern upon the ground.
This putting out new roots to go on by is also the fashion of some
climbing plants. Did you ever notice how the ivy will root all along a
wall? Little strong roots put out at the joints of the stem, and hold the
plant fast.
All this motion in plants is due to growth. In very hot lands where
there is not only much heat, but where long, wet seasons fill the
earth with water, the growth of plants is very rapid.
In these hot lands, there are more climbing plants than in cool lands.
Some trees, which, in cool lands where they grow slowly, never
climb, turn to climbers in hot lands.
Some plants will twine and climb in hot weather, and stand up
straight alone in cool weather. This shows that in hot weather they
grow so fast that they cannot hold themselves up. When it is cool,
they grow slowly, and make more strong fibre. But we must leave the
stem motions of plants and speak of the motion of other parts.
Let me tell you how to try the leaf motion of plants. Take a house
plant to try, as that is where wind will not move the leaf. Get a piece
of glass about four or five inches square. Smoke it very black.
Lay it under the leaf, so that the point of the leaf bent down will be
half an inch from the glass.
Then take a bristle from a brush and put it in the tip of the leaf. Run
the bristle in the leaf so that the end will come beyond the leaf, and
just touch the glass. Leave it a night and a day. Then you will find the
story of the leaf’s travels written on the glass. As the leaf moves, the
bristle will write little lines in the black on the glass. Try it.
As you have proved the motion of the leaf with your smoked glass,
let us look at leaf motion. There is, first, that motion which unfolds or
unrolls the leaf from the bud. That is made because, by feeding, the
plant is growing larger, and the leaf needs more room.
The leaf often has, after it is grown, a motion of opening and
shutting. Other leaves have a motion of rising and falling. But of
these motions I will tell you in another lesson.
Flowers have, first, the motion by which the flower-bud unfolds to the
full, open blossom. That, as the leaf-bud motion, comes from
growing. Did you ever watch a rose-bud, or a lily-bud, unfold?
Then the flowers of many plants have a motion of opening and
shutting each day. I shall tell you of that, also, in another lesson.
Besides these motions in plants, there are others. Did you ever see
how a plant will turn, or bend, to grow away from a stone, or
something, that is in its way?
If you watch with care the root of one of your bean-seeds, you will
see that it grows in little curves, now this way, now that. It grows so,
even when it grows in water, or in air, where nothing touches it.
People who study these changes tell us that the whole plant, as it
grows, has a turning motion. In this motion all the plant, and all its
parts, move around as they grow.
The curious reasons for this motion of plants, you must learn when
you are older. I can now tell you only a little about it. I will tell you that
the plant moves, because the little cells in it grow in a one-sided way.
Thus the air, light, heat, moisture, cause the cells on one side of the
plant to grow larger than the others. Then the plant stoops, or is
pulled over, that way. It is bent over by the weight. Then that side is
hidden, and the other side has more light, heat, and wet. And as the
cells grow, it stoops that way.
This is easy to understand in climbing plants. Their long, slim stems
are weak. They bend with their own weight. They bend to the side
that is slightly heavier. Their motion then serves to find them a
support. As they sweep around, they touch something which will hold
them up. Then they cling to it.
Now, there is another reason for a tendril taking hold of anything.
The skin of the tendril is very soft and fine. As it lies against a string,
or stick, or branch, the touch of this object on its fine skin makes the
tendril bend, or curl.
It keeps on bending or curling, until it gets quite around the object
which it touches. Then it still goes on bending, and so it gets around
a second time, and a third, and so on. Thus the tendril makes curl
after curl, as closely and evenly as you could wind a string on a stick.
Some plants, as the hop, move around with the sun; other plants
move in just the other direction. It is as if some turned their faces,
and some their backs, to the sun.

FOOTNOTES:
[8] What you call the leaf of a fern is, properly speaking, a frond.
LESSON X.
PLANTS AND THEIR PARTNERS.
Did I not tell you that the plants had taken partners and gone into
business? I said that their business was seed-growing, but that the
result of the business was to feed and clothe the world.
In our first lessons we showed you that we get all our food, clothes,
light, and fuel, first or last, from plants. “Stop! stop!” you say. “Some
of us burn coal. Coal is a mineral.” Yes, coal is a mineral now, but it
began by being a vegetable. All the coal-beds were once forests of
trees and ferns. Ask your teacher to tell you about that.
If all these things which we need come from plants, we may be very
glad that the plants have gone into business to make more plants.
Who are these partners which we told you plants have? They are the
birds and the insects. They might have a sign up, you see, “Plant,
Insect & Co., General Providers for Men.”
Do let us get at the truth of this matter at once! Do you remember
what you read about the stamens and pistils which stand in the
middle of the flower? You know the stamens carry little boxes full of
pollen. The bottom of the pistil is a little case, or box, full of seed
germs.
You know also that the pollen must creep down through the pistils,
and touch the seed germs before they can grow to be seeds. And
you also know, that unless there are new seeds each year the world
of plants would soon come to an end.
Now you see from all this that the stamens and pistils are the chief
parts of the flower. The flower can give up its calyx, or cup, and its
gay petals, its color, honey, and perfume. If it keeps its stamens and
pistils, it will still be a true seed-bearing flower.
It is now plain that the aim of
the flower must be to get
that pollen-dust safely
landed on the top of the
pistil.
You look at a lily, and you
say, “Oh! that is very easy.
Just let those pollen boxes
fly open, and their dust is
sure to hit the pistil, all right.”
But not so fast! Let me tell
you that many plants do not
carry the stamens and pistils
all in one flower. The
stamens, with the pollen
boxes, may be in one flower,
and the pistil, with its sticky
cushion to catch pollen, may
be in another flower.
More than that, these
flowers, some with stamens, THE THREE PARTNERS.
and some with pistils, may
not even be all on one plant! Have you ever seen a poplar-tree? The
poplar has its stamen-flowers on one tree, and its pistil-flowers on
another. The palm-tree is in the same case.
Now this affair of stamen and pistil and seed making does not seem
quite so easy, does it? And here is still another fact. Seeds are the
best and strongest, and most likely to produce good plants, if the
pollen comes to the pistil, from a flower not on the same plant.
This is true even of such plants as the lily, the tulip, and the
columbine, where stamens and pistils grow in one flower.
Now you see quite plainly that in some way the pollen should be
carried about. The flowers being rooted in one place cannot carry
their pollen where it should go. Who shall do it for them?
Here is where the insect comes in. Let us look at him. Insects vary
much in size. Think of the tiny ant and gnat. Then think of the great
bumble bee, or butterfly. You see this difference in size fits them to
visit little or big flowers.
You have seen the great bumble bee busy in a lily, or a trumpet
flower. Perhaps, too, you have seen a little ant, or gnat, come
crawling out of the tiny throat of the thyme or sage blossom. And you
have seen the wasp and bee, busy on the clover blossom or the
honeysuckle.
Insects have wings to take them quickly wherever they choose to go.
Even the ant, which has cast off its wings,[9] can crawl fast on its six
nimble legs.
Then, too, many insects have a long pipe, or tongue, for eating. You
have seen such a tongue on the bee.[10] In this book you will soon
read about the butterfly, with its long tube which coils up like a watch
spring.
With this long tube the insect can poke into all the slim cups, and
horns, and folds, of the flowers of varied shapes.
Is it not easy to see that when the insect flies into a flower to feed, it
may be covered with the pollen from the stamens? Did you ever
watch a bee feeding in a wild rose? You could see his velvet coat all
covered with the golden flower dust.
Why does the insect go to the flower? He does not know that he is
needed to carry pollen about. He never thinks of seed making. He
goes into the flower to get food. He eats pollen sometimes, but
mostly honey.
In business, you know, all the partners wish to make some profit for
themselves. The insect partner of the flower has honey for his gains.
The flower lays up a drop of honey for him.
In most flowers there is a little honey. Did you ever suck the sweet
drop out of a clover, or a honeysuckle? This honey gathers in the
flower about the time that the pollen is ripe in the boxes. Just at the
time that the flower needs the visit of the insects, the honey is set
ready for them.
Into the flower goes the insect for honey. As it moves about, eating,
its legs, its body, even its wings, get dusty with pollen. When it has
eaten the honey of one flower, off it goes to another. And it carries
with it the pollen grains.
As it creeps into the next flower, the pollen rubs off the insect upon
the pistil. The pistil is usually right in the insect’s way to the honey.
The top of the pistil is sticky, and it holds the pollen grains fast. So
here and there goes the insect, taking the pollen from one flower to
another.
But stop a minute. The pollen from a rose will not make the seed
germs of a lily grow. The tulip can do nothing with pollen from a
honeysuckle. The pollen of a buttercup is not wanted by any flower
but a buttercup. So of all. The pollen to do the germ any good must
come from a flower of its own kind.
What is to be done in this case? How will the insect get the pollen to
the right flower? Will it not waste the clover pollen on a daisy?
Now here comes in a very strange habit of the insect. Insects fly
“from flower to flower,” but they go from flowers of one kind to other
flowers of the same kind. Watch a bee. It goes from clover to clover,
not from clover to daisy.
Notice a butterfly. It flits here and there. But you will see it settle on a
pink, and then on another pink, and on another, and so on. If it
begins with golden rod, it keeps on with golden rod.
God has fixed this habit in insects. They feed for a long time on the
same kind of flowers. They do this, even if they have to fly far to
seek them. If I have in my garden only one petunia, the butterfly
which feeds in that will fly off over the fence to some other garden to
find another petunia. He will not stop to get honey from my sweet
peas.
Some plants have drops of honey all along up the stem to coax ants
or other creeping insects up into the flower.
But other plants have a sticky juice along the stem, to keep crawling
insects away. In certain plants the bases of the leaf-stems form little
cups, for holding water. In this water, creeping insects fall and drown.
Why is this? It is because insects that would not properly carry the
pollen to another flower, would waste it. So the plant has traps, or
sticky bars, to keep out the kind of insects that would waste the
pollen, or would eat up the honey without carrying off the pollen.
I have not had time to tell you of the many shapes of flowers. You
must notice that for yourselves.
Some are like cups, some like saucers, or plates, or bottles, or bags,
or vases. Some have long horns, some have slim tubes or throats.
Some are all curled close about the stamens and pistils.
These different kinds of flowers need different kinds of insects to get
their pollen. Some need bees with thick bodies. Some need
butterflies with long, slim tubes. Some need wasps with long, slender
bodies and legs. Some need little creeping ants, or tiny gnats.
Each kind of flower has what will coax the right kind of insects, and
keep away the wrong ones. What has the plant besides honey to
coax the insect for a visit? The flower has its lovely color, not for us,
but for insects. The sweet perfume is also for insects.
Flowers that need the visits of moths, or other insects that fly by
night, are white or pale yellow. These colors show best at night.
Flowers that need the visits of day-flying insects, are mostly red,
blue, orange, purple, scarlet.
There are some plants, as the grass, which have no sweet perfume
and no gay petals. I have told you of flowers which are only a small
brown scale with a bunch of stamens and pistils held upon it. And
they have no perfumes. These flowers want no insect partners. Their
partner is the summer wind! The wind blows the pollen of one plant
to another. That fashion suits these plants very well.
So, by means of insect or wind partners, the golden pollen is carried
far and wide, and seeds ripen.
But what about the bird partners? Where do they come in?
If the ripe seed fell just at the foot of the parent plant, and grew
there, you can see that plants would be too much crowded. They
would spread very little. Seeds must be carried from place to place.
Some light seeds, as those of the thistle, have a plume. The maple
seeds have wings. By these the wind blows them along.
But most seeds are too heavy to be wind driven. They must be
carried. For this work the plant takes its partner, the bird.
To please the eye of the bird, and attract it to the seed, the plant has
gay-colored seeds. Also it has often gay-colored seed cases. The
rose haws, you know, are vivid red. The juniper has a bright blue
berry. The smilax has a black berry. The berries of the mistletoe are
white, of the mulberry purple.
These colors catch the eye of the bird. Down he flies to swallow the
seed, case, and all. Also many seed cases, or covers, are nice food
to eat. They are nice for us. We like them. But first of all they were
spread out for the bird’s table.
Birds like cherries, plums, and strawberries. Did you ever watch a
bird picking blackberries? The thorns do not bother him. He swallows
the berries fast,—pulp and seed.
You have been told of the hard case which covers the soft or germ
part of the seed, and its seed-leaf food. This case does not melt up
in the bird’s crop or gizzard, as the soft food does. So when it falls to
the ground the germ is safe, and can sprout and grow.
Birds carry seeds in this way from land to land, as well as from field
to field. They fly over the sea and carry seeds to lonely islands,
which, but for the birds, might be barren.
So by means of its insect partners, the plant’s seed germs grow, and
perfect seeds. By means of the bird partners, the seeds are carried
from place to place. Thus many plants grow, and men are clothed,
and warmed, and fed.

FOOTNOTES:
[9] See Nature Reader, No. 2, Lessons on Ants.
[10] No. 1, Lesson 18.
LESSON XI.
AIR, WATER, AND SAND PLANTS.
Most of the plants which you see about you grow in earth or soil. You
have heard your father say that the grass in some fields was scanty
because the soil was poor. You have been told that wheat and corn
would not grow in some other field, because the soil was not rich
enough.
You understand that. The plant needs good soil, made up of many
kinds of matter. These minerals are the plant’s food. Perhaps you
have helped your mother bring rich earth from the forest, to put
about her plants.
But beside these plants growing in good earth in the usual way, there
are plants which choose quite different places in which to grow.
There are air-plants, water-plants, sand-plants. Have you seen all
these kinds of plants?
You have, no doubt, seen plants growing in very marshy, wet places,
as the rush, the iris, and the St. John’s-wort. Then, too, you have
seen plants growing right in the water, as the water-lilies, yellow and
white; the little green duck-weed; and the water crow-foot.
If you have been to the sea-shore, you have seen green, rich-looking
plants, growing in a bank of dry sand. In the West and South, you
may find fine plants growing in what seem to be drifts, or plains of
clear sand.
Air-plants are less common. Let us look at them first. There are
some plants which grow upon other plants and yet draw no food
from the plant on which they grow. Such plants put forth roots,
leaves, stems, blossoms, but all their food is drawn from the air.
I hope you may go and see some hot-house where orchids are kept.
You will see there splendid plants growing on a dead branch, or

You might also like