You are on page 1of 7

Visualizing Texts: Determining Best Visual Metaphors

Fionn Murtagh, Adam Ganz March 24, 2009

1

Introduction

There may or may not be evident visual metaphors that come to mind for a given text. Here we proceed by first asking what is the “backbone” of the text, and what are its most dominant components including both structure and content. Can we go as far as saying that a text is more or less summarized by parts of it? These parts would need to be representative of the text. To the extent that we can say that a subset of a text provides good coverage of the full text we have a ready way of summarizing a text. Therefore we first try to summarize a text by its most salient sections. Next we look at these sections to pick out the most interesting visual metaphors. An aid here is to look at text sections and their closely associated – perhaps characterizing – terms used in text. All of this work is based on presence/absence and frequency of occurrence data determined for the set of document sections, and the set of terms appearing in the text. The terms are words, ignoring punctuation, of at least two characters in length. Unsurprisingly “stop words”, which we use, appear high on the list of terms ordered by frequency of occurrence (“the”, “of”, “and”, etc.). No lemmatization or other forms of text normalization are applied. It is unusual to make use of all words like this. However commonly used words do reveal style (with implications for example for the emotional content of text: consider [1] who explore emotional content of financial news with implications for investment decisions). The technique used by us here is Correspondence Analysis which takes an input data matrix containing frequencies of occurrence of text sections crossed by terms, and produces Euclidean – metric and hence visual, too – embeddings of the text sections and terms. See [2, 4, 5, 6] for background and discussion of various text analyses. Taking Marx’s text, [3], we subdivided the text into 21 consecutive paragraphs. Word counts varied from 512 words in paragraph 6 to just 25 words in the one-sentence paragraph 5. We will now look at the visualizations of Figures 1, 2 and 3. They are planar projections of the space of 21 (to start with, and then 6) paragraphs and the 1

(high dimensional) space of terms. The percentage inertia describes how good the fit of such projections is. The percentage inertia explained by the most important plane is given by the percentages of inertia explained by factors 1 and 2: see the figures. These factors 1 and 2, and subsequent ones (not shown nor discussed), are not greatly different in inertia explained, – i.e. they express approximately the same quantities of information. The fact that we will now discuss an approximation to a high dimensional information space should be kept in mind. Nonetheless we are dealing with the best approximation.

2

Structure of Text from Correspondence Analysis Displays

Figure 1 using all the data gives rise to the following. Most paragraphs are bunched near the origin, i.e. the average paragraph profile. (One speaks of “profiles” in Correspondence Analysis because paragraphs – rows – are normalized in the analysis to even out the different numbers of terms per paragraph. Similarly term – column – profiles arise through the same evening out of the different numbers of paragraphs characterized by each term.) So the paragraphs that seem to really matter are: 15, 21, 20, 1, 19, 11. By “really matter” we can quantify this through the mathematically defined correlations and contributions to the factors. As a consequence of these findings in Figure 1, we will proceed with just the restricted set of 6 paragraphs as constituting the “backbone” of Marx’s text. We find that Figure 2, using 6 paragraphs instead of the given 21, and hence with 482 associated terms rather than 974 associated terms, is not unlike Figure 1. It is important to note that the factors of Corresondence Analysis are not fixed in their orientations. Hence, it is equally acceptable to have reflections in the axes, just so long as there is consistency. So in both figures, paragraphs 20 and 21 are closely located. These two paragraphs are counterposed to paragraph 11. In both figures, 20 and 21, and then 11 and finally 15 constitute approximately three vertices of a triangle. What does change in location in proceeding to Figure 2 from Figure 1 is paragraph 19; and paragraph 1 has come closer to the origin (and hence the average paragraph). Figure 3 shows just one term on the display, “superstition”, which will be discussed below. If one wished to have the most salient paragraphs to summarize the overall text we therefore find this job to be done quite well by using paragraphs 1, 11, 15, 19, 20 and 21. The reader of the text on commodity fetishism who is in a hurry should concentrate on these paragraphs! The factors, as already noted, are very similar (quantified by percentage of inertia explained), even if factor 1 is defined from the best projection, and factor 2 from the second best. Looking at terms that most explain these factors (using

2

the mathematically defined contributions), and ignoring (by not reporting them here) very commonly occurring words, we find the following. Factor 1 is strongly characterized, and defined, by “value”, “use” and “exchange”, among other terms. In other words, Factor 1 accounts for the core terms of exchange value and use value. Factor 2 is strongly characterized, and defined, by “ancient”, “society”, “world”. In other words, Factor 2 is taken up with the implications for our understanding of the time-evolving context.

3

Visual Expressions for the Retained Subset of Paragraphs
• Paragraph 1: commodity’s attributes • Paragraph 11: Robinson Crusoe story • Paragraph 15: religion and worship • Paragraph 19: on provenance of monetary system • Paragraph 20, 21: use value and/or exchange value

In brief, the contents of the paragraphs are as follows.

Marx’s writing can be very visual, with plenty of illustrations. In paragraph 1, there is reference to tables made from wood. Paragraphs 11 and 15 have ready-made visual metaphors. For paragraph 19, maybe the term “superstition” is a good one as a basis for a visual metaphor. (In this paragraph the word is used relative to the social relation, and not natural property, of gold or silver as money). See Figure 3 where this term, “superstition”, and the paragraph, 19, are acceptably close. Among the terms characterizing paragraphs 20 and 21, quite acceptably there are “value”, “use”, “exchange”. Here it may be necessary to introduce new symbolic expressions for these fundamental terms in Marx’s text, “use value” and “exchange value”.

References
[1] A. Devitt and K. Ahmad, “Sentiment analysis: the languages of emotion and financial news” (article in preparation for ACM Transactions in Information Systems), 2009. [2] L. Lebart, A. Salem and L. Berry, Exploring Textual Data, Kluwer, 1998. (L. Lebart and A. Salem, Analyse Statistique des Donn´es Textuelles, e Dunod, 1988.)

3

. . . . . .. . 21 . .. . .. .. . . . . . 20 . . . ... .. . . . . . . . .15 . . . . . 16 5 . .. . . . . . . ......... .. .. . . .. . .. . .. . . ...... . . . .. . . . . . . . . . ... ............ .... 17 .. . . . . . 7 . ...13. .9.4 . . 18 . . . . . ....8 . . . .. ...14... .. . . ... .. . . . . .. 36..... 2.. . .. . . . 10. .... .. . . . . .. . . . . .. . . . .. .. . . . .. . 12 . . . . .. .. . . . . 1 . . .. . . . .. . . . . ... . . . . . 19 .. . . . . . .. . .. ... . . ... . . . . . . . . . .. . .. . . . 11 . . . . . 1 .

.

Factor 2, 6.67% inertia

−3

−2

−1

0

. −1 0 1 2 3

Factor 1, 6.76% inertia

Figure 1: Paragraphs 1 to 21 are noted. The 974 terms that characterize these paragraphs are displayed as dots.

4

. 1.0 . . . . . . . 20 0.0 . . .. −0.5 . .. .. . . . . . .. . .. . . . 1 . . . . . . . . . .. .. 19 . . −1.0 . . . . . . . . .. .. . . . 15 . . −1.5 −1.0 −0.5 0.0 0.5 1.0 . . . . . . . . . . . . . .. .. . . . . . . . . . . 11 . . .

0.5

..

.

Factor 2, 21.4% inertia

21

..

Factor 1, 22.0% inertia

Figure 2: Compared to Figure 1 a smaller set of paragraphs is used here: paragraphs 1, 11, 15, 19, 20, 21. There are 482 terms in these paragraphs, displayed as dots.

5

. 1.0 . . . . . . . 20 0.0 . . .. −0.5 . .. .. . . . . . .. . .. . . . 1 . . . . . . . . . .. .. 19 . . −1.0 . . . . . . . . .. .. . . . 15 . . s −1.5 −1.0 −0.5 0.0 0.5 1.0 . . . . . . . . . . . . . .. .. . . . . . . . . . . 11 . . .

0.5

..

.

Factor 2, 21.4% inertia

21

..

Factor 1, 22.0% inertia

Figure 3: As Figure 2 but with term “superstition” noted as “s”.

6

[3] K. Marx, Capital Volume 1, Part I: Commodities and Money, Chapter 1: Commodities, Section 4, The Fetishism of Commodities and the Secret Thereof, online edition: http://www.marxists.org/archive/marx/works/1867-c1/ch01.htm [4] F. Murtagh, Correspondence Analysis and Data Coding with R and Java, Chapman and Hall/CRC Press, 2005. [5] F. Murtagh, A. Ganz and S. McKie, “The structure of narrative: the case of film scripts”, Pattern Recognition, 42, 302–312, 2009. (See discussion in Z. Merali, “Here’s looking at you, kid. Software promises to identify blockbuster scripts.”, Nature, 453, p. 708, 4 June 2008.) [6] F. Murtagh, “The Correspondence Analysis platform for uncovering deep structure in data and information”, Sixth Boole Lecture, Computer Journal, forthcoming, 2008. Advance Access 9 Sept. 2008 (doi:10.1093/comjnl/bxn045).

7