You are on page 1of 13

Analyzing the k-core entropy of word co-occurrence

networks: applications to cross-linguistic comparisons

xxx* 1 and xxx†1

1
xxx

September 9, 2021

Abstract

We study a large sample of languages across xxx families xxx, by means of the k-core
decomposition algorithm. xxx Our main goal is to xxx

Keywords: Computational typology, Co-occurrence graphs, K-core decomposition, En-

tropy

1 Introduction

• short and long-range correlations between words –¿ networks!!!

• solution –¿ entropy + networks as natural approach to long-range correlations

• how we approximate the k-core decomposition of a language, to make plausible cross-

linguistic comparisons?
* correo

correo

1
This paper addresses some intriguing questions: To what extent it is possible to unveil the

underlying structure of co-occurrence networks? It is possible to use this kind of information

to make comparisons across different linguistic families? Here, we explore a network-based

approach to the answer to these questions. xxx

xxx xxx xxx xxx

Recently, Statistical Mechanics has been a fruitful framework to study statistical properties

of human language (see, among other studies, [1, 2, 3, 4]). xxxx

xxx

In order to answer the proposed questions, this paper focuses on an examination of global

properties of networks (representing in turn languages across the world) that may capture fea-

tures of the network as a whole, rather than averaging individual node’s properties as node de-

gree or clustering. This allows us to propose a simple and interpretable network-based approach

to corpus-based linguistic typology (CITA). To do this, we examine the k-core decomposition,

which consists in identifying layers of increasing connectivity by a simple pruning strategy.

This algorithm defines inner subsets of networks which are not only formed by central nodes

but also densely connected ones. The k-shell decomposition provides thus a computationally

tractable procedure (O(|V | + |E|) in time, where V and E are node and edges sets [? ]) to study

hierarchical properties of large scale networks.

Particularly, the k-core decomposition algorithm has been used to describe the hierarchical

organization of Internet and neural networks. As already seen in two intriguing studies [? ? ],

the k-core decomposition algorithm has been fruitful to characterize networks beyond the degree

distribution and to uncover hierarchies due to the specific structure of Internet graphs. Based on

this algorithm, [? ] provide a profound characterization of the hierarchical structure of human

cortical network, as a densely connected nucleus alongside shells of increasing connectivity.

xxx xxx xxx

2
2 Materials and methods

2.1 Textual corpora

To compare languages from a network-based perspective, we need a comparable source of in-

formation that represents (to a certain extent) the main structural features of each language. For

the sake of simplicity and to avoid some genre distortions, we based our experiments on a freely

available parallel corpus 1 .

Figure 1: Basic description of the parallel corpus. The figure displays a summary of the
number of tokens and types.

A word type is a unique string (a sequence of unicode characters2 ) delimited by white spaces.

A word token is then any repetition of a word type. Details about the corpus are shown in Fig. 1

and Table 1.
1
https://www.unicode.org/udhr/index.html
2
https://home.unicode.org/

3
Table 1: Basic description of the studied linguistic families (based on Glottolog [5]).

linguistic family glottocode languages


Indo-European indo1319 83
Atlantic-Congo atla1278 72
Austronesian aust1307 33
Sino-Tibetan sino1245 16
Quechuan quec1387 13
Afro-Asiatic afro1255 10
Turkic turk1311 10
Other families 125
Total 362

2.2 Information-theoretic entropy

The seminal works of C. E. Shannon [6] have suggested that human language is mainly based

on the statistical choice of linguistic units. A natural mathematical approach to this kind of

uncertainty is precisely Information Theory. Indeed, there is a precise quantity of the average

amount of choice associated with words: the word entropy.

To define this quantity, we start with a the text T formed by word-types taken from the

finite set Wt . In probabilistic terms, word-type probabilities are distributed according to p(w),

w ∈ Wt . The average amount of choice of word-types (or simply the entropy) reads [6]

X
H=− p(w) log(p(w)) (1)
w∈Wt

If we denote by fw the frequency of the word-type w can be estimated using the so-called

maximum likelihood estimator:

fw
fw = P (2)
w0 ∈Wt w0

xxxxx

As noticed by [7], the estimation of word entropy involves two main problems. First,

NSB estimator [8].

4
2.3 Basic concepts on Network Theory

Each language is viewed here as an undirected network G = (V, E), where V represents the

set of word-types, while E is formed by pairs of word-types occurring within a fixed-size slid-

ing window. This simple procedure involves a statistical approach to the distance in which

relationships between linguistic units occur. The neighborhood of the node u ∈ V is the set

Vu = {v ∈ V : uv ∈ E}. The degree of the node u ∈ V is the size of Vu . An induced

subnetwork of G is a network formed by a subset V 0 of V and all the corresponding edges of the

nodes of V 0 .

2.4 k-shell decomposition

k-shell decomposition algorithm is an iterative process that at each step identifies the connec-

tivity of the most “external” nodes, kmin = minu∈V d(u), and removes the nodes with degree

lower or equal than kmin , until the core of the network is revealed [? ? ]. More precisely, the

algorithm reads (see Figure 2):

Step 1. Start with the network G and the minimum node degree kmin = minu∈V d(u).

Step 2. Remove all nodes with d(u) 6 kmin , resulting in a pruned set V 0 that induces the

subgraph G0 .

Step 3. G is replaced by G0 and go back to Step 1.

Stop condition. If G0 is the null network (without nodes).

The k-shell is the induced subnetwork by the removed nodes in a given k step. All nodes of

the k-shell are associated to the index k. The shell index for a node u ∈ V is denoted k(u). In

our analysis, the k-crust of the graph G is the induced subgraph by the set V \ V k , where V k is

the set of nodes of the k-shell.

5
1

3 3 3 3

3 d(u) 6 1 3

3 3 3 3

1 2 2 2 2
2 2
2 2 2 2

2 2

d(u) 6 2

shell index k = 1 shell index k = 2 shell index k = 3 3 3

3 3

Figure 2: k-shell decomposition in a simple network. At each step, the k-shell decomposition
algorithm searches for the nodes of lower degree k. Then, all nodes with degree 6 k are pruned.
From the remaining set of nodes, the degree is computed for each node. If nodes have degree
6 k, the pruning phase is repeated. Next, the process searches again for nodes of lower degree.
In this example, the k-shell decomposition reveals three layers of increasing connectivity. Red
nodes are associated to the first step. Then, these nodes are pruned from the network and define
the 1-shell. The next step searches for nodes with degree 6 2 (yellow nodes), which define the
2-shell. Finally, after pruning this set of nodes it is revealed the 3-shell (blue nodes).

2.5 Information-theoretic approaches to structural patterns in networks

• von neumann entropy

• k-core entropy! proposal :)

2.6 Network construction and implementation details

For basic text preprocessing (whitespace tokenization, punctuation removal and conversion to

lower case), we used NLTK [9]. Network-theoretic techniques (in particular, k-core decomposi-

tion) were made using NetworkX [10].

6
For each language, its associated network was built along the following steps:

Step 1. Preprocess each sentence (equivalent to a line) by whitespace tokenization, punc-

tuation removal and conversion to lower case.

Step 2. Define the set of word-types Wt of the entire text.

Step 3. Through an iterative process, inspect each sentence in order to find word-types

occurring within a fixed-size window (based on the fact that dependency relationships

occur in general at small distances [11]). Each new co-occurrence between pairs of word-

types from Wt defines an edge of the network. Repetitions of bigrams increase the weight

of the respective edge.

Fixed-sized co-occurrence windows are varied from 1 o 5.

3 Results

3.1 Basic description of the k-core entropy for our sample of languages

We first shed light on a basic description of the word co-occurrence networks, based on the

k index. As explained above, at each step of the k-core decomposition algorithm nodes (or

word-types in our case) are associated to a integer number (the k index), describing their level

of connectivity. To give a global characterization of the distribution of k indexes, we simply

calculate the average over all the nodes of the word co-occurrence networks: k̄ = n1 i∈V ki ,
P

where ki denotes the k index for the node i. A first important question relies on the possible

influence of the radius on our results. Fig. 3 displays histograms for the calculated average

k index for all XX languages and different window sizes. The average k index is distributed

as follows: (radius 1) around a mean of 2.42 (SD = 0.53); (radius 2) around a mean of 4.5

(SD = 1.03); (radius 3) around a mean of 6.33 (SD = 1.47); (radius 4) around a mean of 7.93

(SD = 1.82); and (radius 5) around a mean of 9.37 (SD = 2.13). It is clear from these simple

7
calculations that there is a linear relationship between the average k index and the window size.

A simple calculation shows a slope of 1.73 for the relationship between radius and average k

index. This fact suggest that the choice of a particular window-size is only affected by a constant.

Figure 3: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.

3.2 Comparisons with previous approaches to computational typology

To compare the previous observations with other accounts to the global quantitative comparison

between languages (for example, [12]) we stressed the fact that the type token ratio is positively

correlated with morphological complexity. With this, it seems reasonable to explore the rela-

tionship between the hierarchical information provided by the k-core algorithm and a simple

corpus-based measure. As shown in Fig. 5, there is a clear exponential decay of the average

k index as the type-token ratio increases. Remarkably, despite a linear influence of radius, this

exponential behavior does not seem to be affected by the choice of the fixed-window size.

8
Figure 4: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.

9
Figure 5: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.

3.3 k-core entropy

3.4 Applications of k-core entropy to the analysis of two large linguistic families

To profound on the appearance of the exponential decay of the average k index as a function of
10
type-token ratio, we study in detail word co-occurrence networks for a radius 2. As shown in
Fig. 6, there are radical differences between low and high type-token ratio languages regarding

the average k index. A low the average k index is observed for Quechuan languages (mean of

3.7; SD = 0.3); by contrast, morphologically simpler languages (like the Austronesian family)

displays a mean of 5.24 (SD = 1). This fact suggests in principle that the average k index is

lower for families exhibiting higher morphological complexity. In other terms, for such families

displaying a low average k index there is a large proportion of wordforms (based of their high

morphological complexity rules). It is reasonable to hypothesize thus that the detection of a

large type-token ration is a strong evidence of a simpler hierarchical network structure.

Figure 6: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.

3.5 Average k index across our sample of languages

xxxx

11
References

[1] Jin Cong and Haitao Liu. Approaching human language with complex networks. Physics

of Life Reviews, 11(4):598 – 618, 2014.

[2] Yuyang Gao, Wei Liang, Yuming Shi, and Qiuling Huang. Comparison of directed and

weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and

its Applications, 393:579 – 589, 2014.

[3] Ricard V. Solé, Bernat Corominas-Murtra, Sergi Valverde, and Luc Steels. Language net-

works: Their structure, function, and evolution. Complexity, 15(6):20–26, 2010.

[4] Luı́s F. Seoane and Ricard Solé. The morphospace of language networks. Scientific Re-

ports, 8(1):10465, 2018.

[5] Harald Hammarström, Robert Forkel, Martin Haspelmath, and Sebastian Bank. Glottolog

4.3. Jena, 2020.

[6] C. E. Shannon. A mathematical theory of communication. The Bell System Technical

Journal, 27(3):379–423, 1948.

[7] Christian Bentz, Dimitrios Alikaniotis, Michael Cysouw, and Ramon Ferrer-i-Cancho. The

entropy of words—learnability and expressivity across more than 1000 languages. Entropy,

19(6), 2017.

[8] Ilya Nemenman, Fariel Shafee, and William Bialek. Entropy and inference, revisited. In

Advances in Neural Information Processing Systems 14 - Proceedings of the 2001 Confer-

ence, NIPS 2001, Advances in Neural Information Processing Systems. Neural information

processing systems foundation, January 2002. 15th Annual Neural Information Processing

Systems Conference, NIPS 2001 ; Conference date: 03-12-2001 Through 08-12-2001.

[9] Edward Loper and Steven Bird. Nltk: The natural language toolkit. In Proceedings of the

ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language

12
Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, page 63–70, USA,

2002. Association for Computational Linguistics.

[10] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure,

dynamics, and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod

Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15,

Pasadena, CA USA, 2008.

[11] R Ferrer i Cancho, Esteban J L C. Gómez-Rodrı́guez, and L Alemany-Puig. The optimality

of syntactic dependency distances. page under review, 2020.

[12] Christian Bentz, Tatyana Ruzsics, Alexander Koplenig, and Tanja Samardžić. A compari-

son between morphological complexity measures: Typological data vs. language corpora.

In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity

(CL4LC), pages 142–153, Osaka, Japan, December 2016. The COLING 2016 Organizing

Committee.

13

You might also like