Analyzing The K-Core Entropy of Word Co-Occurrence Networks: Applications To Cross-Linguistic Comparisons

Analyzing the k-core entropy of word co-occurrence
networks: applications to cross-linguistic comparisons
xxx* 1 and xxx†1
1
xxx
September 9, 2021
Abstract
We study a large sample of languages across xxx families xxx, by means of the k-core
decomposition algorithm. xxx Our main goal is to xxx
Keywords: Computational typology, Co-occurrence graphs, K-core decomposition, En-
tropy
1 Introduction
• short and long-range correlations between words –¿ networks!!!
• solution –¿ entropy + networks as natural approach to long-range correlations
• how we approximate the k-core decomposition of a language, to make plausible cross-
linguistic comparisons?
* correo
†
correo
1
This paper addresses some intriguing questions: To what extent it is possible to unveil the
underlying structure of co-occurrence networks? It is possible to use this kind of information
to make comparisons across different linguistic families? Here, we explore a network-based
approach to the answer to these questions. xxx
xxx xxx xxx xxx
Recently, Statistical Mechanics has been a fruitful framework to study statistical properties
of human language (see, among other studies, [1, 2, 3, 4]). xxxx
xxx
In order to answer the proposed questions, this paper focuses on an examination of global
properties of networks (representing in turn languages across the world) that may capture fea-
tures of the network as a whole, rather than averaging individual node’s properties as node de-
gree or clustering. This allows us to propose a simple and interpretable network-based approach
to corpus-based linguistic typology (CITA). To do this, we examine the k-core decomposition,
which consists in identifying layers of increasing connectivity by a simple pruning strategy.
This algorithm defines inner subsets of networks which are not only formed by central nodes
but also densely connected ones. The k-shell decomposition provides thus a computationally
tractable procedure (O(|V | + |E|) in time, where V and E are node and edges sets [? ]) to study
hierarchical properties of large scale networks.
Particularly, the k-core decomposition algorithm has been used to describe the hierarchical
organization of Internet and neural networks. As already seen in two intriguing studies [? ? ],
the k-core decomposition algorithm has been fruitful to characterize networks beyond the degree
distribution and to uncover hierarchies due to the specific structure of Internet graphs. Based on
this algorithm, [? ] provide a profound characterization of the hierarchical structure of human
cortical network, as a densely connected nucleus alongside shells of increasing connectivity.
xxx xxx xxx
2
2 Materials and methods
2.1 Textual corpora
To compare languages from a network-based perspective, we need a comparable source of in-
formation that represents (to a certain extent) the main structural features of each language. For
the sake of simplicity and to avoid some genre distortions, we based our experiments on a freely
available parallel corpus 1 .
Figure 1: Basic description of the parallel corpus. The figure displays a summary of the
number of tokens and types.
A word type is a unique string (a sequence of unicode characters2 ) delimited by white spaces.
A word token is then any repetition of a word type. Details about the corpus are shown in Fig. 1
and Table 1.
1
https://www.unicode.org/udhr/index.html
2
https://home.unicode.org/
3
Table 1: Basic description of the studied linguistic families (based on Glottolog [5]).
linguistic family glottocode languages

Indo-European indo1319 83
Atlantic-Congo atla1278 72
Austronesian aust1307 33
Sino-Tibetan sino1245 16
Quechuan quec1387 13
Afro-Asiatic afro1255 10
Turkic turk1311 10
Other families 125
Total 362
2.2 Information-theoretic entropy
The seminal works of C. E. Shannon [6] have suggested that human language is mainly based
on the statistical choice of linguistic units. A natural mathematical approach to this kind of
uncertainty is precisely Information Theory. Indeed, there is a precise quantity of the average
amount of choice associated with words: the word entropy.
To define this quantity, we start with a the text T formed by word-types taken from the
finite set Wt . In probabilistic terms, word-type probabilities are distributed according to p(w),
w ∈ Wt . The average amount of choice of word-types (or simply the entropy) reads [6]
X
H=− p(w) log(p(w)) (1)
w∈Wt
If we denote by fw the frequency of the word-type w can be estimated using the so-called
maximum likelihood estimator:
fw
fw = P (2)
w0 ∈Wt w0
xxxxx
As noticed by [7], the estimation of word entropy involves two main problems. First,
NSB estimator [8].
4
2.3 Basic concepts on Network Theory
Each language is viewed here as an undirected network G = (V, E), where V represents the
set of word-types, while E is formed by pairs of word-types occurring within a fixed-size slid-
ing window. This simple procedure involves a statistical approach to the distance in which
relationships between linguistic units occur. The neighborhood of the node u ∈ V is the set
Vu = {v ∈ V : uv ∈ E}. The degree of the node u ∈ V is the size of Vu . An induced
subnetwork of G is a network formed by a subset V 0 of V and all the corresponding edges of the
nodes of V 0 .
2.4 k-shell decomposition
k-shell decomposition algorithm is an iterative process that at each step identifies the connec-
tivity of the most “external” nodes, kmin = minu∈V d(u), and removes the nodes with degree
lower or equal than kmin , until the core of the network is revealed [? ? ]. More precisely, the
algorithm reads (see Figure 2):
Step 1. Start with the network G and the minimum node degree kmin = minu∈V d(u).
Step 2. Remove all nodes with d(u) 6 kmin , resulting in a pruned set V 0 that induces the
subgraph G0 .
Step 3. G is replaced by G0 and go back to Step 1.
Stop condition. If G0 is the null network (without nodes).
The k-shell is the induced subnetwork by the removed nodes in a given k step. All nodes of
the k-shell are associated to the index k. The shell index for a node u ∈ V is denoted k(u). In
our analysis, the k-crust of the graph G is the induced subgraph by the set V \ V k , where V k is
the set of nodes of the k-shell.
5
1
3 3 3 3
3 d(u) 6 1 3
3 3 3 3
1 2 2 2 2
2 2
2 2 2 2
2 2
d(u) 6 2
shell index k = 1 shell index k = 2 shell index k = 3 3 3
3 3
Figure 2: k-shell decomposition in a simple network. At each step, the k-shell decomposition
algorithm searches for the nodes of lower degree k. Then, all nodes with degree 6 k are pruned.
From the remaining set of nodes, the degree is computed for each node. If nodes have degree
6 k, the pruning phase is repeated. Next, the process searches again for nodes of lower degree.
In this example, the k-shell decomposition reveals three layers of increasing connectivity. Red
nodes are associated to the first step. Then, these nodes are pruned from the network and define
the 1-shell. The next step searches for nodes with degree 6 2 (yellow nodes), which define the
2-shell. Finally, after pruning this set of nodes it is revealed the 3-shell (blue nodes).
2.5 Information-theoretic approaches to structural patterns in networks
• von neumann entropy
• k-core entropy! proposal :)
2.6 Network construction and implementation details
For basic text preprocessing (whitespace tokenization, punctuation removal and conversion to
lower case), we used NLTK [9]. Network-theoretic techniques (in particular, k-core decomposi-
tion) were made using NetworkX [10].
6
For each language, its associated network was built along the following steps:
Step 1. Preprocess each sentence (equivalent to a line) by whitespace tokenization, punc-
tuation removal and conversion to lower case.
Step 2. Define the set of word-types Wt of the entire text.
Step 3. Through an iterative process, inspect each sentence in order to find word-types
occurring within a fixed-size window (based on the fact that dependency relationships
occur in general at small distances [11]). Each new co-occurrence between pairs of word-
types from Wt defines an edge of the network. Repetitions of bigrams increase the weight
of the respective edge.
Fixed-sized co-occurrence windows are varied from 1 o 5.
3 Results
3.1 Basic description of the k-core entropy for our sample of languages
We first shed light on a basic description of the word co-occurrence networks, based on the
k index. As explained above, at each step of the k-core decomposition algorithm nodes (or
word-types in our case) are associated to a integer number (the k index), describing their level
of connectivity. To give a global characterization of the distribution of k indexes, we simply
calculate the average over all the nodes of the word co-occurrence networks: k̄ = n1 i∈V ki ,
P
where ki denotes the k index for the node i. A first important question relies on the possible
influence of the radius on our results. Fig. 3 displays histograms for the calculated average
k index for all XX languages and different window sizes. The average k index is distributed
as follows: (radius 1) around a mean of 2.42 (SD = 0.53); (radius 2) around a mean of 4.5
(SD = 1.03); (radius 3) around a mean of 6.33 (SD = 1.47); (radius 4) around a mean of 7.93
(SD = 1.82); and (radius 5) around a mean of 9.37 (SD = 2.13). It is clear from these simple
7
calculations that there is a linear relationship between the average k index and the window size.
A simple calculation shows a slope of 1.73 for the relationship between radius and average k
index. This fact suggest that the choice of a particular window-size is only affected by a constant.
Figure 3: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.
3.2 Comparisons with previous approaches to computational typology
To compare the previous observations with other accounts to the global quantitative comparison
between languages (for example, [12]) we stressed the fact that the type token ratio is positively
correlated with morphological complexity. With this, it seems reasonable to explore the rela-
tionship between the hierarchical information provided by the k-core algorithm and a simple
corpus-based measure. As shown in Fig. 5, there is a clear exponential decay of the average
k index as the type-token ratio increases. Remarkably, despite a linear influence of radius, this
exponential behavior does not seem to be affected by the choice of the fixed-window size.
8
9
3.3 k-core entropy
3.4 Applications of k-core entropy to the analysis of two large linguistic families
To profound on the appearance of the exponential decay of the average k index as a function of
10
type-token ratio, we study in detail word co-occurrence networks for a radius 2. As shown in
Fig. 6, there are radical differences between low and high type-token ratio languages regarding
the average k index. A low the average k index is observed for Quechuan languages (mean of
3.7; SD = 0.3); by contrast, morphologically simpler languages (like the Austronesian family)
displays a mean of 5.24 (SD = 1). This fact suggests in principle that the average k index is
lower for families exhibiting higher morphological complexity. In other terms, for such families
displaying a low average k index there is a large proportion of wordforms (based of their high
morphological complexity rules). It is reasonable to hypothesize thus that the detection of a
large type-token ration is a strong evidence of a simpler hierarchical network structure.
3.5 Average k index across our sample of languages
xxxx
11
References
[1] Jin Cong and Haitao Liu. Approaching human language with complex networks. Physics
of Life Reviews, 11(4):598 – 618, 2014.
[2] Yuyang Gao, Wei Liang, Yuming Shi, and Qiuling Huang. Comparison of directed and
weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and
its Applications, 393:579 – 589, 2014.
[3] Ricard V. Solé, Bernat Corominas-Murtra, Sergi Valverde, and Luc Steels. Language net-
works: Their structure, function, and evolution. Complexity, 15(6):20–26, 2010.
[4] Luı́s F. Seoane and Ricard Solé. The morphospace of language networks. Scientific Re-
ports, 8(1):10465, 2018.
[5] Harald Hammarström, Robert Forkel, Martin Haspelmath, and Sebastian Bank. Glottolog
4.3. Jena, 2020.
[6] C. E. Shannon. A mathematical theory of communication. The Bell System Technical
Journal, 27(3):379–423, 1948.
[7] Christian Bentz, Dimitrios Alikaniotis, Michael Cysouw, and Ramon Ferrer-i-Cancho. The
entropy of words—learnability and expressivity across more than 1000 languages. Entropy,
19(6), 2017.
[8] Ilya Nemenman, Fariel Shafee, and William Bialek. Entropy and inference, revisited. In
Advances in Neural Information Processing Systems 14 - Proceedings of the 2001 Confer-
ence, NIPS 2001, Advances in Neural Information Processing Systems. Neural information
processing systems foundation, January 2002. 15th Annual Neural Information Processing
Systems Conference, NIPS 2001 ; Conference date: 03-12-2001 Through 08-12-2001.
[9] Edward Loper and Steven Bird. Nltk: The natural language toolkit. In Proceedings of the
ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language
12
Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, page 63–70, USA,
2002. Association for Computational Linguistics.
[10] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure,
dynamics, and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod
Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15,
Pasadena, CA USA, 2008.
[11] R Ferrer i Cancho, Esteban J L C. Gómez-Rodrı́guez, and L Alemany-Puig. The optimality
of syntactic dependency distances. page under review, 2020.
[12] Christian Bentz, Tatyana Ruzsics, Alexander Koplenig, and Tanja Samardžić. A compari-
son between morphological complexity measures: Typological data vs. language corpora.
In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity
(CL4LC), pages 142–153, Osaka, Japan, December 2016. The COLING 2016 Organizing
Committee.
13

Analyzing The K-Core Entropy of Word Co-Occurrence Networks: Applications To Cross-Linguistic Comparisons

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analyzing The K-Core Entropy of Word Co-Occurrence Networks: Applications To Cross-Linguistic Comparisons

Uploaded by

Copyright:

Available Formats

Analyzing the k-core entropy of word co-occurrence

networks: applications to cross-linguistic comparisons

xxx* 1 and xxx†1

Keywords: Computational typology, Co-occurrence graphs, K-core decomposition, En-

• short and long-range correlations between words –¿ networks!!!

• solution –¿ entropy + networks as natural approach to long-range correlations

• how we approximate the k-core decomposition of a language, to make plausible cross-

underlying structure of co-occurrence networks? It is possible to use this kind of information

to make comparisons across different linguistic families? Here, we explore a network-based

approach to the answer to these questions. xxx

xxx xxx xxx xxx

of human language (see, among other studies, [1, 2, 3, 4]). xxxx

to corpus-based linguistic typology (CITA). To do this, we examine the k-core decomposition,

which consists in identifying layers of increasing connectivity by a simple pruning strategy.

hierarchical properties of large scale networks.

this algorithm, [? ] provide a profound characterization of the hierarchical structure of human

cortical network, as a densely connected nucleus alongside shells of increasing connectivity.

xxx xxx xxx

2.1 Textual corpora

To compare languages from a network-based perspective, we need a comparable source of in-

available parallel corpus 1 .

linguistic family glottocode languages

2.2 Information-theoretic entropy

amount of choice associated with words: the word entropy.

maximum likelihood estimator:

NSB estimator [8].

Vu = {v ∈ V : uv ∈ E}. The degree of the node u ∈ V is the size of Vu . An induced

2.4 k-shell decomposition

algorithm reads (see Figure 2):

Step 3. G is replaced by G0 and go back to Step 1.

Stop condition. If G0 is the null network (without nodes).

the set of nodes of the k-shell.

shell index k = 1 shell index k = 2 shell index k = 3 3 3

2.5 Information-theoretic approaches to structural patterns in networks

• von neumann entropy

• k-core entropy! proposal :)

2.6 Network construction and implementation details

tion) were made using NetworkX [10].

Step 1. Preprocess each sentence (equivalent to a line) by whitespace tokenization, punc-

tuation removal and conversion to lower case.

Step 2. Define the set of word-types Wt of the entire text.

of the respective edge.

Fixed-sized co-occurrence windows are varied from 1 o 5.

of connectivity. To give a global characterization of the distribution of k indexes, we simply

3.2 Comparisons with previous approaches to computational typology

3.3 k-core entropy

morphological complexity rules). It is reasonable to hypothesize thus that the detection of a

large type-token ration is a strong evidence of a simpler hierarchical network structure.

3.5 Average k index across our sample of languages

of Life Reviews, 11(4):598 – 618, 2014.

weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and

its Applications, 393:579 – 589, 2014.

works: Their structure, function, and evolution. Complexity, 15(6):20–26, 2010.

ports, 8(1):10465, 2018.

4.3. Jena, 2020.

[6] C. E. Shannon. A mathematical theory of communication. The Bell System Technical

Journal, 27(3):379–423, 1948.

Advances in Neural Information Processing Systems 14 - Proceedings of the 2001 Confer-

Systems Conference, NIPS 2001 ; Conference date: 03-12-2001 Through 08-12-2001.

2002. Association for Computational Linguistics.

Pasadena, CA USA, 2008.

[11] R Ferrer i Cancho, Esteban J L C. Gómez-Rodrı́guez, and L Alemany-Puig. The optimality

of syntactic dependency distances. page under review, 2020.