Professional Documents
Culture Documents
1
xxx
September 9, 2021
Abstract
We study a large sample of languages across xxx families xxx, by means of the k-core
decomposition algorithm. xxx Our main goal is to xxx
tropy
1 Introduction
linguistic comparisons?
* correo
†
correo
1
This paper addresses some intriguing questions: To what extent it is possible to unveil the
Recently, Statistical Mechanics has been a fruitful framework to study statistical properties
xxx
In order to answer the proposed questions, this paper focuses on an examination of global
properties of networks (representing in turn languages across the world) that may capture fea-
tures of the network as a whole, rather than averaging individual node’s properties as node de-
gree or clustering. This allows us to propose a simple and interpretable network-based approach
This algorithm defines inner subsets of networks which are not only formed by central nodes
but also densely connected ones. The k-shell decomposition provides thus a computationally
tractable procedure (O(|V | + |E|) in time, where V and E are node and edges sets [? ]) to study
Particularly, the k-core decomposition algorithm has been used to describe the hierarchical
organization of Internet and neural networks. As already seen in two intriguing studies [? ? ],
the k-core decomposition algorithm has been fruitful to characterize networks beyond the degree
distribution and to uncover hierarchies due to the specific structure of Internet graphs. Based on
2
2 Materials and methods
formation that represents (to a certain extent) the main structural features of each language. For
the sake of simplicity and to avoid some genre distortions, we based our experiments on a freely
Figure 1: Basic description of the parallel corpus. The figure displays a summary of the
number of tokens and types.
A word type is a unique string (a sequence of unicode characters2 ) delimited by white spaces.
A word token is then any repetition of a word type. Details about the corpus are shown in Fig. 1
and Table 1.
1
https://www.unicode.org/udhr/index.html
2
https://home.unicode.org/
3
Table 1: Basic description of the studied linguistic families (based on Glottolog [5]).
The seminal works of C. E. Shannon [6] have suggested that human language is mainly based
on the statistical choice of linguistic units. A natural mathematical approach to this kind of
uncertainty is precisely Information Theory. Indeed, there is a precise quantity of the average
To define this quantity, we start with a the text T formed by word-types taken from the
finite set Wt . In probabilistic terms, word-type probabilities are distributed according to p(w),
w ∈ Wt . The average amount of choice of word-types (or simply the entropy) reads [6]
X
H=− p(w) log(p(w)) (1)
w∈Wt
If we denote by fw the frequency of the word-type w can be estimated using the so-called
fw
fw = P (2)
w0 ∈Wt w0
xxxxx
As noticed by [7], the estimation of word entropy involves two main problems. First,
4
2.3 Basic concepts on Network Theory
Each language is viewed here as an undirected network G = (V, E), where V represents the
set of word-types, while E is formed by pairs of word-types occurring within a fixed-size slid-
ing window. This simple procedure involves a statistical approach to the distance in which
relationships between linguistic units occur. The neighborhood of the node u ∈ V is the set
subnetwork of G is a network formed by a subset V 0 of V and all the corresponding edges of the
nodes of V 0 .
k-shell decomposition algorithm is an iterative process that at each step identifies the connec-
tivity of the most “external” nodes, kmin = minu∈V d(u), and removes the nodes with degree
lower or equal than kmin , until the core of the network is revealed [? ? ]. More precisely, the
Step 1. Start with the network G and the minimum node degree kmin = minu∈V d(u).
Step 2. Remove all nodes with d(u) 6 kmin , resulting in a pruned set V 0 that induces the
subgraph G0 .
The k-shell is the induced subnetwork by the removed nodes in a given k step. All nodes of
the k-shell are associated to the index k. The shell index for a node u ∈ V is denoted k(u). In
our analysis, the k-crust of the graph G is the induced subgraph by the set V \ V k , where V k is
5
1
3 3 3 3
3 d(u) 6 1 3
3 3 3 3
1 2 2 2 2
2 2
2 2 2 2
2 2
d(u) 6 2
3 3
Figure 2: k-shell decomposition in a simple network. At each step, the k-shell decomposition
algorithm searches for the nodes of lower degree k. Then, all nodes with degree 6 k are pruned.
From the remaining set of nodes, the degree is computed for each node. If nodes have degree
6 k, the pruning phase is repeated. Next, the process searches again for nodes of lower degree.
In this example, the k-shell decomposition reveals three layers of increasing connectivity. Red
nodes are associated to the first step. Then, these nodes are pruned from the network and define
the 1-shell. The next step searches for nodes with degree 6 2 (yellow nodes), which define the
2-shell. Finally, after pruning this set of nodes it is revealed the 3-shell (blue nodes).
For basic text preprocessing (whitespace tokenization, punctuation removal and conversion to
lower case), we used NLTK [9]. Network-theoretic techniques (in particular, k-core decomposi-
6
For each language, its associated network was built along the following steps:
Step 3. Through an iterative process, inspect each sentence in order to find word-types
occurring within a fixed-size window (based on the fact that dependency relationships
occur in general at small distances [11]). Each new co-occurrence between pairs of word-
types from Wt defines an edge of the network. Repetitions of bigrams increase the weight
3 Results
3.1 Basic description of the k-core entropy for our sample of languages
We first shed light on a basic description of the word co-occurrence networks, based on the
k index. As explained above, at each step of the k-core decomposition algorithm nodes (or
word-types in our case) are associated to a integer number (the k index), describing their level
calculate the average over all the nodes of the word co-occurrence networks: k̄ = n1 i∈V ki ,
P
where ki denotes the k index for the node i. A first important question relies on the possible
influence of the radius on our results. Fig. 3 displays histograms for the calculated average
k index for all XX languages and different window sizes. The average k index is distributed
as follows: (radius 1) around a mean of 2.42 (SD = 0.53); (radius 2) around a mean of 4.5
(SD = 1.03); (radius 3) around a mean of 6.33 (SD = 1.47); (radius 4) around a mean of 7.93
(SD = 1.82); and (radius 5) around a mean of 9.37 (SD = 2.13). It is clear from these simple
7
calculations that there is a linear relationship between the average k index and the window size.
A simple calculation shows a slope of 1.73 for the relationship between radius and average k
index. This fact suggest that the choice of a particular window-size is only affected by a constant.
Figure 3: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.
To compare the previous observations with other accounts to the global quantitative comparison
between languages (for example, [12]) we stressed the fact that the type token ratio is positively
correlated with morphological complexity. With this, it seems reasonable to explore the rela-
tionship between the hierarchical information provided by the k-core algorithm and a simple
corpus-based measure. As shown in Fig. 5, there is a clear exponential decay of the average
k index as the type-token ratio increases. Remarkably, despite a linear influence of radius, this
exponential behavior does not seem to be affected by the choice of the fixed-window size.
8
Figure 4: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.
9
Figure 5: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.
3.4 Applications of k-core entropy to the analysis of two large linguistic families
To profound on the appearance of the exponential decay of the average k index as a function of
10
type-token ratio, we study in detail word co-occurrence networks for a radius 2. As shown in
Fig. 6, there are radical differences between low and high type-token ratio languages regarding
the average k index. A low the average k index is observed for Quechuan languages (mean of
3.7; SD = 0.3); by contrast, morphologically simpler languages (like the Austronesian family)
displays a mean of 5.24 (SD = 1). This fact suggests in principle that the average k index is
lower for families exhibiting higher morphological complexity. In other terms, for such families
displaying a low average k index there is a large proportion of wordforms (based of their high
Figure 6: average core number vs. type-token ratio. The figure displays the average core
number versus type-token ratio across our sample of languages.
xxxx
11
References
[1] Jin Cong and Haitao Liu. Approaching human language with complex networks. Physics
[2] Yuyang Gao, Wei Liang, Yuming Shi, and Qiuling Huang. Comparison of directed and
[3] Ricard V. Solé, Bernat Corominas-Murtra, Sergi Valverde, and Luc Steels. Language net-
[4] Luı́s F. Seoane and Ricard Solé. The morphospace of language networks. Scientific Re-
[5] Harald Hammarström, Robert Forkel, Martin Haspelmath, and Sebastian Bank. Glottolog
[7] Christian Bentz, Dimitrios Alikaniotis, Michael Cysouw, and Ramon Ferrer-i-Cancho. The
entropy of words—learnability and expressivity across more than 1000 languages. Entropy,
19(6), 2017.
[8] Ilya Nemenman, Fariel Shafee, and William Bialek. Entropy and inference, revisited. In
ence, NIPS 2001, Advances in Neural Information Processing Systems. Neural information
processing systems foundation, January 2002. 15th Annual Neural Information Processing
[9] Edward Loper and Steven Bird. Nltk: The natural language toolkit. In Proceedings of the
ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language
12
Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, page 63–70, USA,
[10] Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. Exploring network structure,
dynamics, and function using networkx. In Gaël Varoquaux, Travis Vaught, and Jarrod
Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11 – 15,
[12] Christian Bentz, Tatyana Ruzsics, Alexander Koplenig, and Tanja Samardžić. A compari-
son between morphological complexity measures: Typological data vs. language corpora.
(CL4LC), pages 142–153, Osaka, Japan, December 2016. The COLING 2016 Organizing
Committee.
13