Professional Documents
Culture Documents
net/publication/369451528
CITATIONS READS
4 44
3 authors, including:
Alicia I. Figueroa-Barra
University of Chile
34 PUBLICATIONS 63 CITATIONS
SEE PROFILE
All content following this page was uploaded by Alicia I. Figueroa-Barra on 23 March 2023.
SUPPLEMENT ARTICLE
S153
C. Palominos et al
shown that usage of definite NPs distinguishes Sz groups narrative time might provide an analytical substrate for
from controls, and Sz groups with and without FTD understanding language dysfunction in psychosis. By let-
from each other, in Spanish and English,7,8 with related ting nodes represent referential NPs rather than words,
patterns recently found in Turkish.9 Tovar et al.10 studied speech graphs contribute to a linguistic model of psy-
a rare sample of highly thought-disordered patients with chosis that can illuminate future computational studies
Sz and report that word-level (lexical-conceptual) anom- and neurocognitive models alike. Our basic hypothesis
alies were very rare, while referential anomalies were was that referential anomalies previously noted in Sz
pervasive. Interestingly, determiners have also made a would be manifest at the level of topological measures of
N 20 20 20
Gender (%female) 40% 40% 40%
Age 35.7 ± 7.5 18.4 ± 2.8 32.6 ± 11.8 9.62 .000* −5.23 .000* 0.99 .330
Note: PANSS, Positive and Negative Syndrome Scale; SIPS, Structured Interview of Prodromal Syndromes; SOPS, Scale of Prodromal
Symptoms; SIPS/SOPS, Diagnoses of prodromal syndromes based on SIPS; GAF, Global Assessment of Functioning; Sz, schizo-
phrenia; CHR, clinical high risk.
*
Significance level < .05.
Speech Graph Analysis representation, the total number of nodes coincides with
All clinical interviews were recorded on audio and tran- the sum of recurrent and non-recurrent entities. Our
scribed by a team from LEPSI (Language, Psychosis and analysis proceeded from basic descriptive characteris-
Intersubjectivity, University of Chile) group. The inter- tics of the graphs (1), to topological measures relating
views of the ESECH corpus were already transcribed. these descriptive measures to narrative time (2), and fi-
All NPs were identified and annotated, as well as their nally measures of topological distance in all referential
relative position with respect to the consecutive words of chains (ie, sequences of recurrences to the same entity)
the speech. As in the work of Mota et al.,15,16 we adopted (3). Descriptive characteristics (1) were the number of
a speech graph approach, representing the speech pro- nodes, number of NPs (equal to the number of edges plus
duced by each individual as a graph composed by nodes one), number of recurrent (and non-recurrent) entities,
and directed edges. The nodes were the NPs, which allows and the average degree (number of connections through
us to study the temporal and topological structure of ref- edges with other nodes) of recurrent entities. The average
erential meaning. They are connected by edges, which degree of nonrecurrent entities was not calculated, since
represent the subsequent occurrence of these NPs in the these can be connected to maximally two nodes, and only
narrative. The number of words between nodes was cal- connected to one in case they occupy the first or the last
culated as a variable of interest (defining distance). As position of the graph. The normalized measures (2) in-
a surjective assignment of NPs to nodes, different NPs volved the density of NPs (number of NPs/ number of
can be assigned to the same node when the same entity words), which we broke down into the three elements
is co-referenced by these NPs. In other words, each node that make it up, namely the normalized numbers of re-
represents an entity present in the discourse, which can be current entities, of their recurrences, and of nonrecurrent
instantiated through different NPs. entities. The number of recurrences was measured in two
To make longer interviews comparable to shorter ones, ways, first as the average number of recurrences per en-
the interviews were limited to a maximum of 1000 words tity, then as the total number of recurrences throughout
(fully including the last utterance before reaching 1000 the speech. In group (3), we measured the production of
words). Figure 1 exemplifies a graph for a speech frag- NPs over narrative time. We calculated first the distance
ment. Each node in the graph represents a specific entity. between NPs as the number of words between consecu-
Recurring entities are the man and the woman. tive NPs.
Since we were mainly interested in the position of
the NPs corresponding to recurrent entities, we meas-
Graph-Theoretical Analysis ured the topological distance between recurrent NPs, de-
A distinction between recurrent and nonrecurrent entities fined as the number of nodes between a recurrent entity
was made, the former being those that appear at least node and the next node referring to the same entity, or
twice during the speech, while non-recurrent entities are equivalently, the number of edges in a directed path that
referenced only once. As can be deduced from the graph starts at a recurrent entity node and ends at the same
S155
C. Palominos et al
(A) General speech graph representation of a discourse with 22 nodes and 30 edges. Letters represent entities as referenced by NPs (eg,
a man, the street, etc.). (B) A fragment of the same speech graph with 6 nodes and 7 edges: There was [a man][m] on [the street][st]
waiting for [someone][so]. Later, [a woman][w] met [him][m]. Although [they][th] went to [a cafe][ca], [she][w] seemed to be busy.
(C) Four edges in black (and three nodes in between) depicting a referential chain for the entity m. (D, E) Two depictions of referential
chains. After identifying re-appearances of the same entity, we calculate the topological distance as the number of edges in between
them: here there are four edges between a man and him (D), and four (different) edges between a woman and she (E). Nodes [m] and [w]
are divided in (D) and (E) only to visualize topological distances.
node. The final variables were obtained by calculating and (Honestly-significant-difference) HSD statistic are
the maximum and mean values of the topological dis- reported with the significance level of the HSD test set
tances of each entity (see Table 2 for a summary of these to.05. Only significant results are graphed. In addition, a
measures). sensitivity analysis considering different windows of nar-
rative time (as measured in number of words) was run.
Statistical Analysis Windows from 200 to 800 words (with increments of 100)
Basic descriptive characteristics were compared using were considered (with windows length equalized across
a Mann-Whitney U test. To compare the values of groups). For each window length, 20 random samples of
normalized and distance variables between groups, a that length were selected and we calculated the percentage
one-way ANCOVA test was run controlling for possible of these samples in which there were significant results be-
confounding variables as age and years of education, fol- tween groups. Details are reported in the Supplementary
lowed by a Tukey post hoc (simultaneous) pairwise com- Materials (Table S2). Next, a logistic regression was run to
parisons correcting for family-wise error rate. P-values distinguish between Sz and control groups based on only
S156
Coreference Delays in Psychotic Discourse
three variables of interest: density of recurrent entities, entities, or (3) recurrences, or some combination by these
density of nonrecurrent entities, and average number of three. However, there were no significant group differ-
recurrences by entity, which together determine the den- ences in any of these numbers. On the other hand, when
sity of NPs; after which we added the mean distance be- considering the total number of recurrences by words,
tween NPs and max-max entities distance as a second step there were significant differences between Sz and controls
in the regression. Post-hoc, a preliminary and exploratory (Figure 2B). Furthermore, the effect of the higher density
analysis of semantic similarity was added to further elu- of NPs in Sz than controls was also reflected in a signifi-
cidate our topological distance results. For this, we used cantly smaller average distance between NPs in Sz (Figure
a fastText word embedding from Spanish Unannotated 2C). Although no significant differences in the mean dis-
Corpora.23 The embedding used contains 1 313 423 vec- tance between NPs were observed between Sz and CHR,
tors of dimension 300. All the words that matched the the mean distance was less in both Sz (4.7 words) and
embedding were included. In each case, the semantic sim- CHR (5.2 words) relative to controls (6.6 words).
ilarity was calculated as the cosine similarity of the vec- Furthermore, on average, distances between NPs refer-
tors for a moving window of 10 words. Subsequently, the ring to the same entity, counted in terms of the number
value was averaged across all windows. Results are pro- of nodes in between them, were larger in Sz than in
vided in Supplementary Figure S4. Statistical analysis was controls (Table 3 and Figure 2D and E). Figure 2D and
performed using Stata and Python (Python 3.9.4), using E show the difference in the mean-max and max-max
the SciPy and Sklearn libraries. The libraries Seaborn and distance between recurring entities, which indicates a
NetworkX were used to generate the graphics. widening of the temporal window between two links of
a referential chain when maximal values of these chains
Results are taken. Figure 2F shows the cumulative average dis-
tribution of distances between recurrent entities across
Results for group comparisons across all variables are groups. Interestingly, the control group had a larger pro-
summarized in Table 3 and Figure 2. There was a sig- portion of distances less than ten nodes, than both Sz
nificant difference in the density of NPs, with a higher and CHR. Thus, on average, 93% of the distances were
density in Sz than in both CHR and controls (Figure 2A). ten or fewer nodes in controls, while it was 86% and
This higher density of NPs in Sz could be caused by a 83% for Sz and CHR, respectively. This indicates that
higher number of (1) recurrent entities, (2) nonrecurrent for these last two groups, there is a higher proportion of
S157
C. Palominos et al
Mean (SD)
CHR vs. Sz vs.
Descriptive Sz CHR Control Sz vs. CHR Control Control
Topological
Measures U P U P U P
Number of nodes 84.6 (32.1) 73.1 (21.7) 86.1 (37.4) 158 .188 130 .109 169.5 .385
Number of NPs 118.1 (38.2) 99.2 (26.9) 112.1 (37.8) 131 .050 133.5 .130 167.5 .363
Mean (SD)
Normalized Topo- Mean HSD- Mean HSD- Mean HSD-
logical Measures Sz CHR Control Difference Test Difference Test Difference Test
Density of NPs 0.142 (0.037) 0.114 (0.03) 0.114 (0.038) 0.034 4.1747* 0.000 0.0472 0.034 4.1275a
Density of recurrent 0.012 (0.005) 0.010 (0.004) 0.009 (0.003) 0.003 2.7048 0.000 0.2432 0.003 2.9480
entities
Density of 0.090 (0.033) 0.075 (0.025) 0.078 (0.039) 0.021 2.7367 0.004 0.46371 0.017 2.2696
nonrecurrent entities
Average number of 4.3 (1.4) 4.2 (2.3) 3.7 (1.0) 0.138 0.3794 0.500 1.3706 0.638 1.7500
recurrences by entity
Total number of 0.052 (0.021) 0.039 (0.021) 0.036 (0.018) 0.014 2.8864 0.003 0.6721 0.017 3.5585a
recurrences by words
Mean distance be- 4.7 (0.8) 5.2 (0.8) 6.6 (2.6) 0.584 1.5356 1.399 3.6799a 1.983 5.2155a
tween NPs
Mean-mean entities 7.2 (4.1) 6.3 (4.2) 4.8 (4.0) 1.156 1.1239 1.555 1.5110 2.711 2.6349
distance
Max-mean entities 28.2 (23.4) 18.9 (16.2) 15.0 (16.6) 10.865 2.3461 3.896 0.8414 14.761 3.1874
distance
Mean-max entities 15.4 (9.7) 12.7 (8.7) 8.3 (6.7) 3.336 1.5932 4.454 2.1272 7.790 3.7204a
distance
Max-max entities 52.0 (27.0) 35.1 (23.1) 28.7 (24.1) 20.145 3.3095 6.383 1.0486 26.528 4.3581a
distance
Note: NPs, noun phrases; Sz, schizophrenia; CHR, clinical high risk; HSD, honestly-significant-difference.
a
HSD-test for significance level < 0.05.
long distances greater than ten nodes. A post hoc sensi- between NPs and max-max entities distance as variables
tivity analysis using different windows of words showed for the regression, the accuracy improved to 84.2% and the
that some significant differences between groups were ROC was 87.7. This same analysis allowed us to distinguish
already found in samples of 300 or more words, but also between CHR and controls with 83.8% accuracy, and be-
that some other topological measures required at least tween Sz and CHR with 74.4% accuracy. Finally, after ap-
800 words until differences could be observed between plying a 10-fold cross-validation in each comparison, the
groups (for details, see the Supplementary Table S2). latter accuracies changed as follows: 80% (Sz and controls),
In the logistic regression analysis, the accuracy of clas- 71.7% (CHR and controls), and 57.5% (Sz and CHR).
sifying Sz (compared to controls) based on three ana-
lytical variables (density of recurrent entities, density of Discussion
nonrecurrent entities, and average number of recurrences
by entity) was 63%. In a post hoc analysis, the area under This is the first study to target the referential structure
the curve for the receiver operating characteristics curve of meaning in psychotic discourse directly, using graph
(ROC) was 76.9. In a second step, adding the mean distance theory. Results confirmed that speakers with Sz deviate
S158
Coreference Delays in Psychotic Discourse
Fig. 2. (A) Density of NPs (number of NPs by number of words). (B) Total number of recurrences by number of words. (C) Mean
distance between NPs (distance in number of words). (D) Mean of max distances between recurrent entities (mean-max) (distance in
number of nodes). (E) Max of max distances between recurrent entities (max-max) (distance in number of nodes). (F) Cumulative
distribution of distances.
from controls as well as people at CHR in this partic- in Sz compared to controls. Put differently, not merely
ular dimension of meaning, confirming previous sugges- the NP-type, as shown in previous studies, but the timing
tions,8–10 yet substantiating them at the level of topological of NPs matters, to understand the language profile of Sz.
measures of distance between coreferential NPs. Results As noted, coreference is a necessary condition for co-
specifically showed that the Sz group produced more NPs herence and narrative of any kind: maintaining an en-
over narrative time than both CHR and control groups; tity in the computational workspace of language as a
and at the same time, there was a widening of the tem- discourse is generated, is fundamental. What we specifi-
poral window of coreference between recurring entities cally show here is that, in Sz, entities “linger around” for
S159
C. Palominos et al
longer. As referential chains are established, pairs of links entity can be more delayed in Sz, as shown when using
in these chains are intervened by the production of other NPs as nodes, while, at the same time, the recurrence to
NPs/entities, and mainly by recurrences of other entities, a certain word can be closer (in number of nodes that are
as can be inferred from the higher number of recurrences words), even considering the longest component (LSC).
and higher topological distances in Sz. Even though, by This apparent contradiction is explained by considering
contrast to these recurrences, the density of recurrent that speech graphs are constructed at different scales
and non-recurrent entities, and the average number of re- (using NPs versus words as nodes). It is possible that there
currences by entity, were not significant between groups, is a relationship between both measures and it remains to
S161
C. Palominos et al
Social Sciences), vol 91. Dordrecht, Netherlands: Springer; 22. San Martín A, Guerrero S. Estudio Sociolingüístico del
2001. Español de Chile (ESECH): recogida y estratificación del
18. First MB, Gibbon M, Spitzer RL, Williams JBW, Benjamin corpus de Santiago. Boletín de filología 2015;50(1):221–247.
LS. Structured Clinical Interview for DSM-IV Axis II doi:10.4067/S0718-93032015000100009.
Personality Disorders (SCID-II). Washington, DC: American 23. Cañete J. Spanish Word Embeddings [Data set]. Zenodo;
Psychiatric Press, Inc.; 1997. 2019. doi:10.5281/zenodo.3255001.
19. Kay, SR, Opler, LA, Fiszbein, A. Manual of the Evaluation of 24. NeuralCoref 4.0. Coreference Resolution in spaCy with
Psychiatric Symptoms Used the Positive and Negative Syndrome Neural Networks. GitHub. https://github.com/huggingface/
Scale (PANSS). Tokyo: Seiwa Shoten Co., Ltd.; 1991. neuralcoref.
S162