Professional Documents
Culture Documents
P
We find that CVCL can learn powerful multi-
hilosophers and cognitive scientists have world through a child’s eyes and ears, would modal representations from limited slices
argued that learning a new word re- it need strong inductive biases or additional of one child’s experience. In the following
quires sorting through a vast, poten- cognitive capacities to get word learning un- sections, we show that CVCL is capable of
tially infinite set of candidate meanings derway? Or would a simpler account driven matching a range of everyday words to their
(1–3). For instance, when a child hears by associative learning, in conjunction with corresponding visual referents in categoriza-
the word “ball” in an utterance, how do they feature-based representation learning (18), tion tasks, generalizing to highly novel vi-
learn to associate this word with round, bouncy suffice? sual exemplars not seen during training, and
objects (i.e., the correct visual referents), rather In this article, we put the simplest theories achieving broad-scale alignment between vi-
than with other features, objects, or events? of word learning to an unprecedented test: We sual and linguistic conceptual systems. Our
Young children are highly adept word learners: examine the learnability of word-referent map- results suggest that multimodal representa-
At 6 to 9 months, they begin connecting words pings from a single child’s longitudinal head- tion learning paired with domain-general,
to their visual counterparts (4). By 18 to mounted video recordings. To do so, we introduce associative learning mechanisms provides a
24 months, they can comprehend 300 words on the Child’s View for Contrastive Learning mod- computational foundation for breaking into
average, mostly nouns (5, 6). How do children el (CVCL, as shown in Fig. 1B) which instantiates word learning.
get started on word learning? What ingredients a form of associative learning that is cross-
(e.g., learning mechanisms and representational situational, tracking the co-occurrences between Evaluating acquired word-referent mappings
commitments) are needed to learn word-referent words and possible visual referents to determine After training was completed, we evaluated
mappings from early experience? their mappings (10, 19–22). CVCL interprets this CVCL and various alternative models for the
The nature of these ingredients is the sub- idea through recent advances in multimodal quality of their learned word-referent mappings.
ject of intense interest and debate. One promi- (e.g., vision-and-language) machine learning Adapting a common procedure for testing
nent theory is that word learning is driven by that integrates representation learning and children (Fig. 1, C and D) (31), models were
simple, domain-general, associative learning associative learning (23–26), using a contras- prompted with a target category label and
mechanisms (7–11), such as tracking the co- tive objective that coordinates two neural net- selected the corresponding visual referent among
occurrences between stimuli in ways nonspecific works, a vision encoder and a language encoder. four candidate images (based on their cosine
to language. Alternative theories point to stron- Trained in a self-supervised manner (i.e., using similarity to the label). Figure 2A shows the
ger constraints on word learning (e.g., innate only the recordings from the child’s view and results for Labeled-S: an evaluation dataset
knowledge or domain-specific inductive biases; no outside labels), the contrastive objective of frames annotated with 22 visual concepts
for instance, that different words have different brings together the embeddings (vectors) of that were jointly present in this child’s visual
meanings) (12–14), or to other emerging cogni- video frames and linguistic utterances that and linguistic experience. This dataset was
tive abilities that actively support word learn- temporally co-occur (treating the co-occurrences adapted from (32) [see supplementary materials
ing (e.g., reasoning and social cognition) (3, 15). as positive evidence), while separating those that (SM) S.5 for additional details]. Overall, CVCL’s
Each account is well-supported by empirical do not (treating the absence of co-occurrence classification accuracy was 61.6%. Figure 2D
studies in the lab (3, 9, 13, 14, 16, 17), but ac- as implicit negative evidence), as shown in shows the breakdown in performance across
knowledging the evidence for multiple learning Fig. 1B. Assuming that spoken utterances cor- the different evaluation categories, where CVCL’s
mechanisms, often measured across different relate with observable visual referents, CVCL performance for 11 out of the 22 concepts was
developmental time points, does not reveal converts these temporal associations into a found to be within 5% of the upper-bound
their relative importance. Nor does it pro- smooth learning signal for learning and align- estimate from CLIP (25), a similar image-text
vide sufficient guidance for building compu- ing its multimodal representations. Without contrastive neural network, but trained on sev-
tational models that, like children, aim to learn strong constraints on word meaning, nor ad- eral orders of magnitude more data (400 million
outside the lab. If a model could perceive the vance knowledge of possible visual referents, image-text pairs from the web). To address any
this combination of representation learning potential issues related to category overlap in
and associative learning enables the recovery the evaluation frames, we also conducted a
1
Center for Data Science, New York University, New York, of many, although not all, of the underlying follow-up evaluation using a manually filtered
NY, USA. 2Department of Psychology, New York University,
New York, NY, USA. word-referent mappings from a child’s re- subset with 15 mutually exclusive categories
*Corresponding author. Email: waikeenvong@gmail.com corded input. (see SM S5 and fig. S3).
CVCL was compared to alternatives (see SM for learning. To lesion the use of strong visual Thus, CVCL’s performance is comparable to
S2 for details) that aimed to capture meaning- embeddings (CVCL-Random Features), CVCL’s a strong web-scale contrastive model when
ful lower and upper bounds on perform- vision encoder was randomly initialized and tested within-distribution. Finally, to exam-
ance (Fig. 2A). To lesion the visual-linguistic frozen during training. Again, performance ine the performance achievable with direct
co-occurrences, we trained a variant using a dropped substantially (M = 38.0%), although a supervision with individual category labels
training dataset in which the co-occurring few concepts such as “sand” and “car” were (from the manually annotated Labeled-S eval-
frames and utterances were randomly shuffled partially acquired (Fig. 2D). We also estimated uation set) rather than child-directed utter-
and instead paired with other frames and ut- two upper bounds on performance based on ances, we trained a Linear Probe model. This
terances from the training set (CVCL-Shuffled), models that use either outside or oracle training Linear Probe was constructed by fitting a
breaking the original co-occurrence links while data, beyond what a child has access to. linear classifier on top of the frozen pre-
preserving the information from each indepen- Evaluating CLIP (25) out-of-the-box achieved trained vision encoder (initialized from self-
dent modality. This model performed at chance 66.7% accuracy, a 5.1% improvement over CVCL, supervision) and achieved 81.6% accuracy
(mean, M = 26.7%), showing the critical role owing to the relative improvement of a few based on thousands of within-distribution
of consistent visual and verbal co-occurrence concepts such as “kitchen,” “toy,” and “basket.” supervised examples.
As a follow-up, we aimed to quantify the value displayed in table S2. Reducing the amount that of CVCL (Fig. 2B). Furthermore, by com-
of a word occurring in a natural utterance ver- of directly labeled supervision resulted in an paring their relative frequencies, we can
sus in a directly labeled example. As shown in expected decrease in classification accuracy conservatively estimate that one directly lab-
Fig. 2B, we trained additional Linear Probes to 77.2 and 65.9%, respectively (with per cat- eled example is worth at least seven examples
with fewer labeled examples (using 10 and 1% egory performance in fig. S2). Despite the from natural language. Nevertheless, natural
of the available labeled data), with the number limited number of labeled examples in the language supervision has the advantage of more
of natural language examples for CVCL and 1% Linear Probe, its performance was margin- accurately representing what children learn
directly labeled examples for the Linear Probes ally better than and most comparable to from, and enabling a flexible representational
Classification Accuracy
Classification Accuracy
75 75 75
25 25 25
61.6 26.7 38.0 66.7 81.6 61.6 65.9 77.2 81.6 61.6 58.9 59.3 58.3
0 0 0
CVCL CVCL CVCL CLIP Linear CVCL Linear Linear Linear CVCL CVCL CVCL CVCL
(Shuffled) (Rand. Probe Probe Probe Probe (LSTM) (Single (Scratch)
Features) (1%) (10%) (100%) Frame)
Model Model Model
D
Sand Crib Car Paper Stairs Puzzle
100
75
50
25
0
25
0
Fig. 2. Image classification accuracy from Labeled-S evaluation. (A) Performance of CVCL (green) compared to alternative models that represent upper
and lower bounds. The performance of the upper bounds cannot be directly compared to CVCL because they are trained with much more (or cleaner) data.
(B) Performance of CVCL compared to multiple Linear Probes trained with varying levels of direct supervision. (C) Performance of CVCL compared to CVCL
model variants. (D) Performance broken down by target category. In each graph, error bars represent standard error across models trained with three different
random seeds, and the dashed line represents chance accuracy.
A 100 CLIP
Linear Probe
75
50
CVCL
25 CVCL (Shuffled)
CVCL (Rand. Features)
Classification Accuracy
0
e
tte ib
y
br ice
m
tre t
e
g
r
g
w e
h
ck
cl ll
tu k
e
ck l
tra t
in
la t
m le ce
w s
so e
is s
rs
p tv
ch ack
se
bu d
ja on
pa e t
s
bu al
ha
ck ca
ai
be
rfl
oc
ro e
sc ck
nt
av
nd ak
pl
do
rin
ic
rtl
be
bu cr
ck
oo
so
ro
ic av
ee
ch
tt
ju
ap
sa c
ck
ne
ba
100 CLIP
Linear Probe
50
CVCL
25 CVCL (Shuffled)
CVCL (Rand. Features)
0
ba nch
tri oon
e
ot st l
as l
te
bo rd
co ttle
rp ie
ne
de e
sk
ph ofa
wa ne
h
in
br ey
sp lla
n
n
ka pe
gu ak
ha pi ar
ru a
ba sh
m t
ba a
st gel
p
ke
kn n
ife
w
hp oo
ca ske
er
cl
tc
oo
pe
irb zz
am
fa
ai ok
co
bi
it
bo
e
um k
bi
la
sh
pi
cy
s
be
ll
to
Target Category
“You want to
“There’s the
pour that
bucket”
in the bucket?”
“Want a bit
“You need a
more on
spoon”
your spoon?”
Fig. 3. Zero-shot classification accuracy from Konkle objects evaluation. categories, whereas both upper bounds were close to ceiling (owing to
(A) Performance of CVCL as evaluated using 64 visual categories. Error bars training on these types of images). (B) In each row, two randomly selected
represent standard error across three models trained with different random training examples (image-utterance pairs) for four different visual concepts
seeds, and the black dashed line represents chance accuracy. The colored dash (in bold) are shown, alongside four test examples corresponding (left to right) to
lines represent the overall accuracy across all trials for CVCL, as well as the other the two top, the median, and the worst exemplars. The percent correct
upper and lower bounds. CVCL performed significantly above chance, and below each generalization example refers to performance when this image is
better than either lower-bound estimate, but still struggled across many the target.
scheme that accommodates an unbounded can explain how infants generalize to novel visual shown in fig. S6. We find that CVCL learns to
number of visual concepts. Separately, to ex- stimuli inside the lab. represent different sets of visually similar
amine whether other factors influenced the items from a concept as distinct subclusters,
learnability of word-referent mappings, we The organization of multimodal despite using a single vector per word. For
also trained and evaluated additional var- representations example, the word embedding for “stairs”
iants of the CVCL model, varying either In this section, we present three families of most strongly activates two separate clusters
aspects of the model architecture or the train- analyses exploring the structure of the learned representing indoor versus outdoor stairs,
ing procedure, although none performed multimodal representations in CVCL. First, we whereas “puzzle” produces two other clusters
better than CVCL itself (see Fig. 2C and SM examined the extent to which CVCL’s visual that represent alphabet versus animal puzzles.
S2 for details). Overall, our findings sug- and linguistic conceptual systems align. For Previous psychological theories of concept
gest that many of the earliest word-referent example, if both the vision and word embed- learning often required explicit, built-in mech-
mappings can be acquired from as few as dings for “car” are independently more similar anisms to capture substructure within concepts
10 to a hundred naturally occurring word- to “road” than “ball,” this would indicate good (39, 40), but in CVCL, we find that multicluster
referent pairs. multimodal alignment (36, 37). representations emerge implicitly through con-
Using the 22 concepts from Labeled-S, we trastive learning.
Generalizing to novel visual exemplars computed a visual prototype for each concept We also qualitatively examined CVCL’s ability
CVCL’s successes have broader implications ing was assumed to be too hard otherwise tation learning and associative, cross-situational
for theories of word learning. Alternative theo- (2, 3, 13, 14) (with different perspectives fo- mechanisms (9, 19, 22, 48) are sufficient to
ries posit reliance on strong inductive biases, cusing on evidence from different develop- acquire word-referent mappings from one child’s
specialized language machinery, or other cog- mental ages). CVCL’s focus on learnability first-person experiences. Rather than count-
nitive abilities, in part, because word learn- with minimal ingredients shows how represen- ing co-occurrences between discrete symbolic
entities like traditional cross-situational models, broadly applicable and domain-general learn- mutual exclusivity (13), the principle of contrast
CVCL encodes both words and images using ing strategy (26, 52, 54), allowing CVCL to (12), the shape bias (56), syntactic cues (57),
distributed vectors (49–52). The contrastive ob- learn representations informed by both within- social or gestural cues (15), or hypothesis gen-
jective leverages the temporal co-occurrence of domain [e.g., word-to-word (49–51)] and across- eration (58). Each of these additional factors
words and images as an associative learning domain (e.g., word-to-image) correlations (36). has empirical support and their inclusion may
signal, enabling the incremental acquisition Although our primary aim was establishing further improve learning, to the extent that
and alignment of multiple word-referent map- the learnability of word-referent mappings with they do not already emerge from training. Sub-
pings jointly, and sidestepping previous con- minimal ingredients, CVCL’s successes do not sequent research could systematically test
ceptualizations of word learning as a discrete rule out more sophisticated forms of represen- for their contributions on top of CVCL, by
search over a vast number of candidate hypo- tation and reasoning, especially ones that might incorporating these biases into the architecture
theses (1, 53). Contrastive learning is also a emerge in later development (55). These include or training procedure (59, 60). Nevertheless, our
findings suggest that they are not essential 21. M. C. Frank, N. D. Goodman, J. B. Tenenbaum, Psychol. Sci. 20, 48. B. McMurray, J. S. Horst, L. K. Samuelson, Psychol. Rev. 119,
for making genuine progress on word learning 578–585 (2009). 831–877 (2012).
22. A. Fazly, A. Alishahi, S. Stevenson, Cogn. Sci. 34, 1017–1063 (2010). 49. J. L. Elman, Cogn. Sci. 14, 179–211 (1990).
from one child’s experience. 23. A. Lazaridou, G. Chrupała, R. Fernández, M. Baroni, in 50. T. K. Landauer, S. T. Dumais, Psychol. Rev. 104, 211–240
Future work could aim to bring learning Proceedings of the 2016 Conference of the North American (1997).
in children and models closer together by Chapter of the Association for Computational Linguistics: 51. T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of
Human Language Technologies (2016), pp. 387–392. Word Representations in Vector Space. arXiv:1301.3781 [cs.CL]
incorporating more cognitively plausible assump- 24. D. Harwath et al., in Proceedings of the European Conference (2013).
tions into the model. For example, children on Computer Vision (ECCV) (2018), pp. 649–665. 52. S. Chopra, R. Hadsell, Y. LeCun, in IEEE Computer Society
learn from temporally extended episodes (61), 25. A. Radford et al., Learning Transferable Visual Models From Conference on Computer Vision and Pattern Recognition
Natural Language Supervision. arXiv:2103.00020 (CVPR), pp. 539–546 (2005).
whereas CVCL learns from independent still [cs.CV] (2021). 53. S. Pinker, Learnability and Cognition: The Acquisition of
frames, likely affecting the learnability of verbs 26. M. Nikolaus, A. Alishahi, G. Chrupała, Trans. Assoc. Comput. Argument Structure (MIT Press, 1989).
and other abstract words. Second, children Linguist. 10, 922–936 (2022). 54. T. Chen, S. Kornblith, M. Norouzi, G. Hinton, in International
27. J. Sullivan, M. Mei, A. Perfors, E. Wojcik, M. C. Frank, Open Conference on Machine Learning (PMLR, 2020), pp. 1597–1607.
are fundamentally active, embodied learners,
Mind (Camb.) 5, 20–29 (2021). 55. S. C. Meylan, E. Bergelson, Annu. Rev. Linguist. 8, 77–99 (2022).
whereas CVCL must learn passively from re- 28. We obtain this estimate by approximating the total number of 56. B. Landau, L. B. Smith, S. S. Jones, Cogn. Dev. 3, 299–321
corded visual-linguistic experience. CVCL’s suc- hours during this time span as 1.583 × 365 × 24 = 13,867 (1988).
cesses are implicitly supported, in part, by the hours, and assuming that half of these are waking hours, 57. L. Gleitman, Lang. Acquis. 1, 3–55 (1990).
then the proportion of time that SAYCam-S covers is 58. J. C. Trueswell, T. N. Medina, A. Hafri, L. R. Gleitman, Cogn.
child’s actions, attention, and social engage-