Professional Documents
Culture Documents
Speak Memory - ChatGPT
Speak Memory - ChatGPT
Example:
Input: The door opened, and [MASK], dressed and hatted, entered with a cup of tea.
Output: <name>Gerty</name>
Input: My back’s to the window. I expect a stranger, but it’s [MASK] who pushes open the door, flicks on the light. I can’t place
that, unless he’s one of them. There was always that possibility.
Output:
2022) and have highlighted the role of both verba- comprehension (Hill et al., 2015; Onishi et al.,
tim (e.g., Lee et al., 2022; Ippolito et al., 2022) and 2016), no names at all appear in the context to
subtler forms of memorization (Zhang et al., 2021). inform the cloze fill. In the absence of information
Our analysis and experimental findings add to this about each particular book, this name should be
scholarship but differs significantly in one key re- nearly impossible to predict from the context alone;
spect: our focus is on ChatGPT and GPT-4 —black it requires knowledge not of English, but rather
box models whose training data or protocols are about the work in question. Predicting the most
not fully specified. frequent name in the dataset (“Mary”) yields an
accuracy of 0.6%, and human performance (on a
Data contamination. A related issue noted upon sample of 100 random passages) is 0%.
critical scrutiny of uncurated large corpora is data We construct this evaluation set by running
contamination (e.g., Magar and Schwartz, 2022) BookNLP1 over the dataset described below in §4,
raising questions about the zero-shot capability of extracting all passages between 40 and 60 tokens
these models (e.g., Blevins and Zettlemoyer, 2022) with a single proper person entity and no other
and worries about security and privacy (Carlini named entities. Each passage contains complete
et al., 2021, 2023). For example, Dodge et al. sentences, and does not cross sentence boundaries.
(2021) find that text from NLP evaluation datasets We randomly sample 100 such passages per book,
is present in C4, attributing performance gain partly and exclude any books from our analyses with
to the train-test leakage. Lee et al. (2022) show fewer than 100 such passages.
that C4 contains repeated long running sentences We pass each passage through the prompt listed
and near duplicates whose removal mitigates data in figure 2, which is designed to elicit a single
contamination; similarly, deduplication has also word, proper name response wrapped in XML tags;
been shown to alleviate the privacy risks in models two short input/output examples are provided to
trained on large corpora (Kandpal et al., 2022b). illustrate the expected structure of the response.
3 Task 4 Data
We formulate our task as a cloze: given some con- We evaluate 5 sources of English-language fiction:
text, predict a single token that fills in a mask. To
account for different texts being more predictable • 91 novels from LitBank, published before 1923.
than others, we focus on a hard setting of predict- • 90 Pulitzer prize nominees from 1924–2020.
ing the identity of a single name in a passage of
• 95 Bestsellers from the NY Times and Publishers
40-60 tokens that contains no other named entities.
Weekly from 1924–2020.
Figure 1 illustrates two such examples.
In the following, we refer to this task as a name • 101 novels written by Black authors, either from
cloze, a version of exact memorization (Tirumala the Black Book Interactive Project2 or Black Cau-
et al., 2022). Unlike other cloze tasks that focus on 1
https://github.com/booknlp/booknlp
2
entity prediction for question answering/reading http://bbip.ku.edu/novel-collections
GPT-4 ChatGPT BERT Date Author Title
0.98 0.82 0.00 1865 Lewis Carroll Alice’s Adventures in Wonderland
0.76 0.43 0.00 1997 J.K. Rowling Harry Potter and the Sorcerer’s Stone
0.74 0.29 0.00 1850 Nathaniel Hawthorne The Scarlet Letter
0.72 0.11 0.00 1892 Arthur Conan Doyle The Adventures of Sherlock Holmes
0.70 0.10 0.00 1815 Jane Austen Emma
0.65 0.19 0.00 1823 Mary W. Shelley Frankenstein
0.62 0.13 0.00 1813 Jane Austen Pride and Prejudice
0.61 0.35 0.00 1884 Mark Twain Adventures of Huckleberry Finn
0.61 0.30 0.00 1853 Herman Melville Bartleby, the Scrivener
0.61 0.08 0.00 1897 Bram Stoker Dracula
0.61 0.18 0.00 1838 Charles Dickens Oliver Twist
0.59 0.13 0.00 1902 Arthur Conan Doyle The Hound of the Baskervilles
0.59 0.22 0.00 1851 Herman Melville Moby Dick; Or, The Whale
0.58 0.35 0.00 1876 Mark Twain The Adventures of Tom Sawyer
0.57 0.30 0.00 1949 George Orwell 1984
0.54 0.10 0.00 1908 L. M. Montgomery Anne of Green Gables
0.51 0.20 0.01 1954 J.R.R. Tolkien The Fellowship of the Ring
0.49 0.16 0.13 2012 E.L. James Fifty Shades of Grey
0.49 0.24 0.01 1911 Frances H. Burnett The Secret Garden
0.49 0.12 0.00 1883 Robert L. Stevenson Treasure Island
0.49 0.16 0.00 1847 Charlotte Brontë Jane Eyre: An Autobiography
0.49 0.22 0.00 1903 Jack London The Call of the Wild
cus American Library Association award winners Appendix presents the same for books published
from 1928–2018. after 1928.3 Of particular interest in this list is the
• 95 works of Global Anglophone fiction (outside dominance of science fiction and fantasy works,
the U.S. and U.K.) from 1935–2020. including Harry Potter, 1984, Lord of the Rings,
Hunger Games, Hitchhiker’s Guide to the Galaxy,
• 99 works of genre fiction, containing science Fahrenheit 451, A Game of Thrones, and Dune—
fiction/fantasy, horror, mystery/crime, romance 12 of the top 20 most memorized books in copy-
and action/spy novels from 1928–2017. right fall in this category. Table 2 explores this in
Pre-1923 LitBank texts are born digital on more detail by aggregating the performance by the
Project Gutenberg and are in the public domain top-level categories described above, including the
in the United States; all other sources were created specific genre for genre fiction.
by purchasing physical books, scanning them and GPT-4 and ChatGPT are widely knowledge-
OCR’ing them with Abbyy FineReader. As of the able about texts in the public domain (included
time of writing, books published after 1928 are in pre-1923 LitBank); it knows little about works
generally in copyright in the U.S. of Global Anglophone texts, works in the Black
Book Interactive Project and Black Caucus Ameri-
5 Results can Library Association award winners.
Table 5: Date of first publication performance, with 95% bootstrap confidence intervals around MAE. Spearman
ρ across all data and all differences between top and bottom 10% within a model are significant (p < 0.05).
a valid year prediction at all (or refrains, noting the 7.2 Predicting narrative time
year is “uncertain”); if it make a valid prediction,
Our second experiment predicts the amount of time
we calculate the prediction error for each book as
that has passed in the same 250-word passage sam-
the absolute difference between the predicted year
pled above, using the conceptual framework of Un-
and the true year (|ŷ−y|). In order to assess any dis-
derwood (2018), explored in the context of GPT-4
parities in performance as a function of a model’s
in Underwood (2023). Unlike the year prediction
knowledge about a book, we focus on the differen-
task above, which can potentially draw on encyclo-
tial between the books in the top 10% and bottom
pedic knowledge that GPT-4 has memorized, this
10% by GPT-4 name cloze accuracy.
task requires reasoning directly about the informa-
tion in the passage, as each passage has its own
As table 5 shows, we see strong disparities in duration. Memorized information about a com-
performance as result of a model’s familiarity with plete book could inform the prediction about all
a book. ChatGPT refrains from making a predic- passages within it (e.g., through awareness of the
tion in only 4% of books it knows well, but 9% in general time scales present in a work, or the amount
books it does not (GPT-4 makes a valid prediction of dialogue within it), in addition to having exact
in all cases). When ChatGPT makes a valid predic- knowledge of the passage itself.
tion, the mean absolute error is much greater for As above, our core question asks: do we see
the books it has not memorized (29.8 years) than a difference in performance between books that
for books it has (3.3 years); and the same holds GPT-4 has memorized and those that it has not?
for GPT-4 as well (14.5 years for non-memorized To assess this, we again identify the top and bot-
books; 0.6 years for memorized ones). Over all tom decile of books by their GPT-4 name cloze
books with predictions, increasing GPT name cloze accuracy. We manually annotate the amount of
accuracy is linked to decreasing error rates at year time passing in a 250-word passage from that book
prediction (Spearman ρ = −0.39 for ChatGPT, using the criteria outlined by Underwood (2018),
−0.37 for GPT-4, p < 0.001). including the examples provided to GPT-4 in Un-
derwood (2023), blinding annotators to the identity
of the book and which experimental condition it be-
Predicting the date of first publication is typi- longs to. We then pass those same passages through
cally cast as a textual inference problem, estimat- GPT-4 using the prompt in figure 4, adapted from
ing a date from linguistic features of the text alone Underwood (2023). We calculate accuracy as the
(e.g., mentions of airplanes may date a book to Spearman ρ between the predicted times and true
after the turn of the 20th century; thou and thee times and calculate a 95% confidence interval over
offer evidence of composition before the 18th cen- that measure using the bootstrap.
tury, etc.). While it is reasonable to believe that We find a large but not statistically significant
memorized access to an entire book may lead to difference between TOP10 (ρ = 0.50 [0.25–0.72])
improved performance when predicting the date and BOT10 (ρ = 0.27 [−0.02–0.59]), suggesting
from a passage within it, another explanation is that GPT-4 may perform better on this task for
simply that GPT models are accessing encyclope- books it has memorized, but not conclusive evi-
dic knowledge about those works—i.e., identifying dence that this is so. Expanding this same process
a passage as Pride and Prejudice and then access- to the top and bottom quintile sees no effect (TOP20
ing a learned fact about that work. For tasks that (ρ = 0.47 [0.29–0.63]; BOT20 (ρ = 0.46 [0.25–
can be expressed as facts about a text (e.g., time 0.64]); if an effect exists, it is only among the most
of publication, genre, authorship), this presents a memorized books. This suggests caution; more
significant risk of contamination. research is required to assess these disparities.
8 Discussion downstream task, it is risky to draw expectations
of generalization from its performance on Alice
We carry out this work in anticipation of the ap- in Wonderland, Harry Potter, Pride and Prejudice,
peal of ChatGPT and GPT-4 for both posing and and so on—it simply knows much more about these
answering questions in cultural analytics. We find works than the long tail of literature (both in the
that these models have memorized books, both in public domain and in copyright).
the public domain and in copyright, and the ca-
pacity for memorization is tied to a book’s overall Whose narratives? At the core of questions sur-
popularity on the web. This differential in mem- rounding the data used to train large language mod-
orization leads to differential in performance for els is one of representation: whose language, and
downstream tasks, with better performance on pop- whose lived experiences mediated through that
ular books than on those not seen on the web. As language, is captured in their knowledge of the
we consider the use of these models for questions world? While previous work has investigated this
in cultural analytics, we see a number of issues in terms of what varieties of English pass GPT-
worth considering. inspired content filters (Gururangan et al., 2022),
we can ask the same question about the narrative
The virtues of open models. This archaeology experiences present in books: whose narratives in-
is required only because ChatGPT and GPT-4 are form the implicit knowledge of these models? Our
closed systems, and the underlying data on which work here has surfaced not only that ChatGPT and
they have been trained is unknown. While our GPT-4 have memorized popular works, but also
work sheds light on the memorization behavior what kinds of narratives count as “popular”—in
of these models, the true training data remains particular, works in the public domain published
fundamentally unknowable outside of OpenAI. As before 1928, along with contemporary science fic-
others have pointed out with respect to these sys- tion and fantasy. Our work here does not inves-
tems in particular, there are deep ethical and repro- tigate how this potential source of bias might be-
ducibility concerns about the use of such closed come realized—e.g., how generated narratives are
models for scholarly research (Spirling, 2023). A pushed to resemble text it has been trained on (Lucy
preferred alternative is to embrace the develop- and Bamman, 2021); how the values of the memo-
ment of open-source, transparent models, including rized texts are encoded to influence other behaviors,
LLaMA (Touvron et al., 2023), OPT (Zhang et al., etc.—and so any discussion about it can only be
2022) and BLOOM (Scao et al., 2022). While open speculative, but future work in this space can take
source models will not alleviate the disparities in the disparities we find in memorization as a starting
performance we see by virtue of their openness point for this important work.
(LLaMA is trained on books in the Pile, and all use
information from the general web, which each al- 9 Conclusion
locate attention to popular works), but it addresses
the issue of data contamination since the works As research in cultural analytics is poised to be
in the training data will be known; works in any transformed by the affordances of large language
evaluation data would then be able to be queried models, we carry out this work to uncover the
for membership (Marone and Van Durme, 2023). works of fiction that two popular models have
memorized, in order to assess the risks of using
Popular texts are likely not good barometers closed models for research. In our finding that
of model performance. A core concern about memorization affects performance on downstream
closed models with unknown training data is test tasks in disparate ways, closed data raises uncer-
contamination: data in an evaluation benchmark tainty about the conditions when a model will per-
(providing an assessment of measurement valid- form well and when it will fail; while our work
ity) may be present in the training data, leading an tying memorization to popularity on the web offers
assessment to be overconfident in a model’s abili- a rough guide to assess the likelihood of memo-
ties. Our work here has shown that OpenAI models rization, it does not solve that fundamental chal-
know about books in proportion to their popularity lenge. Only open models with known training
on the web, and that their performance on down- sources would do so. Code and data to support
stream tasks is tied to that popularity. When bench- this work can be found at https://github.com/
marking the performance of these models on a new bamman-group/gpt4-books.
Acknowledgments Terra Blevins and Luke Zettlemoyer. 2022. Language
contamination helps explains the cross-lingual capa-
The research reported in this article was supported bilities of English pretrained models. In Proceed-
by funding from the National Science Foundation ings of the 2022 Conference on Empirical Methods
(IIS-1942591) and the National Endowment for the in Natural Language Processing, pages 3563–3574,
Abu Dhabi, United Arab Emirates. Association for
Humanities (HAA-271654-20). Computational Linguistics.
Elizabeth F. Evans and Matthew Wilkens. 2018. Na- Daphne Ippolito, Florian Tramèr, Milad Nasr, Chiyuan
tion, ethnicity, and the geography of british fiction, Zhang, Matthew Jagielski, Katherine Lee, Christo-
1880–1940. Cultural Analytics. pher A Choquette-Choo, and Nicholas Carlini. 2022.
Preventing verbatim memorization in language mod-
Kathleen Fitzpatrick. 2012. The humanities, done dig- els gives a false sense of privacy. arXiv preprint
itally. Debates in the digital humanities, pages 12– arXiv:2210.17546.
15.
FM Jong, Henning Rode, and Djoerd Hiemstra. 2005.
Leo Gao, Stella Biderman, Sid Black, Laurence Gold- Temporal language models for the disclosure of his-
ing, Travis Hoppe, Charles Foster, Jason Phang, Ho- torical text. Proceedings of the XVIth International
race He, Anish Thite, Noa Nabeshima, et al. 2020. Conference of the Association for History and Com-
The pile: An 800gb dataset of diverse text for lan- puting (AHC 2005).
guage modeling. arXiv preprint arXiv:2101.00027.
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric
Timnit Gebru, Jamie Morgenstern, Briana Vec- Wallace, and Colin Raffel. 2022a. Large language
chione, Jennifer Wortman Vaughan, Hanna Wal- models struggle to learn long-tail knowledge. arXiv
lach, Hal Daumé Iii, and Kate Crawford. 2021. preprint arXiv:2211.08411.
Datasheets for datasets. Communications of the
ACM, 64(12):86–92. Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022b.
Deduplicating training data mitigates privacy risks
Lauren M E Goodlad and Wai Chee Dimock. 2021. AI in language models. In International Conference on
and the Human. PMLA, 136(2):317–319. Machine Learning, pages 10697–10707. PMLR.
John Guillory. 1993. Cultural capital the problem of lit- Folgert Karsdorp and Lauren Fonteyn. 2019. Cultural
erary canon formation. University of Chicago Press, entrenchment of folktales is encoded in language.
Chicago. Palgrave Communications, 5(1):25.
Suchin Gururangan, Dallas Card, Sarah K Drier, Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke
Emily K Gade, Leroy Z Wang, Zeyu Wang, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization
Zettlemoyer, and Noah A Smith. 2022. Whose lan- through memorization: Nearest neighbor language
guage counts as high quality? measuring language models. In International Conference on Learning
ideologies in text data selection. arXiv preprint Representations.
arXiv:2201.10474.
Abhimanu Kumar, Matthew Lease, and Jason
Gary Gutting. 1989. Michel Foucault’s Archaeology of Baldridge. 2011. Supervised language modeling for
Scientific Reason: Science and the History of Rea- temporal resolution of texts. In Proceedings of the
son. Cambridge University Press. 20th ACM international conference on Information
and knowledge management, pages 2069–2072.
Sil Hamilton and Andrew Piper. 2022. The COVID
that wasn’t: Counterfactual journalism using GPT. Katherine Lee, Daphne Ippolito, Andrew Nystrom,
In Proceedings of the 6th Joint SIGHUM Workshop Chiyuan Zhang, Douglas Eck, Chris Callison-Burch,
on Computational Linguistics for Cultural Heritage, and Nicholas Carlini. 2022. Deduplicating training
Social Sciences, Humanities and Literature, pages data makes language models better. In Proceedings
of the 60th Annual Meeting of the Association for Yasaman Razeghi, Robert L Logan IV, Matt Gardner,
Computational Linguistics (Volume 1: Long Papers), and Sameer Singh. 2022. Impact of pretraining
pages 8424–8445, Dublin, Ireland. Association for term frequencies on few-shot numerical reasoning.
Computational Linguistics. In Findings of the Association for Computational
Linguistics: EMNLP 2022, pages 840–854, Abu
Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. Dhabi, United Arab Emirates. Association for Com-
2021. Question and answer test-train overlap in putational Linguistics.
open-domain question answering datasets. In Pro-
ceedings of the 16th Conference of the European Nicolas Ruth, Andreas Niekler, and Manuel Burghardt.
Chapter of the Association for Computational Lin- 2022. Peeking Inside the DH Toolbox – Detection
guistics: Main Volume, pages 1000–1008, Online. and Classification of Software Tools in DH Publica-
Association for Computational Linguistics. tions. Proceedings of the Computational Humani-
ties Research Conference 2022, 1613:0073.
Li Lucy and David Bamman. 2021. Gender and repre-
sentation bias in GPT-3 generated stories. In Pro- Teven Le Scao, Angela Fan, Christopher Akiki, El-
ceedings of the Third Workshop on Narrative Un- lie Pavlick, Suzana Ilić, Daniel Hesslow, Ro-
derstanding, pages 48–55, Virtual. Association for man Castagné, Alexandra Sasha Luccioni, François
Computational Linguistics. Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-
parameter open-access multilingual language model.
Inbal Magar and Roy Schwartz. 2022. Data contami- arXiv preprint arXiv:2211.05100.
nation: From memorization to exploitation. In Pro-
ceedings of the 60th Annual Meeting of the Associa- Henning Schmidgen, Bernhard Dotzler, and Benno
tion for Computational Linguistics (Volume 2: Short Stein. 2023. From the archive to the computer:
Papers), pages 157–165, Dublin, Ireland. Associa- Michel Foucault and the digital humanities. Journal
tion for Computational Linguistics. of cultural analytics, 7(4).
Marc Marone and Benjamin Van Durme. 2023. Data Reza Shokri, Marco Stronati, Congzheng Song, and
portraits: Recording foundation model training data. Vitaly Shmatikov. 2017. Membership inference at-
tacks against machine learning models. In 2017
Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao IEEE symposium on security and privacy (SP),
Wang, David Evans, and Taylor Berg-Kirkpatrick. pages 3–18. IEEE.
2022. An empirical analysis of memorization in
fine-tuned autoregressive language models. In Pro- Arthur Spirling. 2023. Why open-source generative AI
ceedings of the 2022 Conference on Empirical Meth- models are an ethical way forward for science. Na-
ods in Natural Language Processing, pages 1816– ture, 616(7957):413–413.
1826, Abu Dhabi, United Arab Emirates. Associa-
tion for Computational Linguistics. Dominik Stammbach, Maria Antoniak, and Elliott Ash.
2022. Heroes, villains, and victims, and GPT-
Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gim- 3: Automated extraction of character roles without
pel, and David McAllester. 2016. Who did what: training data. In Proceedings of the 4th Workshop
A large-scale person-centered cloze dataset. In Pro- of Narrative Understanding (WNU2022), pages 47–
ceedings of the 2016 Conference on Empirical Meth- 56, Seattle, United States. Association for Computa-
ods in Natural Language Processing, pages 2230– tional Linguistics.
2235, Austin, Texas. Association for Computational
Linguistics. Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie,
Yizhong Wang, Hannaneh Hajishirzi, Noah A.
Andrew Piper, Richard Jean So, and David Bamman. Smith, and Yejin Choi. 2020. Dataset cartography:
2021. Narrative theory for computational narra- Mapping and diagnosing datasets with training dy-
tive understanding. In Proceedings of the 2021 namics. In Proceedings of the 2020 Conference on
Conference on Empirical Methods in Natural Lan- Empirical Methods in Natural Language Process-
guage Processing, pages 298–311, Online and Punta ing (EMNLP), pages 9275–9293, Online. Associa-
Cana, Dominican Republic. Association for Compu- tion for Computational Linguistics.
tational Linguistics.
Kushal Tirumala, Aram Markosyan, Luke Zettlemoyer,
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine and Armen Aghajanyan. 2022. Memorization with-
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, out overfitting: Analyzing the training dynamics of
Wei Li, and Peter J Liu. 2020. Exploring the limits large language models. Advances in Neural Infor-
of transfer learning with a unified text-to-text trans- mation Processing Systems, 35:38274–38290.
former. The Journal of Machine Learning Research,
21(1):5485–5551. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Stephen Ramsay and Geoffrey Rockwell. 2012. De- Baptiste Rozière, Naman Goyal, Eric Hambro,
veloping things: Notes toward an epistemology of Faisal Azhar, et al. 2023. Llama: Open and effi-
building in the digital humanities. Debates in the cient foundation language models. arXiv preprint
digital humanities, pages 75–84. arXiv:2302.13971.
Ted Underwood. 2018. Why literary time is measured • David Bamman: conducted experiments, an-
in minutes. ELH, 85(2). notated narrative time duration data; wrote
Ted Underwood. 2023. Using GPT-4 to paper.
measure the passage of time in fiction.
https://tedunderwood.com/2023/03/19/using-
gpt-4-to-measure-the-passage-of-time-in-fiction/.
Appendix
Author contributions
• Kent Chang: performed extrinsic analysis; an-
notated narrative time duration data; wrote
paper.
Example:
Input: He felt his face. ‘I miss sneering at something I love. How we used to love to gather in the checker-tiled kitchen in front
of the old boxy cathode-ray Sony whose reception was sensitive to airplanes and sneer at the commercial vapidity of broadcast
stuff.’
Output: <year>1996</year>
Input: The multiplicity of its appeals—the perpetual surprise of its contrasts and resemblances! All these tricks and turns of
the show were upon him with a spring as he descended the Casino steps and paused on the pavement at its doors. He had not
been abroad for seven years—and what changes the renewed contact produced! If the central depths were untouched, hardly
a pin-point of surface remained the same. And this was the very place to bring out the completeness of the renewal. The
sublimities, the perpetuities, might have left him as he was: but this tent pitched for a day’s revelry spread a roof of oblivion
between himself and his fixed sky.
Output: <year>1905</year>
Input: He did not speak to Jack or G.H., nor they to him. He lit a cigarette with shaking fingers and watched the spinning billiard
balls roll and gleam and clack over the green stretch of doth, dropping into holes after bounding to and fro from the rubber
cushions. He felt impelled to say something to ease the swelling in his chest. Hurriedly, he flicked his cigarette into a spittoon
and, with twin eddies of blue smoke jutting from his black nostrils, shouted hoarsely, “Jack, I betcha two bits you can’t make it!”
Jack did not answer; the ball shot straight across the table and vanished into a side pocket. “You would’ve lost,” Jack said. “Too
late now,” Bigger said. “You wouldn’t bet, so you lost.” He spoke without looking. His entire body hungered for keen sensation,
something exciting and violent to relieve the tautness. It was now ten minutes to three and Gus had not come. If Gus stayed away
much longer, it would be too late. And Gus knew that. If they were going to do anything, it certainly ought to be done before
folks started coming into the streets to buy their food for supper, and while the cop was down at the other end of the block.
Output:
Table 6: Top 50 books in copyright (published after 1928) by GPT-4 name cloze accuracy.
Read the following passage of fiction. Then do five things.
2: Reason step by step to decide how much time is described in the passage. If the passage doesn’t include any explicit reference
to time, you can guess how much time the events described would have taken. Even description can imply the passage of time by
mentioning the earlier history of people or buildings. But characters’ references to the past or future in spoken dialogue should
not count as time that passed in the scene. Report the time using units of years, weeks, days, hours, or minutes. Do not say zero
or N/A.
3: If you described a range of possible times in step 2 take the midpoint of the range. Then multiply to convert the units into
minutes.
4: Report only the number of minutes elapsed, which should match the number in step 3. Do not reply N/A.
5: Given the amount of speculation required in step 2, describe your certainty about the estimate–either high, moderate, or low.
Input:
TWENTY-FIVE It was raining again the next morning, a slanting gray rain like a swung curtain of crystal beads. I got up feeling
sluggish and tired and stood looking out of the windows, with a dark harsh taste of Sternwoods still in my mouth. I was as empty
of life as a scarecrow’s pockets. I went out to the kitchenette and drank two cups of black coffee. You can have a hangover
from other things than alcohol. I had one from women. Women made me sick. I shaved and showered and dressed and got my
raincoat out and went downstairs and looked out of the front door. Across the street, a hundred feet up, a gray Plymouth sedan
was parked. It was the same one that had tried to trail me around the day before, the same one that I had asked Eddie Mars about.
There might be a cop in it, if a cop had that much time on his hands and wanted to waste it following me around. Or it might be a
smoothie in the detective business trying to get a noseful of somebody else’s case in order to chisel a way into it. Or it might be
the Bishop of Bermuda disapproving of my night life.
Output:
1: A detective wakes up ’the next morning,’ looks out a window for an undefined time, drinks (and presumably needs to make)
two cups of coffee, then shaves and showers and gets dressed before stepping out his front door and seeing a car.
2: Making coffee, showering, and getting dressed take at least an hour. There’s some ambiguity about whether to count the
implicit reference to yesterday (since this is ’the next morning’) as time elapsed in the passage, but let’s say no, since yesterday
is not actually described. So, an hour to 90 minutes.
3: 1.25 hours have elapsed. Multiplying by 60 minutes an hour that’s 75 minutes.
4: 75 minutes.
5: Low confidence, because of ambiguity about a reference to the previous day.
Input:
CHAPTER I "TOM!" No answer. "TOM!" No answer. "What’s gone with that boy, I wonder? You TOM!" No answer. The old
lady pulled her spectacles down and looked over them about the room; then she put them up and looked out under them. She
seldom or never looked through them for so small a thing as a boy; they were her state pair, the pride of her heart, and were built
for "style," not service–she could have seen through a pair of stove-lids just as well. She looked perplexed for a moment, and
then said, not fiercely, but still loud enough for the furniture to hear: "Well, I lay if I get hold of you I’ll–" She did not finish,
for by this time she was bending down and punching under the bed with the broom, and so she needed breath to punctuate the
punches with. She resurrected nothing but the cat. "I never did see the beat of that boy!" She went to the open door and stood in
it and looked out among the tomato vines and "jimpson" weeds that constituted the garden. No Tom. So she lifted up her voice at
an angle calculated for distance and shouted: "Y-o-u-u TOM!" There was a slight noise behind her and she turned just in time to
seize a small boy by the slack of his roundabout and arrest his flight.
Output:
1: An old lady calls for a boy named Tom, checks for him under the bed, goes to the open door and calls for him — then finally
catches him.
2: The lady’s actions are described minutely and seem rapid; they probably took two to four minutes. The lady also alludes to
Tom’s past bad behavior, but references to the past in dialogue should not count as time passing in the scene.
3: Three minutes have elapsed.
4: 3 minutes.
5: High confidence.
Input: [text]
Output:
Figure 4: Sample prompt for narrative time prediction, from Underwood (2023).