You are on page 1of 19

History Unclassified

Peering down the Memory Hole: Censorship, Digitization,


and the Fragility of Our Knowledge Base

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
GLENN D. TIFFERT

THE DIGITAL DEEPFAKES, bot farms, and troll factories assaulting our public sphere have
thrown an overdue light on the radical changes sweeping through our ecosystem of
knowledge.1 New technologies and platforms are making the manipulation of our infor-
mation space feasible at a scale and with an ease that would scarcely have been imagin-
able a generation ago. In particular, the crude artisanal and industrial forms of publica-
tion and censorship familiar to us from centuries past are yielding to an individuated,
dynamic model of information control powered by adaptive algorithms that operate in
ways even their creators struggle to understand.2 These algorithms curate every facet of
our online lives, recursively intermediating our realities according to evolving internal
logics that we cannot see. Lately they have even expanded into authorship by indepen-
dently synthesizing for public consumption new content from archival collections.3 As
their performance improves, “the idea that there’s one article for everyone is going to
quickly change to the one article for me,” and the practice of history, to say nothing of
other empirical disciplines, may never be the same.4
The Lieberthal-Rogel Center for Chinese Studies at the University of Michigan and the Hoover Institution
generously supported this study. Mary Gallagher, Steven Abney, Fred Gibbs, and Kerby Shedden offered
valuable feedback. Fu Liangyu helped to locate and acquire materials. Luo Fusheng, Margaret Orton, Ar-
den Shapiro, Yan Wei, and Charlotte Yin provided essential research assistance.
1
Samantha Bradshaw and Philip N. Howard, “Challenging Truth and Trust: A Global Inventory of Or-
ganized Social Media Manipulation,” Working Paper 2018.1, Project on Computational Propaganda, Ox-
ford Internet Institute, 2018, https://comprop.oii.ox.ac.uk/research/cybertroops2018/; Zeynep Tufekci,
“The Road from Tahrir to Trump,” MIT Technology Review 121, no. 5 (2018): 10–17.
2
Robert Darnton, Censors at Work: How States Shaped Literature (New York, 2014); David Weinber-
ger, “Our Machines Now Have Knowledge We’ll Never Understand,” Wired, April 18, 2017, https://
www.wired.com/story/our-machines-now-have-knowledge-well-never-understand/.
3
Chris Merriman, “BBC 4.1 Joins the AI Revolution with Two Nights of Neural Network Generated Clips,”
The Inquirer, September 4, 2018, https://www.theinquirer.net/inquirer/news/3061268/bbc-41-joins-the-ai-revolu
tion-with-two-nights-of-neural-network-generated-clips; “Xinhua Publishes 1st MGC Video on Two Sessions,”
Xinhua News Agency, March 2, 2018, https://www.youtube.com/watch?v=IE8JzO7eyPQ; Ahmed Elgammal,
Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone, “CAN: Creative Adversarial Networks, Generating
‘Art’ by Learning about Styles and Deviating from Style Norms,” arXiv, June 21, 2017, https://arxiv.org/abs/
1706.07068 [cs.AI]; Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever,
“Language Models Are Unsupervised Multitask Learners,” OpenAI, February 14, 2019, https://d4mucfpksywv.
cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
4
Sean Gourley, founder of the machine intelligence company Primer, quoted in Kelsey Ables, “What
Happens When China’s State-Run Media Embraces AI?,” Columbia Journalism Review, June 21, 2018,
https://www.cjr.org/analysis/china-xinhua-news-ai.php.
© The Author(s) 2019. Published by Oxford University Press on behalf of the American Historical
Association. All rights reserved. For permissions, please e-mail journals.permissions@oup.com.

550
Peering down the Memory Hole 551

We can peer into that future with the aid of a case study from the People’s Republic
of China (PRC), where online platforms comparable to JSTOR are rewriting the histori-
cal record by stealthily redacting their holdings. Using a combination of qualitative and

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
computational methods, I analyze a sample of this censorship, reverse-engineer its logic,
and consider where it may take us. My findings expose the leading edge of a coming
storm, and because the practices I identify are easily emulated and refined, no corner of
the knowledge economy lies beyond their reach. As digitization advances around
the globe, the opportunities and temptation to exploit the vulnerabilities described here
will multiply, with potentially devastating consequences not just for the reliability of
our source base and the knowledge we derive from it, but also for the civic life these
public goods sustain. Acquainting ourselves with the threat is essential to preempting
that outcome.
Most analyses of the digital turn suffer from a common blind spot: they generally
presume that the custodians of our digital collections are neutral third parties who have
no reason to alter or allow others to alter the records in their care.5 That trust is unwise.
In a growing number of countries, the digital domain resembles less a marketplace of
ideas than an arena for combat, and the PRC, to take one example, welcomes this fight
and weaponizes our credulity.6
True to its Leninist roots, the Chinese Communist Party (CCP) describes the online
realm as a “battlefield” on which a tightly disciplined political struggle must be waged
and won.7 Intent on seizing the initiative, it exploits the openness of democratic socie-
ties to project its influence abroad, while vigilantly policing its own walled garden with
5
“Authenticity Task Force Report,” in The Long-Term Preservation of Authentic Electronic Records:
Findings of the InterPARES Project (2002), http://www.interpares.org/book/interpares_book_d_part1.pdf,
21; Roy Rosenzweig, “Scarcity or Abundance? Preserving the Past in a Digital Era,” American Historical
Review 108, no. 3 (June 2003): 735–762; Tim Hitchcock, “Confronting the Digital: Or How Academic
History Writing Lost the Plot,” Cultural and Social History 10, no. 1 (2013): 9–23; Ludmilla Jordanova,
“Historical Vision in a Digital Age,” Cultural and Social History 11, no. 3 (2014): 343–348; Lara Putnam,
“The Transnational and the Text-Searchable: Digitized Sources and the Shadows They Cast,” American
Historical Review 121, no. 2 (April 2016): 377–402; Abby Smith Rumsey, When We Are No More: How
Digital Memory Is Shaping Our Future (New York, 2016); Trevor Owens, The Theory and Craft of Digi-
tal Preservation (Baltimore, 2018).
6
Bradshaw and Howard, “Challenging Truth and Trust”; United States Department of Justice, Report
of the Attorney General’s Cyber Digital Task Force, July 2, 2018, https://www.justice.gov/ag/page/file/
1076696/download; Justin Clark, Robert Faris, Ryan Morrison-Westphal, Helmi Noman, Casey Tilton,
and Jonathan Zittrain, “The Shifting Landscape of Global Internet Censorship,” Berkman Klein Center for
Internet & Society Research Publication, June 2017, http://nrs.harvard.edu/urn-3:HUL.InstRe
pos:33084425; Adrian Shahbaz, “Freedom on the Net: The Rise of Digital Authoritarianism,” Freedom
House, October 2018, https://freedomhouse.org/report/freedom-net/freedom-net-2018, 29; Paul M. Barrett,
Tara Wadhwa, and Dorothée Baumann-Pauly, Combating Russian Disinformation: The Case for Stepping
Up the Fight Online, NYU Stern Center for Business and Human Rights, July 2018, https://issuu.com/
nyusterncenterforbusinessandhumanri/docs/nyu_stern_cbhr_combating_russian_di.
7
“Zhangwo yulun zhanchang zhudongquan: Xi Jinping yaoqiu sanweidu dazao meiti xin qijian” 掌握
舆论战场主动权: 习近平要求三维度打造媒体新旗舰 [Grasp the Initiative on the Public Opinion Bat-
tlefield: Xi Jinping Demands the Three-Dimensional Forging of a New Flagship for Media], China Central
Television (CCTV), February 18, 2017, http://news.cctv.com/2017/02/18/ARTIrsoCDdYTIbTLWNRW
p2ii170218.shtml; “Zai quanguo xuanchuan sixiang gongzuo huiyi shang de jianghua” 在全国宣传思想
工作会议上的讲话 [Speech at the National Propaganda Thought Work Conference], in Zhonggong zhon-
gyang wenxian yanjiushi 中共中央文献研究室 [CCP Central Committee Documents Research Office],
ed., Xi Jinping guanyu quanmian shenhua gaige lunshu zhaibian 习近平关于全面深化改革论述摘编
[Extracts of Xi Jinping on Comprehensively Deepening Reform] (Beijing, 2014), 83; Chai Yifei 柴逸扉,
“Xi Jinping de xinwen yulun guan” 习近平的新闻舆论观 [Xi Jinping’s Views on News and Public Opin-
ion], Zhongguo gongchandang xinwen wang 中国共产党新闻网 [CCP News Network], February 25,
2016, http://theory.people.com.cn/n1/2016/0225/c40531-28148369.html.

AMERICAN HISTORICAL REVIEW APRIL 2019


552 Glenn D. Tiffert

a pervasive system of networked authoritarianism that showcases how illiberal regimes


the world over can turn the technologies of the information age to their advantage.8
History is a revealing case in point. Orwell famously observed, “Who controls the

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
past controls the future; who controls the present controls the past.”9 Mindful of that,
the CCP vigorously suppresses domestic “attempts to distort or smear socialism with
Chinese characteristics, Party history, the history of the PRC, the history of the people’s
armed forces, Party leaders, and acclaimed heroes and role models.”10 And lately it has
begun to export this censorship regime beyond its borders, leveraging its economic and
technological resources to co-opt foreigners, sometimes without their knowledge or
consent, in an audacious campaign to sanitize the historical record and globalize its own
competing narratives.11
The CCP’s timing is impeccable. Economic and technological disruptions to our
ecosystem of knowledge are eroding our capacity to detect, much less combat, this in-
formation war, and nowhere is that more apparent than in our libraries. Motivated by
thrift and efficiency, many academic libraries are deaccessioning volumes and outsourc-
ing growing parts of their collections to online platforms, trusting these platforms to
supply full replacement value and to guarantee the integrity of their products. In 2014
alone, the University of California, Santa Cruz eliminated nearly 60 percent of the
books from its Science and Engineering Library, and the University of California,
Berkeley’s Haas School of Business transferred virtually its entire print collection to
storage.12 Nearly all of the more than 100,000 scholarly journals that Berkeley’s librar-
ies receive each year arrive digitally, and have for more than a decade.13
Much can go wrong with this bargain, especially since many of the publishers and
platforms that now aggregate and deliver our knowledge are market-driven ventures
subject to commercial pressures.14 They may adhere to different values, priorities, and
8
Shahbaz, “Freedom on the Net,” 6–10; Rebecca MacKinnon, “China’s ‘Networked Authoritarian-
ism,’” Journal of Democracy 22, no. 2 (2011): 32–46; Wen-Hsuan Tsai, “How ‘Networked Authoritarian-
ism’ Was Operationalized in China: Methods and Procedures of Public Opinion Control,” Journal of
Contemporary China 25, no. 101 (2016): 731–744; Paul Mozur, “With Cameras and A.I., China Closes
Its Grip,” New York Times, July 8, 2018, A1; Jack Goldsmith and Stuart Russell, “Strengths Become Vul-
nerabilities: How a Digital World Disadvantages the United States in Its International Relations,” Aegis
Series Paper No. 1806, 2018, https://www.hoover.org/sites/default/files/research/docs/381100534-
strengths-become-vulnerabilities.pdf; Chris C. Demchak and Yuval Shavitt, “China’s Maxim—Leave No
Access Point Unexploited: The Hidden Story of China Telecom’s BGP Hijacking,” Military Cyber Affairs
3, no. 1 (2018): 1–9; Margaret E. Roberts, Censored: Distraction and Diversion inside China’s Great
Firewall (Princeton, N.J., 2018); Louisa Lim and Julia Bergin, “Inside China’s Audacious Global Propa-
ganda Campaign,” The Guardian, December 7, 2018, https://www.theguardian.com/news/2018/dec/07/
china-plan-for-global-media-dominance-propaganda-xi-jinping.
9
George Orwell, 1984 (1949; repr., New York, 1961), 248.
10
“Guanyu xin xingshi xia dangnei zhengzhi shenghuo de ruogan zhunze” 关于新形势下党内政治生
活的若干准则 [Certain Norms for Intra-Party Political Life under the New Circumstances], Xinhua she
新华社 [Xinhua News Agency], November 2, 2016, http://www.xinhuanet.com/politics/2016-11/02/c_
1119838382_2.htm; “China Is Struggling to Keep Control over Its Version of the Past,” The Economist
421, no. 9013 (2016): 37–38; Yan Lianke, “On China’s State-Sponsored Amnesia,” New York Times,
April 1, 2013, https://www.nytimes.com/2013/04/02/opinion/on-chinas-state-sponsored-amnesia.html.
11
Christopher Walker, “What Is ‘Sharp Power’?,” Journal of Democracy 29, no. 3 (2018): 9–23.
12
Teresa Watanabe, “Universities Redesign Libraries for the 21st Century: Fewer Books, More Space,”
Los Angeles Times, April 19, 2017, http://www.latimes.com/local/lanow/la-me-college-libraries-20170419-
story.html.
13
Matthew Quinlan, “Five Questions for UC Berkeley Librarian Jeffrey Mackie-Mason,” California
Magazine, Summer 2017, http://alumni.berkeley.edu/california-magazine/summer-2017-adaptation/five-
questions-uc-berkeley-librarian-jeffrey-mackie-mason.
14
Maria Bustillos, “Erasing History,” Columbia Journalism Review 57, no. 1 (2018): 112–118.

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 553

standards of stewardship than traditional libraries, and they may be accountable to dif-
ferent constituencies, such as shareholders.15 Powerful interest groups or the threat of
litigation can influence their decisions. Bankruptcies, corporate restructurings, licensing

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
disputes, and state action can snuff out their online collections without warning.16 And
things can go spectacularly wrong when they confront the demands of a mercurial cen-
sorship regime and the authoritarian government behind it.17
Not long ago, it might have seemed preposterous to suggest that some of our most
respected academic publishers and technology firms would be complicit in state censor-
ship. But times have changed.18 In the summer of 2017, acting on a request from its Chi-
nese importer, Cambridge University Press (CUP) quietly removed 315 articles and book
reviews from the online edition of the respected British academic journal The China
Quarterly, without consulting the journal’s editors or the affected authors. For subscribers
in China, the items simply disappeared, though they remained accessible elsewhere. After
exposure led to negative publicity, CUP ultimately reversed itself and rebuffed a concur-
rent request to censor approximately 100 articles from the online edition of the Journal of
Asian Studies, the flagship journal of the U.S.-based Association for Asian Studies. By
contrast, Springer Nature, which bills itself as the largest academic publisher in the world,
capitulated to Chinese requests, effectively arguing that its censorship of over 1,000 of its
own publications was a cost of doing business. Enticed by the Chinese market, Google,
Apple, and Facebook have proffered similar concessions.19
Traditional post-publication censorship is notoriously toilsome and inefficient. Tear-
ing out passages or seizing and destroying entire volumes demands physical control of
the relevant texts, and copies often slip through the net to bear witness. Digitization,
however, mitigates these deficiencies. It encodes knowledge not in tangible objects dis-
persed redundantly among libraries and collectors, but in effortlessly mutable bitstreams
delivered from distant servers along a centralized distribution chain. As the CUP and
Springer episodes illustrate, the providers who control these servers can silently alter
our knowledge base at its source without ever leaving their back offices, making one al-
teration after another, each with the potential to propagate instantaneously around the
globe. They can apply these alterations as broadly or as narrowly as they wish, forking
the sources under their control into myriad editions, each defined by the idiosyncratic
circumstances of a given audience. They have proven by their example that it matters
15
Kate Klonick, “The New Governors: The People, Rules, and Processes Governing Online Speech,”
Harvard Law Review 131 (2018): 1598–1670.
16
Bustillos, “Erasing History,” 117–118.
17
“At Beijing Book Fair, Publishers Admit to Self-Censorship to Keep Texts on Chinese Market,”
South China Morning Post, August 24, 2017, https://www.scmp.com/news/china/policies-politics/article/
2108095/beijing-book-fair-publishers-admit-self-censorship-keep; Jacqueline Williams, “A Book on Chi-
nese Sway in Australia Hits a Nerve,” New York Times, November 20, 2017, A8.
18
Association for Asian Studies, “Update on Chinese Censorship of Academic Publications,” Novem-
ber 7, 2017, http://www.asian-studies.org/asia-now/entryid/103/update-on-chinese-censorship-of-aca
demic-publications; Elizabeth Redden, “An Unacceptable Breach of Trust,” Inside Higher Ed, October 3,
2018, https://www.insidehighered.com/news/2018/10/03/book-publishers-part-ways-springer-nature-over-
concerns-about-censorship-china; Ben Bland, “China Censorship Drive Splits Leading Academic Publish-
ers,” Financial Times, November 6, 2017, 4; Nicholas Loubere and Ivan Franceschini, “How the Chinese
Censors Highlight Fundamental Flaws in Academic Publishing,” Chinoiresie, October 16, 2018, http://
www.chinoiresie.info/how-chinese-censors-highlight-fundamental-flaws-in-academic-publishing/.
19
Farhad Manjoo, “Apple’s Silence in China Sets Dangerous Precedent,” New York Times, July 31,
2017, B1; Paul Mozur, “China Exerts Digital Control beyond Its Borders,” New York Times, March 2,
2018, A1.

AMERICAN HISTORICAL REVIEW APRIL 2019


554 Glenn D. Tiffert

not when an item was originally published, since they can digitally modify or wipe it at
any later point in time, leaving no trace of their handiwork behind.
For censors, the possibilities are mouthwatering. Digital platforms offer them dy-
namic, fine-grained mastery over memory and identity, and in the case of China, they

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
are capitalizing on this to engineer a pliable version of the past that can be tuned algo-
rithmically to always serve the CCP’s present. Dazzled by the abundance of sources on
these platforms, we have failed to grasp these Potemkin-like possibilities, much less
their historiographical implications. Let us attend to that now.

POLITICAL-LEGAL RESEARCH 政法研究 and Law Science 法学 were the two dominant aca-
demic law journals published in the PRC during the 1950s. The original print editions
of these publications document the construction of China’s post-1949 socialist legal sys-
tem and the often savage debates that seized it. Few libraries outside the PRC possess
complete original print runs of these journals, and with the advent of convenient digital
editions, those that do have typically relegated the fragile paper volumes to off-site stor-
age. For most users, online access is now the norm.
Unfortunately, the online editions of Political-Legal Research and Law Science
have been redacted in ways that materially distort the historical record but are invisible
to the end user. The consequences are as unsettling as they are deliberate: the more
faithful scholars are to this adulterated source base and the sanitized reality it projects,
the more they may unwittingly promote the agendas of the censors.
Consider the issues originally published in the PRC from 1956 through 1958, which
chronicle how budding debates over matters such as judicial independence, the tran-
scendence of law over politics and class, the presumption of innocence, and the herita-
bility of law abruptly gave way to vituperative denunciations of those ideas and their
sympathizers. Currently, only two online platforms—China National Knowledge Infra-
structure 中国知网 and the National Social Sciences Database 国家哲学社会科学学
术期刊数据库—offer full-text coverage of these issues, and their holdings are identi-
cal, down to their silent omission of exactly the same sixty-three articles, a coincidence
that suggests a common blacklist.
The temporal distribution of the omissions is striking. (See Figure 1.) They start

100%
90%
80%
70%
60%
Pages

50%
40% Uncensored

30% Censored

20%
10%
0%
1956.12
1956.2

1956.4

1956.6

1956.8

1956.9

1956.10

1957.2

1957.4

1957.6

1957.8

1957.10

1957.12

1958.1

1958.2

1958.3

1958.4

1958.5

1958.6

1958.7

1958.8

1958.9

1958.10

1958.12

Year.Month

FIGURE 1: Pages censored by publication date, Political-Legal Research and Law Science (1956–1958).

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 555

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
FIGURE 2: Articles missing from the online edition of the August 1957 issue of Law Science.

abruptly in the summer of 1957, when a wave of political persecutions known as the
Anti-Rightist Campaign began, then crest over the next few months before generally ta-
pering off as the campaign wound down.20 For the three years in question, more than
8 percent of the articles and 11 percent of the total page count have been erased from
the online editions of these journals. Notably, the gaps are concentrated at the tops of
their tables of contents, which means that the censors are today suppressing the articles
these journals once proudly led with. For instance, the online edition of the October
1957 issue of Political-Legal Research omits seven out of eleven main articles, reduc-
ing a fifty-seven-page issue to twenty-three pages. Likewise, the online edition of the
August 1957 issue of Law Science omits the first nine articles, reducing a seventy-two-
page issue to forty-two pages. The table of contents belonging to that issue appears here
in Figure 2, scanned from an original paper edition. The articles missing online are
marked with arrows, and their translated titles appear alongside.
As one might expect, the search engines on both platforms are blind to the missing con-
tent, returning only sanitized results and leaving the end user none the wiser. Similarly, the
online tables of contents for affected issues display unbroken lists of articles with no place-
20
The small spike in October 1958 corresponds to the conclusion of the Fourth National Judicial Work
Conference, which formalized the purge of many of the cadres who had led the PRC judicial system dur-
ing the preceding decade, and laid down a corrective, strongly leftist ideological line.

AMERICAN HISTORICAL REVIEW APRIL 2019


556 Glenn D. Tiffert

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
FIGURE 3: Cosine similarity of combined corpora with (un)censored facets.

holders or notations for the omissions. On one site, the only hint would be unexplained
gaps in the page number sequence of the articles, a detail that is easy to overlook or miss
the significance of. The other site omits even this clue by forgoing page numbers altogether.
There is little doubt that the omissions are content-driven. Computational tools illus-
trate this clearly. Figure 3 plots the articles from both journals. The spatial arrangement
of the markers denotes the relative proximity of the texts to one another based on their
discursive (cosine) similarity. Each dot represents a document present in the online
corpora, and each triangle represents a document missing from them. If the criteria for
the omission of an article were essentially random, one would expect to see both
markers distributed similarly across the document space. If, on the other hand, the crite-
ria were deterministic, one would expect instead to see structure in the data, manifested
as clustering or separation between the markers. The actual results are unmistakable.
The uncensored facet is evenly distributed, while the censored facet is not. This finding
strongly suggests that the omission of texts is not random, but rather involves a discrimi-
nating logic, though we must go deeper to determine what that is.
I have read all 737 articles in these corpora closely, and I have devoted years of
study to the domain they describe. But I am also mindful of my human subjectivity, es-

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 557

Political-Legal Research
0 1 2 3 4 5 6 7 8 9 10

Rightist element 7.52, p=.006

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
Wang Han 6.84, p=.009
Rightist 6.09, p=.014
Wang Jixin 5.33, p=.021
Lu Mingjian 5.24, p=.022
Campaign to Eliminate Counterrevolutionaries 5.23, p=.022
To weigh the evidence 4.79, p=.029
Deduce 4.58, p=.032
Qian Duansheng 4.27, p=.039
Jia Qian 4.19, p=.041
Wu Chuanyi 3.95, p=.047

Law Science
0 5 10 15 20 25 30 35 40 45 50
Yang Zhaolong
Wang Zaoshi 10.11, p=.001 45.53, p=1.50E-11
Rule of law 6.35, p=.012
Kuomintang 5.84, p=.016
Campaign to Eliminate Counterrevolutionaries 5.62, p=.018
Rightist element 5.05, p=.025
Roscoe Pound 4.95, p=.026
News 4.90, p=.027
Schools & Departments 4.84, p=.028
Scientific 4.82, p=.028
Law 4.76, p=.029
Rule of man 4.73, p=.030
Luo Jiaheng 4.64, p=.031
China Democratic League 4.48, p=.034
Rightist 4.48, p=.034
Legislation 4.34, p=.037
Fascist 4.04, p=.045
(Nationalist) Six Codes 3.87, p=.049
FIGURE 4: χ2 feature selection (χ2 score, p-value, α=.05).

pecially my susceptibility to ordinary cognitive bias, and the possibility that salient
details and relationships among their nearly four million characters of text might elude
me. A procedure called χ2 (chi-squared) feature selection mitigates these limitations by
measuring the strength of the dependence between the appearance of a term in a text
and the membership of that text in the censored class. The higher the score, the stronger
the correlation. The results are highly informative and allow me, broadly speaking, to
reverse-engineer the logic behind the censorship.
Figure 4 lists the terms (features) most closely correlated with the articles censored
from the online editions of each journal, the relative strength of those correlations, and
their respective degrees of statistical significance. For the uninitiated, these feature lists
read like keyword tags from the ferocious ideological debates that seized the Chinese le-
gal system in the Anti-Rightist period. Notably, every name that appears on them, save
one, identifies a prominent figure who was flagrantly persecuted as a personification of
heterodoxy and an example to others. The sole exception is Roscoe Pound, the former

AMERICAN HISTORICAL REVIEW APRIL 2019


558 Glenn D. Tiffert

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022

FIGURE 5: Yang Zhaolong and Roscoe Pound (1947). Historical & Special Collections, Harvard Law School Library.

dean of Harvard Law School, who appears on the Law Science list by virtue of his past
association with the top feature, Yang Zhaolong.21 (See Figure 5.)
21
When Pound served as a legal advisor to the Nationalist Ministry of Justice from 1946 to 1948,
Yang was frequently his interpreter and translator, and the two collaborated closely on field surveys of the
Chinese judiciary and on plans for the reform of Chinese legal education. Ai Yongming 艾永明 and Lu
Jinbi 陆锦璧, eds., Yang Zhaolong faxue wenji 杨兆龙法学文集 [Collected Legal Writings of Yang Zhao-
long] (Beijing, 2005), 467–558.

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 559

0.3

0.25

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
0.2
tf-idf score

0.15
Uncensored
Censored
0.1

0.05

0
1956.1

1956.2

1956.3

1956.4

1956.5

1956.6

1957.1

1957.2

1957.3

1957.4

1957.5

1957.6

1958.1

1958.2

1958.3

1958.4

1958.5

1958.6
Year.Issue

FIGURE 6: “Rightist element,” Political-Legal Research (1956–1958).

To get a taste of how censors are purposefully rewriting the history of this period,
let us briefly examine the term most closely correlated with censorship in each journal.
Figures 6 and 7 plot the weight of these terms over time, arranged in the sequential or-
der of the articles in which they appear, and color-coded according to the censorship sta-
tus of those articles.22 This casts a spotlight on exactly which arguments the censors are
selecting for and against, and the discursive effect that sorting has.
The term most highly correlated with censorship in Political-Legal Research is
“rightist element.” As one might expect, anyone labeled a “rightist element” was singled
out for persecution during the Anti-Rightist Campaign. It turns out that censors have cut
the weight of “rightist element” in my three-year sample of this journal by 41 percent,
which at the very least warps our sense of how the term was actually used, who it de-
scribed, how usage may have changed over time, and why.
The warping impact of the censors is still more pronounced for the top term from
Law Science: Yang Zhaolong. Yang (1904–1979) was one of the most internationally
respected Chinese jurists of his generation. A graduate of Harvard Law School (SJD
’35), he held a variety of high academic and governmental positions in the Nationalist
era (1927–1949), including chief of the Ministry of Justice’s Criminal Division, where
he directed Chinese participation in the Tokyo war crimes trials. In 1949, underground
CCP operatives persuaded him to remain on the mainland to serve the incoming com-
munist regime, though his background soon foreclosed that possibility.
During a brief political thaw from 1956 to early 1957, Yang was invited to join
Fudan University’s law faculty and the editorial board of Law Science. In those months,
he contributed articles to the journal that cogently refuted CCP orthodoxy on the class
nature and heritability of law, and on cause and effect in criminal law.23 He also drew
22
Weight refers to a term’s tf-idf score, a standard statistical measurement of the importance of a term
in a given document and corpus.
23
Yang Zhaolong 杨兆龙, “Falü de jiejixing he jichengxing” 法律的阶级性和继承性 [On the Class
Nature and Heritability of Law], Huadong zhengfa xuebao 华东政法学报 [East China Journal of Politics
and Law] 3 (1956): 26–34; Yang Zhaolong 杨兆龙, “Xingfa kexue zhong yinguo guanxi de jige wenti”
刑法科学中因果的几个问题 [Several Problems in the Relationship between Cause and Effect in the Sci-
ence of Criminal Law], Faxue 法学 [Law Science] 1 (1957): 61–63.

AMERICAN HISTORICAL REVIEW APRIL 2019


560 Glenn D. Tiffert

0.7

0.6

0.5

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
tf-idf score

0.4

0.3 Uncensored
Censored
0.2

0.1

0
1956.1
1956.2

1956.3

1957.1

1957.2

1957.3

1957.4

1957.5

1957.6

1958.1

1958.2

1958.3

1958.4

1958.5

1958.6

1958.7

1958.8

1958.9
Year.Issue
FIGURE 7: “Yang Zhaolong,” Law Science (1956–1958).

attention in public forums to the slow pace of codification in the PRC, the low quality
of CCP legal personnel, and official discrimination against highly trained non-party
experts like himself.24 For his prestige and frankness, Yang paid a heavy price. When
the Anti-Rightist Campaign struck Shanghai, a wave of calumny enveloped him.
The censors have slashed Yang’s footprint in my sample of this journal by 83 per-
cent, mostly by excising his critiques of the legal system’s practical defects and the sear-
ing rebuttals he endured. (See Figure 6.) In fact, those rebuttals account for the largest
cohort of articles censored from Law Science, which is why his score on the χ2 feature
selection test eclipses all other terms in my corpora by a wide margin. Evidently, the
censors wish us to come away with the breathtaking misconception that Yang had little
that was controversial to say.25 But back in the real world, his ideas made him the top
target of the Anti-Rightist Campaign in Shanghai’s legal community, and ultimately led
to twelve years of imprisonment as a rightist and counterrevolutionary.
Like Yang, the other individuals who appear on these feature lists promoted values
24
“Peiyang xinsheng liliang, hai you bushao wenti” 培养新生力量, 还有不少问题 [Training Up a New
Force, There Are Still Many Problems], Xinmin wanbao 新民晚报 [New People’s Evening Post], May 4, 1957,
1–2; “Shanghai zhishijie tan guanche baijia zhengming wenti” 上海知识界谈贯彻百家争鸣问题 [Shanghai’s
Intellectual Community Discusses the Problem of Implementing “Let One Hundred Flowers Bloom, Let One
Hundred Schools Contend”], Guangming ribao 光明日报 [Enlightenment Daily], May 1, 1957, 2; “Sifa gong-
zuo ‘qiang’ gao ‘gou’ shen, Minmeng shiwei zhaokai sifa zuotanhui pangtingji” 司法工作墙高沟深, 民盟市
委召开司法座谈会旁听记 [The “Walls” Are High and the “Chasm” Deep in Judicial Work: Notes from a Fo-
rum on the Administration of Justice Convened by the Municipal Committee of the China Democratic League],
Xinmin bao 新民报 [New People’s Evening Post], May 19, 1957, 1; Yang Zhaolong 杨兆龙, “Falüjie dang yu
feidang zhi jian” 法律界党与非党之间 [The Split between Party and Non-Party in the Legal Community],
Wenhui bao 文汇报 [Wenhui Daily], May 8, 1957, 2; Yang Zhaolong 杨兆龙, “Woguo zhongyao fadian heyi
chichi hai bu banbu?” 我国重要法典何以迟迟还不颁布? [Why after So Long Have China’s Key Legal
Codes Not Been Promulgated?], Xinwen ribao 新闻日报 [Daily News], May 9, 1957, 2–3; Yang Zhaolong,
“Wo tan jidian yijian” 我谈几点意见 [I Discuss Several Points], Xinwen ribao 新闻日报 [Daily News], June
6, 1957, 3.
25
The record indicates otherwise. Mei Naihan 梅耐寒, “Guanyu ‘fa de jiejixing he jichengxing’ de tao-
lun: Jieshao Shanghai faxuehui dierci xueshu zuotanhui” 关于‘法的阶级性和继承性’的讨论: 介绍上海
法学会第二次学术座谈会 [The Discussion on “The Class Nature and Heritability of Law”: Introducing
the Shanghai Law Society’s Second Academic Forum], Faxue 法学 [Law Science] 2 (1957): 28–30;
Zhang Jinghua 张景华, “Guanyu falü jichengxing zhong de jige wenti” 关于法律继承性中的几个问题
[Several Questions Concerning the Heritability of Law], Faxue 法学 [Law Science] 5 (1957): 18–21.

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 561

People’s Judicature (1957–1958) 402 180

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
Censored
Uncensored
Teaching and Research (1956–1958) 100 444

0% 20% 40% 60% 80% 100%


Articles
FIGURE 8: Coverage of two other major journals (article count).

associated with the rule of law and greater separation between party and state. Today,
the record of their arguments and the persecutions they endured sticks like a thorn in the
side of a regime that has since not only written the rule of law into its constitution, but
also turned history on its head by presenting the current slogan “Socialist Rule of Law
with Chinese Characteristics” as the culmination of a proud originalist vision.26
It falls to the censors to discreetly resolve such conflicts between past and present,
and they have been busy. In fact, with the CCP intent on publicizing its “China solution”
中国方案 as a proven alternative to liberal democracy, no topic or historical period is
safe from their touch. Other publications, such as People’s Judicature 人民司法(工作),
the official organ of the courts, and Teaching and Research 教学与研究, a leading social
science journal, are missing not just discrete articles but also entire issues. (See Figure 8.)
Additionally, President Xi Jinping’s 2001 doctoral dissertation has vanished from relevant
databases, as has recent scholarship on the systems of secret informants that currently per-
meate Chinese schools and workplaces.
Censorship even encumbers the online archives of the CCP’s official newspaper, the
People’s Daily. Searching for sensitive terms that have appeared in the print edition of
this newspaper can sever a user’s connection and lock out access for several minutes at
a time. Trickier to spot, identical queries can produce different results, depending on
whether the vendors supplying access to the archive host their servers in China or out-
side of it. Similarly, many Chinese state archives are transitioning to digital document
delivery, which allows them to screen requests granularly using nothing more than
metadata and a patron’s profile. They can monitor the files a patron receives and the
specific pages a patron takes an interest in. Archival documents formerly accessible in
paper format are frequently absent from these digital facsimiles.
The custodians of such digital collections, and the Western publishers who make
common cause with them, are plainly not neutral third parties, which serves as a cau-
tionary tale for us all. By stealthily omitting certain topics, voices, and opinions, they
are concealing basic facts and distorting what the discourse on a given subject actually
26
Glenn Tiffert, “Socialist Rule of Law with Chinese Characteristics: A New Genealogy,” in Fu Hua-
ling, John Gillespie, Pip Nicholson, and William Edmund Partlett, eds., Socialist Law in Socialist East
Asia (New York, 2018), 72–96.

AMERICAN HISTORICAL REVIEW APRIL 2019


562 Glenn D. Tiffert

looked like, where the weight of opinion on it may have been, and how that might have
changed over time. They not only are complicit in the intentional misrepresentation of
history, but are also contaminating research based on their holdings and violating the

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
trust of their users. By tendentiously distorting consciousness of China’s past, they are
prejudicing its possible futures.

WE MUST NOT SUPPOSE that these dangers are peculiar to China.27 Rather, they are emblem-
atic of our deepening digital dependence and the redistribution of power it entails. As
knowledge transcends traditional fixed media, it is slipping from our grasp and ever more
under the control of those able to enclose, harness, and commodify it anew. In myriad
ways, this puts us at the mercy of their better angels—reliant on their stewardship, and
more vulnerable than ever to the political, regulatory, commercial, and licensing terms
that may impinge upon it. Seduced by the digital dream, we hasten our submission when
we purge from our shelves the physical evidence necessary to independently monitor the
performance of these new providers and hold them to account.
Unexpectedly, intellectual property law intensifies our disadvantage. Political-Legal
Research and Law Science will be under copyright in the United States until ninety-five
years after their original date of publication, or the early 2050s at the earliest, which pre-
cludes republishing the censored content without the consent of its Chinese rights hold-
ers.28 This means that simply by digitally consolidating sources on servers under its
control, a savvy government or other interested party can adulterate the historical record
with unprecedented ease, not just at home but also universally, the better to achieve in-
formation dominance and shape the global public opinion battlefield. Alternatively, by
flexing its market power or by arranging for proxies to take ownership stakes in content
providers, it could secure similar results. Either way, the conditions for agitation, propa-
ganda, and disinformation to flourish could scarcely be more favorable. We are tempt-
ing fate by passively waiting for those audacious or secure enough to exploit them. The
KGB had a name for such deeds: active measures.29
27
Matthew Connelly, “State Secrecy, Archival Negligence, and the End of History as We Know It,”
Knight First Amendment Institute at Columbia University, September 2018, https://knightcolumbia.org/
content/state-secrecy-archival-negligence-and-end-history-we-know-it; National Archives and Records Ad-
ministration, 2018–2022 Strategic Plan, February 2018, https://www.archives.gov/files/about/plans-
reports/strategic-plan/2018/strategic-plan-2018-2022.pdf; Jennifer Schuessler, “Obama’s Libraryless Li-
brary,” New York Times, February 21, 2019, C1; Bob Clark, “In Defense of Presidential Libraries: Why
the Failure to Build an Obama Library Is Bad for Democracy,” The Public Historian 40, no. 2 (2018):
96–103; Meredith R. Evans, “Presidential Libraries Going Digital,” ibid., 116–121.
28
17 U.S.C. §104A, §108(h). However, there can be a limited right for libraries, archives, and muse-
ums to reproduce a work during the last twenty years of its copyright term. Tyler T. Ochoa, “Copyright
Protection for Works of Foreign Origin,” in Jan Klabbers and Mortimer Sellers, eds., The Internationaliza-
tion of Law and Legal Education (London, 2008), 167–190; Elizabeth Townsend Gard, “Creating a Last
Twenty (L20) Collection: Implementing Section 108(h) in Libraries, Archives and Museums,” SSRN, Oc-
tober 2, 2017, revised December 3, 2017, https://dx.doi.org/10.2139/ssrn.3049158.
29
T. S. Allen and A. J. Moore, “Victory without Casualties: Russia’s Information Operations,” Parame-
ters 48, no. 1 (2018): 59–71; Thomas Boghardt, “Operation INFEKTION: Soviet Bloc Intelligence and
Its AIDS Disinformation Campaign,” Studies in Intelligence 53, no. 4 (2009): 1–24; David King, The
Commissar Vanishes: The Falsification of Photographs and Art in Stalin’s Russia, new ed. (London,
2014); United States Department of State, Active Measures: A Report on the Substance and Process of
Anti-U.S. Disinformation and Propaganda Campaigns, Department of State Publication 9630, August
1986, chap. 5.

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 563

I RECKON THAT THE ARTICLES excised from the online editions of Political-Legal Research
and Law Science were personally selected by specialists schooled in the pertinent his-
tory and its bearing on current controversies. Their choices evince discernment and

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
care. Nevertheless, technology may soon relieve them of this burden. The computa-
tional techniques I have employed to analyze these corpora are double-edged weapons;
they can be used to automate and enhance the work of the censors, too.
In anticipation of this development and as a proof of concept, I built a predictive
text-classification model that uses machine learning to analyze and censor my corpora.
In mere minutes, my model can independently reproduce the choices made by the actual
human censors with an average accuracy of 95 percent. It can also process colossal vol-
umes of text far more easily than unassisted human censors possibly could.30 With ac-
cess to more training data, such as the contents of an online article platform comprising
perhaps billions of characters, I could undoubtedly improve its accuracy, and the impli-
cations are staggering.
My results demonstrate that effective automated manipulation of the past lies within
our grasp. By freely modulating the nearly 600,000 parameters in my model, a censor
could concoct bespoke versions of the historical record on demand, each exquisitely
tuned to the shifting ideological or political requirements of the present, much like a re-
cording engineer amplifies, attenuates, adds, or removes sound by manipulating the con-
trols on a mixing console to achieve the perfect mix. One could furthermore devolve
this task to an artificial intelligence able to roam the breadth of our archives, endlessly
reconstructing them according to preprogrammed templates that can adapt in real time
to the prevailing winds and learn from the behaviors of countless users recorded every
minute of every day. Human reviewers need only weigh in at the margins.
This technology swaddles us now. In China, it powers the most sophisticated regime
of online surveillance and censorship on the planet.31 In the United States, it helps social
media firms assemble our newsfeeds and suppress computational agitprop, copyright
infringements, and odious content.32 In Europe, it facilitates compliance with the “right
to erasure” of personal data codified in the 2016 General Data Protection Regulation
(GDPR).33 Curating history is merely an additional use case in a bulging portfolio of
applications that has firms around the world scrambling for competitive advantage and
market dominance.34 Of course, every advance they make also accrues to the censors. It
30
For a discussion of tools and methods, see the appendix.
31
Zhongguo xintongyuan 中国信通院 [China Academy of Information and Communications Technolo-
gy], “Rengong zhineng anquan baipishu” 人工智能安全白皮书 [White Paper on Artificial Intelligence
Security], September 2018, http://www.caict.ac.cn/kxyj/qwfb/bps/201809/P020180918473525332978.pdf.
32
John Herrman, “Online Platforms Annexed Much of Our Public Sphere, Playacting as Little Democ-
racies—Until Extremists Made Them Reveal Their True Nature,” New York Times, August 21, 2017,
MM18; Mark Zuckerberg, “A Blueprint for Content Governance and Enforcement,” Facebook, November
15, 2018, https://www.facebook.com/notes/mark-zuckerberg/a-blueprint-for-content-governance-and-en
forcement/10156443129621634/.
33
“Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement
of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation),” Official Journal
of the European Union 59, no. L119 (2016): 1–87, here 43–44.
34
Shoshana Zuboff, “The Secrets of Surveillance Capitalism.” Frankfurter Allgemeine, March 5, 2016,
http://www.faz.net/aktuell/feuilleton/debatten/the-digital-debate/shoshana-zuboff-secrets-of-surveillance-ca
pitalism-14103616.html; Ryan Gallagher, “Google CEO Tells Senators That Censored Chinese Search En-
gine Could Provide ‘Broad Benefits,’” The Intercept, October 12, 2018, https://theintercept.com/2018/10/
12/google-search-engine-china-censorship/.

AMERICAN HISTORICAL REVIEW APRIL 2019


564 Glenn D. Tiffert

is a very short hop indeed from the technologies that compose our digital lives to the
nightmare of Orwell’s memory hole, where reality is continuously reinvented by the
powerful at will.

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
Left unchecked, this next-generation paradigm of information control will spread
and may well drag the discipline of history into the quicksand of post-truth polemics,
where the heroic efforts of individual scholars cannot save us. It will triumph merely by
sowing doubt. After all, who can say for sure that a body of sources has not been com-
promised? And if scholars are generally conscious of tampering with the digital record
as an environmental factor conditioning research, they may not be able to discern the
specific distortions it introduces into their work. Generational change and disruptions to
our knowledge ecosystem are eroding our mastery of the analog backstops.
Identifying and debunking discrete instances of these known unknowns is not a scal-
able solution. It would require compiling parallel trusted baseline corpora for compari-
son, and advanced training in data science and diplomatics (the discipline devoted to
the critical analysis of historical documents). It would necessitate ongoing vigilance as
well, because digital censorship can be a moving target. Chasing that target would di-
vert scarce resources from research and writing, set scholars against one another, sub-
vert the truth value of their claims, and erode the public’s estimation of their compe-
tence—all to the advantage of those in the shadows. More importantly, we have no
reliable mechanism for ensuring that the effort expended in achieving small victories
would translate into anything larger. If the platforms make no amends, then others
would still wander innocently into the same old traps. This is the predicament Chinese
studies confronts today, and it will come to other fields tomorrow.
Historians cannot overcome this singly, and obscurity is no defense. Instead, we
should learn from the lost opportunities to forestall the crises now afflicting social media
and election security. If we are to prevent the practices described here from proliferat-
ing, then we must mobilize to confront them structurally. Otherwise, growing unease
about the integrity of the knowledge we consume and produce will metastasize and fur-
ther sap the trust necessary for robust scholarship and democratic practice.
An institutional subscription to an online knowledge platform can cost tens of thou-
sands of dollars or more annually, and subscribers, as consumers, must insist that they
receive what they are paying for. Demanding that providers make unredacted collec-
tions available on alternate servers beyond the reach of interested censors is an impor-
tant first step. Knowledge creators, archivists, learned societies, rights holders, and con-
tent providers must also design and implement a set of industry-wide best practices to
uphold the integrity of our digital collections, transparently disclose omissions and
modifications, and defend against tampering at the levels of the individual character,
document, and corpus. Such standards must apply not only to the digitization of legacy
analog sources (which are, after all, not eternal), but also to those “born” digital, and it
is imperative that commercial providers in particular adopt them.
In short, we have passed the point where we can naïvely trust; if we truly value the
integrity of the sources on which we depend, and in turn our professional credibility,
then we must now also verify. A variety of solutions, such as digital signatures, block-
chain certification, and ISO 16363 certification, with logos signifying validated stan-
dards compliance, are potentially available to meet those needs. We must engineer such
technical safeguards into the foundations of our burgeoning digital knowledge infra-

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 565

structure. However, technical solutions by themselves are not enough. We must also
back them up with mutually reinforcing collective statements of principle, supple-
mented as appropriate by fiduciary obligations stipulated in private contract and public

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
law.35 The menace is real and already among us. Never before has knowledge been
prone to such sweeping and supple manipulation. Our understanding of ourselves, and
our future, hang in the balance.

Appendix: Tools and Methods


To the best of my knowledge, no libraries in the United States possess complete original
print editions of Political-Legal Research, People’s Judicature, or Teaching and Re-
search, and fewer than a handful possess complete original editions of Law Science. To
assemble the corpora analyzed in this article, I therefore drew on fragmentary holdings
from multiple institutions, supplemented by my personal reference collection and
acquisitions from China.
To ensure commensurability and keep the scope of the project manageable, I de-
cided to concentrate my analysis on articles published in Political-Legal Research and
Law Science from 1956 through the end of 1958, the only period when the two journals
overlapped. Fortunately, this permits one to juxtapose two historically significant
moments: the Hundred Flowers Campaign and the Anti-Rightist backlash that abruptly
followed, when the CCP’s brief solicitation of popular feedback switched to searing ret-
ribution.
Political-Legal Research published sixty-two issues between 1954 and 1966, when
the Cultural Revolution forced its closure. Sponsored by the Chinese Association for
Politics and Law in Beijing, it counted many of the highest legal officials in the central
government among its patrons, and its coverage generally favored their statist priorities.
By contrast, Law Science enjoyed a much shorter life, publishing just eighteen issues,
all between 1956 and 1958, when the Anti-Rightist Campaign forced its closure.36 Law
Science was sponsored by the East China Institute of Politics and Law in Shanghai, one
of a handful of regional academies established in the early 1950s by the PRC state to
train a new generation of socialist cadres for administrative and legal positions.
Preparing these journals for analysis required several labor-intensive steps, the most
basic of which involved converting their printed pages into text files with a high degree
of accuracy.37 To maximize fidelity, I used the best-preserved original print editions I
could find, not reprints or reproductions. Every page of every issue in my three-year
sample, more than 2,000 in total, was scanned at 600 dpi in grayscale without any lossy
compression, resulting in nearly 20 gigabytes of data. Second, I used a commercial opti-
cal character recognition (OCR) package to convert those scans into plain text. I then
sliced the output into individual files, one for each article. My final corpora consisted of
356 files from Political-Legal Research and 381 files from Law Science, comprising
35
An example of such a statement is Association of University Presses, “Facing Censorship: A State-
ment of Guiding Principles,” March 21, 2018, http://www.aupresses.org/news-a-publications/news/1692-
facing-censorship-a-statement-of-guiding-principles.
36
The first three issues, dating from 1956, bear the title Huadong zhengfa xuebao 华东政法学报 [East
China Journal of Politics and Law].
37
My workflow relied on the following principal software: MacOS (10.13.6), ABBYY FineReader
(12.1.11), BBEdit (12.5), Anaconda3 (3.7), Scikit-learn (0.19.1), Keras (2.22), Java 8 (131), Stanford Chi-
nese Segmenter (3.7) with the ctb standard, Mallet (2.08), and Microsoft Excel (16.16.3).

AMERICAN HISTORICAL REVIEW APRIL 2019


566 Glenn D. Tiffert

nearly four million characters in total.38 The two PRC platforms hosting these journals
are currently censoring approximately 8 percent of the articles, or 11 percent of the total
page count in my sample, though the latter share exceeds 50 percent for certain issues

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
straddling the 1957/58 divide.
Third, university-educated native speakers of Chinese compared my Law Science
corpus against the original documents, character by character, to establish the reliability
of the OCR. The test set consisted of 127 pages (approximately 207,000 characters)
from the original issues, and averaged a remarkable 99 percent accuracy, easily suffi-
cient for my purposes. Time and funding precluded an equally exhaustive test of my
Political-Legal Research corpus, but spot checks suggested similar accuracy.
Fourth, I wrote several programs in Python, the first of which stripped my corpora
of semantically expendable characters, such as punctuation, alphanumerics, Cyrillic,
gremlins, and whitespace. This reduced each file to an unbroken stream of Chinese
characters, which was then fed to a segmentation algorithm that tokenized the texts. Un-
like Western languages, Chinese does not separate semantic units with whitespace. The
segmentation algorithm performs this step, which the natural language processing
(NLP) routines in conventional computational text analysis require.39 It bears mention-
ing that my corpora were originally published just as the PRC was moving from tradi-
tional Chinese characters to the simplified set used today, and the articles in them conse-
quently contain a transitional mixture of both. This confused the segmentation
algorithm, and necessitated the preliminary measure of converting both corpora uni-
formly to simplified characters.
Fifth, the segmentation algorithm performed poorly for named entities, particularly
personal names, which can be idiosyncratic. For example, surnames were commonly
dissociated from given names. I therefore manually compiled a dictionary of several
hundred entries, including every author as well as prominent organizations and individu-
als mentioned in my texts. A Python program then scanned the texts for occurrences of
these names and reconstituted them correctly.
Sixth, I compiled a series of metadata files necessary for the analytics I intended to
perform. These files are essentially spreadsheets (.csv) with columns for the filenames,
article titles, author names, index number (year-issue#-article#), and censorship status
of every article in various slices of my corpora. I built metadata files for the Political-
Legal Research corpus, the Law Science corpus, a combined corpus, and yet another
38
One could arrive at a slightly different count, depending on how one divides sidebars and forums
with contributions from multiple authors, but the key point is that every character was captured.
39
My methodology and workflow were informed by various texts in computational linguistics, natural
language processing, and digital history, including Charu C. Aggarwal and ChengXiang Zhai, eds., Min-
ing Text Data (New York, 2012); Steven Bird, Ewan Klein, and Edward Loper, Natural Language Pro-
cessing with Python (Cambridge, Mass., 2009); Michael C. Hout, Megan H. Papesh, and Stephen D.
Goldinger, “Multidimensional Scaling,” Wiley Interdisciplinary Review of Cognitive Science 4, no. 1
(2013): 93–103; David Mimno, “Computational Historiography: Data Mining in a Century of Classics
Journals,” ACM Journal on Computing in Cultural Heritage 5, no. 1 (2012): 3:1–3:19; Jason D. M. Ren-
nie, Lawrence Shih, Jaime Teevan, and David R. Karger, “Tackling the Poor Assumptions of Naive Bayes
Text Classifiers,” in Tom Fawcett, ed., Proceedings of the Twentieth International Conference on Ma-
chine Learning (Menlo Park, Calif., 2003), 616–623; David Underhill, Luke K. McDowell, David J. Mar-
chette, and Jeffrey L. Solka, “Enhancing Text Analysis via Dimensionality Reduction,” in Weide Chang
and James B. D. Joshi, eds., Proceedings of the IEEE International Conference on Information Reuse and
Integration (Piscataway, N.J., 2007), 348–353. Portions of my code adapted examples shared by Paul
Vierthaler and countless contributors to Stack Overflow.

AMERICAN HISTORICAL REVIEW APRIL 2019


Peering down the Memory Hole 567

combined corpus that substituted the city of publication (Beijing or Shanghai, respec-
tively) for the censorship field, which allowed me to study the legal discourse of this pe-
riod across not just time, but also space. Finally, after approximately five months of

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
preparation, the corpora were ready for data analysis.
To perform that analysis, I wrote another Python program that stripped out stop-
words from the preprocessed corpora and transformed them into both word2vec models
and tf-idf matrices against which I could run a battery of exploratory statistical tests and
classification algorithms, only a few of which appear in this article.40 Briefly, the word2-
vec models capture the contextual relationships among all of the terms in my corpora
and facilitate semantic analysis of the underlying texts. The tf-idf matrices describe
each of the 737 documents in my corpora in nearly 600,000 dimensions, where each di-
mension measures the importance of a unique unigram or bigram (e.g., “judge,” “Yang
Zhaolong,” “rule of law”).
Next, I experimented with three approaches to building predictive models that could
faithfully reproduce the choices made by the human censors. First, I evaluated several
types of neural networks using my word2vec models, but these achieved lackluster
results, perhaps because neural networks perform best with much larger datasets.41 Sec-
ond, using my tf-idf matrices, I evaluated fourteen different classification algorithms
and selected the most promising among them for optimization through grid search
cross-validation, which iterated over thousands of possible configurations and reported
performance for each on several metrics.42 Candidates from the gradient boosting family
of classifiers generally achieved the highest scores. Third, I evaluated various ensemble
stacking classifiers using the same grid search technique. Ensemble stacking classifiers
run competing classification models in parallel, pool the results, and feed those up to a
meta-classifier for final evaluation, on the theory that the whole may do better than the
sum of its parts. This approach achieved the best performance for my dataset.43
Feature engineering played an important part in optimizing the performance of my
prediction models. Recall that just over 8 percent of the documents in my corpora were
censored, which means that nearly 92 percent were not. I used SMOTE synthetic sam-
pling on the training data for my models to compensate for this imbalance, but not on
the validation or testing data used to evaluate them. Likewise, my final stacking classi-
40
The normalized tf-idf values were computed using the TfidfVectorizer  class in Scikit-learn 0.19.1,
according to the following function: tfidf ðt;dÞ ¼ tf ðt;dÞ  log 1þdf
1þnd
þ 1 . The exploratory data analysis
ðd;t Þ

techniques included t-SNE analysis (TruncatedSVD), principal component analysis, Euclidean distance
calculations (MDS), scree plots, cosine similarity calculations (MDS), silhouette coefficient calculations,
spherical k-means clustering (MDS), and hierarchical clustering analysis (Ward dendrogram). I also per-
formed χ2 feature selection and topic modeling (Mallet, LDA) on my corpora.
41
The candidates included recurrent neural networks with(out) LSTM and a convolutional neural net-
work using the Tensorflow backend to Keras.
42
The classification algorithms evaluated were dummy, Gaussian naïve bayes, Bernoulli naïve bayes,
multinomial naïve bayes, k-nearest neighbors, logistic regression, random forest, linear support vector ma-
chine, non-linear support vector machine, decision trees, gradient boosting, light gradient boosting, xtreme
gradient boosting, and a multi-layer perceptron. I evaluated them on their mean accuracy, F1 score, Mat-
thews correlation coefficient, sensitivity rate, and specificity rate.
43
The winning model was an optimized two-level ensemble stacking classifier comprising a gradient
boosting classifier and light gradient boosting machine at level one, and a Bernoulli naïve bayes meta-
classifier at level two. After ten stratified k-fold cross-validations, the model achieved a mean accuracy of
0.95, a mean F1 score of 0.97, a mean Matthews correlation coefficient of 0.69, a mean sensitivity rate of
0.97, and a mean specificity rate of 0.73.

AMERICAN HISTORICAL REVIEW APRIL 2019


568 Glenn D. Tiffert

fier used χ2 feature selection to identify the features most highly correlated with censor-
ship, and passed only the feature importances calculated from those by each of the
level-one classifiers to the level-two meta-classifier.

Downloaded from https://academic.oup.com/ahr/article/124/2/550/5426383 by Fondation Nationale Des Sciences Politiques user on 16 January 2022
I validated all of my models using stratified k-fold cross-validation and ranked them
on the basis of their Matthews correlation coefficients, a metric suited to imbalanced
classes. Stratified k-fold cross-validation divides the corpus into k subsets (called
“folds”), each with the same distribution of censored and uncensored documents as the
corpus at large. It trains the model on k-1 folds, and then evaluates the model’s perfor-
mance against the sole remaining fold, which the model has never seen before. It repeats
this process k-1 times, excluding a different fold from the model-building each time. Fi-
nally, it averages the metrics returned by each iteration.
There is an art to predictive model-building, and no doubt the potential to extract
still higher performance remains, but my ultimate goal in this study lay elsewhere: to
raise consciousness of an emergent new paradigm of information control that is begin-
ning to encroach on the integrity of the historical record and will soon proceed apace.
Against the background of a political climate that is testing the vigor of liberalism, this
development, coupled with the snowballing aggregation of our global knowledge base
onto platforms beyond our control, promises to be game-changing.

Glenn D. Tiffert is a Visiting Fellow at the Hoover Institution, and a historian of


modern China. His research has centered on Chinese legal history, including publi-
cations on constitutionalism, the construction of a modern judiciary, and the gene-
alogy of the rule of law in the PRC. His current book manuscript, provisionally
entitled “Judging Revolution,” radically reinterprets the Mao era and the 1949 rev-
olution by way of a deep archival dive into the origins of the PRC judicial system.
His current research probes the intersections between information technology and
authoritarianism, and the ramifications of China’s rise for American interests.

AMERICAN HISTORICAL REVIEW APRIL 2019

You might also like