You are on page 1of 7

Staying Safe on the Chinese Internet: A Guide for Open-Source Researchers

Ryan Fedasiuk, October 2021

The goal of this document is to acquaint researchers and analysts with tools, resources, and
best practices to remain safe when collecting or accessing open-source information.

There are three main considerations when collecting open-source information from foreign
government and military websites. In order of priority, they are:

1. Protecting your devices, network, and files from malware.


2. Archiving your sources for posterity.
3. Masking your activities from nosey onlookers.

This guide introduces some tools, resources, and best practices that can help you achieve
these goals.

The Cardinal Rules of Open-Source OpSec


1. Always assume compromise.
2. Always stay connected to a VPN.
3. Never download files locally.
4. Whenever possible, access only the cached or archived versions of webpages.
5. Whenever appropriate, archive sources immediately.
6. Whenever in doubt, scan before you click.

Resources, Tools, and Best Practices


1. VPNs

A virtual private network (VPN) can secure your network by masking your IP address and
encrypting information that is transmitted from your device. Most VPN services will let you select
a server through which to route internet traffic. This has the added benefit of camouflaging your
IP address. For a faster connection, choose a server located near you. For a slower connection
likely to raise fewer eyebrows, choose a connection based in Hong Kong, Taiwan, Singapore, or
other country—or, use a service that allows you to tunnel into China.

There are many options and considerations: price, number of servers, connection speed,
whether the service keeps logs of your browsing activity, and saturation—whether the
government whose files you are browsing has blocked many of the service’s connection nodes.
2. Cached Webpages

A safer way to access any webpage is to access Google’s cached version of that page, rather
than visiting the website directly. You can think of a cached page as a past version of the
website in question, which Google’s search engine accessed and saved internally while creating
search results and previews. Not every webpage is cached, but you’ll find that most webpages
have this option.

To access the cached version of a webpage, either type cache:[URL] directly into your
browser’s navigation bar, or click on the three dots next to a Google search result to see more
information about the page:

The bottom-right hand corner of the ensuing pop-up will include a button that says “Cached.”
Click on it to access the cached page.
A cached version of a webpage will have a banner at the top that looks like this:

Accessing the cached version of a webpage is not foolproof. It is still possible for a website
owner to track which IP address is viewing a cached webpage, through certain embedded
images and other elements. Accessing the text-only version of a cached page, or the HTML
source code, can mitigate some of these risks, and allow you to more quickly find information on
webpages that are slow to load.

Cached webpages are especially useful for viewing documents that you would otherwise have
to download directly onto your computer—something you should never do. For example, take
this .xls spreadsheet file hosted by the Cyberspace Affairs Commission:

Just clicking on this Google search result would normally result in the file being
automatically downloaded to your computer—a disaster! Grappling with auto-download
links is an eternal challenge when collecting open-source information from foreign websites.

A far safer (and faster) way of getting at the information is to access the cached version of the
webpage that is hosting the file. Rather than downloading something and opening it in Excel,
Google’s cache transforms it into a webpage that you can view in your browser:
This strategy works for all common filetypes: .doc, .pdf, .xls, and .xlsx, among others, but will
sometimes mess with file formatting (especially PDFs).

3. Archive Services

Archiving sources is incredibly important. Within days or even hours of publishing products,
sources of information frequently disappear, and original website links are frequently broken.
But there are several reasons you might want to archive a website, beyond ensuring future
access to the material:

● Archive services can serve a similar function to a cached a webpage, allowing you to
view a safer version of the page. (Please note that it is also possible to archive a
Google-cached webpage, rather than the original source, for layered protection).
● Some archive services, like the Wayback Machine (discussed below), will tell you if
someone else has already archived the page, which can be useful to know.
● Some archive services will generate unique links and display the exact timestamps for
when they were generated. This can be helpful in plagiarism disputes and/or tracking
project timelines.
● Running U.S. news media articles through digital archive services can bypass some
paywalls or article limits.

In particular, two free archiving services are useful for making redundant links, creating archives,
quickly checking whether a source might be interesting (when no cached page exists), or
spot-checking for users without Perma.cc accounts. These include:

● The Internet Archive (Wayback Machine): https://web.archive.org/save/


● Archive Today: https://archive.vn/
It’s often worth double- or triple-archiving really valuable documents across more than one
archive service.

Please note that most archiving services will “ping” the website with a U.S.-based IP
address. This can ruin your attempts to remain stealthy, for example, with a China- or
Hong-Kong based VPN. Please also note it may be possible for website owners to
retroactively break archive links you have already established. For these reasons,
web-based digital archive services may not always be the best option.

To maintain maximum privacy, security, and long-term access, it is often worthwhile to save local
copies of webpages as PDFs to your computer, then upload them to cloud storage or an
external harddrive. Please note this is not the same thing as downloading a PDF from the
website itself—which you should never do. Rather, when you are viewing a webpage, follow
the following steps:

● First, attempt to “print” the webpage by opening the print interface (press CTRL+P).
● Then, instead of actually printing it out, change the destination to “Save as PDF.”
● Finally, consider duplicating the saved file to external flash drives or uploading to the
cloud.

4. URL and File Scanners

Sometimes, there will be a potentially valuable source of information that resists archiving and
has no cached webpage. It’s a gamble to directly access these kinds of links. But you can
conduct due diligence: Whenever in doubt, scan before you click.

VirusTotal is a free service that scans files and URLs for malware by checking them against 79
different antivirus software services, including well-known consumer brands like BitDefender
and Kaspersky.
VirusTotal collects information about the files and URLs uploaded to its interface. It is essentially
a testing platform for antivirus services. It has access to 79+ antivirus services because it
provides diagnostic information to improve their products based on the scans that users like you
generate. It does not require an account.

5. Browser Sandbox

If you’re sitting down for an extended session of information-hunting, it’s best to do all of your
searching inside of a virtual sandbox (or virtual machine, VM). There are several applications
that can create a firewall around programs and applications you choose to run, such as web
browsers like Google Chrome and Firefox.

A web browser session run inside the sandbox will close when the sandbox is closed. Any files
downloaded from the browser will remain inside the sandbox, and can be wiped when the
sandbox is closed, without being saved to your actual computer. You can still give express
permission to transfer individual files outside of the sandbox.

There are different sandbox options available for PC or Mac users, but many are free,
open-source, and relatively lightweight applications:

● For PC Users: Sandboxie is the gold standard in virtual sandbox applications.


● For Mac Users: A common Sandboxie alternative is Cuckoo Sandbox.

6. Antivirus Software

If you’re conducting open-source research, it behooves you to have a subscription to


high-quality antivirus software. However, if you do not already have antivirus on your personal
computer, there are some free options worth downloading and running regularly:

● Malwarebytes offers free, relatively lightweight, on-demand malware scans. It can be run
in conjunction with other antivirus software products.
● Bitdefender is often cited as a high-quality antivirus software, but there are other
alternatives, like Norton, McAfee, AVG, and Kaspersky, among others.

If at any point you break one of the six cardinal rules outlined in this guide, or accidentally click
on an auto-download link, it’s worth running a quick Malwarebytes scan. But remember—the
best practice in this line of work is to assume compromise. If a state wants to track your
browsing and research activity, they will surely be able to do so.

You might also like