Making sense of Internet censorship using automation

posted by Daniel Tkacik
August 3, 2017

Depending on which country you are in, various parts of the Internet may be censored for various reasons – some lawful and some not. Currently, the life of an Internet censorship researcher is a hard one; the process of finding out which websites are censored and which ones are not can be an incredibly cumbersome process. 

“Imagine you have a list of 10,000 URLs. Your eyes glaze over scrolling through them and you don’t notice the one or two URLs that are critical,” says Zack Weinberg, a CyLab researcher and a Ph.D. student in the department of Electrical and Computer Engineering. “But if this job were performed by a computer – a computer does not get bored.”

Last week, Weinberg presented a new automated method for studying Internet censorship, in which the computer takes on a bigger role than the human in scanning the Internet for censored (or not censored) material. Weinberg presented the study at the 17th Privacy Enhancing Technologies Symposium in Minneapolis.

Using the new automated method, Weinberg and his team analyzed the content and longevity of 760,000 websites found on actual blacklists of censored content in particular countries as well as “probe lists” – hand-curated lists of websites that, based on previous studies, may have an elevated chance of being censored. These probe lists are a key component of Internet censorship research.

The researchers found that the actual blacklists had few similarities with the probe lists.

“The number one thing that gets censored the most worldwide is social media,” Weinberg says. “These days, censorship tends to be aimed at making it harder for mass popular movements to organize. Social media is pretty good at that.”

Yet, despite the prominence of social media censorship worldwide, Weinberg says they were relatively under-represented in the hand-curated probe lists that they looked at.

“Using this automated method could better inform those lists in the future,” Weinberg says.

The researchers also found that webpages on controversial topics tend to have much shorter lifetimes (i.e. are up for some amount of time and then are censored) than pages with non-controversial content.

“This says to us that probe lists need to be continuously updated to be useful,” Weinberg says.

Other authors on the study included ECE Ph.D. students Mahmood Sharif and Janos Szurdi, and Engineering and Public Policy and Institute for Software Research professor Nicolas Christin. 

