Skip to main content

CyLab Chronicles

Anupam Datta on Impact and Implications of IEEE Award Winning Paper, "Bootstrapping Privacy Compliance in Big Data Systems"

posted by Richard Power

Anupam Datta is an Associate Professor at Carnegie Mellon University (CMU) and a leading researcher at CMU CyLab. CyLab Chronicles recently sat down with him to discuss his team's latest successes and the future direction of their vital research. Here is our conversation.

In the 21st Century, the privacy space has become a very challenging one - for government, for business and for each of us as individuals. This very challenging space is fraught with complex issues. Which particular issue does the work in "Bootstrapping Privacy Compliance in Big Data Systems" seek to address?

Anupam Datta: To allay privacy concerns, Web services companies, such as Facebook, Google and Microsoft, all make promises about how they will *use* personal information they gather. But ensuring that *millions of lines of code* in their systems *respect* these *privacy promises* is a challenging problem. Recent work with my PhD student in collaboration with Microsoft Research addresses this problem.  We present a workflow and tool chain to automate privacy policy compliance checking in big data systems. The tool chain has been applied to check compliance of over a million lines of code in the data analytics pipeline for Bing, Microsoft’s Web search engine. This is the first time automated privacy compliance analysis has been applied to the production code of an Internet-scale system. The paper, written jointly with my PhD student Shayak Sen and a team from Microsoft Research, was presented at the 2014 IEEE Symposium on Security and Privacy and recognized with a Best Student Paper Award.

Tell us something about the system you and your team designed? And briefly describe its components, Legalease and Grok?

Datta: Central to the design of the workflow and tool chain are (a) *Legalease* — a language that allows specification of privacy policies that impose restrictions on how user data flows through software systems; and (b) *Grok* — a data inventory that annotates big data software systems (written in the Map-Reduce programming model) with Legalease’s policy datatypes, thus enabling automated compliance checking. Let me elaborate.

Privacy policies are often crafted by legal teams while software that has to respect these policies are written by developers. An important challenge is thus to design privacy policy languages that are *usable* by legal privacy teams, yet have precise *operational meaning* (semantics) that software developers can use to restrict how their code operates on personal information of users. The design of Legalease was guided by these dual considerations. Legalease builds on prior work from my research group that shows that privacy policies often involve *nested allow-deny information flow rules with exceptions* (see DeYoung et al. 2010, Garg et al. 2011 for the first complete logical specification and audit of the HIPAA Privacy Rule for healthcare privacy in the US). For example a rule might say: "IP address will not be used for advertising, except it address may be used for detecting abuse. In such cases it will not be combined with account information.” Our hypothesis was that such rules match the *mental model* of legal privacy policy authors. The results of our *user study*, involving participants from the legal privacy team and privacy champions at Microsoft (who sit between the legal privacy team and software developers), provide evidence in support of our hypothesis: after a short tutorial, the participants were able to encode the entire Bing policy pertaining to how users’ personal information will be used on the Bing servers (9 policy clauses) in about 15 minutes with high accuracy.

Software systems that perform data analytics over personal information of users are often written without a technical connection to the privacy policies that they are meant to respect. Tens of millions of lines of such code are already in place in companies like Facebook, Google, and Microsoft. An important challenge is thus to *bootstrap* existing software for privacy compliance. Grok addresses this challenge. It annotates software written in programming languages that support the Map-Reduce programming model (e.g., Dremel, Hive, Scope) with Legalease’s policy datatypes. We focus on this class of languages because they are the languages of choice in industry for writing data analytics code. A simple way to conduct the bootstrapping process would be to ask developers to manually annotate all code with policy datatypes (e.g., labelling variables as IPAddress, programs as being for the purpose of Advertising etc.). However, this process is too labor-intensive to scale. Instead, we develop a set of techniques to automate the bootstrapping process. I should add that the development of Grok was led by Microsoft Research and was underway before our collaboration with them began.

What do you see as the practical outcome of this research? Is it going to be applicable in the operational environment? What is it leading toward?

Datta: One practical outcome of this research is that it enables companies to check that their big data software systems are compliant with their privacy promises. Indeed the prototype system is already running on the production system for Bing, Microsoft's search engine. So, yes, it is applicable to current operational environments.
I view this result as a significant step forward in ensuring privacy compliance at scale. With the advent of technology that can help companies keep their privacy promises, users can reasonably expect stronger privacy promises from companies that operate over their personal information. Of course, much work remains to be done in expanding this kind of compliance technology to cover a broader set of privacy policies and in its adoption in a wide range of organizations.

What other privacy challenges are you looking to address in your ongoing research? What's next for your team?

Datta: We are deeply examining the Web advertising ecosystem, in particular, its *transparency* and its implications for *privacy* and *digital discrimination*. There are important computational questions lurking behind this ecosystem: Are web-advertising systems transparently explaining what information they collect and how they use that information to serve targeted ads? Are these systems compliant with promises they make that they won't use certain types of information (e.g., race, health information, sexual orientation) for advertising? Can we answer these questions from the "outside", i.e., without gaining access to the source code and data of web advertising systems (in contrast with our privacy compliance work with Microsoft Research)? How can we enhance transparency of the current Web advertising ecosystem? What does digital discrimination mean in computational terms and how can we design systems that avoid it?

As a researcher who has looked long and deeply into the privacy space, what do you think is most lacking in the approaches of government and business? What is most needed? What are people missing?

Datta: First, I believe that there is a pressing need for better *computational tools* that organizations can (and should) use to ensure that their software systems and people are compliant with their privacy promises. These tools have to be based on well-founded computing principles and at the same time have to be usable by the target audience, which may include lawyers and other groups who are not computer scientists.
Second, current policies about collection, use, and disclosure of personal information often make rather *weak promises*. This is, in part, a corollary of my first point: organizations do not want to make promises that they don't have the tools to help them be compliant with. But it is also driven by economic considerations, e.g., extensive use of users' personal information for advertising can result in higher click-through rates. While many companies make promises that they will not use certain types of information (e.g., race, health information, sexual orientation) for advertising, it is far from clear what these promises even mean and how a company can demonstrate they are compliant with these promises.
Third, we need better *transparency mechanisms* from organizations that explain what information they collect and how they use that information. Self-declared privacy policies and mechanisms like ad settings managers are steps in the right direction. However, the transparency mechanisms should themselves be auditable to ensure that they indeed transparently reflect the information handling practices of the organization.
These are all fascinating questions -- intellectually deep from a computer science standpoint and hugely relevant to the lives of hundreds of millions of people around the world. They keep me and my research group up at night!

Related Posts:

See all CyLab Chronicles articles