Ethical Data Work: Lessons on Technical Data Protection
A beginner-friendly tutorial that explores ethical data science, elucidating how one can work with data while preserving sensitive, private information.
Being data scientists with naturally inquisitive minds, our work is all about going berserk with data. We are literally obsessed about it. Day in and day out, we’re tasked with cleaning messy tables, modelling causal relationships, and developing bulletproof predictions. And most of us do so with passion.
Yet regardless of our task — whether sophisticated or mundane — it’s ultimately all about data insights. Everything we do, in fact, is geared towards extracting valuable insights for commercial advantage, and in the context of academia, about pushing scientific frontiers. Thus we have one thing in common as a collective of data professionals: each and every one of us employs their skills to unlock the full potential of data.
But this fixation with the unlocking of value isn’t virtuous in every circumstance. Our zealous pursuit after insight could inflict great harm if there’s nothing that restraints it. As is the case with every endeavour that is devoid of balance. This is why I’m writing the article you’re reading just now, exploring the constraints companies and academics have to navigate when they exploit data. Practically, privacy is safeguarded by data protection regulation and essentially represents legal constraints defining the boundaries of data work.
Data Protection: Technical Enforcement
Companies need to comply with a variety of rules and regulations to avoid data-protection violations. Some of which could easily result in hefty fines and reputational damage of a lasting nature, especially in the case of data theft.
To accomplish this, they need to implement mechanisms that give private individuals full control over the data that is stored about them. On top of that, it is also important that they restrict the collection and retention of sensitive information. “How is this done?” you may wonder. The following paragraphs offer definitive answers.
Technique 1: Simple Suppression
Whenever we wish to anonymize a dataset, we first must acquire some information about the statistical traits of the original data. Obtaining the probability distribution of the column (or columns) of interest is thus imperative, so that we can create anonymized replicas of the original data.
A great resource to experiment with is Mockeroo (https://www.mockaroo.com/) which allows software developers as well well as data scientists to create artificial datasets. One would not even notice that the dataset is not real. Best of all, however, is that they are fully customizable. So, if you need some mock data to start building your app, or just generally mimic real data as much as possible, Mockeroo is your best bet.
Using Mockeroo, I have assembled a fictional dataset containing names, ages, genders, addresses and credit card details.
As seen in the picture above, we have sensitive PII like the full name of the person, and their finance details. That’s not everything there is, though. Non-sensitive PII such as gender and address information encompassing country, city and state is also visible. Such fragments suffice to enable companies or individuals with malicious intent to cause harm. For they could create user profiles attached to geographic location to manipulate voter intent, as was done in the Cambridge Analytica scandal. Let’s have a look at how we can anonymize this hypothetically real dataset.
Of course, the patterns could be more complex, and the level of detail stripped off will vary in practice. But simple concealment strategies such as these may help with compliance with data-protection laws.
Let’s now turn to the next set of techniques.
Technique 2: Substitute Sampling for Numerics
In any case where we wish to anonymize a dataset, we must acquire some information about the statistical properties of the original data. Obtaining the probability distribution of the column of interest is imperative, so that we can create anonymized versions with the same statistical properties.
The below function uses the Scipy library to draw a skewed normal distribution from our original data to create a replica that has the same statistical patterns. This ensures anonymity yet allows for data not to be lost in our endeavour to extract insights.
The parameter a varies depending on whether the underlying data exhibits a right skew or a left skew. When the parameter a=0 we are dealing with the good old normal distribution. Below we can see the original data on the left and the anonymous, replicated salary data on the right. You can play around with alpha to achieve the best replica of your underlying distribution.
Technique 3: Sensitive Data Replacement with Faker
Turning now to an indispensable ethical data science tool, we will explore the fake data generating Python library called Faker. According to the documentation, Faker is a Python library that generates fake data for you. From names, addresses, credit card details to more complex data such as geographical data, you will find this package very useful and most likely add it to your bookmarks.
So how does it work you may wonder? In my fictional dataset I have used it to generate realistic names based on a gender column.
I found it interesting to note that some of my fictional individuals had PhD titles, though it would’ve been even more awesome if that reflected in the salary in some way. At any rate, you may want to write a few utility functions using Faker to substitute sensitive data with fictional.
Technique 4: Summarization into Bins
The generalization of data results from applying aggregation operations such as binning, rounding, and categorizing in broader ways so as to replace an extremely precise value with a less precise one. In essence, it is as though we are slightly blurring the face of a person, while we can still see in broad terms their characteristics. By doing so, sensitive data and personal identifiers are removed while the data still remains useful for analysis.
Here we are using Pandas cut function to group the ages into categorical cuts.
Until not so long ago, scientists used human beings as Guinea pigs for experimentation without regard for ethical considerations. To gain as much insight from data as possible, the modern data scientist might be tempted to do the same. Not with human bodies, but with sensitive information of private individuals. It is precisely because of this that governments have enacted laws such as GDPR and the Digital Personal Data Act of 2018. As such, it is our responsibility to employ techniques such as the ones discussed above to handle personal data in an ethical manner.
These techniques are of particular value to data professionals in developing countries such as Somalia. Operating in such environments, where the ownership and production of data remains unclear, enormous challenges present themselves. Among these is the lack of effective state institutions, which caused the emergence of a largely unregulated and unrestricted data ecosystem. Anonymization offers technical workarounds.
In this article, we’ve examined a few intriguing techniques that will help the ethical data scientist preserve privacy without losing some statistically insightful gems in the raw data. These techniques are only a few. There are numerous others. But that is a story for another post.