Knowledge and information have been assets well before computers and data science, but with the adoption of digital tools, we have become more effective in extracting values.
At the same time, the increased power generated by data, requires that we keep in mind the potential for harm inherent in any rapidly-expanding frontier.
As a Millennial, living my teenage years seeing social media gaining ground and power, I share with my generation the anxiety and apprehension of our information being exposed, but also the FOMO and need to be online and use the next tool or communication channel.
Being connected and sharing data is not necessarily an option to many of us, arguably, with remote working becoming widespread in 2020, we could say it is not an option at all.
My passion for technology and my need to bring a different approach to innovation and policy making brought me to law school, and now, almost 10 years later, I am dedicating myself to growing my expertise as a data scientist.
Inevitably, as a professional who has always been focused on ethics and privacy, it is crucial to me to adopt a conscious approach to the data I am processing and how I look at them.
Although the importance of an ethical approach to data is widely recognized in the industry, I was still looking for practical ways of embedding my ethics into my data science projects.
So I decided to write this blog post showing fellow data scientists how to use the tool and discuss some of the real life issues we could takle with such a practice.
To install Deon you need to have python 3 on your machine, and in Terminal use the following command:
$ pip install deon
You could then either add a markdown file into your project folder using the command
$ deon -o ETHICS.md
or add a checklist directly in the last cell of your Jupyter Notebook using
$ deon -o my-notebook.ipynb # append cells to existing output file
And that’s it! It’s a pretty straightforward installation, that in just a minute will make your work a little more conscious.
So far following the previous staps we have embedded a checklist from Deon into your Jupyter Notebook, but now let’s look into what is actually part of that checklist.
ETHICS.md is structured in 5 different parts: Collection, Storage, Analysis, Modeling, and Deployment. For each section are indicated some general topics and questions that should be considered by the data scientist while managing the project.
Data Science Ethics Checklist
A. Data Collection
- A.1 Informed consent: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
- A.2 Collection bias: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
- A.3 Limit PII exposure: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn’t relevant for analysis?
- A.4 Downstream bias mitigation: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
B. Data Storage
- B.1 Data security: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
- B.2 Right to be forgotten: Do we have a mechanism through which an individual can request their personal information be removed?
- B.3 Data retention plan: Is there a schedule or plan to delete the data after it is no longer needed?
- C.1 Missing perspectives: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
- C.2 Dataset bias: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
- C.3 Honest representation: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
- C.4 Privacy in analysis: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
- C.5 Auditability: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
- D.1 Proxy discrimination: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
- D.2 Fairness across groups: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
- D.3 Metric selection: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
- D.4 Explainability: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
- D.5 Communicate bias: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
- E.1 Redress: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
- E.2 Roll back: Is there a way to turn off or roll back the model in production if necessary?
- E.3 Concept drift: Do we test and monitor for concept drift to ensure the model remains fair over time?
- E.4 Unintended use: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
Data Science Ethics Checklist generated with deon.
Although I see the value of having a standardized checklist, that could constitute a standard across the industry, the current checklist does take into account many different aspects, that more junior data scientists are not necessarily able to manage autonomously.
So I decided to create a checklist and share it in this blog, for whoever is approaching data science and wants an easy to implement, and guided list of steps to take towards an ethical data science project.
Below you can find my checklist in html :
- A. Awareness: If there are human subjects, in good faith, I made reasonable steps to assure they are aware and have opted-in to the data treatment.
- B. Bias: Data have been checked against bias individually (as features) and collectively (as part of the dataset).
- C. Concept Drift The model will be monitored to ensure that remains fair over time. Contacts for complains are available in repository.
- D. Data Security This project adopts measure to ensure security of the data treated.
- E. Explainability The decisions made by the model are explained and/or interpreted in the documentation.
Adopting an ethics checklist is an important practice that could help the industry to maintain fairness and be ready to identify and fix problems within data treatment.
Create your own checklist, or adopt a generic one are both good practices, especially for bigger projects within industries that have specific ethical issues, that require special care and attention.
To write my checklist I was inspired by the free ebook “Ethics and Data Science” by H.Mason, DJ Patil, and Mike Loukides (O’Reilly, 2018).
Also, for those who are still skeptical about the impacts of lack of ethics in data science, check out this collection of real life risks for each provision in the ETHICS.md checklist by Deon.
If you have comments or observations about my checklist reach out!
You can find me at firstname.lastname@example.org