Tony Bain

Building products & teams that leverage data, analytics, AI & automation to do amazing things

Data Scientists – Manage your own Ethical Standards

June 27, 2017 | Tony Bain


It seems the privacy nuts might be right. You know, the type of people who stand on streets warning passers-by that the government is watching them. Or the people who wear tinfoil on their heads because they are worried about some corporation reading their thoughts. The reality, it turns out, may not be all that different.

Earlier this week various news sources reported that the personal details over nearly 200 million US voters was exposed. While much of this data was already public information, in voter registration databases, reportedly the data had also been manipulated to try and understand individuals at a personal level. That is, predicting the answers each individual voter may respond to various questions that were important to understand by the data holder. Presumably this was done so they could target campaigns towards the relevant voters. 

Also, this week another story surfaced about people losing their anonymity when online. If was reported that some people who had been looking up specific medical conditions on the web later received a letter in the mail from a company they had never heard of offering them participation in medical trials that relate to those conditions. The veil of privacy was likely torn off for these people, and conversely the power of data matching techniques became publicly apparent. 

It is clear our digital footprints are becoming so extensive, and are fragmented around the world in various databases and logs. And the organisations that hold this data are realising its value, either to themselves or to others, and as such may be willing to leverage or share this data. And as a result, the power and level of understanding that can be gained though combining multiple data sets is beginning to be demonstrated. By combing and matching data at an individual level we are able to much more fidelity at an granular level the before in generalised aggregate data sets.

By bringing data sets together using clever data matching tools it is becoming possible to piece together a tapestry of information related to individuals where specific demographics are known, or relate to a proxy of an individual where specific demographics such as name may not be known but others (location, age, sex, race etc) are reasonably predicted and then used to answer questions at based around the individual. This de-anonymising of data has been demonstrated in various forms, including examples where anonymous medical data was reverse engineered to identify individuals with a level of accuracy.

Many people likely would be surprised by the volume of data they leave online, but perhaps many others would assume they leave a digital trail with everything they do on the web. But even they might be surprised that this is also occurring in the offline world too. For example, when you go out to a store or walk through malls etc, there is a chance you are creating a digital trail behind you. Your mobile phone is likely to be “pinging” to find WiFi networks nearby even if not connecting to them. This ping includes a unique number for your phone (your MAC address). This unique identifier has been used to trace individual’s movements through a store, how long you spent in a particular department, what other stores you may have gone into and perhaps where you went to lunch. Similarly, in London advertisers reportedly used wifi enabled garbage cans for tracking individual’s movements around the city (although these were ‘scrapped’ after being made public). 

Given the effectiveness of data matching, when it would seem a relatively small hurdle to climb should someone really want to associate this type of data back to individuals so they can target them directly. In Booz Allen’s book “The Mathematical Corporation” the author discusses how some organisations working in this particular field worked hard to establish ethical standards and boundaries to ensure their organisations were seen as credible and trustworthy.  But also noting that not all organisations have necessarily applied such boundaries.

Of course, privacy means different things to different people. I am sure the next generation will care less than the current generation about privacy as they have grown up being told everything they put online is public. Maybe privacy won’t exist as a concept and everyone will assume that all data is public including all personal and medical information. While this is a strong possibility for the future, right now many would find it disconcerting to be individually identified from their anonymous digital footprints. And this is not because they are doing something wrong or have something to hide. But just because it seems creepy and weird and feels like it puts us at a risk of being a potential victim of fraud or other wrongdoing.

The modern day leveraging of data is the result of activities primarily undertaken by of data scientists. We are the ones turning data into “actionable insights”. And while we are largely focused on the technical and computational challenges in solving data problems, we also need to acknowledge that every single data project has a set of ethical considerations, without exception. And while ethics is taught as an important topic in many disciplines, from medical through business and financial it is often overlooked in technology. This is a gap that requires focus given the level of widespread impact data projects can have on individuals as an outcome.

"every single data project has a set of ethical considerations, without exception"

People have widely different personal views on ethics. Some would consider it poor ethics to ingest personal data to try and emotively influence someone into “buying another widget”. Others would consider this just part of a free market where such marketing is helping business to succeed, creating jobs and therefore benefiting everyone. Some would see de-anonymizing medical details as troublesome for privacy reasons, others would see it is a necessary step in bringing relevant data together to truly understand illness and disease that may help save millions of lives.

I am not going to preach my view of what data ethics you should apply. My point is however, data scientists should take time to decide where you sit on the ethical spectrum and what your boundaries are. Sometimes it is all too easy to get caught up in a technical challenge, or trying to impress your peers or organisations that we consider the ethical issues in entirety. We should always maintain our own ethical standards so later in our careers we will be able to look back on our work and feel like we have always “done good not harm”. 

Some ethical factors you may consider include:

-         Are people aware that their data is being collected? Are you authorised to use it?

-         Will individuals be surprised or concerned that their data is being used in your project. How would you feel if your data or was used in this way?

-         Are you taking proper measures to secure all data, both at rest and in transit? What would be the impact to individuals of exposure?.

-         What is the real-world impact of the project. Does this only positively impact individuals or is there potential negative impact on individuals? If your project uses prediction, what is the real world negative impact on individuals if your prediction is wrong? How can true-negatives or false-positives be identified and managed?

-         Is the data being used to influence individuals into making decisions? If so is this visible influence so the individual is aware of it or is influence being applied in a subtle or emotive manner without the individual’s awareness?

-         Also if targeting individuals, are those being targeted a group who are likely to be in an impaired state allowing your influence to be more effective than would normally be expected?

-         If buying data, has this data been sourced ethically and legally? Is this data trusted and accurate?

Of course, there may be more than just moral factors in play. Most countries have extensive legal requirements relating to data, privacy and disclosure that must be considered. Again, as a data scientist we should be aware of the relevant laws within the domain we are operating in. While your organisation will likely defer to law specialists for expert advice, for your own personal sense of professionalism you should understand at a high-level the key legal requirements so you don’t breach such requirements.

Data science is an interesting field due to the high level of variability of knowledge and skills to deliver effectively. Ethical, moral and legal understanding of issues relating to the use of data are part of these key skills and should be considered up there in the same vein as the ability to code in R or design a regression model.

The opinions and positions expressed are my own and do not necessarily reflect those of my employer.