Data Scientists – Manage your own Ethical Standards


It seems the privacy nuts might be right. You know, the type of people who stand on streets warning passers-by that the government is watching them. Or the people who wear tinfoil on their heads because they are worried about some corporation reading their thoughts. The reality, it turns out, may not be all that different.

Earlier this week various news sources reported that the personal details over nearly 200 million US voters was exposed. While much of this data was already public information, in voter registration databases, reportedly the data had also been manipulated to try and understand individuals at a personal level. That is, predicting the answers each individual voter may respond to various questions that were important to understand by the data holder. Presumably this was done so they could target campaigns towards the relevant voters. 

Also, this week another story surfaced about people losing their anonymity when online. If was reported that some people who had been looking up specific medical conditions on the web later received a letter in the mail from a company they had never heard of offering them participation in medical trials that relate to those conditions. The veil of privacy was likely torn off for these people, and conversely the power of data matching techniques became publicly apparent. 

It is clear our digital footprints are becoming so extensive, and are fragmented around the world in various databases and logs. And the organisations that hold this data are realising its value, either to themselves or to others, and as such may be willing to leverage or share this data. And as a result, the power and level of understanding that can be gained though combining multiple data sets is beginning to be demonstrated. By combing and matching data at an individual level we are able to much more fidelity at an granular level the before in generalised aggregate data sets.

By bringing data sets together using clever data matching tools it is becoming possible to piece together a tapestry of information related to individuals where specific demographics are known, or relate to a proxy of an individual where specific demographics such as name may not be known but others (location, age, sex, race etc) are reasonably predicted and then used to answer questions at based around the individual. This de-anonymising of data has been demonstrated in various forms, including examples where anonymous medical data was reverse engineered to identify individuals with a level of accuracy.

Many people likely would be surprised by the volume of data they leave online, but perhaps many others would assume they leave a digital trail with everything they do on the web. But even they might be surprised that this is also occurring in the offline world too. For example, when you go out to a store or walk through malls etc, there is a chance you are creating a digital trail behind you. Your mobile phone is likely to be “pinging” to find WiFi networks nearby even if not connecting to them. This ping includes a unique number for your phone (your MAC address). This unique identifier has been used to trace individual’s movements through a store, how long you spent in a particular department, what other stores you may have gone into and perhaps where you went to lunch. Similarly, in London advertisers reportedly used wifi enabled garbage cans for tracking individual’s movements around the city (although these were ‘scrapped’ after being made public). 

Given the effectiveness of data matching, when it would seem a relatively small hurdle to climb should someone really want to associate this type of data back to individuals so they can target them directly. In Booz Allen’s book “The Mathematical Corporation” the author discusses how some organisations working in this particular field worked hard to establish ethical standards and boundaries to ensure their organisations were seen as credible and trustworthy.  But also noting that not all organisations have necessarily applied such boundaries.

Of course, privacy means different things to different people. I am sure the next generation will care less than the current generation about privacy as they have grown up being told everything they put online is public. Maybe privacy won’t exist as a concept and everyone will assume that all data is public including all personal and medical information. While this is a strong possibility for the future, right now many would find it disconcerting to be individually identified from their anonymous digital footprints. And this is not because they are doing something wrong or have something to hide. But just because it seems creepy and weird and feels like it puts us at a risk of being a potential victim of fraud or other wrongdoing.

The modern day leveraging of data is the result of activities primarily undertaken by of data scientists. We are the ones turning data into “actionable insights”. And while we are largely focused on the technical and computational challenges in solving data problems, we also need to acknowledge that every single data project has a set of ethical considerations, without exception. And while ethics is taught as an important topic in many disciplines, from medical through business and financial it is often overlooked in technology. This is a gap that requires focus given the level of widespread impact data projects can have on individuals as an outcome.

"every single data project has a set of ethical considerations, without exception"

People have widely different personal views on ethics. Some would consider it poor ethics to ingest personal data to try and emotively influence someone into “buying another widget”. Others would consider this just part of a free market where such marketing is helping business to succeed, creating jobs and therefore benefiting everyone. Some would see de-anonymizing medical details as troublesome for privacy reasons, others would see it is a necessary step in bringing relevant data together to truly understand illness and disease that may help save millions of lives.

I am not going to preach my view of what data ethics you should apply. My point is however, data scientists should take time to decide where you sit on the ethical spectrum and what your boundaries are. Sometimes it is all too easy to get caught up in a technical challenge, or trying to impress your peers or organisations that we consider the ethical issues in entirety. We should always maintain our own ethical standards so later in our careers we will be able to look back on our work and feel like we have always “done good not harm”. 

Some ethical factors you may consider include:

-         Are people aware that their data is being collected? Are you authorised to use it?

-         Will individuals be surprised or concerned that their data is being used in your project. How would you feel if your data or was used in this way?

-         Are you taking proper measures to secure all data, both at rest and in transit? What would be the impact to individuals of exposure?.

-         What is the real-world impact of the project. Does this only positively impact individuals or is there potential negative impact on individuals? If your project uses prediction, what is the real world negative impact on individuals if your prediction is wrong? How can true-negatives or false-positives be identified and managed?

-         Is the data being used to influence individuals into making decisions? If so is this visible influence so the individual is aware of it or is influence being applied in a subtle or emotive manner without the individual’s awareness?

-         Also if targeting individuals, are those being targeted a group who are likely to be in an impaired state allowing your influence to be more effective than would normally be expected?

-         If buying data, has this data been sourced ethically and legally? Is this data trusted and accurate?

Of course, there may be more than just moral factors in play. Most countries have extensive legal requirements relating to data, privacy and disclosure that must be considered. Again, as a data scientist we should be aware of the relevant laws within the domain we are operating in. While your organisation will likely defer to law specialists for expert advice, for your own personal sense of professionalism you should understand at a high-level the key legal requirements so you don’t breach such requirements.

Data science is an interesting field due to the high level of variability of knowledge and skills to deliver effectively. Ethical, moral and legal understanding of issues relating to the use of data are part of these key skills and should be considered up there in the same vein as the ability to code in R or design a regression model.

The opinions and positions expressed are my own and do not necessarily reflect those of my employer.

The Risks of Unfettered AI


I have written before about the potential risks of machine learning when implemented in areas that impact on people’s daily lives. Much of this has been hypothetical, thinking through the possibilities of what “could” happen in the future if we go down various paths. This story is a little different as it is something that is actually happening at the moment.

Like many people I use various cloud software for different purposes. Most of these are paid for on a monthly subscription via credit card. One particular piece of software I use is from a major vendor that has more than 75,000 customers. I have used this software for a few years and have paid monthly via a credit card I have had from my bank (names not important).

Now 4 months ago, I got a text message at 2am from my bank saying that they had flagged a transaction as suspicious and that my card was temporarily blocked. It just so happened that I had a 3am meeting that day, so no long after getting the text I called the bank.  It turns out it was the payment for the software mentioned above, and the banks fraud detection system had flagged it as unusual for some reason. But no issue, the person on the phone quickly ok’d the transaction and re-enabled the card, so no harm and I could use my card as normal again.

However, a month later I again received the SMS from the bank. And again I called, again they explained it was this transaction and again they resolved. I explained that this had also happened the month before and I was provided assurances that this was now resolved. Business as usual again.

Now fast forward to the same time last month. But this time no notification from the bank. Instead I started getting messages from other providers saying my payments to them had been declined. So I called the bank. Turns out, you guessed it, the same transaction had flagged again causing my card to be blocked, causing other payments to fail. This time I make a bit of a fuss and they provide more assurance that they had updated the notes in the system to say this was a valid transaction.

I am sure by now you can predict where this is heading, of course it happened again this month. I spoke to the credit card security department and while my card was again re-enabled I asked about the likelihood of this transaction causing my card to be blocked again next month. As it appears, while staff can add “notes” to the system, they do not seem to have any method to override the fraud detection system to ensure a valid transaction is not repeatedly flagged incorrectly.

Improved fraud detection is one of the commonly cited areas where machine learning is brining positive gains. These algorithms are “learning” their own patterns from historical of data, finding relationships much subtler than what was possible before when we had to manually coded rules. This tends to provide a higher level of accuracy overall in detecting potential fraud. I actually applaud banks efforts to continuously improve in this area, having a different credit card number stolen years ago has made me well aware of the extent of the problem they are trying to solve.

However, machine learning can be complex to debug or influence for individual error cases. Global rules are extracted by machine learning algorithms from millions or billions of rows of history and these learnings are what are used to make future predictions. Over time miss-classifications may feed back into the learning process as a form of continuous improvement, but this may take some time to occur, and unless the error rate is of high significance it may not actually change the prediction outcomes.

While as a whole you can achieve high levels of accuracy there will always be residual false-positives, where valid transactions are flagged incorrectly. So, what happens when one customer with one transaction is being classified incorrectly? Implementing machine learning systems with real world influence, without a “sanity” override can lead to undesired consequences. We have to remember errors will still occur no matter our accuracy and this needs to be managed. Secondary level assessment using more traditional, user-definable rules may be required to handle these errors to ensure systems are able to respond appropriately and quickly to individual cases of miss-classification.

However, for now I am now caught in the error percentages of a machine learning process. I have no way to make this valid payment without the associated card being blocked on each occurrence. Which means I either have to go through this process every month or look to move this payment to a card from another bank. 

Given the extent of credit card fraud, perhaps the misclassification of a small percentage of valid transactions is a tolerable impact, globally credit card fraud is a $16b problem which needs to be resolved. Of course, I would be unlikely to move bank because of an issue with a single transaction. However, if more transactions start to fail because of this limitation I wouldn’t have many other options as the system starts to impact and degrade the usability of the service that it was designed to protect.

This is of course still just an example. The point being, wherever we are using machine learning to make prediction we still need to acknowledge the prediction error rates and provide appropriate measures to limit the ongoing impact of these.

The opinions and positions expressed are my own and do not necessarily reflect those of my employer.

How Data Science may change Hacking


Malicious hacking today largely consists of exploiting weaknesses in an applications stack, to gain access to private data that shouldn't be public or corrupt/interfere with the operations of a given application.  Sometimes this is to expose software weaknesses, other times this is done for hackers to generate income by trading private information which is of value.

Software vendors are now more focused on baking in security concepts into their code, rather than thinking of security as being an operational afterthought.  Although breaches still happen.  In fact, data science is being used in a positive way in the areas of intrusion, virus and malware detection to move use from reactive response to a more proactive and predictive approach to detecting breaches.

However, as we move forward into an era where aspects of human decision making are being replaced with data science combined with automation, I think it is of immense importance that we have the security aspects of this front of mind from the get go.  Otherwise we are at risk of again falling into the trap of considering security as an afterthought.  And to do this we really need to consider what aspects of data science open themselves up to security risk.

One key area that immediately springs to mind is “gaming the system” specifically in relation to machine learning.  For example, banks may automate the approval of small bank loans and use machine learning prediction to determine if an applicant has the ability to service the loan and presents a suitable risk.  The processing and approval of the loan may be performed in real-time without human involvement, and funds may immediately available to the applicant on approval. 

However what may happen it malicious hackers became aware of the models being used to predict risk or serviceability, if they can reverse engineer them and also learn what internal and third party data sources were being used to feed these models or validate identity?  In this scenario malicious hackers may, for example, create false identities and exploit weaknesses in upstream data providers to generate fake data that results in positive loan approvals.  Or they may undertake small transactions in in certain ways, exploiting model weaknesses that trick the ML into believing the applicant is less of a risk than they actually are.  The impact of this real time processing could cause catastrophic scale business impact in relatively short time frames.

Now the above scenario is not necessary all that likely, with banking in particular having a long history of automated fraud detection and an established security first approach.  But as we move forward with the commoditisation of machine learning, a rapidly increasing number of businesses are beginning to use this technology to make key decisions.  When doing so it becomes therefore imperative that we not only consider the positive aspects, but also what could go wrong and the impact misuse or manipulation could cause. 

For example, if the worst case scenario could be, for example, that a clever user raising customer service ticket has all their requests marked as “urgent” because they carefully embed keywords causing the sentiment analysis to believe they are an exiting customer, you might decide that while this is a weakness it may not require mitigation.  However if the potential risk is instead incorrectly granting a new customer a $100k credit limit, you may want to take the downside risk more seriously.

Potential mitigation techniques may include:

  • Using multiple sources of third party data.  Avoid becoming dependant on single sources of validation that you don’t necessarily control.
  • Use multiple models to build layers of validation.  Don’t let a single model become a single point of failure, use other models to cross reference and flag large variances between predictions.
  • Controlled randomness can be a beautiful thing, don’t let all aspects of your process be prescribed.
  • Potentially set bounds for what is allowed to be confirmed by ML and what requires human intervention.  Bounds may be value based, but should also take expected rate of request into consideration (how may request per hour/day etc.).
  • Test the “what If” scenarios and test the robustness and gamability of your models in the same way that you test for accuracy.

The above is just some initial thoughts and not exhaustive, I think we are at the start of the ML revolution and it is the right time to get serious about understanding and mitigation of the risk surrounding the potential manipulation of ML when combined with business process automation.