Malicious hacking today largely consists of exploiting weaknesses in an applications stack, to gain access to private data that shouldn't be public or corrupt/interfere with the operations of a given application. Sometimes this is to expose software weaknesses, other times this is done for hackers to generate income by trading private information which is of value.
Software vendors are now more focused on baking in security concepts into their code, rather than thinking of security as being an operational afterthought. Although breaches still happen. In fact, data science is being used in a positive way in the areas of intrusion, virus and malware detection to move use from reactive response to a more proactive and predictive approach to detecting breaches.
However, as we move forward into an era where aspects of human decision making are being replaced with data science combined with automation, I think it is of immense importance that we have the security aspects of this front of mind from the get go. Otherwise we are at risk of again falling into the trap of considering security as an afterthought. And to do this we really need to consider what aspects of data science open themselves up to security risk.
One key area that immediately springs to mind is “gaming the system” specifically in relation to machine learning. For example, banks may automate the approval of small bank loans and use machine learning prediction to determine if an applicant has the ability to service the loan and presents a suitable risk. The processing and approval of the loan may be performed in real-time without human involvement, and funds may immediately available to the applicant on approval.
However what may happen it malicious hackers became aware of the models being used to predict risk or serviceability, if they can reverse engineer them and also learn what internal and third party data sources were being used to feed these models or validate identity? In this scenario malicious hackers may, for example, create false identities and exploit weaknesses in upstream data providers to generate fake data that results in positive loan approvals. Or they may undertake small transactions in in certain ways, exploiting model weaknesses that trick the ML into believing the applicant is less of a risk than they actually are. The impact of this real time processing could cause catastrophic scale business impact in relatively short time frames.
Now the above scenario is not necessary all that likely, with banking in particular having a long history of automated fraud detection and an established security first approach. But as we move forward with the commoditisation of machine learning, a rapidly increasing number of businesses are beginning to use this technology to make key decisions. When doing so it becomes therefore imperative that we not only consider the positive aspects, but also what could go wrong and the impact misuse or manipulation could cause.
For example, if the worst case scenario could be, for example, that a clever user raising customer service ticket has all their requests marked as “urgent” because they carefully embed keywords causing the sentiment analysis to believe they are an exiting customer, you might decide that while this is a weakness it may not require mitigation. However if the potential risk is instead incorrectly granting a new customer a $100k credit limit, you may want to take the downside risk more seriously.
Potential mitigation techniques may include:
- Using multiple sources of third party data. Avoid becoming dependant on single sources of validation that you don’t necessarily control.
- Use multiple models to build layers of validation. Don’t let a single model become a single point of failure, use other models to cross reference and flag large variances between predictions.
- Controlled randomness can be a beautiful thing, don’t let all aspects of your process be prescribed.
- Potentially set bounds for what is allowed to be confirmed by ML and what requires human intervention. Bounds may be value based, but should also take expected rate of request into consideration (how may request per hour/day etc.).
- Test the “what If” scenarios and test the robustness and gamability of your models in the same way that you test for accuracy.
The above is just some initial thoughts and not exhaustive, I think we are at the start of the ML revolution and it is the right time to get serious about understanding and mitigation of the risk surrounding the potential manipulation of ML when combined with business process automation.
Author: Tony Bain
Tony has 20 years experience building software and services business using advanced analytics, collaboratively using computers to do what they do best and empowering people to do what they do best.
He is the co-founder of RockSolid SQL (now part of DXC Technology) and has grown the business to over 130 customers globally, and is also an adviser for LiquidityCube, one of the most exciting emerging fintech startups right now. Tony has written numerous books, articles and posts on data driven business and regularly presents at data focused conferences.