How Data Science may change Hacking


Malicious hacking today largely consists of exploiting weaknesses in an applications stack, to gain access to private data that shouldn't be public or corrupt/interfere with the operations of a given application.  Sometimes this is to expose software weaknesses, other times this is done for hackers to generate income by trading private information which is of value.

Software vendors are now more focused on baking in security concepts into their code, rather than thinking of security as being an operational afterthought.  Although breaches still happen.  In fact, data science is being used in a positive way in the areas of intrusion, virus and malware detection to move use from reactive response to a more proactive and predictive approach to detecting breaches.

However, as we move forward into an era where aspects of human decision making are being replaced with data science combined with automation, I think it is of immense importance that we have the security aspects of this front of mind from the get go.  Otherwise we are at risk of again falling into the trap of considering security as an afterthought.  And to do this we really need to consider what aspects of data science open themselves up to security risk.

One key area that immediately springs to mind is “gaming the system” specifically in relation to machine learning.  For example, banks may automate the approval of small bank loans and use machine learning prediction to determine if an applicant has the ability to service the loan and presents a suitable risk.  The processing and approval of the loan may be performed in real-time without human involvement, and funds may immediately available to the applicant on approval. 

However what may happen it malicious hackers became aware of the models being used to predict risk or serviceability, if they can reverse engineer them and also learn what internal and third party data sources were being used to feed these models or validate identity?  In this scenario malicious hackers may, for example, create false identities and exploit weaknesses in upstream data providers to generate fake data that results in positive loan approvals.  Or they may undertake small transactions in in certain ways, exploiting model weaknesses that trick the ML into believing the applicant is less of a risk than they actually are.  The impact of this real time processing could cause catastrophic scale business impact in relatively short time frames.

Now the above scenario is not necessary all that likely, with banking in particular having a long history of automated fraud detection and an established security first approach.  But as we move forward with the commoditisation of machine learning, a rapidly increasing number of businesses are beginning to use this technology to make key decisions.  When doing so it becomes therefore imperative that we not only consider the positive aspects, but also what could go wrong and the impact misuse or manipulation could cause. 

For example, if the worst case scenario could be, for example, that a clever user raising customer service ticket has all their requests marked as “urgent” because they carefully embed keywords causing the sentiment analysis to believe they are an exiting customer, you might decide that while this is a weakness it may not require mitigation.  However if the potential risk is instead incorrectly granting a new customer a $100k credit limit, you may want to take the downside risk more seriously.

Potential mitigation techniques may include:

  • Using multiple sources of third party data.  Avoid becoming dependant on single sources of validation that you don’t necessarily control.
  • Use multiple models to build layers of validation.  Don’t let a single model become a single point of failure, use other models to cross reference and flag large variances between predictions.
  • Controlled randomness can be a beautiful thing, don’t let all aspects of your process be prescribed.
  • Potentially set bounds for what is allowed to be confirmed by ML and what requires human intervention.  Bounds may be value based, but should also take expected rate of request into consideration (how may request per hour/day etc.).
  • Test the “what If” scenarios and test the robustness and gamability of your models in the same way that you test for accuracy.

The above is just some initial thoughts and not exhaustive, I think we are at the start of the ML revolution and it is the right time to get serious about understanding and mitigation of the risk surrounding the potential manipulation of ML when combined with business process automation.

What is the biggest challenge for Big Data?

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but when organizations like Google are doing 10PB sorts on 8000 machines in just over 6 hours – we know the technical scope for Big Data exists and eventually will flow down to the masses, and such scale will likely be achievable by most organizations in the next decade.

Instead I think the core problem that needs to be addressed relates to people and skills.  There are lots of technical engineers who can build distributed systems, orders of magnitude more who can operate them and fill them to the brim with captured data.  But where I think we are lacking skills is with people who know what to do with the data.  People who know how to make it actually useful.  Sure, a BI industry exists today but I think this is currently more focused on the engineering challenges of providing an organization with faster/easier access to their existing knowledge rather than reaching out into the distance and discovering new knowledge.  The people with pure data analysis and knowledge discovery skills are much harder to find, and these are the people who are going to be front and center driving the big data revolution.  People who you can give a few PB of data too and they can provide you back information, discoveries, trends, factoids, patterns, beautiful visualizations and needles you didn’t even know were in the haystack.

These are people who can make a real and significant impact on an organizations bottom line, or help solve some of the world’s problems when applied to R&D.  Data Geeks are the people to be revered in the future and hopefully we see a steady increase in people wanting to grow up to be Data Scientists.