For me the most difficult to find attribute at the moment is experience. There are a lot of schools now producing a better flow of graduates with data science skills, which is great, but in the short term this means that the bulk of the resource pool is very raw with little real-world experience.
Big “data rich” companies may be interested in partnering with innovative start ups in win-win style relationships which allow the startup to develop out ideas and IP, and the company getting new value that may have been very costly, risky or complex for them to develop internally.
All DBAs will need to be proficient in SQL, including DML and DDL statements, as well as relevant platform specific configuration commands. For a junior DBA it is unlikely they will need to be able to write complex analytical queries, so it is more important that you have a broad knowledge of how to get a range of things done using simpler queries.
Honestly I don’t think everyone should or will learn data science.
For many people I think data science would be mind numbingly dull, and they also lack the mathematical affinity to ever truly be good at it. Also, many couldn’t cope with the long periods of repetitive of data gathering, preparation, cleaning, transforming, experimenting… etc etc when the only positive reward may be the occasional slight % improvement in a model or hypothesis. So in my view, for a lot of the population data science could be an unrewarding and boring job that they would may not very good at.
It seems the privacy nuts might be right. You know, the type of people who stand on streets warning passers-by that the government is watching them. Or the people who wear tinfoil on their heads because they are worried about some corporation reading their thoughts. The reality, it turns out, may not be all that different.
Earlier this week various news sources reported that the personal details over nearly 200 million US voters was exposed. While much of this data was already public information, in voter registration databases, reportedly the data had also been manipulated to try and understand individuals at a personal level. That is, predicting the answers each individual voter may respond to various questions that were important to understand by the data holder. Presumably this was done so they could target campaigns towards the relevant voters.
Also, this week another story surfaced about people losing their anonymity when online. If was reported that some people who had been looking up specific medical conditions on the web later received a letter in the mail from a company they had never heard of offering them participation in medical trials that relate to those conditions. The veil of privacy was likely torn off for these people, and conversely the power of data matching techniques became publicly apparent.
It is clear our digital footprints are becoming so extensive, and are fragmented around the world in various databases and logs. And the organisations that hold this data are realising its value, either to themselves or to others, and as such may be willing to leverage or share this data. And as a result, the power and level of understanding that can be gained though combining multiple data sets is beginning to be demonstrated. By combing and matching data at an individual level we are able to much more fidelity at an granular level the before in generalised aggregate data sets.
By bringing data sets together using clever data matching tools it is becoming possible to piece together a tapestry of information related to individuals where specific demographics are known, or relate to a proxy of an individual where specific demographics such as name may not be known but others (location, age, sex, race etc) are reasonably predicted and then used to answer questions at based around the individual. This de-anonymising of data has been demonstrated in various forms, including examples where anonymous medical data was reverse engineered to identify individuals with a level of accuracy.
Many people likely would be surprised by the volume of data they leave online, but perhaps many others would assume they leave a digital trail with everything they do on the web. But even they might be surprised that this is also occurring in the offline world too. For example, when you go out to a store or walk through malls etc, there is a chance you are creating a digital trail behind you. Your mobile phone is likely to be “pinging” to find WiFi networks nearby even if not connecting to them. This ping includes a unique number for your phone (your MAC address). This unique identifier has been used to trace individual’s movements through a store, how long you spent in a particular department, what other stores you may have gone into and perhaps where you went to lunch. Similarly, in London advertisers reportedly used wifi enabled garbage cans for tracking individual’s movements around the city (although these were ‘scrapped’ after being made public).
Given the effectiveness of data matching, when it would seem a relatively small hurdle to climb should someone really want to associate this type of data back to individuals so they can target them directly. In Booz Allen’s book “The Mathematical Corporation” the author discusses how some organisations working in this particular field worked hard to establish ethical standards and boundaries to ensure their organisations were seen as credible and trustworthy. But also noting that not all organisations have necessarily applied such boundaries.
Of course, privacy means different things to different people. I am sure the next generation will care less than the current generation about privacy as they have grown up being told everything they put online is public. Maybe privacy won’t exist as a concept and everyone will assume that all data is public including all personal and medical information. While this is a strong possibility for the future, right now many would find it disconcerting to be individually identified from their anonymous digital footprints. And this is not because they are doing something wrong or have something to hide. But just because it seems creepy and weird and feels like it puts us at a risk of being a potential victim of fraud or other wrongdoing.
The modern day leveraging of data is the result of activities primarily undertaken by of data scientists. We are the ones turning data into “actionable insights”. And while we are largely focused on the technical and computational challenges in solving data problems, we also need to acknowledge that every single data project has a set of ethical considerations, without exception. And while ethics is taught as an important topic in many disciplines, from medical through business and financial it is often overlooked in technology. This is a gap that requires focus given the level of widespread impact data projects can have on individuals as an outcome.
"every single data project has a set of ethical considerations, without exception"
People have widely different personal views on ethics. Some would consider it poor ethics to ingest personal data to try and emotively influence someone into “buying another widget”. Others would consider this just part of a free market where such marketing is helping business to succeed, creating jobs and therefore benefiting everyone. Some would see de-anonymizing medical details as troublesome for privacy reasons, others would see it is a necessary step in bringing relevant data together to truly understand illness and disease that may help save millions of lives.
I am not going to preach my view of what data ethics you should apply. My point is however, data scientists should take time to decide where you sit on the ethical spectrum and what your boundaries are. Sometimes it is all too easy to get caught up in a technical challenge, or trying to impress your peers or organisations that we consider the ethical issues in entirety. We should always maintain our own ethical standards so later in our careers we will be able to look back on our work and feel like we have always “done good not harm”.
Some ethical factors you may consider include:
- Are people aware that their data is being collected? Are you authorised to use it?
- Will individuals be surprised or concerned that their data is being used in your project. How would you feel if your data or was used in this way?
- Are you taking proper measures to secure all data, both at rest and in transit? What would be the impact to individuals of exposure?.
- What is the real-world impact of the project. Does this only positively impact individuals or is there potential negative impact on individuals? If your project uses prediction, what is the real world negative impact on individuals if your prediction is wrong? How can true-negatives or false-positives be identified and managed?
- Is the data being used to influence individuals into making decisions? If so is this visible influence so the individual is aware of it or is influence being applied in a subtle or emotive manner without the individual’s awareness?
- Also if targeting individuals, are those being targeted a group who are likely to be in an impaired state allowing your influence to be more effective than would normally be expected?
- If buying data, has this data been sourced ethically and legally? Is this data trusted and accurate?
Of course, there may be more than just moral factors in play. Most countries have extensive legal requirements relating to data, privacy and disclosure that must be considered. Again, as a data scientist we should be aware of the relevant laws within the domain we are operating in. While your organisation will likely defer to law specialists for expert advice, for your own personal sense of professionalism you should understand at a high-level the key legal requirements so you don’t breach such requirements.
Data science is an interesting field due to the high level of variability of knowledge and skills to deliver effectively. Ethical, moral and legal understanding of issues relating to the use of data are part of these key skills and should be considered up there in the same vein as the ability to code in R or design a regression model.
The opinions and positions expressed are my own and do not necessarily reflect those of my employer.
I have written before about the potential risks of machine learning when implemented in areas that impact on people’s daily lives. Much of this has been hypothetical, thinking through the possibilities of what “could” happen in the future if we go down various paths. This story is a little different as it is something that is actually happening at the moment.
Like many people I use various cloud software for different purposes. Most of these are paid for on a monthly subscription via credit card. One particular piece of software I use is from a major vendor that has more than 75,000 customers. I have used this software for a few years and have paid monthly via a credit card I have had from my bank (names not important).
Now 4 months ago, I got a text message at 2am from my bank saying that they had flagged a transaction as suspicious and that my card was temporarily blocked. It just so happened that I had a 3am meeting that day, so no long after getting the text I called the bank. It turns out it was the payment for the software mentioned above, and the banks fraud detection system had flagged it as unusual for some reason. But no issue, the person on the phone quickly ok’d the transaction and re-enabled the card, so no harm and I could use my card as normal again.
However, a month later I again received the SMS from the bank. And again I called, again they explained it was this transaction and again they resolved. I explained that this had also happened the month before and I was provided assurances that this was now resolved. Business as usual again.
Now fast forward to the same time last month. But this time no notification from the bank. Instead I started getting messages from other providers saying my payments to them had been declined. So I called the bank. Turns out, you guessed it, the same transaction had flagged again causing my card to be blocked, causing other payments to fail. This time I make a bit of a fuss and they provide more assurance that they had updated the notes in the system to say this was a valid transaction.
I am sure by now you can predict where this is heading, of course it happened again this month. I spoke to the credit card security department and while my card was again re-enabled I asked about the likelihood of this transaction causing my card to be blocked again next month. As it appears, while staff can add “notes” to the system, they do not seem to have any method to override the fraud detection system to ensure a valid transaction is not repeatedly flagged incorrectly.
Improved fraud detection is one of the commonly cited areas where machine learning is brining positive gains. These algorithms are “learning” their own patterns from historical of data, finding relationships much subtler than what was possible before when we had to manually coded rules. This tends to provide a higher level of accuracy overall in detecting potential fraud. I actually applaud banks efforts to continuously improve in this area, having a different credit card number stolen years ago has made me well aware of the extent of the problem they are trying to solve.
However, machine learning can be complex to debug or influence for individual error cases. Global rules are extracted by machine learning algorithms from millions or billions of rows of history and these learnings are what are used to make future predictions. Over time miss-classifications may feed back into the learning process as a form of continuous improvement, but this may take some time to occur, and unless the error rate is of high significance it may not actually change the prediction outcomes.
While as a whole you can achieve high levels of accuracy there will always be residual false-positives, where valid transactions are flagged incorrectly. So, what happens when one customer with one transaction is being classified incorrectly? Implementing machine learning systems with real world influence, without a “sanity” override can lead to undesired consequences. We have to remember errors will still occur no matter our accuracy and this needs to be managed. Secondary level assessment using more traditional, user-definable rules may be required to handle these errors to ensure systems are able to respond appropriately and quickly to individual cases of miss-classification.
However, for now I am now caught in the error percentages of a machine learning process. I have no way to make this valid payment without the associated card being blocked on each occurrence. Which means I either have to go through this process every month or look to move this payment to a card from another bank.
Given the extent of credit card fraud, perhaps the misclassification of a small percentage of valid transactions is a tolerable impact, globally credit card fraud is a $16b problem which needs to be resolved. Of course, I would be unlikely to move bank because of an issue with a single transaction. However, if more transactions start to fail because of this limitation I wouldn’t have many other options as the system starts to impact and degrade the usability of the service that it was designed to protect.
This is of course still just an example. The point being, wherever we are using machine learning to make prediction we still need to acknowledge the prediction error rates and provide appropriate measures to limit the ongoing impact of these.
The opinions and positions expressed are my own and do not necessarily reflect those of my employer.
The technology industry is abuzz with excitement relating to the next industrial revolution, the AI fuelled robotic revolution. The promise is that advances in computer comprehension will bring a new age in terms of decoupling employees from process and provide new levels of efficiency and creativity.
However, while AI is certainly important to this process the other key technology is automation, the technology which provides the pathways for computer initiated actions. Automation technology is what disconnects analytics and understanding from static reports that need to be digested by humans and instead triggers responsive or investigative actions affecting routine business operations. And while AI, what is also referred to as advanced analytics, may serve a key purpose in some automation processes, automation itself has a wider applicability in terms of continuous delivery of the existing, potentially simple, processes which exist in most businesses today.
For over a decade I have led a team building platforms that marry analytics with automation (focused within specific domains). And while we certainly leverage advanced analytics for complex processes, what is interesting is that much of the initial gains (efficiencies, productivity, quality etc.) actually come from automating what is already well known, understood and may be comparatively simple.
A limited resource in any organisation is the number of employees and every employee is limited by the number of working hours in a day and the number of working days in the week. If you speak to those in operational roles, most will be able to recite lots of things that they would like to be doing but don’t have time to do. Available time may mean what is urgent gets done, whereas what is ideal gets done at much lower frequencies that optimal.
This is where automation helps and becomes itself a mechanism supporting the advancement of automation in an organisation. By automating albeit simple common key tasks first you achieve several things:
- You can get quick wins, this is important for any initiative.
- You can achieve improvements in efficiencies and quality by doing “what you already know” consistently without it necessarily being complex
- Doing this frees up resources with domain expertise and now automation experience who can help drive the cycle of progressively more complex processes which may include decisions leveraging AI
It seems often those going down an automation path start trying to think of the hardest problems utilising the most complex of analytics first, as on-paper these would seem to be the ones most likely to generate value. However, in practice I have found that automation of simple but poorly attended tasks may bring substantially more value than initially expected. Going forward the “banking” of efficiency gains can then utilised to fuel a continuous cycle of improvement through automation of increasing complex processes.
"automation of simple but poorly attended tasks may bring substantially more value than initially expected."
Therefore, I feel automation is currently the most important concept that organisations heading down the robotic path need to consider. This itself has many considerations which should be factored into the IT landscape and the ability to integrate automation may influence future device and software decisions. Triggers and actions may be driven by AI, but equally many processes may continue to be relatively simple workflows of checks and actions based on simple logic or conventional analytics.
The opinions and positions expressed are my own and do not necessarily reflect those of my employer.
Unless you have been living under a rock you will know that machine learning, and more broadly artificial intelligence, is one of the hottest topics in IT right now. And while this topic is certainly overhyped there is actually some meat on the bone, i.e. there is something real underneath all the noise. And because of this, around the globe major tech players are starting to make “bet the company” style investments on the future being a highly AI centric world.
So we have all heard the dream of curing cancer, self-driving cars and you have probably heard commentary on the negative perspective about mass unemployment and social unrest. But what does machine learning mean to the enterprise now and the near future?
Firstly, let’s talk quickly about what machine learning actually is. If you based this off popular media you would think machine learning is sitting down with a computer, who looks like a robot, and teaching it how you do something like accounting. Once done you’re then free to go off and fire all your accountants (sorry accountants, nothing personal, just an example!). Certainly nothing exists in the AI space that I am aware that would even come close to this today. With the risk of being a downer, machine learning is actually a set of models that use complex mathematics to learn relationships and patterns from data for the purpose of future classification or prediction. That’s it! Machine learning is number crunching code that takes data in and spits numeric predictions/classifications out.
So why all the fuss about machine learning if it is just some form of psychic calculator?
Humans are good at programming computers to do complex things when those complex things can be broken down into a series of steps of of reduced complexity that we can "get our head around". However, increasingly, we are expecting more complexity from our computers. We want to talk to them and have them understand. We want them to be able to recognise images and classify them appropriately. We want them to help more effectively diagnose illnesses and eliminate risk and take burden out of our daily lives. To do these things we have deal with highly complex relationships that can’t easily be represented in conventional ways. Essentially we were trying to have the "human" solve the complexity problem first, then instruct the computer how to replicate our way of thinking so they can solve it too.
But when trying to translate any real world occurrence into something our computers understand our efforts have been good, but sometimes not really good enough for widespread use. The number of variables and relationships have been too complex, sometimes these problems have thousands or millions of variables, and our limited ability to comprehend leads us telling the computer how do things with inherit flaws and weaknesses. How many times have you used voice recognition which understands some things but acts like you are speaking gibberish at other times? Or you have written a document and the spell checker fails to find the correct spmelling of a word that like you're making up your own words as you go? How many times have you let your car drive itself and it has ended up in a paddock (ok, bad example)?
The machine learning revolution has come because we have thrown our hands up in the air and said it is “all too hard, you work it out” to our computers *. Instead of giving our computers specifically coded instructions, we are now giving them data and requesting they "learn" how to best predict the outcomes we need. And computers don't get confused when dealing with immense complexity with data which may have billions of items and thousands of variables – instead complexity translates into longer processing time. Enter clever optimisation methods and hardware (GPU/FPGA) acceleration and boom, you have a fundamental change in how we do things.
* Ok more correctly, machine learning builds on 40+ years of research and development, with modern advances in computing power and scalable algorithms making it a practical solution.
Machine learning is a generic approach that we can apply to a vast set of prediction problems where we have sufficient data available to train. And by prediction I don’t mean trying to guess the lotto numbers, but anytime a computer is trying to “understand” something this is a form of prediction. Spell checking is prediction, shopping recommendations is prediction, your credit risk is prediction, what link you will click on a site is prediction, what marketing offers you will respond to is prediction, the identification of fraudulent transactions is prediction. The list goes on and on including more subtle forms such as the accounting categorisation of a business transaction, the expected delivery time of an order, auto-completing search boxes and so on. All prediction. And by combining this prediction with new forms of input (sensors, devices, IOT) and outputs (automation, robotics) we can do some pretty cool things.
By giving the computer data and guidance and "letting them learn" we are often now able to produce a better outcome that if we had tried to program the specific logic ourselves. In some cases decades of research into specific problem related algorithms have been replaced (or enhanced) by generic machine learning capabilities. For example, in one online machine learning course one of your first projects is to create a handwriting recognition program which translates images of hand written letters into their equivalent ascii codes. In the old world this was a massively difficult problem, not understanding printed text but actual handwriting and being able to deal with the millions of variances between the way people write by hand. In the new world, armed with a large library of correctly label source images, we can train a machine learning model on this data that reliably translates new handwritten images to text. All in a few dozen lines of code.
To support the worlds desire for AI capabilities we are seeing major AI platform vendors start to commoditize machine learning. Commoditization basically means making it useable to a wider audience than a select group of phd’s, statisticians or qunats. Commoditization generally also means “black boxing” machine learning in ways that doesn’t require the implementer to understand in great detail why their machine learning themselves models work. Instead they just need to understand how to plug these models into their applications so they can learn from the data generated once deployed and use this to guide application functions. As I have mentioned before, this consumerization carries some risk related to the ethics and astuteness of those building/testing these black-box models, however my general feeling is that this is the way forward for many mainstream requirements.
Ok, so we have covered what this is, what steps will enterprise's take to implement machine learning in their organisations?
Do well… nothing
I believe many, if not most, organisations will start receiving the benefits of machine learning by doing nothing. Well, not absolutely nothing but no direct research or investment into machine learning or AI. Instead they will work with existing software vendors to update applications. Overtime it will be the software vendors that do the implementing mentioned above and provide unified machine learning capabilities within core application functionality.
Many apps in use today will add machine learning enhanced features and capabilities. Some of this will impact usability, the apps will seem to be more in-tune with what users do and how they do it, apps providing prediction, alerts and notifications and/or guidance will seem to become more accurate over time and give users less noise to deal with. Largely this will be transparent, other than the IT department reporting less monitors pushed off desks and keyboards thrown out windows in frustration. This will be fairly well universal across the spectrum of app classes, from ERP, CRM, financials, HR, Payroll and so on.
Over time these applications may take this integration further and start to pair machine learning with automation to provide smart workflows that start to fundamentally transform the way in which organisations do business. This is when things may start to become highly disruptive to the status quo and may begin changing jobs, eliminating some and creating others, however the focus of those implementing remains focused on the functional business outcomes rather than needing an in-depth understanding of the AI technology driving it.
Already major vendors from Microsoft, SAP, Salesforce to IBM are working to integrate AI into their existing product lines and it is this "AI inside” approach that I think is how most enterprise organisations are going being impacted by machine learning in the near term.
New Classes of Apps
Integrating machine learning with existing applications can start to drive improved usage and support better decision making, but new classes of applications are also coming available to the enterprise which are only possible because of the advancements to AI. These new classes of applications allow organisations to start driving new efficiencies, improving customer service, strengthening security as well as getting new product ideas to market faster and so on.
One of the key new classes of apps is Bots. A bot is an application that combines natural language processing (NLP), with machine learning to “understand” a request from a customer then “predict” the most likely correct answer. Bots can be set up to receive questions from a customer via email, web form etc. They process the message and work out their level of confidence in terms of ability to accurately understand the question. If it is high then the Chatbot may answer the question otherwise pass it through to the customer service team. This may include questions such as “What time do you close today”, “What’s your address” to more personalised “What’s my account balance?”. Chatbots can continuously learn from past interactions to improve their ability to answer more questions in the future more accurately. This has the potential to significantly reduce customer waiting time for simple questions and allow customer service teams to spend more time on customers with complex questions or issues.
More broadly new AI enhanced apps are coming available to support most key forms of enterprise decision making. From HR through marketing, finance, general productivity new application classes are being created to ensure that when decisions are made, they are the best, unbiased decisions given all the available information.
Some organisations may want to go further than above and look to start driving an enhanced competitive advantage using AI. Maybe the organisation is of such complexity that they are better served by in-house built solutions rather than implementing off the shelf product. Maybe these in-house applications have complex risk calculations, classification and/or segmentation of customers, credit risk, fraud detection, churn prediction, procurement and logistic planning and so on.
Benefit from machine learning may be achieved by taking another look at existing prediction logic that has been “programmed” in traditional ways using business rules and complex logic. However, machine learning is not a magic solution. To best solve problems you still need a detailed understanding of the what the problems are and the impact they cause, and this comes best from those with experience and domain knowledge in the business. Leveraging these people to hone in on where the real challenges are and pairing them will people who have skills in modern data science, in my opinion, could provide much benefit.
To support this vendors of enterprise infrastructure software and platforms are busy adding AI capabilities. Microsoft has already included R support into SQL Server and has recently announced upcoming Python support. Microsoft also has their Cortana and Azure AI services all orientated towards mainstream use and deployment. Amazon AWS has extensive AI platform capabilities including recently release their Alexa voice recognition capabilities for mainstream use. Products such as Matlab which organisations have been using for many years to understand data have been enhancing their AI capabilities. More broadly Python and R have already become defacto standards as the languages of choice for machine learning, and decent sized talent pools of people with skills are starting to form, either new graduates or existing bi/data professionals who have cross skilled to round out their data science capabilities.
— Tony Bain (@tonybain) May 3, 2017
For the most part we have been talking about AI technology supporting existing businesses and making them more effective in the marketplace. But what about enterprises who believes the future of the business is in their ability to find new insight in data, or in their ability to solve problems that haven’t been solvable before? Maybe their a drug company in a race to help cure/improve certain infliction's. Maybe they are a hedge fund where they always need to be one step ahead of the market. Then these may require a different approach to how machine learning is leveraged.
Organisations who want to go “all in” on machine learning may see a very different level of investment and return to the approaches I have indicated above. They may need to hire top global talent, build numerous data science teams, invest in data orientated solutions and may even in building products and services that have a primary purpose of generating relevant data for feeding into AI processes. I won't really go in more detail about this here, but needless to say they would have a critical need for strong teams and top down support.
Machine learning is coming to the enterprise and in some forms it is already here. Benefiting from machine learning does not necessarily mean building large teams of data scientists and making huge investments. Often machine learning will be implemented by software vendors who are continuously searching for ways to add value and improve the gains provided by their platforms. However establishing a leading competitive advantage through machine learning may be more involved and require careful introduction into existing applications and in some cases, shooting for the stars.
Generally data we retain within organisational databases is very factual. “Tom bought 6 cans of blue paint”. “Mary delivered package #abc to Jean at 2.30pm”. These are typical examples of the types of records you might find in various corporate database systems. They are recording specific events that the company ahead of time has decided is important to keep and have invested in building applications to create and manage this data.
This data is by nature very focused on “what” has or is happening, which of course is useful for lots of business operations such as logistics, auditing, reporting and so on. However over time the use of this data for most organisations turns to trying to understand “why” things happened so we can influence either making them happen more (for good things) or less (for undesirable things). To try and answer this we sometimes create data warehouses that pulled data from various sources into a single repository with the view of creating analysis that focuses on giving explanation to data rather than reporting just aggregated fact.
But this is where we staring having issues. Our proprietary data represents decisions made by customers and/or staff in the context of their "real world" lives, and real world reasoning is highly complex and influenced by many factors. So complex in fact that many organisations may struggle to explain these relationships in any logical fashion. Sure Tom might buy paint, but why did he need paint, why did he buy blue paint, why did he buy 6 cans, why did he purchase at 2.30pm on a Tuesday, did Tom need anything else we sell at that time other than paint? From our factual data alone the context needed to determine such things simply just may not exist in the data, so we can share at this data all day long we are never going to be able to answer such questions with authority.
So how do we get to the understanding of the why? Well, experienced leaders in this organisation may be able to make gut calls on answers to these questions but this approach is hard to articulate, lacks consistency, is hard to measure until a long way down the track and is hard to implement in software to leverage at scale. Enter machine learning. Machine learning tries to translate the “gut feeling” into its root elements and then combined these into a defined model by learning inherit relationships that exist within data. But this of course is the kicker – again to be successful those relationships need to exist in the data. It is relatively easy to generate “better than guessing” models on most operational enterprise datasets as usually at some level there is some basic contextual relationships in most data, however to order to go further and get highly tuned and accurate models you can “put your money on” you need data that relates those key factors of influence. Machine learning can't learn what isn't there.
Often, to be highly successful we need to combine our proprietary factual data with other sources of data that will add this context. This is where contextual data comes into play. This is data that is published online by various authorities, some of it is open and some of it is proprietary where organisations need to pay. But there is huge, almost limitless, volumes and diversity of open data being published that can take some of the most bland corporate data sets and turn them into a rich treasure trove of insightful information. From government data, health data, environmental data, mapping data, imagery, media and so on. These repositories are where we can source the real world context from to start to finally understand the “why” of things.
So how do we know which relationships are useful?
Determining which relationships are useful is one of the key tasks of the data scientist. But this is initially driven by domain knowledge, i.e. understanding and knowledge of the specific subject matter being analysed. Generally those with domain knowledge brainstorm and expand out the feature of the data to include a rich set of potentially relevant features. Then it is over to the data scientist to crunch the numbers and using mathematical indicators determine which features are actually relevant in terms of predicting the desired outcome. If after this the context still isn’t in the data then it becomes a rinse and repeat effort of trying to improve available features and/or expanding the data set volumes.
Then it is over to the data scientist to crunch the numbers and using mathematical indicators determine which features are actually relevant
For example, using domain knowledge we might expand out our feature set to the point where we could explain, “Tom purchased 6 cans of blue paint because he last painted his house 5 years ago, he lives within 3 miles of the store, the next two weeks expect to be good weather, he owns the house and he likes the color blue”. If we break this down the data to source this analysis may come from a combination of proprietary and open data.
Assuming our CRM system keeps basic demographics we may determine that Tom:
- Purchased 6 cans – we could potentially use contextual datasets to determine Tom’s property size and calculate the likely number of cans of paint based on existing formulas.
- His house is more than 10 years old – probably few would repaint a new house.
- He last painted his house 5 years ago – maybe we keep a history of paint purchases, or maybe we process street imagery through a model which detects changes in house color (maybe an extreme hypothetical).
- Within 3 miles of the store – public data sets, calculated driving distances between addresses
- He owns the house – contextual data sets or requested demographic data
Feeding these contextual features into a machine learning model along with our proprietary data we may learn the relationships that gives us a model to predict if a customer is in the market for paint. Not all features necessarily would be relevant to our model however if we get the right mix then we may be able to product with high levels of confidence. This would allow us to invest appropriately in direct marketing and offers and an appropriate time to maximise our return.
While this is a simple and perhaps silly example it should start to demonstrate the point of how analysis can be greatly improved through the introduction of context from the use of open and public datasets. However there are some challenges in doing this which need to be addressed which include:
- There is a lot of great data freely available in open data sets. However, many open data sets are currently orientated towards human consumption. That is, they tend to be presented in aggregate form with skewed towards some analysis formed by the publisher. Publishers of open datasets need to understand that quite probably it will never be a human consuming their data directly in the future, instead it will be a computer for which raw data is more useful.
- Open data sets are not catalogued. There are various sites that try and list open data repositories but none I have seen are doing this very well. You can currently spend a lot of time trying to find good sources of open data.
- Data integration still continues to be highly time consuming. Combining continues to this day to be one of the most time consuming tasks of any data professional and despite the ETL tools on the market some of the fundamental issues have not been solved. Open data needs to be more self-describing in ways that allow machines to integrate and the poor human can focus on better things.
* example of pre-analysed summary data. Suitable for human consumption but less so for machine consumption (the more likely consumer in the future).
Hopefully this summary helps to explain why context is so important to data and our ability to leverage it for making useful prediction. While this is an overly simple example, the key point is that to accurate prediction useful relationships have to exist.
Malicious hacking today largely consists of exploiting weaknesses in an applications stack, to gain access to private data that shouldn't be public or corrupt/interfere with the operations of a given application. Sometimes this is to expose software weaknesses, other times this is done for hackers to generate income by trading private information which is of value.
Software vendors are now more focused on baking in security concepts into their code, rather than thinking of security as being an operational afterthought. Although breaches still happen. In fact, data science is being used in a positive way in the areas of intrusion, virus and malware detection to move use from reactive response to a more proactive and predictive approach to detecting breaches.
However, as we move forward into an era where aspects of human decision making are being replaced with data science combined with automation, I think it is of immense importance that we have the security aspects of this front of mind from the get go. Otherwise we are at risk of again falling into the trap of considering security as an afterthought. And to do this we really need to consider what aspects of data science open themselves up to security risk.
One key area that immediately springs to mind is “gaming the system” specifically in relation to machine learning. For example, banks may automate the approval of small bank loans and use machine learning prediction to determine if an applicant has the ability to service the loan and presents a suitable risk. The processing and approval of the loan may be performed in real-time without human involvement, and funds may immediately available to the applicant on approval.
However what may happen it malicious hackers became aware of the models being used to predict risk or serviceability, if they can reverse engineer them and also learn what internal and third party data sources were being used to feed these models or validate identity? In this scenario malicious hackers may, for example, create false identities and exploit weaknesses in upstream data providers to generate fake data that results in positive loan approvals. Or they may undertake small transactions in in certain ways, exploiting model weaknesses that trick the ML into believing the applicant is less of a risk than they actually are. The impact of this real time processing could cause catastrophic scale business impact in relatively short time frames.
Now the above scenario is not necessary all that likely, with banking in particular having a long history of automated fraud detection and an established security first approach. But as we move forward with the commoditisation of machine learning, a rapidly increasing number of businesses are beginning to use this technology to make key decisions. When doing so it becomes therefore imperative that we not only consider the positive aspects, but also what could go wrong and the impact misuse or manipulation could cause.
For example, if the worst case scenario could be, for example, that a clever user raising customer service ticket has all their requests marked as “urgent” because they carefully embed keywords causing the sentiment analysis to believe they are an exiting customer, you might decide that while this is a weakness it may not require mitigation. However if the potential risk is instead incorrectly granting a new customer a $100k credit limit, you may want to take the downside risk more seriously.
Potential mitigation techniques may include:
- Using multiple sources of third party data. Avoid becoming dependant on single sources of validation that you don’t necessarily control.
- Use multiple models to build layers of validation. Don’t let a single model become a single point of failure, use other models to cross reference and flag large variances between predictions.
- Controlled randomness can be a beautiful thing, don’t let all aspects of your process be prescribed.
- Potentially set bounds for what is allowed to be confirmed by ML and what requires human intervention. Bounds may be value based, but should also take expected rate of request into consideration (how may request per hour/day etc.).
- Test the “what If” scenarios and test the robustness and gamability of your models in the same way that you test for accuracy.
The above is just some initial thoughts and not exhaustive, I think we are at the start of the ML revolution and it is the right time to get serious about understanding and mitigation of the risk surrounding the potential manipulation of ML when combined with business process automation.