For me the most difficult to find attribute at the moment is experience. There are a lot of schools now producing a better flow of graduates with data science skills, which is great, but in the short term this means that the bulk of the resource pool is very raw with little real-world experience.
Big “data rich” companies may be interested in partnering with innovative start ups in win-win style relationships which allow the startup to develop out ideas and IP, and the company getting new value that may have been very costly, risky or complex for them to develop internally.
All DBAs will need to be proficient in SQL, including DML and DDL statements, as well as relevant platform specific configuration commands. For a junior DBA it is unlikely they will need to be able to write complex analytical queries, so it is more important that you have a broad knowledge of how to get a range of things done using simpler queries.
Honestly I don’t think everyone should or will learn data science.
For many people I think data science would be mind numbingly dull, and they also lack the mathematical affinity to ever truly be good at it. Also, many couldn’t cope with the long periods of repetitive of data gathering, preparation, cleaning, transforming, experimenting… etc etc when the only positive reward may be the occasional slight % improvement in a model or hypothesis. So in my view, for a lot of the population data science could be an unrewarding and boring job that they would may not very good at.
The technology industry is abuzz with excitement relating to the next industrial revolution, the AI fuelled robotic revolution. The promise is that advances in computer comprehension will bring a new age in terms of decoupling employees from process and provide new levels of efficiency and creativity.
However, while AI is certainly important to this process the other key technology is automation, the technology which provides the pathways for computer initiated actions. Automation technology is what disconnects analytics and understanding from static reports that need to be digested by humans and instead triggers responsive or investigative actions affecting routine business operations. And while AI, what is also referred to as advanced analytics, may serve a key purpose in some automation processes, automation itself has a wider applicability in terms of continuous delivery of the existing, potentially simple, processes which exist in most businesses today.
For over a decade I have led a team building platforms that marry analytics with automation (focused within specific domains). And while we certainly leverage advanced analytics for complex processes, what is interesting is that much of the initial gains (efficiencies, productivity, quality etc.) actually come from automating what is already well known, understood and may be comparatively simple.
A limited resource in any organisation is the number of employees and every employee is limited by the number of working hours in a day and the number of working days in the week. If you speak to those in operational roles, most will be able to recite lots of things that they would like to be doing but don’t have time to do. Available time may mean what is urgent gets done, whereas what is ideal gets done at much lower frequencies that optimal.
This is where automation helps and becomes itself a mechanism supporting the advancement of automation in an organisation. By automating albeit simple common key tasks first you achieve several things:
- You can get quick wins, this is important for any initiative.
- You can achieve improvements in efficiencies and quality by doing “what you already know” consistently without it necessarily being complex
- Doing this frees up resources with domain expertise and now automation experience who can help drive the cycle of progressively more complex processes which may include decisions leveraging AI
It seems often those going down an automation path start trying to think of the hardest problems utilising the most complex of analytics first, as on-paper these would seem to be the ones most likely to generate value. However, in practice I have found that automation of simple but poorly attended tasks may bring substantially more value than initially expected. Going forward the “banking” of efficiency gains can then utilised to fuel a continuous cycle of improvement through automation of increasing complex processes.
"automation of simple but poorly attended tasks may bring substantially more value than initially expected."
Therefore, I feel automation is currently the most important concept that organisations heading down the robotic path need to consider. This itself has many considerations which should be factored into the IT landscape and the ability to integrate automation may influence future device and software decisions. Triggers and actions may be driven by AI, but equally many processes may continue to be relatively simple workflows of checks and actions based on simple logic or conventional analytics.
The opinions and positions expressed are my own and do not necessarily reflect those of my employer.
Unless you have been living under a rock you will know that machine learning, and more broadly artificial intelligence, is one of the hottest topics in IT right now. And while this topic is certainly overhyped there is actually some meat on the bone, i.e. there is something real underneath all the noise. And because of this, around the globe major tech players are starting to make “bet the company” style investments on the future being a highly AI centric world.
So we have all heard the dream of curing cancer, self-driving cars and you have probably heard commentary on the negative perspective about mass unemployment and social unrest. But what does machine learning mean to the enterprise now and the near future?
Firstly, let’s talk quickly about what machine learning actually is. If you based this off popular media you would think machine learning is sitting down with a computer, who looks like a robot, and teaching it how you do something like accounting. Once done you’re then free to go off and fire all your accountants (sorry accountants, nothing personal, just an example!). Certainly nothing exists in the AI space that I am aware that would even come close to this today. With the risk of being a downer, machine learning is actually a set of models that use complex mathematics to learn relationships and patterns from data for the purpose of future classification or prediction. That’s it! Machine learning is number crunching code that takes data in and spits numeric predictions/classifications out.
So why all the fuss about machine learning if it is just some form of psychic calculator?
Humans are good at programming computers to do complex things when those complex things can be broken down into a series of steps of of reduced complexity that we can "get our head around". However, increasingly, we are expecting more complexity from our computers. We want to talk to them and have them understand. We want them to be able to recognise images and classify them appropriately. We want them to help more effectively diagnose illnesses and eliminate risk and take burden out of our daily lives. To do these things we have deal with highly complex relationships that can’t easily be represented in conventional ways. Essentially we were trying to have the "human" solve the complexity problem first, then instruct the computer how to replicate our way of thinking so they can solve it too.
But when trying to translate any real world occurrence into something our computers understand our efforts have been good, but sometimes not really good enough for widespread use. The number of variables and relationships have been too complex, sometimes these problems have thousands or millions of variables, and our limited ability to comprehend leads us telling the computer how do things with inherit flaws and weaknesses. How many times have you used voice recognition which understands some things but acts like you are speaking gibberish at other times? Or you have written a document and the spell checker fails to find the correct spmelling of a word that like you're making up your own words as you go? How many times have you let your car drive itself and it has ended up in a paddock (ok, bad example)?
The machine learning revolution has come because we have thrown our hands up in the air and said it is “all too hard, you work it out” to our computers *. Instead of giving our computers specifically coded instructions, we are now giving them data and requesting they "learn" how to best predict the outcomes we need. And computers don't get confused when dealing with immense complexity with data which may have billions of items and thousands of variables – instead complexity translates into longer processing time. Enter clever optimisation methods and hardware (GPU/FPGA) acceleration and boom, you have a fundamental change in how we do things.
* Ok more correctly, machine learning builds on 40+ years of research and development, with modern advances in computing power and scalable algorithms making it a practical solution.
Machine learning is a generic approach that we can apply to a vast set of prediction problems where we have sufficient data available to train. And by prediction I don’t mean trying to guess the lotto numbers, but anytime a computer is trying to “understand” something this is a form of prediction. Spell checking is prediction, shopping recommendations is prediction, your credit risk is prediction, what link you will click on a site is prediction, what marketing offers you will respond to is prediction, the identification of fraudulent transactions is prediction. The list goes on and on including more subtle forms such as the accounting categorisation of a business transaction, the expected delivery time of an order, auto-completing search boxes and so on. All prediction. And by combining this prediction with new forms of input (sensors, devices, IOT) and outputs (automation, robotics) we can do some pretty cool things.
By giving the computer data and guidance and "letting them learn" we are often now able to produce a better outcome that if we had tried to program the specific logic ourselves. In some cases decades of research into specific problem related algorithms have been replaced (or enhanced) by generic machine learning capabilities. For example, in one online machine learning course one of your first projects is to create a handwriting recognition program which translates images of hand written letters into their equivalent ascii codes. In the old world this was a massively difficult problem, not understanding printed text but actual handwriting and being able to deal with the millions of variances between the way people write by hand. In the new world, armed with a large library of correctly label source images, we can train a machine learning model on this data that reliably translates new handwritten images to text. All in a few dozen lines of code.
To support the worlds desire for AI capabilities we are seeing major AI platform vendors start to commoditize machine learning. Commoditization basically means making it useable to a wider audience than a select group of phd’s, statisticians or qunats. Commoditization generally also means “black boxing” machine learning in ways that doesn’t require the implementer to understand in great detail why their machine learning themselves models work. Instead they just need to understand how to plug these models into their applications so they can learn from the data generated once deployed and use this to guide application functions. As I have mentioned before, this consumerization carries some risk related to the ethics and astuteness of those building/testing these black-box models, however my general feeling is that this is the way forward for many mainstream requirements.
Ok, so we have covered what this is, what steps will enterprise's take to implement machine learning in their organisations?
Do well… nothing
I believe many, if not most, organisations will start receiving the benefits of machine learning by doing nothing. Well, not absolutely nothing but no direct research or investment into machine learning or AI. Instead they will work with existing software vendors to update applications. Overtime it will be the software vendors that do the implementing mentioned above and provide unified machine learning capabilities within core application functionality.
Many apps in use today will add machine learning enhanced features and capabilities. Some of this will impact usability, the apps will seem to be more in-tune with what users do and how they do it, apps providing prediction, alerts and notifications and/or guidance will seem to become more accurate over time and give users less noise to deal with. Largely this will be transparent, other than the IT department reporting less monitors pushed off desks and keyboards thrown out windows in frustration. This will be fairly well universal across the spectrum of app classes, from ERP, CRM, financials, HR, Payroll and so on.
Over time these applications may take this integration further and start to pair machine learning with automation to provide smart workflows that start to fundamentally transform the way in which organisations do business. This is when things may start to become highly disruptive to the status quo and may begin changing jobs, eliminating some and creating others, however the focus of those implementing remains focused on the functional business outcomes rather than needing an in-depth understanding of the AI technology driving it.
Already major vendors from Microsoft, SAP, Salesforce to IBM are working to integrate AI into their existing product lines and it is this "AI inside” approach that I think is how most enterprise organisations are going being impacted by machine learning in the near term.
New Classes of Apps
Integrating machine learning with existing applications can start to drive improved usage and support better decision making, but new classes of applications are also coming available to the enterprise which are only possible because of the advancements to AI. These new classes of applications allow organisations to start driving new efficiencies, improving customer service, strengthening security as well as getting new product ideas to market faster and so on.
One of the key new classes of apps is Bots. A bot is an application that combines natural language processing (NLP), with machine learning to “understand” a request from a customer then “predict” the most likely correct answer. Bots can be set up to receive questions from a customer via email, web form etc. They process the message and work out their level of confidence in terms of ability to accurately understand the question. If it is high then the Chatbot may answer the question otherwise pass it through to the customer service team. This may include questions such as “What time do you close today”, “What’s your address” to more personalised “What’s my account balance?”. Chatbots can continuously learn from past interactions to improve their ability to answer more questions in the future more accurately. This has the potential to significantly reduce customer waiting time for simple questions and allow customer service teams to spend more time on customers with complex questions or issues.
More broadly new AI enhanced apps are coming available to support most key forms of enterprise decision making. From HR through marketing, finance, general productivity new application classes are being created to ensure that when decisions are made, they are the best, unbiased decisions given all the available information.
Some organisations may want to go further than above and look to start driving an enhanced competitive advantage using AI. Maybe the organisation is of such complexity that they are better served by in-house built solutions rather than implementing off the shelf product. Maybe these in-house applications have complex risk calculations, classification and/or segmentation of customers, credit risk, fraud detection, churn prediction, procurement and logistic planning and so on.
Benefit from machine learning may be achieved by taking another look at existing prediction logic that has been “programmed” in traditional ways using business rules and complex logic. However, machine learning is not a magic solution. To best solve problems you still need a detailed understanding of the what the problems are and the impact they cause, and this comes best from those with experience and domain knowledge in the business. Leveraging these people to hone in on where the real challenges are and pairing them will people who have skills in modern data science, in my opinion, could provide much benefit.
To support this vendors of enterprise infrastructure software and platforms are busy adding AI capabilities. Microsoft has already included R support into SQL Server and has recently announced upcoming Python support. Microsoft also has their Cortana and Azure AI services all orientated towards mainstream use and deployment. Amazon AWS has extensive AI platform capabilities including recently release their Alexa voice recognition capabilities for mainstream use. Products such as Matlab which organisations have been using for many years to understand data have been enhancing their AI capabilities. More broadly Python and R have already become defacto standards as the languages of choice for machine learning, and decent sized talent pools of people with skills are starting to form, either new graduates or existing bi/data professionals who have cross skilled to round out their data science capabilities.
— Tony Bain (@tonybain) May 3, 2017
For the most part we have been talking about AI technology supporting existing businesses and making them more effective in the marketplace. But what about enterprises who believes the future of the business is in their ability to find new insight in data, or in their ability to solve problems that haven’t been solvable before? Maybe their a drug company in a race to help cure/improve certain infliction's. Maybe they are a hedge fund where they always need to be one step ahead of the market. Then these may require a different approach to how machine learning is leveraged.
Organisations who want to go “all in” on machine learning may see a very different level of investment and return to the approaches I have indicated above. They may need to hire top global talent, build numerous data science teams, invest in data orientated solutions and may even in building products and services that have a primary purpose of generating relevant data for feeding into AI processes. I won't really go in more detail about this here, but needless to say they would have a critical need for strong teams and top down support.
Machine learning is coming to the enterprise and in some forms it is already here. Benefiting from machine learning does not necessarily mean building large teams of data scientists and making huge investments. Often machine learning will be implemented by software vendors who are continuously searching for ways to add value and improve the gains provided by their platforms. However establishing a leading competitive advantage through machine learning may be more involved and require careful introduction into existing applications and in some cases, shooting for the stars.
Generally data we retain within organisational databases is very factual. “Tom bought 6 cans of blue paint”. “Mary delivered package #abc to Jean at 2.30pm”. These are typical examples of the types of records you might find in various corporate database systems. They are recording specific events that the company ahead of time has decided is important to keep and have invested in building applications to create and manage this data.
This data is by nature very focused on “what” has or is happening, which of course is useful for lots of business operations such as logistics, auditing, reporting and so on. However over time the use of this data for most organisations turns to trying to understand “why” things happened so we can influence either making them happen more (for good things) or less (for undesirable things). To try and answer this we sometimes create data warehouses that pulled data from various sources into a single repository with the view of creating analysis that focuses on giving explanation to data rather than reporting just aggregated fact.
But this is where we staring having issues. Our proprietary data represents decisions made by customers and/or staff in the context of their "real world" lives, and real world reasoning is highly complex and influenced by many factors. So complex in fact that many organisations may struggle to explain these relationships in any logical fashion. Sure Tom might buy paint, but why did he need paint, why did he buy blue paint, why did he buy 6 cans, why did he purchase at 2.30pm on a Tuesday, did Tom need anything else we sell at that time other than paint? From our factual data alone the context needed to determine such things simply just may not exist in the data, so we can share at this data all day long we are never going to be able to answer such questions with authority.
So how do we get to the understanding of the why? Well, experienced leaders in this organisation may be able to make gut calls on answers to these questions but this approach is hard to articulate, lacks consistency, is hard to measure until a long way down the track and is hard to implement in software to leverage at scale. Enter machine learning. Machine learning tries to translate the “gut feeling” into its root elements and then combined these into a defined model by learning inherit relationships that exist within data. But this of course is the kicker – again to be successful those relationships need to exist in the data. It is relatively easy to generate “better than guessing” models on most operational enterprise datasets as usually at some level there is some basic contextual relationships in most data, however to order to go further and get highly tuned and accurate models you can “put your money on” you need data that relates those key factors of influence. Machine learning can't learn what isn't there.
Often, to be highly successful we need to combine our proprietary factual data with other sources of data that will add this context. This is where contextual data comes into play. This is data that is published online by various authorities, some of it is open and some of it is proprietary where organisations need to pay. But there is huge, almost limitless, volumes and diversity of open data being published that can take some of the most bland corporate data sets and turn them into a rich treasure trove of insightful information. From government data, health data, environmental data, mapping data, imagery, media and so on. These repositories are where we can source the real world context from to start to finally understand the “why” of things.
So how do we know which relationships are useful?
Determining which relationships are useful is one of the key tasks of the data scientist. But this is initially driven by domain knowledge, i.e. understanding and knowledge of the specific subject matter being analysed. Generally those with domain knowledge brainstorm and expand out the feature of the data to include a rich set of potentially relevant features. Then it is over to the data scientist to crunch the numbers and using mathematical indicators determine which features are actually relevant in terms of predicting the desired outcome. If after this the context still isn’t in the data then it becomes a rinse and repeat effort of trying to improve available features and/or expanding the data set volumes.
Then it is over to the data scientist to crunch the numbers and using mathematical indicators determine which features are actually relevant
For example, using domain knowledge we might expand out our feature set to the point where we could explain, “Tom purchased 6 cans of blue paint because he last painted his house 5 years ago, he lives within 3 miles of the store, the next two weeks expect to be good weather, he owns the house and he likes the color blue”. If we break this down the data to source this analysis may come from a combination of proprietary and open data.
Assuming our CRM system keeps basic demographics we may determine that Tom:
- Purchased 6 cans – we could potentially use contextual datasets to determine Tom’s property size and calculate the likely number of cans of paint based on existing formulas.
- His house is more than 10 years old – probably few would repaint a new house.
- He last painted his house 5 years ago – maybe we keep a history of paint purchases, or maybe we process street imagery through a model which detects changes in house color (maybe an extreme hypothetical).
- Within 3 miles of the store – public data sets, calculated driving distances between addresses
- He owns the house – contextual data sets or requested demographic data
Feeding these contextual features into a machine learning model along with our proprietary data we may learn the relationships that gives us a model to predict if a customer is in the market for paint. Not all features necessarily would be relevant to our model however if we get the right mix then we may be able to product with high levels of confidence. This would allow us to invest appropriately in direct marketing and offers and an appropriate time to maximise our return.
While this is a simple and perhaps silly example it should start to demonstrate the point of how analysis can be greatly improved through the introduction of context from the use of open and public datasets. However there are some challenges in doing this which need to be addressed which include:
- There is a lot of great data freely available in open data sets. However, many open data sets are currently orientated towards human consumption. That is, they tend to be presented in aggregate form with skewed towards some analysis formed by the publisher. Publishers of open datasets need to understand that quite probably it will never be a human consuming their data directly in the future, instead it will be a computer for which raw data is more useful.
- Open data sets are not catalogued. There are various sites that try and list open data repositories but none I have seen are doing this very well. You can currently spend a lot of time trying to find good sources of open data.
- Data integration still continues to be highly time consuming. Combining continues to this day to be one of the most time consuming tasks of any data professional and despite the ETL tools on the market some of the fundamental issues have not been solved. Open data needs to be more self-describing in ways that allow machines to integrate and the poor human can focus on better things.
* example of pre-analysed summary data. Suitable for human consumption but less so for machine consumption (the more likely consumer in the future).
Hopefully this summary helps to explain why context is so important to data and our ability to leverage it for making useful prediction. While this is an overly simple example, the key point is that to accurate prediction useful relationships have to exist.
Self-driving cars are becoming real and various predictions have them being mainstream over the next decade. But it is interesting as to if we are ready for this innovation in transportation and how we will react when accidents involving self-driving cars invariably happen.
I am not a physiologist, but the human element of technology adoption is of course fundamental. And a potential issue for self-driving cars is that, of course, roads are quite dangerous. We hear about people being killed at the time. In fact based on 2013 figures 0.02% of the world’s population (1.25 million people) are killed on the roads every year, over 3000 people a day globally. Yet this is a risk that we accept and strap our most loved ones into our vehicles for journeys as trivial as going to the beach or getting ice-cream.
So how do we accept and process this risk? As a species we seem to be able to deal with risk by applying a contrived analysis which results in use determining that bad things will never happen to me. We hear about accidents, but we believe we are each a better driver than those involved, we are move observant, we have faster reaction times, there are lots of dangerous people on the road any my job as a good driver is just to avoid them.
I have no doubt that self-driving cars will make the roads safer, probably significantly so. But accidents will still happen, it is improbable to think otherwise. The software powering self-driving cars is typically prediction based – prediction = probability = (small/tiny) potential for being wrong. How do we process the risk once I no longer have the advantage of my "better driving" – when everyone’s tech is the same as my tech and the responsibility for a safe journey is out of my hands? For some, perhaps many, this will make the risk a lot more paramount and the act of going out for ice-cream somewhat more concerning.
I am the co-founder of the RockSolid SQL business, the primary developer of the technology, and have built the business to include some of the largest and most well known customer logos.
My area of expertise is building solutions that deliver customer value via leveraging big data, machine learning, AI and software automation technologies. I have written numerous books, articles and posts on data driven business and have presented at conferences globally.
As a Director for RockSolid SQL I am responsible for:
- Creation of the core RockSolid technology and technical innovation and development of the product
- Data and analytics stratergy/implementation
- Financial performance and growth
- Customer satisfaction, retention and growth
- Business development, sales and marketing and leadership on key opportunities
- Service offerings, pricing and go to market strategies
- Partnering relationships
- Product strategy and direction for our core RockSolid technology
- Technical leadership in key development initiatives
- Team leadership and retention
It has been a few months. As blogging is just a fun “outlet” for me, when things get busy it tends to get put on the back burner. But there is so much to talk about, so much has happened in the world of big data in the past few months. Getting back into the swing of it and working on a few posts!