Data Scientists – Manage your own Ethical Standards

Ethics

It seems the privacy nuts might be right. You know, the type of people who stand on streets warning passers-by that the government is watching them. Or the people who wear tinfoil on their heads because they are worried about some corporation reading their thoughts. The reality, it turns out, may not be all that different.

Earlier this week various news sources reported that the personal details over nearly 200 million US voters was exposed. While much of this data was already public information, in voter registration databases, reportedly the data had also been manipulated to try and understand individuals at a personal level. That is, predicting the answers each individual voter may respond to various questions that were important to understand by the data holder. Presumably this was done so they could target campaigns towards the relevant voters. 

Also, this week another story surfaced about people losing their anonymity when online. If was reported that some people who had been looking up specific medical conditions on the web later received a letter in the mail from a company they had never heard of offering them participation in medical trials that relate to those conditions. The veil of privacy was likely torn off for these people, and conversely the power of data matching techniques became publicly apparent. 

It is clear our digital footprints are becoming so extensive, and are fragmented around the world in various databases and logs. And the organisations that hold this data are realising its value, either to themselves or to others, and as such may be willing to leverage or share this data. And as a result, the power and level of understanding that can be gained though combining multiple data sets is beginning to be demonstrated. By combing and matching data at an individual level we are able to much more fidelity at an granular level the before in generalised aggregate data sets.

By bringing data sets together using clever data matching tools it is becoming possible to piece together a tapestry of information related to individuals where specific demographics are known, or relate to a proxy of an individual where specific demographics such as name may not be known but others (location, age, sex, race etc) are reasonably predicted and then used to answer questions at based around the individual. This de-anonymising of data has been demonstrated in various forms, including examples where anonymous medical data was reverse engineered to identify individuals with a level of accuracy.

Many people likely would be surprised by the volume of data they leave online, but perhaps many others would assume they leave a digital trail with everything they do on the web. But even they might be surprised that this is also occurring in the offline world too. For example, when you go out to a store or walk through malls etc, there is a chance you are creating a digital trail behind you. Your mobile phone is likely to be “pinging” to find WiFi networks nearby even if not connecting to them. This ping includes a unique number for your phone (your MAC address). This unique identifier has been used to trace individual’s movements through a store, how long you spent in a particular department, what other stores you may have gone into and perhaps where you went to lunch. Similarly, in London advertisers reportedly used wifi enabled garbage cans for tracking individual’s movements around the city (although these were ‘scrapped’ after being made public). 

Given the effectiveness of data matching, when it would seem a relatively small hurdle to climb should someone really want to associate this type of data back to individuals so they can target them directly. In Booz Allen’s book “The Mathematical Corporation” the author discusses how some organisations working in this particular field worked hard to establish ethical standards and boundaries to ensure their organisations were seen as credible and trustworthy.  But also noting that not all organisations have necessarily applied such boundaries.

Of course, privacy means different things to different people. I am sure the next generation will care less than the current generation about privacy as they have grown up being told everything they put online is public. Maybe privacy won’t exist as a concept and everyone will assume that all data is public including all personal and medical information. While this is a strong possibility for the future, right now many would find it disconcerting to be individually identified from their anonymous digital footprints. And this is not because they are doing something wrong or have something to hide. But just because it seems creepy and weird and feels like it puts us at a risk of being a potential victim of fraud or other wrongdoing.

The modern day leveraging of data is the result of activities primarily undertaken by of data scientists. We are the ones turning data into “actionable insights”. And while we are largely focused on the technical and computational challenges in solving data problems, we also need to acknowledge that every single data project has a set of ethical considerations, without exception. And while ethics is taught as an important topic in many disciplines, from medical through business and financial it is often overlooked in technology. This is a gap that requires focus given the level of widespread impact data projects can have on individuals as an outcome.

"every single data project has a set of ethical considerations, without exception"

People have widely different personal views on ethics. Some would consider it poor ethics to ingest personal data to try and emotively influence someone into “buying another widget”. Others would consider this just part of a free market where such marketing is helping business to succeed, creating jobs and therefore benefiting everyone. Some would see de-anonymizing medical details as troublesome for privacy reasons, others would see it is a necessary step in bringing relevant data together to truly understand illness and disease that may help save millions of lives.

I am not going to preach my view of what data ethics you should apply. My point is however, data scientists should take time to decide where you sit on the ethical spectrum and what your boundaries are. Sometimes it is all too easy to get caught up in a technical challenge, or trying to impress your peers or organisations that we consider the ethical issues in entirety. We should always maintain our own ethical standards so later in our careers we will be able to look back on our work and feel like we have always “done good not harm”. 

Some ethical factors you may consider include:

-         Are people aware that their data is being collected? Are you authorised to use it?

-         Will individuals be surprised or concerned that their data is being used in your project. How would you feel if your data or was used in this way?

-         Are you taking proper measures to secure all data, both at rest and in transit? What would be the impact to individuals of exposure?.

-         What is the real-world impact of the project. Does this only positively impact individuals or is there potential negative impact on individuals? If your project uses prediction, what is the real world negative impact on individuals if your prediction is wrong? How can true-negatives or false-positives be identified and managed?

-         Is the data being used to influence individuals into making decisions? If so is this visible influence so the individual is aware of it or is influence being applied in a subtle or emotive manner without the individual’s awareness?

-         Also if targeting individuals, are those being targeted a group who are likely to be in an impaired state allowing your influence to be more effective than would normally be expected?

-         If buying data, has this data been sourced ethically and legally? Is this data trusted and accurate?

Of course, there may be more than just moral factors in play. Most countries have extensive legal requirements relating to data, privacy and disclosure that must be considered. Again, as a data scientist we should be aware of the relevant laws within the domain we are operating in. While your organisation will likely defer to law specialists for expert advice, for your own personal sense of professionalism you should understand at a high-level the key legal requirements so you don’t breach such requirements.

Data science is an interesting field due to the high level of variability of knowledge and skills to deliver effectively. Ethical, moral and legal understanding of issues relating to the use of data are part of these key skills and should be considered up there in the same vein as the ability to code in R or design a regression model.

The opinions and positions expressed are my own and do not necessarily reflect those of my employer.

The Risks of Unfettered AI

UnfetteredAI

I have written before about the potential risks of machine learning when implemented in areas that impact on people’s daily lives. Much of this has been hypothetical, thinking through the possibilities of what “could” happen in the future if we go down various paths. This story is a little different as it is something that is actually happening at the moment.

Like many people I use various cloud software for different purposes. Most of these are paid for on a monthly subscription via credit card. One particular piece of software I use is from a major vendor that has more than 75,000 customers. I have used this software for a few years and have paid monthly via a credit card I have had from my bank (names not important).

Now 4 months ago, I got a text message at 2am from my bank saying that they had flagged a transaction as suspicious and that my card was temporarily blocked. It just so happened that I had a 3am meeting that day, so no long after getting the text I called the bank.  It turns out it was the payment for the software mentioned above, and the banks fraud detection system had flagged it as unusual for some reason. But no issue, the person on the phone quickly ok’d the transaction and re-enabled the card, so no harm and I could use my card as normal again.

However, a month later I again received the SMS from the bank. And again I called, again they explained it was this transaction and again they resolved. I explained that this had also happened the month before and I was provided assurances that this was now resolved. Business as usual again.

Now fast forward to the same time last month. But this time no notification from the bank. Instead I started getting messages from other providers saying my payments to them had been declined. So I called the bank. Turns out, you guessed it, the same transaction had flagged again causing my card to be blocked, causing other payments to fail. This time I make a bit of a fuss and they provide more assurance that they had updated the notes in the system to say this was a valid transaction.

I am sure by now you can predict where this is heading, of course it happened again this month. I spoke to the credit card security department and while my card was again re-enabled I asked about the likelihood of this transaction causing my card to be blocked again next month. As it appears, while staff can add “notes” to the system, they do not seem to have any method to override the fraud detection system to ensure a valid transaction is not repeatedly flagged incorrectly.

Improved fraud detection is one of the commonly cited areas where machine learning is brining positive gains. These algorithms are “learning” their own patterns from historical of data, finding relationships much subtler than what was possible before when we had to manually coded rules. This tends to provide a higher level of accuracy overall in detecting potential fraud. I actually applaud banks efforts to continuously improve in this area, having a different credit card number stolen years ago has made me well aware of the extent of the problem they are trying to solve.

However, machine learning can be complex to debug or influence for individual error cases. Global rules are extracted by machine learning algorithms from millions or billions of rows of history and these learnings are what are used to make future predictions. Over time miss-classifications may feed back into the learning process as a form of continuous improvement, but this may take some time to occur, and unless the error rate is of high significance it may not actually change the prediction outcomes.

While as a whole you can achieve high levels of accuracy there will always be residual false-positives, where valid transactions are flagged incorrectly. So, what happens when one customer with one transaction is being classified incorrectly? Implementing machine learning systems with real world influence, without a “sanity” override can lead to undesired consequences. We have to remember errors will still occur no matter our accuracy and this needs to be managed. Secondary level assessment using more traditional, user-definable rules may be required to handle these errors to ensure systems are able to respond appropriately and quickly to individual cases of miss-classification.

However, for now I am now caught in the error percentages of a machine learning process. I have no way to make this valid payment without the associated card being blocked on each occurrence. Which means I either have to go through this process every month or look to move this payment to a card from another bank. 

Given the extent of credit card fraud, perhaps the misclassification of a small percentage of valid transactions is a tolerable impact, globally credit card fraud is a $16b problem which needs to be resolved. Of course, I would be unlikely to move bank because of an issue with a single transaction. However, if more transactions start to fail because of this limitation I wouldn’t have many other options as the system starts to impact and degrade the usability of the service that it was designed to protect.

This is of course still just an example. The point being, wherever we are using machine learning to make prediction we still need to acknowledge the prediction error rates and provide appropriate measures to limit the ongoing impact of these.

The opinions and positions expressed are my own and do not necessarily reflect those of my employer.

How Data Science may change Hacking

How_ml_change_hacking

Malicious hacking today largely consists of exploiting weaknesses in an applications stack, to gain access to private data that shouldn't be public or corrupt/interfere with the operations of a given application.  Sometimes this is to expose software weaknesses, other times this is done for hackers to generate income by trading private information which is of value.

Software vendors are now more focused on baking in security concepts into their code, rather than thinking of security as being an operational afterthought.  Although breaches still happen.  In fact, data science is being used in a positive way in the areas of intrusion, virus and malware detection to move use from reactive response to a more proactive and predictive approach to detecting breaches.

However, as we move forward into an era where aspects of human decision making are being replaced with data science combined with automation, I think it is of immense importance that we have the security aspects of this front of mind from the get go.  Otherwise we are at risk of again falling into the trap of considering security as an afterthought.  And to do this we really need to consider what aspects of data science open themselves up to security risk.

One key area that immediately springs to mind is “gaming the system” specifically in relation to machine learning.  For example, banks may automate the approval of small bank loans and use machine learning prediction to determine if an applicant has the ability to service the loan and presents a suitable risk.  The processing and approval of the loan may be performed in real-time without human involvement, and funds may immediately available to the applicant on approval. 

However what may happen it malicious hackers became aware of the models being used to predict risk or serviceability, if they can reverse engineer them and also learn what internal and third party data sources were being used to feed these models or validate identity?  In this scenario malicious hackers may, for example, create false identities and exploit weaknesses in upstream data providers to generate fake data that results in positive loan approvals.  Or they may undertake small transactions in in certain ways, exploiting model weaknesses that trick the ML into believing the applicant is less of a risk than they actually are.  The impact of this real time processing could cause catastrophic scale business impact in relatively short time frames.

Now the above scenario is not necessary all that likely, with banking in particular having a long history of automated fraud detection and an established security first approach.  But as we move forward with the commoditisation of machine learning, a rapidly increasing number of businesses are beginning to use this technology to make key decisions.  When doing so it becomes therefore imperative that we not only consider the positive aspects, but also what could go wrong and the impact misuse or manipulation could cause. 

For example, if the worst case scenario could be, for example, that a clever user raising customer service ticket has all their requests marked as “urgent” because they carefully embed keywords causing the sentiment analysis to believe they are an exiting customer, you might decide that while this is a weakness it may not require mitigation.  However if the potential risk is instead incorrectly granting a new customer a $100k credit limit, you may want to take the downside risk more seriously.

Potential mitigation techniques may include:

  • Using multiple sources of third party data.  Avoid becoming dependant on single sources of validation that you don’t necessarily control.
  • Use multiple models to build layers of validation.  Don’t let a single model become a single point of failure, use other models to cross reference and flag large variances between predictions.
  • Controlled randomness can be a beautiful thing, don’t let all aspects of your process be prescribed.
  • Potentially set bounds for what is allowed to be confirmed by ML and what requires human intervention.  Bounds may be value based, but should also take expected rate of request into consideration (how may request per hour/day etc.).
  • Test the “what If” scenarios and test the robustness and gamability of your models in the same way that you test for accuracy.

The above is just some initial thoughts and not exhaustive, I think we are at the start of the ML revolution and it is the right time to get serious about understanding and mitigation of the risk surrounding the potential manipulation of ML when combined with business process automation.

The Maturing Field of Big Data

Reddoor

I remember when I was 19, I was working for an electricity utility as a DBA.  I was putting in lots of hours and partly as a reward, and partly because they didn't know what to do with me, then sent me off to a knowledge management conference.  Well, when I got back I was an “expert” in knowledge management and it was going to change the world forever.  I convinced my boss to get me in front of the executive team during their next board meeting.  I was on fire, delivering what could only be considered a stellar presentation on knowledge management.  At the end of the presentation I looked around the room expecting excited faces and many baited breath questions. Instead, you could have heard a pin drop.  They were staring at me not knowing what to say, until one of the executives jokingly asked if I had learnt why Microsoft Word keeps crashing while I was at the conference.  Then they moved on with their meeting, and knowledge management was never discussed again during my tenure.

For a long time after I was thinking how foolish they were, to ignore the technology which was going to change their business. I was literally handing the insight to them on a plate.  However, over time as I got more experienced my view started to change.  At some point many years after I had left I realised I was trying to sell them a solution for a business problem they didn’t have. 

As a wide eyed techie I had an assumption of perceived value, they had thousands of user files on network servers, knowledge management allowed you to structure, access and understand these files in a better way and to me that sounded like it had immense business value, although I wasn't exactly sure what this business value was and couldn't articulate it any deeper than “insight” or “understanding”. And because of this I had failed to put this into any context they cared about, how KM was going to sell more electricity or prevent outages.

Fast forward to present day and I have seen a similar repeat of my experiences across the industry in relation to Big Data, and certainly some of the commentaries on Plantir resonate.  Big Data has been primarily IT lead on the assumption that if you get enough data into one spot there is significant inherent value in it.  You can find lots of web articles about this, about finding “needles in haystacks”, about discovering previously unrealised relationships, about “monetizing” data, about understanding customers better.  But when you try and dig into the detail of what this actually is, there is certainly less information available.

Thankfully, Big Data is starting to move out of the hype phase with the related, but separate fields of Machine Learning and AI starting to take over as the topics generating internet buzz. 

Gartner_Hype_Cycle

But the hype is always a good thing, for a period of time, as out of it we now have awareness, technology, toolsets and capability to develop business solutions using Big Data.  But as the field has matured we also need a mature view to be successful. This includes:

  • Big Data must be business led rather than IT led.  They must be attempting to solve problems that the business cares about and has meaningful impact.  IT is an important part of the solution but not the driver.
  • Solutions must identify value that is not easily identifiable using simpler/less costly methods.  For example, say you have a factory that makes red doors.  Sales have been great but over the last few months sales are declining.  To solve this you using Big Data to identify that customers are growing tired of red doors and now they have a preference to buying blue doors.  Did you really need Big Data to solve this problem?  Do you think if you spoke to your #1 door salesperson they wouldn't be able to give you the exact same information?
  • The Big Data solutions must lead to actions that the business can undertake.  If they have a red door making plant, perhaps they can modify to make blue doors.  But they might struggle to start making cheese.  Big Data has to provide insight within business context.

This doesn't mean that Big Data is becoming boring, far from it in fact.  Instead this maturing means we are more focused on delivering data driven solutions that are going to have a real impact on the world around us.  For any analyst/data scientist that has to be more exciting than simply churning data for data’s sake.

Is the Microsoft HoloLens the next big thing in Analytics?

If you haven’t seen it yet, you should check out the Microsoft HoloLens demos. While it is not widely available yet the developer edition is out and Microsoft is working with their partners to get applications built that make use of the holographic and augmented reality potential.

At first the HoloLens may look like an expensive toy, designed for gamers. Or you may see it as a tool limited to designers. But moving past that, the Microsoft HoloLens has significant potential in the field of data analytics. One of the key challenges of Big Data has been turning the outcome of analytics into a humanly digestible format, so it can be easily explored and understood. However there is a limitation on what you can show in 2D within the confines of a computer screen. The Hololens has an opportunity to change this. Adding an extra dimension to data visualisation combined with a 360 degree view may fundamentally change the way we present data in the future. In addition to data exploration, augmented reality may allow the outcome of analytics to be attached to the real world objects they relate too.

This is of course somewhat dependant on if Microsoft has got the Hololens right and don’t follow the same tease and revert path that Google famously did with the Google Glasses. If the HoloLens really is ready, this is a space that Microsoft can own from the get go with first mover advantage.

Blockchain, Blockchain, Blockchain, Oi, Oi, Oi!

Sydney_bridge

One of the fastest moving technologies of 2016 is Blockchain.  Put simply, Blockchain is decentralised trust system for ensuring transaction validity without a central authority.  The use-cases for Blockchain are far reaching as it is essentially a data platform on which any applications that require trust for the “exchange” of information between multiple parties can be built.  And it has just been revealed/confirmed that it has Australian origins!

The Blockchain methodology was designed to underpin Bitcoin but has since gained momentum in its own right, in fact arguably Blockchain is a more important innovation than Bitcoin itself.  Certainly banking and FSI in general are all over Blockchain, with CBA and 10 of the world’s largest banks simulating trading using Blockchain, and the ASX working on a Blockchain based system for Australian equities.  But these are just a couple of examples. The interest, momentum and pace surrounding Blockchain is quite astounding.

Blockchain is still in its early days and doesn’t have all the issues solved yet.  Scalability can be a challenge but these problems will be resolved as Blockchain technology is evolved.  In terms of resource, there is a lot of technical detail on the web about how Blockchain systems work, however if you’re looking for resources on the “why” for Blockchain a couple of good ones include:

If you’re involved in building or operating trust based systems that exchange information between multiple parties you absolutely need to get up to speed with Blockchain.

Will Automation take my Job? Well, Maybe….

Will Automation take my Job?

Automation is a business transformation technology that involves innovations in the field itself, but more recently leveraging innovations in the areas of AI, Machine Learning and Big Data. And as all of these fields gain maturity, pundits are naturally playing forward the impact and making predictions about job losses across various industries directly as a result of automation.

Reiterating the title of this post, “will automation take my job” I think the answer is a clear “maybe”. But job loss isn’t the only outcome of automation. My experience has shown that many organisations are seeking to increase the value of the output of their internal workings, and often key employees are constrained with low value tasks. In IT this is particularly true, where many employers are seeking proactive innovation and thought leadership from employees in their respective areas. But often this is not being realised as they are consumed with lower skill, high occurrence tasks that are important – but are not producing an ROI to the business. IT is just one example, the same problem can cross many industries and skill sets.

Automation of Today

Today, automation can be good at undertaking pre-planned actions when pre-defined conditions occur. Which means certain types of roles, that are formulative in nature, lend themselves to automation. But trying to improve the efficiency of these roles is not necessarily new. Many organisations have already spent effort reducing the associated costs, sometimes replacing higher cost resources with lower cost alternatives. This transition typically required organisations to document the process aspects of these roles in detail, naturally this feeds well into the foundations of an automation drive. And this is not necessarily limited to the lower end of the pay scale, I am sure there are a number of people in high paying roles in FSI, trading, banking etc. that are beginning to see components of their role replaced by automation.

Automation of Tomorrow

Looking forward, automation is beginning to become more adaptive and use machine learning and AI more broadly to make judgement calls. Bots may understand typed and spoken language as input. Routines may use analytics and prediction to select the best cause of action to a specific situation. This broadens the scope of the application of automation from tasks, which have clear black/white outcomes to those with shades of grey requiring intuition calls.

"If you are doing the job of a robot today, then it is logical to think that computers may one day replace you. But the question is, do you want to be doing the job of a robot to begin with?"

So is this all doom and gloom? I think this is definitely an approaching wave of change that is going to impact on areas of the workforce. Over time this will phase out some roles, and aspects of others, but it will also result in creation of new roles and the improvement of others. Contrary to how if can sometimes seem, most organisations are not just trying to cut costs. They are instead usually focused on ensuring value is being created for both their customers and their shareholders, and driving their competitive advantage. While this does mean reducing costs where practical, it also means making investment in areas that continue to drive growth. This should therefore also mean new jobs, new opportunity and more innovation across the board.

What to do?

But it does mean change is likely for some, and change can be very unpleasant. To ensure you are ready for change I think you need to take an honest look at your current role to determine if it fits the model of a function that overtime could be automated. If so take the opportunity to begin preparing for the change, developing skills and experiences that will ultimately be of higher value if/when organisations begin to adopt automation as a means to increasing value.

Is Amazon about to Disrupt the Database Market?

My LinkedIn feed shows me Shawn Bice, former GM of database systems (SQL Server) at Microsoft is joining Amazon AWS as VP of Analytics. Assuming accuracy, Shawn has joined the likes of Hal Berenson (on the former Microsoft luminaries behind SQL Server), Raju Gulabani and Sundar Raghavan.

While AWS has for many years provided support for common database platforms via their EC2 and RDS options, more recently they have released their own transactional database platform AWS Aruora, and the AWS RedShift data warehousing platform. And to get you there, they have also recently released their database migration service for on-mass on-premise to cloud migration.

AWS seem to have realised that a keystone in winning in the cloud is winning the database. In the data centric world ahead, the data platforms are going to become core to how applications are architected and ultimately deployed. Within the cloud providing a comprehensive set of data services with (semi-) seamless integration, rapid deployment and op-tap scalability will be compelling in convincing developers and organisations to “buy into” that vendors stack.

AWS are actively hiring some of the best and brightest in database for what could be a double whammy if they can get it right. The last time I looked I think the database market on it’s own was a $30b+ market, but in the cloud winning with the database also likely means winning a customers complete cloud stack.

Of course, Microsoft and Oracle are formidable opposition and are arguably ahead of the in terms of developer and enterprise buy in. So it is not necessarily and easy path for ahead.  

I think I have been saying this continuously for the last 15 years; but it is (still) an interesting time to be in database.

What is the biggest challenge for Big Data? (5 years on)

Five years doesn't half fly when you’re having fun!  In this post from 2011 I highlighted some of the challenges facing the “big data revolution” centring on a lack of people with the right skills to deliver value on the proposition.  Fast forward to 2016 and this not only remains true, but is likely the key issue holding back the adopting of advanced analytics in many organisations.

 While there has been an influx of “Data Scientist” titles across the industry, generally organisations are still adopting a technology driven approach driven by IT.  The conversations are still very focused on the how rather than the why, it is still all very v1.0.  There is still a lack of the knowledge required to turn potential into value, value that directly affects an organisations bottom line.

This will start to sort itself out as the field matures and those who understand the business side of the coin become fluent with big data concepts, to the point they can direct the engineering gurus.  IBM with Watson is looking to take this a step further by bypassing the data techies and letting analysts explore data without as much consideration for the engineering/plumbing involved.  This is a similar direction that services such as AWS and Azure Machine Learning are heading, in the cloud.

In 2016 the biggest challenge for Big Data is turning down the focus on the technical how, and turning up the focus on the business driven why.  Engaging and educating those who understand a given business in the capabilities of data science, motivating them to lead these initiatives in their organisations.

The SQL/NoSQL war is over. The winner is… wait, was there a war?

SQL_NoSQL

We are approaching 7 years since the term “NoSQL” re-entered the popular tech vernacular, and 7 years since I wrote the post “Is the Relational Database Doomed?”. During this time, we have experienced a tidal wave of non-relational data management technologies. So, time for an update to my prior article.

At the start of the decade, when the NoSQL buzz was in its heyday, some were predicting the end of the dominance of the relational database platform (RDBMS) within a decade.  The reason for this seemed somewhat sound. That being, the relational database is based on what is now 40+ year technology and things are so much more advanced now than back then, so clearly this was a technology ripe for disruption.

So how has this disruption gone?  Well, all my metrics show there are more relational databases in existence today than at any point in history.  It may be hard for many people to comprehend the volume.  Often, mid-size enterprises operate hundreds of relational databases.  Many large enterprises have thousands to tens of thousands.  These represent the data stores of everything from ERP's, financial systems, content sharing apps, IT tools and so on.

So despite the noise surrounding NoSQL, in a head to head comparison of volume of use, NoSQL use seems so very small.  At a guess, I would predict that for every NoSQL database in existence there would be at least 1000 relational databases.  Probably more.  You would be forgiven for thinking NoSQL use was almost insignificant. 

So why has there been so little disruption?

  • The relational database has such a massive legacy. The IT world is full of people whose front of mind solution for a new data management requirement is a relational database.  To demonstrate this I looked on LinkedIn. My search showed that over 1,000,000 people list “SQL” as a skill in their profile.  In comparison only 16,000 listed NoSQL and 30,000 listed MongoDB.  That’s a massive skills gap.
  • The RDBMS is very general purpose.  There are very few day-to-day data management requirements that cannot be met without a run of the mill RDBMS.  So why would you not go with what you know, if what you know is suitable?
  • Relational databases are solving very complex problems in a balanced approach.  There is 40+ years of learnings on how to balance consistency, concurrency, scale and performance.  Many NoSQL initiatives focus on improving some objectives (such as scale or performance) at the expense of others (such as consistency or redundancy) solving their own problems but also lacking a general purpose appeal.
  • Relational database vendors have also kept innovating.  With some RDBMS vendors you can now combine SQL and XPATH, support JSON natively and support other non-structured data types.  Also, many RDBMS platforms now support in-memory databases and others are quickly adding this support.

So with the continued dominance of the relational database, what future is there for the NoSQL alternatives?  Well that is clear, the same opportunity as they have been filling over the last 7 years.  Edge cases.  Sure, enterprises have many routine data in/data out applications and these belong on the RDBMS but modern enterprises are trying to do more with data than ever before and leverage data in new ways for a competitive advantage.  Maybe they need data to be captured at a massive scale, much greater than what is possible with a traditional RDBMS.  Maybe they are looking to deep mine data to identify complex relationships between entities, or make predictions about how scenarios will transpire.  Maybe they are trying to learn from large and diverse data sets and discover key new ways to improve productivity.  This new world of requirements is where new world platforms have the opportunity to shine, focus on improving a specific set of key objectives, potentially at the expense of others, and then find the market that needs those objectives.

To summarise, I cannot see a world in the near future where any non-RDBMS gains any dominance in supporting the data management needs of most applications.  The vendors people use may change, and the location of those databases may change (on premise to cloud) but they will be relational. However, what are the edge cases of today will of course become more mainstream.  When this happens it will not be at the expense of the RDBMS, but instead they will be in addition to it.  The playing field is getting bigger.   Organisations desire to do more with data, via big data or data science initiatives that is fuelling a market ripe for vendors with clever, yet tightly focused, data management solutions.

So as it turns out, relational (SQL) and non-relational (NoSQL) technologies were not at war at all.  They are in fact allies, working together to deliver organisations both general purpose and special purpose data management solutions.