Data Scientists – Manage your own Ethical Standards


It seems the privacy nuts might be right. You know, the type of people who stand on streets warning passers-by that the government is watching them. Or the people who wear tinfoil on their heads because they are worried about some corporation reading their thoughts. The reality, it turns out, may not be all that different.

Earlier this week various news sources reported that the personal details over nearly 200 million US voters was exposed. While much of this data was already public information, in voter registration databases, reportedly the data had also been manipulated to try and understand individuals at a personal level. That is, predicting the answers each individual voter may respond to various questions that were important to understand by the data holder. Presumably this was done so they could target campaigns towards the relevant voters. 

Also, this week another story surfaced about people losing their anonymity when online. If was reported that some people who had been looking up specific medical conditions on the web later received a letter in the mail from a company they had never heard of offering them participation in medical trials that relate to those conditions. The veil of privacy was likely torn off for these people, and conversely the power of data matching techniques became publicly apparent. 

It is clear our digital footprints are becoming so extensive, and are fragmented around the world in various databases and logs. And the organisations that hold this data are realising its value, either to themselves or to others, and as such may be willing to leverage or share this data. And as a result, the power and level of understanding that can be gained though combining multiple data sets is beginning to be demonstrated. By combing and matching data at an individual level we are able to much more fidelity at an granular level the before in generalised aggregate data sets.

By bringing data sets together using clever data matching tools it is becoming possible to piece together a tapestry of information related to individuals where specific demographics are known, or relate to a proxy of an individual where specific demographics such as name may not be known but others (location, age, sex, race etc) are reasonably predicted and then used to answer questions at based around the individual. This de-anonymising of data has been demonstrated in various forms, including examples where anonymous medical data was reverse engineered to identify individuals with a level of accuracy.

Many people likely would be surprised by the volume of data they leave online, but perhaps many others would assume they leave a digital trail with everything they do on the web. But even they might be surprised that this is also occurring in the offline world too. For example, when you go out to a store or walk through malls etc, there is a chance you are creating a digital trail behind you. Your mobile phone is likely to be “pinging” to find WiFi networks nearby even if not connecting to them. This ping includes a unique number for your phone (your MAC address). This unique identifier has been used to trace individual’s movements through a store, how long you spent in a particular department, what other stores you may have gone into and perhaps where you went to lunch. Similarly, in London advertisers reportedly used wifi enabled garbage cans for tracking individual’s movements around the city (although these were ‘scrapped’ after being made public). 

Given the effectiveness of data matching, when it would seem a relatively small hurdle to climb should someone really want to associate this type of data back to individuals so they can target them directly. In Booz Allen’s book “The Mathematical Corporation” the author discusses how some organisations working in this particular field worked hard to establish ethical standards and boundaries to ensure their organisations were seen as credible and trustworthy.  But also noting that not all organisations have necessarily applied such boundaries.

Of course, privacy means different things to different people. I am sure the next generation will care less than the current generation about privacy as they have grown up being told everything they put online is public. Maybe privacy won’t exist as a concept and everyone will assume that all data is public including all personal and medical information. While this is a strong possibility for the future, right now many would find it disconcerting to be individually identified from their anonymous digital footprints. And this is not because they are doing something wrong or have something to hide. But just because it seems creepy and weird and feels like it puts us at a risk of being a potential victim of fraud or other wrongdoing.

The modern day leveraging of data is the result of activities primarily undertaken by of data scientists. We are the ones turning data into “actionable insights”. And while we are largely focused on the technical and computational challenges in solving data problems, we also need to acknowledge that every single data project has a set of ethical considerations, without exception. And while ethics is taught as an important topic in many disciplines, from medical through business and financial it is often overlooked in technology. This is a gap that requires focus given the level of widespread impact data projects can have on individuals as an outcome.

"every single data project has a set of ethical considerations, without exception"

People have widely different personal views on ethics. Some would consider it poor ethics to ingest personal data to try and emotively influence someone into “buying another widget”. Others would consider this just part of a free market where such marketing is helping business to succeed, creating jobs and therefore benefiting everyone. Some would see de-anonymizing medical details as troublesome for privacy reasons, others would see it is a necessary step in bringing relevant data together to truly understand illness and disease that may help save millions of lives.

I am not going to preach my view of what data ethics you should apply. My point is however, data scientists should take time to decide where you sit on the ethical spectrum and what your boundaries are. Sometimes it is all too easy to get caught up in a technical challenge, or trying to impress your peers or organisations that we consider the ethical issues in entirety. We should always maintain our own ethical standards so later in our careers we will be able to look back on our work and feel like we have always “done good not harm”. 

Some ethical factors you may consider include:

-         Are people aware that their data is being collected? Are you authorised to use it?

-         Will individuals be surprised or concerned that their data is being used in your project. How would you feel if your data or was used in this way?

-         Are you taking proper measures to secure all data, both at rest and in transit? What would be the impact to individuals of exposure?.

-         What is the real-world impact of the project. Does this only positively impact individuals or is there potential negative impact on individuals? If your project uses prediction, what is the real world negative impact on individuals if your prediction is wrong? How can true-negatives or false-positives be identified and managed?

-         Is the data being used to influence individuals into making decisions? If so is this visible influence so the individual is aware of it or is influence being applied in a subtle or emotive manner without the individual’s awareness?

-         Also if targeting individuals, are those being targeted a group who are likely to be in an impaired state allowing your influence to be more effective than would normally be expected?

-         If buying data, has this data been sourced ethically and legally? Is this data trusted and accurate?

Of course, there may be more than just moral factors in play. Most countries have extensive legal requirements relating to data, privacy and disclosure that must be considered. Again, as a data scientist we should be aware of the relevant laws within the domain we are operating in. While your organisation will likely defer to law specialists for expert advice, for your own personal sense of professionalism you should understand at a high-level the key legal requirements so you don’t breach such requirements.

Data science is an interesting field due to the high level of variability of knowledge and skills to deliver effectively. Ethical, moral and legal understanding of issues relating to the use of data are part of these key skills and should be considered up there in the same vein as the ability to code in R or design a regression model.

The opinions and positions expressed are my own and do not necessarily reflect those of my employer.

How Data Science may change Hacking


Malicious hacking today largely consists of exploiting weaknesses in an applications stack, to gain access to private data that shouldn't be public or corrupt/interfere with the operations of a given application.  Sometimes this is to expose software weaknesses, other times this is done for hackers to generate income by trading private information which is of value.

Software vendors are now more focused on baking in security concepts into their code, rather than thinking of security as being an operational afterthought.  Although breaches still happen.  In fact, data science is being used in a positive way in the areas of intrusion, virus and malware detection to move use from reactive response to a more proactive and predictive approach to detecting breaches.

However, as we move forward into an era where aspects of human decision making are being replaced with data science combined with automation, I think it is of immense importance that we have the security aspects of this front of mind from the get go.  Otherwise we are at risk of again falling into the trap of considering security as an afterthought.  And to do this we really need to consider what aspects of data science open themselves up to security risk.

One key area that immediately springs to mind is “gaming the system” specifically in relation to machine learning.  For example, banks may automate the approval of small bank loans and use machine learning prediction to determine if an applicant has the ability to service the loan and presents a suitable risk.  The processing and approval of the loan may be performed in real-time without human involvement, and funds may immediately available to the applicant on approval. 

However what may happen it malicious hackers became aware of the models being used to predict risk or serviceability, if they can reverse engineer them and also learn what internal and third party data sources were being used to feed these models or validate identity?  In this scenario malicious hackers may, for example, create false identities and exploit weaknesses in upstream data providers to generate fake data that results in positive loan approvals.  Or they may undertake small transactions in in certain ways, exploiting model weaknesses that trick the ML into believing the applicant is less of a risk than they actually are.  The impact of this real time processing could cause catastrophic scale business impact in relatively short time frames.

Now the above scenario is not necessary all that likely, with banking in particular having a long history of automated fraud detection and an established security first approach.  But as we move forward with the commoditisation of machine learning, a rapidly increasing number of businesses are beginning to use this technology to make key decisions.  When doing so it becomes therefore imperative that we not only consider the positive aspects, but also what could go wrong and the impact misuse or manipulation could cause. 

For example, if the worst case scenario could be, for example, that a clever user raising customer service ticket has all their requests marked as “urgent” because they carefully embed keywords causing the sentiment analysis to believe they are an exiting customer, you might decide that while this is a weakness it may not require mitigation.  However if the potential risk is instead incorrectly granting a new customer a $100k credit limit, you may want to take the downside risk more seriously.

Potential mitigation techniques may include:

  • Using multiple sources of third party data.  Avoid becoming dependant on single sources of validation that you don’t necessarily control.
  • Use multiple models to build layers of validation.  Don’t let a single model become a single point of failure, use other models to cross reference and flag large variances between predictions.
  • Controlled randomness can be a beautiful thing, don’t let all aspects of your process be prescribed.
  • Potentially set bounds for what is allowed to be confirmed by ML and what requires human intervention.  Bounds may be value based, but should also take expected rate of request into consideration (how may request per hour/day etc.).
  • Test the “what If” scenarios and test the robustness and gamability of your models in the same way that you test for accuracy.

The above is just some initial thoughts and not exhaustive, I think we are at the start of the ML revolution and it is the right time to get serious about understanding and mitigation of the risk surrounding the potential manipulation of ML when combined with business process automation.

What is the biggest challenge for Big Data? (5 years on)

Five years doesn't half fly when you’re having fun!  In this post from 2011 I highlighted some of the challenges facing the “big data revolution” centring on a lack of people with the right skills to deliver value on the proposition.  Fast forward to 2016 and this not only remains true, but is likely the key issue holding back the adopting of advanced analytics in many organisations.

 While there has been an influx of “Data Scientist” titles across the industry, generally organisations are still adopting a technology driven approach driven by IT.  The conversations are still very focused on the how rather than the why, it is still all very v1.0.  There is still a lack of the knowledge required to turn potential into value, value that directly affects an organisations bottom line.

This will start to sort itself out as the field matures and those who understand the business side of the coin become fluent with big data concepts, to the point they can direct the engineering gurus.  IBM with Watson is looking to take this a step further by bypassing the data techies and letting analysts explore data without as much consideration for the engineering/plumbing involved.  This is a similar direction that services such as AWS and Azure Machine Learning are heading, in the cloud.

In 2016 the biggest challenge for Big Data is turning down the focus on the technical how, and turning up the focus on the business driven why.  Engaging and educating those who understand a given business in the capabilities of data science, motivating them to lead these initiatives in their organisations.

Webinar: NoSQL, NewSQL, Hadoop and the future of Big Data management

Join me for a webinar where I discuss how the recent changes and trends in big data management effect the enterprise.  This event is sponsored by Red Rock and RockSolid.


It is an exciting and interesting time to be involved in data. More change of influence has occurred in the database management in the last 18 months than has occurred in the last 18 years. New technologies such as NoSQL & Hadoop and radical redesigns of existing technologies, like NewSQL , will change dramatically how we manage data moving forward. 

These technologies bring with them possibilities both in terms of the scale of data retained but also in how this data can be utilized as an information asset. The ability to leverage Big Data to drive deep insights will become a key competitive advantage for many organisations in the future.

Join Tony Bain as he takes us through both the high level drivers for the changes in technology, how these are relevant to the enterprise and an overview of the possibilities a Big Data strategy can start to unlock.


What is the biggest challenge for Big Data?

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but when organizations like Google are doing 10PB sorts on 8000 machines in just over 6 hours – we know the technical scope for Big Data exists and eventually will flow down to the masses, and such scale will likely be achievable by most organizations in the next decade.

Instead I think the core problem that needs to be addressed relates to people and skills.  There are lots of technical engineers who can build distributed systems, orders of magnitude more who can operate them and fill them to the brim with captured data.  But where I think we are lacking skills is with people who know what to do with the data.  People who know how to make it actually useful.  Sure, a BI industry exists today but I think this is currently more focused on the engineering challenges of providing an organization with faster/easier access to their existing knowledge rather than reaching out into the distance and discovering new knowledge.  The people with pure data analysis and knowledge discovery skills are much harder to find, and these are the people who are going to be front and center driving the big data revolution.  People who you can give a few PB of data too and they can provide you back information, discoveries, trends, factoids, patterns, beautiful visualizations and needles you didn’t even know were in the haystack.

These are people who can make a real and significant impact on an organizations bottom line, or help solve some of the world’s problems when applied to R&D.  Data Geeks are the people to be revered in the future and hopefully we see a steady increase in people wanting to grow up to be Data Scientists. 

SQL Server to discontinue support for OLE-DB

ODBC was first created in 1992 as a generic set of standards for providing access to a wide range of data platforms using a standard interface.  ODBC used to be a common interface for accessing SQL Server data in earlier days.  However over the last 15 years ODBC has been second fiddle as a provider for SQL Server application developers who have usually favoured the platform specific OLE-DB provider and the interface built on top of it such as ADO.

Now in an apparent reverse of direction various Microsoft blogs have announced the next version of SQL Server will be the last to support OLE-DB with the emphasis returning to ODBC.  Why this is the case isn’t entirely clear but various people have tried to answer this, the primary message being that ODBC is an industry standard whereas OLE-DB is Microsoft proprietary.   And as they are largely equivalent, it makes sense to only to continue to support the more generic of the two providers.

After years of developers moving away from ODBC to OLE-DB, as you would expect this announcement is being met with much surprise in the community.  But to be fair I suspect most developers won’t notice as they user higher level interfaces, such as ADO.NET, which abstract the specifics of the underlying providers.  C/C++ developers on the other hand may need to revisit their data access interfaces if they are directly accessing SQL Server via OLE-DB.


NSA, Accumulo & Hadoop

Reading yesterday that the NSA has submitted a proposal to Apache to incubate their Accumulo platform.  This, according to the description, is a key/value store built over Hadoop which appears to provide similar function to HBase except it provides “cell level access labels” to allow fine grained access control.  This is something you would expect as a requirement for many applications built at government agencies like the NSA.  But this also is very important for organizations in health care and law enforcement etc where strict control is required to large volumes of privacy sensitive data.

An interesting part of this is how it highlights the acceptance of Hadoop.  Hadoop is no longer just a new technology scratching at the edges of the traditional database market.  Hadoop is no longer just used by startups and web companies.  This is highlighted by outputs like this from organizations such as the NSA.  This is also further highlighted by the amount of research and focus on Hadoop by the data community at large (such as last week at VLDB).  No, Hadoop has become a proven and trusted platform and is now being used by traditional and conservative segments of the market.  


Reply to The Future of the NoSQL, SQL, and RDBMS Markets

Conor O'Mahony over at IBM wrote a good post on a favorite topic of mine “The Future of the NoSQL, SQL, and RDBMS Markets”.  If this is of interest to you then I suggest you read his original post.  I replied in the comments but thought I would also repost my reply here.


Hi Connor, I wish it was as simple as SQL & RDBMS is good for this and NoSQL is good for that.  For me at least, the waters are much muddier than that.

The benefit of SQL & RDBMS is that its general purpose nature has meant it can be applied to a lot of problems, and because of its applicability it is become mainstream to the point every developer on the planet can probably write basic SQL.  And it is justified, there aren’t many data problems you can’t through a RDBMS at and solve.

The problem with SQL & RDBMS, well essentially I see two.  Firstly, distributed scale is a problem in a small number of cases.  This can be solved by losing some of the generic nature of RDBMS and keeping SQL such as with MPP or attempts like Stonebraker’s NewSQL.  The other way is to lose RDBMS and SQL altogether to achieve scale with alternative key/value methods such as Cassandra, HBase etc.  But these NoSQL databases don’t seem to be the ones gaining the most traction.  From my perspective, the most “popular” and fastest growing NoSQL databases tend to be those which aren’t entirely focused on pure scale but instead focus first on the development model, such as Couch and MongoDB.  Which brings me to my second issue with SQL & RDBMS.

Without a doubt the way in which we build applications has changed dramatically over the last 20 years.  We now see much greater application volumes, much smaller developer teams, shorter development timeframes and faster changing requirements.  Much of what the RDBMS has offered developers – such as strong normalization, enforced integrity, strong data definition, documented schemas – have become less relevant to applications and developers.  Today I would suspect most applications use a SQL database purely as a application specific dumb datastore.  Usually there aren’t multiple applications accessing that database, there aren’t lots of direct data import/exports into other aplications, no third party application reporting, no ad-hoc user queries and the data store is just a repository for a single application to retain data purely for the purpose of making that application function.  Even several major ERP applications have fairly generic databases with soft schemas without any form of constraints of referential integrity.  This is just handled better, from a development perspective, in the code that populates it.

Now of course the RDBMS can meet this requirement – but the issue is the cost of doing this is higher than what it needs to be.  People write code with classes, RDBMS uses SQL.  The translation between these two structures, the plumbing code, can be in cases 50% of more of an applications code base (be that it hand-written code or automatic code generated by a modeling tool).  Why write queries if you are just retrieving and entire row based on key.  Why have a strict data model if you are the only application using it and you maintain integrity in the code?  Why should a change in requirements require you to now to go through the process of building a schema change script/process that has to have deployed sync’d with application version.  Why have cost based optimization when all the data access paths are 100% known at the time of code compilation?

Now I am still largely undecided on all of this.  I get why NoSQL can be appealing.  I get how it fits with today’s requirements, what I am unsure about if it is all very short sighted.  Applications being built today with NoSQL will themselves grow over time.  What may start off today as simple gets/puts within a soft schema’d datastore may overtime gain certain reporting or analytics requirements unexpected when initial development began.  What might have taken a simple SQL query to meet such a requirement in RDBMS now might require data being extracted into something else, maybe Hadoop or MPP or maybe just a simple SQL RDBMS – where it can be processed and re-extracted back into the NoSQL store in a processed form.  It might make sense if you have huge volumes of data but for the small scale web app, this could be a lot of cost and overhead to summarize data for simple reporting needs.

Of course this is all still evolving.  And RDBMS vendors and NoSQL are both on some form of convergence path.  We have already started hearing noises about RBDMS looking to offer more NoSQL like interfaces to the underlying data stores as well as the NoSQL looking to offer more SQL like interfaces to their repositories.  They will meet up eventually, but by then we will all be talking about something new like stream processing 🙂

Thanks Connor for the thought provoking post.


IA Ventures – Jobs shout out

My friends over at IA Ventures are looking both for an Analyst and for an Associate to their team.  If Big Data, New York and start-ups is in your blood then I can’t think of a better VC to be involved in. 

From the IA blog:

"IA Ventures funds early-stage Big Data companies creating competitive advantage through data and we’re looking for two start-up junkies to join our team – one full-time associate / community manager and one full time analyst. Because there are only four of us (we’re a start-up ourselves, in fact), we’ll need you to help us investigate companies, learn about industries, develop investment theses, perform internal operations, organize community events, and work with portfolio companies—basically, you can take on as much responsibility as you can handle."

Roger, Brad and the team continue to impress with their focus on Big Data, their strategic investments in monetizing data and knowledge of the industry in general.

Realtime Data Pipelines

In life there are really two major types of data analytics.  Firstly, we don’t know what we want to know – so we need analytics to tell us what is interesting.  This is broadly called discovery.  Secondly, we already know what we want to know – we just need analytics to tell us this information, often repeatedly and as quickly as possible.  This is called anything from reporting or dashboarding through more general data transformation and so on.

Typically we are using the same techniques to achieve this.  We shove lots of data into a repository of some from (SQL, MPP SQL, NoSQL, HDFS etc) then run queries/ jobs/ processes across that data to retrieve the information we care about.  

Now this makes sense for data discovery.  If we don’t know what we want to know, having lots of data in a big pile that we can slice and dice in interesting ways is good.   But when we already know what we want to know, continued batch based processing across mounds of data to produce “updated” results of data, that is often changing in constantly, can be highly inefficient.

Enter Realtime Data Pipelines.  Data is fed in one end, results are computed in real time as data flows down the pipeline and come out the other end whenever relevant changes we care about occur.  Data Pipelines / workflow / streams are becoming much more relevant for processing massive amounts of data with real time results.  Moving relevant forms of analytics out of large repositories into the actual data flow from producer to consumer, I believe, will be a fundamental step forward in big data management.

There are some emerging technologies looking to address this, more details to follow.