Generally data we retain within organisational databases is very factual. “Tom bought 6 cans of blue paint”. “Mary delivered package #abc to Jean at 2.30pm”. These are typical examples of the types of records you might find in various corporate database systems. They are recording specific events that the company ahead of time has decided is important to keep and have invested in building applications to create and manage this data."
This data is by nature very focused on “what” has or is happening, which of course is useful for lots of business operations such as logistics, auditing, reporting and so on. However over time the use of this data for most organisations turns to trying to understand “why” things happened so we can influence either making them happen more (for good things) or less (for undesirable things). To try and answer this we sometimes create data warehouses that pulled data from various sources into a single repository with the view of creating analysis that focuses on giving explanation to data rather than reporting just aggregated fact.
But this is where we start having issues. Our proprietary data represents decisions made by customers and/or staff in the context of their "real world" lives, and real world reasoning is highly complex and influenced by many factors. So complex in fact that many organisations may struggle to explain these relationships in any logical fashion. Sure Tom might buy paint, but why did he need paint, why did he buy blue paint, why did he buy 6 cans, why did he purchase at 2.30pm on a Tuesday, did Tom need anything else we sell at that time other than paint? From our factual data alone the context needed to determine such things simply just may not exist in the data, so we can share at this data all day long we are never going to be able to answer such questions with authority.
So how do we get to the understanding of the why? Well, experienced leaders in this organisation may be able to make gut calls on answers to these questions but this approach is hard to articulate, lacks consistency, is hard to measure until a long way down the track and is hard to implement in software to leverage at scale. Enter machine learning. Machine learning tries to translate the “gut feeling” into its root elements and then combined these into a defined model by learning inherit relationships that exist within data. But this of course is the kicker – again to be successful those relationships need to exist in the data. It is relatively easy to generate “better than guessing” models on most operational enterprise datasets as usually at some level there is some basic contextual relationships in most data, however to order to go further and get highly tuned and accurate models you can “put your money on” you need data that relates those key factors of influence. Machine learning can't learn what isn't there.
Often, to be highly successful we need to combine our proprietary factual data with other sources of data that will add this context. This is where contextual data comes into play. This is data that is published online by various authorities, some of it is open and some of it is proprietary where organisations need to pay. But there is huge, almost limitless, volumes and diversity of open data being published that can take some of the most bland corporate data sets and turn them into a rich treasure trove of insightful information. From government data, health data, environmental data, mapping data, imagery, media and so on. These repositories are where we can source the real world context from to start to finally understand the “why” of things.
So how do we know which relationships are useful?
Determining which relationships are useful is one of the key tasks of the data scientist. But this is initially driven by domain knowledge, i.e. understanding and knowledge of the specific subject matter being analysed. Generally those with domain knowledge brainstorm and expand out the feature of the data to include a rich set of potentially relevant features. Then it is over to the data scientist to crunch the numbers and using mathematical indicators determine which features are actually relevant in terms of predicting the desired outcome. If after this the context still isn’t in the data then it becomes a rinse and repeat effort of trying to improve available features and/or expanding the data set volumes.
Then it is over to the data scientist to crunch the numbers and using mathematical indicators determine which features are actually relevant
For example, using domain knowledge we might expand out our feature set to the point where we could explain, “Tom purchased 6 cans of blue paint because he last painted his house 5 years ago, he lives within 3 miles of the store, the next two weeks expect to be good weather, he owns the house and he likes the color blue”. If we break this down the data to source this analysis may come from a combination of proprietary and open data.
Assuming our CRM system keeps basic demographics we may determine that Tom:
- Purchased 6 cans – we could potentially use contextual datasets to determine Tom’s property size and calculate the likely number of cans of paint based on existing formulas.
- His house is more than 10 years old – probably few would repaint a new house.
- He last painted his house 5 years ago – maybe we keep a history of paint purchases, or maybe we process street imagery through a model which detects changes in house color (maybe an extreme hypothetical).
- Within 3 miles of the store – public data sets, calculated driving distances between addresses
- He owns the house – contextual data sets or requested demographic data
Feeding these contextual features into a machine learning model along with our proprietary data we may learn the relationships that gives us a model to predict if a customer is in the market for paint. Not all features necessarily would be relevant to our model however if we get the right mix then we may be able to product with high levels of confidence. This would allow us to invest appropriately in direct marketing and offers and an appropriate time to maximise our return.
While this is a simple and perhaps silly example it should start to demonstrate the point of how analysis can be greatly improved through the introduction of context from the use of open and public datasets. However there are some challenges in doing this which need to be addressed which include:
- There is a lot of great data freely available in open data sets. However, many open data sets are currently orientated towards human consumption. That is, they tend to be presented in aggregate form with skewed towards some analysis formed by the publisher. Publishers of open datasets need to understand that quite probably it will never be a human consuming their data directly in the future, instead it will be a computer for which raw data is more useful.
- Open data sets are not catalogued. There are various sites that try and list open data repositories but none I have seen are doing this very well. You can currently spend a lot of time trying to find good sources of open data.
- Data integration still continues to be highly time consuming. Combining continues to this day to be one of the most time consuming tasks of any data professional and despite the ETL tools on the market some of the fundamental issues have not been solved. Open data needs to be more self-describing in ways that allow machines to integrate and the poor human can focus on better things.
* example of pre-analysed summary data. Suitable for human consumption but less so for machine consumption (the more likely consumer in the future).
Hopefully this summary helps to explain why context is so important to data and our ability to leverage it for making useful prediction. While this is an overly simple example, the key point is that to accurate prediction useful relationships have to exist.