Tony Bain

Building products & teams that leverage data, analytics, AI & automation to do amazing things

Cloud Databases - Lucy in the Sky with Data – Part 1

July 15, 2008 | Tony Bain

The objective of this article is to help describe and then position “Databases in the Cloud” as a solution to issues being faced today, or as a method of driving new value that currently isn’t happening today.  I try to avoid specific technological or implementational discussions and focus on the bigger picture.  My background is in Database Management, Enterprise Software Development, SaaS and Web 2.0. application development.



Overview


A good place to start is with some definitions.  These may or may not be considered industry standard as a lot of these concepts are still under debate, however these are the definitions I will use throughout this article.

  • Hosted Database Service – An industry standard database server that is made available as a hosted service by a service provider. You connect to a server using a proprietary protocol and control the data and schema contained within a database, however the operational management are is performed by the service provider.
  • Database as a Service (DBaaS) – An evolution of the above with a key difference, the interface to the database is using a standard SOA protocol (SOAP, REST). Rather than connecting to the database using a proprietary client library directly over TCP, the application connects to the DBaaS using a standard protocol web services protocol and consumes the methods and properties exposed by the service, which are used to issues commands to access and modify the data contained within.
  • Grid Computing – Grids refer to the infrastructure that can scale out on demand, distributes load between nodes, has inbuilt fault tolerance, and high levels of scalability through scale out. Grid nodes can often be easily provisioned without downtime and are commonly constructed from industry standard hardware. Clouds are typically built on Grids.
  • Cloud Databases – DBaaS, combined with Grid Computing, combined with utility computing i.e. all physical infrastructure details abstracted, accessed using standard web service protocols. Typically Cloud Database services providers provide an abstracted (you access the address of the cloud, rather than a specific server), highly available service, with an unlimited scalability potential. Internally the cloud is built on a grid with fault tolerance, load balancing and data partitioning capabilities. The utility component allows each user to “pay” for the service based on utilization of the service.
  • Data as a Service (DaaS) – A further evolution of the DBaaS with a different focus. DBaaS is focused on making the “repository” of data a service which is accessed by various applications. DaaS is implemented overtop of DBaaS to expose the data itself as a service. This is an important future concept which physically will be implemented in Cloud services.

Clouds vs Hosted RDBMS


For all the major Cloud Database services right now, there are significantly more differences than just the interface.  Most major Cloud Databases have a simple, discreet subset of standard RDBMS functionality.  Common RDBMS functions such as stored procedures, triggers, transactional consistency across tables, DRI, object level permissions are typically not available.  In addition, accessing the data using the standard SQL is not supported.  Instead a simplistic Create, Retrieve, Update and Delete (CRUD) query syntax is provided for accessing individual records and basic RDBMS SQL functions such as joins are typically not available.



In many cases a defined schema is not available instead an individual record or “entity” can take on its own schema, at this point in time most Cloud Databases are not RDBM’s.  They are much more closely aligned with Object databases.



Why Is This?


The focus of current cloud implementations has been on achieving the scalability and performance goals of the cloud, and to do this it has been necessary to deprecate features and functionality to ensure the scalability of the platform.



One of the main issues with providing a full relational implementation in a service model is that in a RDBM’s it is largely the responsibility of the developer/dba/architect to manage the resource impact of various commands and the follow on impact that occurs, such as concurrency, resource bottlenecking, etc.  If not managed properly it is possible for a certain queries on large data sets to impact all queries occurring on a server though high levels of resource contention.  Complex queries doing large sorts, aggregations, multi table joins can require large amounts of memory CPU and I/O to serve.  To manage performance in these scenarios DBAs have the power to index, optimize memory allocations, optimize data placements, profile and tune queries using indexing and so one (note this has nothing to do with the relational SQL model itself, just the practicalities of physically implementing it).



This doesn’t play out well in a model where for all intensive purposes, the infrastructure and physical location and layout of a database is invisible to the developer of the application.  The possibility of causing high levels of impact to other users of a shared platform is very undesirable to the service provider, as is the requirement to have the customer review and make decisions on the physical layout of the data on the physical infrastructure.



Instead the current approach of the Cloud Database services is to move to the “client side” much of the processing overhead of intensive operations such as joins, sorts, aggregations etc and only allow the issuing of discreet and much more predictable (from an impact perspective) row focused CRUD operations reducing the possibility that a single query can cause massive levels of impact to the shared platform.  Because of the scale of the Cloud Databases being offered (TB, PB, billions of rows etc) the impact of unpredictable processing would be a massive problem.



The reduced functionality is a key difference that allows Cloud databases to provide near linear, unrestricted scalability growth in a uniform manner.  How this plays out into the future remains to be seen.  There will be improvements made in this area as vendors find means to address such issues.  For example, it is not only conceivable that you will pay for usage, but you will also pay for the size of your database processing pipeline.  If you issue a query that causes a high level of impact but only have a small pipeline, they your query will take much longer to process than if you had purchased a much larger pipeline option.




Doesn’t this Limit Cloud Databases?


Yes.  Right now you cannot simply decide to unplug an existing application from an onsite RDBMS and plug it back into a cloud database.  Existing RDBMS applications will currently be incompatible with most Cloud Database offerings.  Use of current Cloud Database offerings must have the application code changed to utilize a cloud database.


What Other Issues Limit Cloud Databases?



There are a number of other reasons that use of Cloud Databases is going to be limited currently, and limited for some time to come.  These reasons include:


Latency



The response time between an application and a database server is key for application performance, and directly impacts productivity in a lot of industries.  Current day database applications typically measure response time is in ms, many of applications process hundreds or thousands of application requests a second.  For a database on a LAN near an application latency is less of an issue, but for a Database Cloud hosted remotely, latency is a key problem that will mandate application architecture and limit Cloud usage. 
In such a situation, regardless of how good your ISP is, you are not going to have the same bandwidth available to you as you currently do on a 1GB switched backend server network.  Once you go out to the web your request is passing through firewalls, routers, multiple providers, then to the Cloud Database provide itself and back again.  This will be hundreds if not thousands of times slower than what is possible having your database server sitting next to your application server.


Security



Putting your data out there in a Cloud somewhere obviously has many security concerns that goes with it.  Security concerns around who in the service provider has access to the data, concerns around the robustness of the service providers security model, concerns around the transportation security model.


Availability


The availability of the Cloud will typically be very good, however if the Cloud is external the availability of the network and all the components between you and the Cloud provider must be considered.  You will likely pass through a bunch of ISPs, dozens of routers, firewalls, switches etc.  While your Cloud provider will likely provide you a service level commitment, you may not be getting the same level of commitment from your ISP.


So why use Cloud Databases Then?


If the major draw card for Cloud Databases was simply the ability for organisations to outsource their infrastructure requirements, then you would be quite right in concluding that hosted infrastructure services have been around for many years and have only made a minor impact, mostly in the SME market.  So what is it with Cloud Databases that is making them an area of great interest? 


Clouds for Web 2.0.



Clouds for Web 2.0. applications fit well for a couple of reasons.  Firstly, scale.  Clouds provide the promise or near unlimited scalability, so as a creator of such a service, if your service takes off then your Cloud Database service should be able to cope for your application to grow to hundreds, then millions of users.
Implementing such scalability internally is now much harder than it used to be because of the speed of internet growth, your application can literally go ballistic overnight.  Implementing increased infrastructure in response to customer demand in near real time, is a difficult and costly and worrisome area for most start ups.  We all know what can happen if you get it wrong, such as the recent very public issues that Twitter has been having.



Secondly cost to entry.  To implement a large scale robust “Cloud style” infrastructure in house is going to require some very high levels of expertise and a lot of capital.  For a start up on a shoe string, investing in database infrastructure to support 1 million concurrent users when you currently have 5 concurrent users is very difficult to justify.  Using a Cloud Database service allows you to pay for what you are using and then scale up your service quickly as your business takes off.



Of course a Database Cloud is only one layer in the application architecture so scalability issues surrounding other layers also need to be addressed in your architecture.


Don’t Cloud yourself Into a Corner


My biggest concern around Cloud offerings for Web 2.0. at present is ensuring you don’t cloud yourself into a corner.  What I mean by this is while Cloud offerings may meet your immediate application requirements, try and have some forward thinking about ways in which you will derive value from your data once you have amassed lots of it.  Aggregating data, analyzing data, doing things like data mining (to provide recommendations, up-sell etc) will later on become areas that you will be wanting to explore to provide more smart functionality back to your users.  Make sure the Cloud Database service you go with has the ability to allow you to leverage your data appropriate in the future else you will find yourself running into a brick wall in terms of your value cycle.


Clouds for the Enterprises



Clouds for the Enterprise are interesting as there is no specific reason why a Cloud cannot exist within the walls of the enterprise, a sort of corporate Database Cloud for internal use.  From an enterprise perspective, this is really combining Grid computing with DBaaS to provide a platform which enterprise applications can hook into and be free on any specific scalability concerns. 


External Clouds



External Clouds is what most people are referring to when the discuss Cloud Databases in an enterprise context.  Obviously Cloud Databases right now are not a replacement for business critical databases servers in such an environment.  The latency, security and network availability issues ensure that is not practical.  So why then would a corporate consider a Cloud Database offering?
Many enterprises have requirements to interact with external users and systems through applications and integration processes (edge computing).  Whether it be a web site, web store or b2b integration process, some level of data often needs to stray outside of a enterprise walls.  In these scenarios hooking into a robust cloud service may be preferable to placing corporate assets outside the firewall.



Summary


On reviewing this article I have found the following points that require further thought, investigation and elaboration:


  • Cloud Databases for Web 2.0. / web based start ups are an attractive option as they allow for rapid growth and low cost to entry. However what are the barriers if you go down this path in terms of ability to derive value from your data assets.
  • The Cloud Database scalability requirements are met by deprecating most core RDBMS functionality, functionality which has been seen as critical in terms of a RDBMS and built up over the last 30 years. So what is the true impact of this?
  • Cloud Databases for Enterprise are interesting when you bring the Cloud internal, but have a weak positioning when talking about external Clouds. Most of the talk about Clouds for enterprise refers to external services, so their attraction seems understandably low. Edge computing and b2b use generates little excitement in me, so really I want to investigate this area further to see if there is a stronger potential here.
  • The benefit of adding the SOAP/REST interface isn’t really explained in any detail in this article. Yeah it is a generic protocol that allows varied platforms to interact with the Cloud service, but there hasn’t really been an issue with all major application platforms having access to SQL Server, Oracle, DB2 or MySQL servers in the past. So what else is the value that this is providing?


So it is clear there is a need to delve into a lot of these areas in much more detail.  Many of the benefits of this model will become more solid as we start to describe more of the concepts surrounding DaaS, Data as a Service and start to position real solutions in response to real problems being faced today.