There are some revived talks about in memory databases, partly due to IBM’s release of SolidDB for DB2 (and Oracles of TimesTen). This is an area that I spend a bit of time working in so let me share a few thoughts.
Initially in memory databases sounds like a good idea. Existing databases are disk based, disk is slow, memory is fast, in memory databases will offer huge performance improvements right? Well like everything, sometimes. You see while existing database severs (SQL Server, Oracle, DB2 etc) persist data on permanent storage media, usually disk, any database server worth its salt fully utilizes available memory as cache of disk data. And usually there is a lot of smarts around how this cache is managed, if physical RAM is less than the total size of the database, then the cache (or buffer pool) is managed to ensure what is in that cache is the most useful, most actively used portion of the database. To improve your caching capabilities, you increase the amount of memory available.
So normal, On Disk databases are actually doing a lot of work in memory. Many transactional databases actually achieve “cache hit ratio’s” in the range of 98%+ which means that they are only actually going to disk for <2% of the requested data. However because of the multi layered approach, pure In Memory databases can reduce the level of complexity, potentially leading to performance improvements.
Advantages of In Memory Databases
Because the buffer pool can be used to store any bit of disk data in cache temporarily, when you are accessing a particular piece of data, a mapping and lookup process has to occur to find out where in the buffer pool that particular piece of information is currently located, or if it is in the buffer pool at all.
A pure In Memory databases on the other hand, usually loads the entire database into memory on startup, and references all the database data using known memory addresses, thereby greatly reducing the amount of mapping/locating overhead that occurs on every request. This can have significant performance improvements, however of course the entire database has to be able to fit into memory.
Disadvantages of In Memory Databases
Commonly, core enterprise databases are in 100-500GB size range (many are much larger than this) while a server with a “large” amount of RAM today would be in the 32-64GB range. Obviously this fast discrepancy between data volumes make In Memory databases impractical for most mission critical databases of a larger size (or have the potential to grow larger than memory limits).
Additionally memory is not persisted, which means if a failure occurs on the server running the In Memory database, changes that have occurred in memory will be lost. To compensate this, most In Memory databases have the ability to log transactions to a disk to ensure recoverability. Of course this goes a long way to undoing the benefits of the In Memory solution (depending on the type of Workload) so many In Memory database offer options to allow risk/performance to be balanced by reducing the frequency that logging occurs (which increases amount of data changes that could be lost if failure occurs). Again this only suits certain types of workloads, many workloads a small window of loss is acceptable and many workloads any form of transaction loss is unacceptable.
Suitable Uses of In Memory Databases
So it is clear that In Memory databases do not replace existing, more traditional database server technologies, instead they complement them and offer the ability to improve workloads with pretty specific characteristics.
Ideal workloads for In Memory databases include:
- Read Only datasets – Data is read under high volume but changes are very infrequent / non-existent from the application clients (e.g. product web content, product catalogs). Updates may still occur, but if these are managed update the cache contents can also be managed.
- High volume data with accepted margins of error. Many datasets do not need to be 100% accurate all day every day. For example, collecting web site stats, you might accept a risk for losing some data occasionally as a trade off for performance improvements of being able to process a higher transaction rate.
More commonly, In Memory databases are being used in conjunction with traditional On Disk databases to provide caching for a nominate set of tables. These tables are cached from the On Disk database, the application connects to the In Memory database but all requests that are made of tables not memory are transparently passed through to the On Disk database. This allows the cached tables to be accessed quickly, but allows highly transaction tables to be passed through and avoid the limitations of the in memory database.
The real problem with the hybrid model is that the tables most suitable for caching are not usually the tables that are contributing to the bulk of the performance overhead. The tables causing the bulk of performance overhead are typically the large and highly transactional tables that do not lend themselves to an In Memory model due to their data volatility and strict persistence requirements. While the smaller, and less transactional tables can be cached, often queries issued against the “database” involve both sets of tables (reference data, transactional data) meaning the overhead is still occurring on the On Disk database.
While RAM sizes are growing, database sizes are growing just as fast, if not faster, therefore it is unlikely we will reach a stage where RAM sizes will be adequate to cache all mainstream databases anytime soon.
For In Memory to be truly universal what is needed is a solution that moves beyond caching raw data, and focuses on caching data from a different perspective, further down the processing path. This solves a bunch of issues, but also introduces a whole new bunch!
Stay tuned as I start to share about innovations being made in this area.