In-memory data grids have shown promise in speeding up processing and analysis for large, scalable applications. They often provide a useful middle tier between back-end databases and front-end Web applications. Now, they are beginning to appear in so-called big data applications as well.
Still, software architects have to carefully consider limitations when applying in-memory data grids (IMDGs). Where a specific set of data is regularly invoked, they work well. But where diverse data is infrequently polled, the IMDG may not be the best avenue.
Sometimes, the need for SQL data systems of record means that the IMDG will take a backseat to the traditional relational database (RDB). Hardware configurations, too, can affect how well IMDGs fit into overall system architecture. Finally, the responsibility for managing the IMDG can sometimes become an issue between different parts of IT departments.
The Co-Evolution of NoSQL and Data grids
In many cases, in-memory data grids and NoSQL databases address many of the same application challenges. Cameron Purdy, vice president of development at Oracle, said, "One of the interesting trends is the convergence of data grid technology built up around domain models, real-time highly scalable access and management of information, and the other class of apps called NoSQL. There is a strong correlation between NoSQL and data grids. Some of the shared concepts have to do with partitioning, scale-out, resiliency by redundancy."
One of the fundamental differences is that data grids tend to delegate the system of record responsibilities. For example 80% of Tangosol's early customers used Oracle even before Tangosol was acquired by Oracle, so they would have a back-end durable and reliable database management system. The data grid itself was the live working system.
Data grids started with high-speed in-memory and moved to being able to persist and provide durability of information. In contrast, NoSQL started out with durable storage and then grew into a high-speed in-memory system. "In some senses, the technologies are starting to converge," explained Purdy.
There are two especially important ways that IMDGs help to reduce application latency, said Cameron Purdy, vice president of development at Oracle, and one of the key developers of in-memory data grid technology at Tangosol, now part of Oracle. One is by reducing network and disk-based communication, and the other is by staging the data in an object format that works better with applications.
Purdy explained, "IMDGs are good for many things, but if used for the wrong problem they will work against you. They are good at working with data sets in memory. Processing petabytes of information is not something you would try and stick in memory. A lot of specialized systems are better at dealing with analysis of information on that scale."
Another factor is the data model being used. Purdy said, "If the application is working with data as objects, it works beautifully with the data grid. But if the application is treating data as SQL data, it is probably best to use a SQL database. If the app is talking SQL, using a data grid to accelerate that typically gives poor results."
Partitioning the data and data affinity are some of the concepts that lend themselves well to data grid architectures. Data grids work best when there is a good domain model to work from. "Applications with a weaker data model or no data model have the greatest problems adopting data grids," said Purdy. The problem also needs to be easily partitioned across multiple servers.
"Distributed caches only make sense if the application itself is distributed -- namely, runs on more than one machine for scalability and/or availability reasons," said Ron Pressler, founder and CEO of Parallel Universe, an Israeli technology company. "If a distributed cache is to provide low-latency access to data, it must be close to the running application code -- i.e., reside on the same machines the application runs on, or be replicated to provide scalability. In either case, the cache will store its data on multiple machines."
An application's data access behavior can play an important role in determining whether data grids are the best solution. Pressler explained, "If the application displays consistent data-access patterns, like accessing a subset of the data over and over, or accessing easily identified groups of items at the same time, we say that the application has a good data locality, and caches will significantly improve its performance. For applications employing distributed caches, the biggest challenge is increasing data locality. This is usually done by thinking hard about the domain and figuring out exactly what to cache, as well as how to represent the cached data -- e.g., by de-normalizing it."
In-memory data grids are starting to play a role in helping to analyze real-time big-data applications in domains like finance in cases when there is a live data set with terabytes of hot data. Although they could be used in others areas like reconstructing users' sessions from Web logs, Purdy said that other approaches, such as Hadoop, might be a better fit. IMDGs are also emerging as an enabling technology for scalable cloud apps.
"Data grids are one of the critical building blocks for cloud infrastructure," said Purdy. IMDGs promise to simplify the process of scaling up and down as needs change. For example, if you are using EC2 and adding 100 servers, having a data grid that scales across those servers, all seeing the same live information and knowing about each other and not as 100 different apps, then data grids become invaluable. If you don't have the ability to manage and visualize and access in a safe and reliable manner, then building systems in the cloud becomes more difficult.
The lack of a standard interface is one significant problem. To address this need, the industry is rallying behind the forthcoming javax.cache standard under the auspices of JSR107.
"For cloud computing, we have seen all major IaaS vendors introduce caching as a service. So it is a standard part of the infrastructure. And Java EE 7, which is aimed at cloud deployment, will include javax.cache for the same reason," said Greg Luck, Terracotta chief technology officer and original developer of Ehcache, an open source, standards-based cache.
Like others, Luck sees a natural affinity between IMDGs and cloud computing. The cloud, he notes, has a bit of a bias -- that is, a preference against disk memory, and toward the solid-state memory that IMDGs use.
"Disk is slow, but in the cloud disk is usually much slower due to virtualization and sharing of NICs, network connections and physical drives," he said. But IMDGs are not a good fit for all large applications. Luck said, "Caching works well when the same data is read multiple times."
Different types of data lend themselves very well to in-memory data grids but some types may require special handling. Jason Andersen, senior product manager at Red Hat responsible for the JBoss Enterprise Portal Platform said, "If there is a regulatory or auditing requirement, there may need to be a means to push the data into a disk-based data solution like a relational database. We see this often where the data grid is used as what we call a database pacemaker, where the data grid is there to speed up access between an application and a database; but ultimately, the system of record is the database."
Another issue lies in assigning responsibility for managing the IMDG. Andersen noted, "Data is traditionally managed by DB admins and operators while caches are typically part of a given application architecture. We have seen some cases where there has been a small struggle to determine who will be responsible for the solution."