BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Distributed in-memory data grids (IMDGs) enable you to process very large operational data sets in memory, achieving rapid performance and data consistency for many jobs. Companies have been finding uses for them most recently with emerging "big data" applications, but they have been used longer in a variety of Web and distributed applications where performance is an issue.
With ever-greater demand for processing capabilities, the emergence of cloud computing and the number crunching associated with big data, the data grids -- as they are generally termed -- have gained more attention and are now available from vendors large and small.
A data grid functions between data and application and provides a short-term repository for data, enhancing performance by improving access and eliminating bottlenecks. Driven by demand for middleware performance, in-memory data grids are increasingly becoming a mainstream enhancement to middleware and are being applied beyond databases, where they first found employment.
Data grids are not so mainstream that there are many case studies, but examples have arisen, such as the following:
- An online sports site uses a data grid to eliminate bottlenecks and ensure ample caching capacity.
- An ISP cuts 80% of its technical customer service calls by an average of one minute with an in-memory data grid.
- A bank speeds intraday credit risk assessment by deploying a data grid and achieving a latency of less than five milliseconds for 99.9% of the credit checks.
Many traits of grids seem apt for the new arena of cloud computing. Elastic scaling is discussed in relation to the cloud, and it is very similar in operation to the data grid. Analyst John Rymer, at Forrester Research Inc., said he prefers the more descriptive term, "elastic caching," to "in-memory data grid."
In-memory data cache performance dos and don'ts
For more information on working with in-memory data grids, check out this quick tip on getting the most out of in-memory data caches.
Elastic caching captures a particularly useful characteristic of data grids, he said, citing a report on the subject that he co-authored with Forrester analyst Mike Gualtieri. That report, "The Forrester Wave: Elastic Caching Platforms, Q2 2010," marks the data grid as an important option for accommodating sudden or periodic fluctuations in compute load.
Rymer said the products in the space are getting a lot of interest now -- not just for database bottlenecks, where they first found renown -- but also for Web applications and big data use cases. For those interested in the data grid approach, the good news, noted Rymer, is that there are a lot of good choices.
Players include Oracle, which has Coherence (a product the company acquired); IBM, which has WebSphere eXtreme Scale; Software AG, with its Terracotta Ehcache (also acquired); GigaSpaces, which is active with its eXtreme Application Platform (XAP); GridGain Systems, which offers GridGain; Red Hat, with JBoss Data Grid 6; and ScaleOut Software, with its various ScaleOut Servers.
Picking and choosing IMDGs
A lot of the grids' benefits come from the benefits RAM may have over disk memory. "With memory getting cheaper by the day, now you can easily load several terabytes of data in a moderate-size 20- to 40-node grid, depending on the amount of RAM you have available," said Dmitriy Setrakyan, CTO, GridGain Systems. "IMDGs enable you to process quite large operational data sets in memory, giving lightning fast performance and data consistency across the whole system."
"Essentially, you can use them not only to address performance bottlenecks, because in-memory reads are a lot faster than reading from disk, but also to make sure updates remain consistent within the whole system," Setrakyan told SearchSOA.com. This becomes extremely important when data needs to be updated in consistent fashion, he said.
Jason Andersen, director, product marketing at Red Hat, said to consider multiple factors when deciding whether to implement a data grid. "The people that will use it are important, and the use case is important, as is the type of data you are using," he said. For example, if you are working with information that requires a lot of ad hoc queries and it fits well with the relational data paradigm, then that is not an ideal candidate for a grid, although a data grid can work in some cases.
On the other hand, you may be working with requirements for high-performance read-write access to ephemeral data that needs to be shared by different users and workloads and can be discarded at the end of the day. That is what Andersen calls a "fantastic fit" for the IMDG.
It could be something like real-time tracking where you are trying to optimize routes in a logistics system, he explained. Another ideal use case for data grid adoption is someone running large-scale financial simulations, he said.
Although data grids have been around for a while, new applications have led data grids to be considered more mainstream in recent years. There is simply more demand for performance and more demand for the processing of data today. Some organizations that have experimented with NoSQL data architectures have found that they can't keep up with data they are processing in some cases. That provides yet another reason to experiment with data grid technology.
Shawn McAllister is CTO of Solace Systems Inc., a content networking company that manufactures and sells middleware appliances. He said, as with any complex IT problem, the first step in deciding whether to consider a data grid is defining the problem you're trying to solve. What ramifications will the selection have on how you can address future similar (or even dissimilar) requirements?
Here is a series of questions to ask:
- Where are the sources of record?
- Are they localized or highly distributed?
- What big data technologies will be plugged into the data grid?
- What data dependencies need to be considered?
- How will data be synchronized between geographic locations?
- How will temporary data inconsistencies be handled?
A lot of grids' benefits come from the benefits RAM may have over disk memory.
"The increasingly distributed nature of information also introduces the need to carefully consider requirements in the area of high availability, fault tolerance, disaster recovery and business continuity," he said. High availability has long been a chief selling point for the grids.
William Bain, founder and CEO of ScaleOut Software Inc., which focuses on distributed data grid software (cross platform and within the compute cloud), agrees that things are evolving very rapidly for data grids as more and more organizations run into processing challenges. "Any app that needs to be highly scalable and must deal with rapid changes in data is a candidate for in-memory data grids," he said.
"Database servers aren't designed for that, and they generally become bottlenecks," said Bain. His company offers a scale-out approach that runs on Linux, Java and .NET Windows. It is also oriented toward the cloud where, he noted, there are particular challenges for running data grids.
Full steam for data grids
Although data grids have grown in popularity, Forrester's Rymer said they aren't the only or necessarily even the most popular option. That honor goes to Memcached, an open source caching product widely used at Facebook and other large websites. Rymer explains that Memcached is distributed but not elastic. Originally developed by Danga Interactive, it works by caching data and objects in RAM to reduce the number of times an external data source (such as a database or API) must be read. It relies on a hash table distributed across multiple machines. When the table is full, subsequent inserts cause older data to be purged.
However, noted Rymer, "Since Memcached is not elastic, you can't expand or contract its capacity on the fly. But because it is open source and free, it is very popular." Still, traits such as expansion on the fly may be enough to keep commercial data grids growing as an option.
The bottom line, McAllister said, is that enterprise IT systems must become higher-scale, more fluid and more real time. "The last generation of SOA-style database-accessible-by-middleware applications are insufficient for many reasons," he said.
First, the types of applications moving to data grids are combining data from many systems, which he said are awfully slow using SOA-style techniques. "The more times you fetch this data, the more of a burden you put on your systems of record, which slows all your applications down," McAllister said.
Second, data is rarely localized to one location, and modern data grids rely on sophisticated WAN optimization middleware in ways that SOA and databases cannot. For example, he noted, "Cloud or mobile service providers will migrate operational data from a remote source to be physically close to the user to improve performance. Internet providers like Google and Amazon have been caching frequently used data in memory, scattered throughout the world for more than a decade. Commercial data grid technologies give that same kind of architectural freedom to enterprise architects," he said.
Third, McAllister said a single change to a piece of underlying data may affect 50 or 100 cascading derivative calculations done in analytics or complex event processing engines. "Data grids make those operations in-memory calculations that can execute orders of magnitude more quickly than the SOA-style database jobs that were in vogue five years ago. [That's] when data volumes were more manageable and consumer or customer expectations were more relaxed," he said.
Lastly, he noted that the scale of operation of a large in-memory data grid are too demanding for the kinds of message queue (MQ) or Java Message Service (JMS) middleware that architects were using in the database era. "New approaches are needed to load the data grid and move data between data grids without bottlenecking in the middleware layers," said McAllister.
You can follow SearchSOA.com on Twitter @SearchSOA.