As organizations move data from monolithic architectures toward microservices and containers, traditional data...
stewardship practices may be at risk. To maintain data integrity, developers and architects need to restructure their microservices data management practices accordingly.
"With regard to containerization, architects need to be aware that containers are not by default designed to be persistent, [which is] a key requirement for any data-intensive workload," said Matt Aslett, research director of data platforms and analytics at 451 Research.
Standard Docker containers include data volumes that are tied to an individual server. This means that, if a container fails or is moved from one server to another, its connection to the data volume is lost, he said. However, data containers can be used to persist data in a spare container to the application, while Docker volume plug-ins enable users to take advantage of third-party storage systems to create, remove, mount and unmount data volumes, he added.
Two essential data strategies
Edwin Yuen, analyst at Enterprise Strategy Group, agrees that data management is a significant consideration when dealing with microservices and distributed systems. However, there are strategies organizations can use to prevent microservices data from becoming too dispersed to be managed, Yuen said.
First, Yuen said, architects must deal with data consolidation.
"In the monolithic world, this was problematic because of the centralization of the data in a single data center, which could have data access, performance and backup and recovery issues," Yuen said. With distributed cloud services and storage, it is now possible to use hyperscale clouds to provide the data back end for multiple applications and use the scale and replication of data within the hyperscale clouds to better manage them, he explained.
"The break from traditional data management is to move beyond thinking of data as a single set or location and think of it as a data service that can be location-independent," Yuen said.
A second, more advanced approach is to analyze the applications and services and determine whether they need access to the current or primary data set at all, Yuen said. If the application doesn't require the data updated in real time, is not production-related or could otherwise use a copy of the data, the concept of data management and enablement or copy data management can be used.
"These services can get a copy of the data set from primary or secondary storage and allow the application or services to work from that data copy," he said.
Data clones can also be made available, Yuen added. These data copies could be used by developers for data mining, sales or demonstration purposes, but they can also be derived from the primary data store.
"The data copies are tracked and managed, but much of the traditional data administration is reduced, allowing for more data distribution without the distributed management requirements," Yuen said.
The importance of data planning
Randy Shoup, former vice president of engineering for Stitch Fix, an online clothing retailer, took an approach similar to Yuen's. Historically, Stitch Fix had one monolithic database, but the company is moving toward a service-based architecture, Shoup said.
"We have been slowly moving toward services, and we aren't done. But that is the normal evolution of any company," Shoup said, citing the experiences of Amazon, eBay and Twitter. "Every one of them started off with some retrospectively ugly monolith because they couldn't initially afford to build for where they would be years in the future."
Even though microservices data is often widely distributed, planning is the key, Shoup said, who is now VP of engineering at WeWork, a provider of shared office spaces.
"I have lots of microservices around my company that need to know about the customer, but I have one place that owns the customer data," he said. For example, Shoup explained that the fulfillment service at Stitch Fix needs the customer address, and the customer support team needs some of that information, too. For Shoup, the key is to have one service that acts as a system of record for the customer.
"Customer service owns the customer and has the canonical representation in its system of record, while every other place that might have that information is considered nonauthoritative, read-only cache," he said. A fulfillment service may retain the information, but it is still just cache: read-only and nonauthoritative. If another entity asks for the customer address, it is just like the domain name system, which is a nonauthoritative cache of information about how to map a path to a website, for instance.
"There is one place that is the canonical representation for Google's IP address, and every other location is just cache," Shoup said.
There are two ways to get necessary data to services that need it. The first approach is synchronous: The customer service function could be asked in real time, or the customer service function could produce an event that the other services would subscribe to or listen to. The second approach is the event-driven approach, where, every time there is an address change, for instance, it triggers an event that informs all the necessary parties.
"Most [large] systems use a combination of those strategies," Shoup said.
Separate from transactional issues, Shoup said, there is still a need to apply analytics across data. A separate big data implementation or a data warehouse can be periodically populated with canonical data, he said. But it should not become the canonical source -- that would be a return to monolithic practices.
The challenges of consolidation
The need to consolidate microservices data can present challenges in a cloud environment, Forrester analyst Randy Heffner warned.
"While fractured application landscapes have always caused data management difficulties, as data becomes spread far and wide across SaaS apps, cloud platforms [and] IoT landscapes … the cost of bringing the data together again can be driven up by cloud charges for data movement and API access packs, a quota of API calls into one's SaaS app," he said.
Architects must rethink analytical architectures and develop patterns that distribute analysis to where piece parts of the data reside, Heffner explained. Furthermore, architects must develop and adopt advanced patterns for managing transactions that affect data spread across multiple clouds -- like eventual consistency and compensating transaction models.
"They must understand that issues of error handling and resiliency for transaction handling can affect customer experience … if part of the transaction errors out and results in a customer seeing inconsistent data," he added.
Data lake management vendors are particularly helpful when it comes to data consolidation. Data lake management environments help users create catalog-based inventories of data stored in their data lake environments, which are typically based on Hadoop, Aslett said. Key vendors in this space include Alation, Cambridge Semantics, Cask Data, Immuta, Infoworks, Podium Data, Tamr, Unifi Software, Waterline Data and Zaloni. The data lake management environments provided by these vendors have evolved, Aslett said. For example, they often include functionalities such as collaboration and automated recommendations. Some have expanded their ability to catalog data across a wider data management estate, including relational databases and cloud storage.
Aslett said the "incumbent" data management vendors -- such as IBM, Informatica, Hitachi, Talend and even Hadoop distributors -- are making plans to address data sprawl. The Hadoop distributors, in particular, are stepping up their abilities to identify and manage data in environments beyond the Hadoop Distributed File System, Aslett said.
Yuen also thinks the market will start to address these challenges, particularly when it comes to serverless and containers.
"I think we'll also [see] more and more storage-specific services for serverless and container-based systems, like container storage interface initiatives and vendors specializing in storage management for these services," he said.