An architect's guide: How to use big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data, like all enterprise data, is useful only if it's projected to users through applications. For architects...
designing or redesigning big data applications, one key question is whether to use service-oriented architecture (SOA) or RESTful application programming interfaces (APIs) to link big data components and services to the rest of the application. Start with the interfaces the big data products expose, then define the big data interfaces from the application-side in. Next, consider security and governance, and finally, carefully separate management and access services for whatever APIs selected.
The primary question when working with big data applications is the nature of the exchange with the big data repository itself. This relationship includes the traditional questions of the level of abstraction that suits applications best, the transaction or state control needed and the necessary security.
Big data tools often have a mixture of APIs, some RESTful and some more SOA-like. This compilation can lead to a confusing picture unless big data interfaces are abstracted as services or resources, rather than exposed directly to applications. That way, only one component needs to be changed -- the big data adapter process -- if big data tools are changed down the line.
When to use SOA or REST for big data applications
The primary question when working with big data applications is the nature of the exchange with the big data repository itself.
At the interface between the big data adapter and the rest of the components, look at how big data is used as a guide in selecting an API. SOA is appropriate when something like big data repositories publish a specific set of capabilities that are bound to applications. This model can be highly abstract, meaning applications using big data can be completely insulated from the technology and distributability of the data itself.
SOA makes sense in scenarios where applications are expected to make use of big data more in terms of results of specific analytic or reductional processes. If applications need to know about big data as a resource set without abstraction into high-level services, RESTful interfaces are probably more suitable.
At this high level of application review, it's critical to not fall into the trap of assuming big data applications should be built using available RESTful APIs like Apache's WebHDFS just because they're available. In general, the best way to link a big data process or appliance with an application is through the highest-level interfaces available, not ones designed for direct file-system-level manipulation. The later type of interfaces will create considerable incremental development work and it's nearly impossible to transfer applications from one big data service or appliance to another if written at this level.
Context, state and transactional behavior
A transaction is a context of work, a logical sequencing of steps that from a business perspective make up a process. When deciding whether SOA or REST is best for big data applications, the first question is whether transaction work is contained within the big data component or whether big data references are spread through multiple components in the transaction. In the first case, RESTful interfaces can be applied easily. In the second, some mechanism for transactional state control will be needed to use REST. SOA will work in either case.
Security and governance are important points in making a big data SOA or REST decision. In SOA, security, access logging and control can be explicit and highly integrated with user directories and application access controls. With REST, security and access control mechanisms likely have to be applied externally. This may be a strong argument for wrapping big data access inside a SOA component even if REST is used at the big data-product level. It must be determined how to create big data security that will pass formal governance reviews if a RESTful model is used. Most users will incorporate VPN or SSL-level security, but augment it with application access security applied at the network level.
Finally, be careful not to expose too much. In most cases, big data services can be divided into two groups -- the actual data access services and the data service platform management and control services. All modern big data architectures will support both, but because platform management and control is typically seen as a technical process rather than as an application process, its support often involves low-level features accessed through simple REST interfaces. Big data management services should rarely be exposed directly to application developers or users because of the risk of creating rather significant errors in results.
More on big data applications
Rethink data integration for successful big data applications
Data warehousing best practices for supporting big data
How Hadoop presents new ways to program big data applications
Where platform management tools are needed to prepare for big data use, it's best to build these functions into the big data adapter component, creating an easier interface for application developers to use and ensuring necessary platform management steps are always taken before big data analysis begins.
It will be necessary to preserve access to platform management APIs for architects and big data specialists to use in their regular repository maintenance tasks. Some users recommend abstracting low-level platform management APIs to make even platform management practices more portable across multiple implementations, but experience says that it's difficult to create useful general practices at this level. It's likely best to simply allow experts and architects to employ the specific platform management APIs when needed.
When a big data SOA/REST review is completed, always check the results against general policies for the use of these two distinctively different API styles. Any time a tool is created that requires different practices than an established API baseline supports, expect a higher level of risk. Be sure the risk is justified before proceeding.
About the author:
Tom Nolle is president of CIMI Corp., a strategic consulting firm specializing in telecommunications and data communications since 1982.