Design patterns have caught on as a way to simplify the development of software applications. As organizations...
begin to tackle building applications that leverage new sources and types of data, design patterns for big data design promise to reduce complexity, boost performance of integration and improve the results of working with new and larger forms of data.
"Design patterns, as proposed by Gang of Four [Erich Gamma, Richard Helm, Ralph Johnson and John Vlissides, authors of Design Patterns: Elements of Reusable Object-Oriented Software], relates to templates and guidance frameworks for solving recurrently occurring problems," said Derick Jose, director of Big Data Solutions at Flutura Decision Sciences and Analytics. "In the big data world, there are some recurrently occurring problems which need design pattern templates for solutioning."
The best design pattern depends on the goals of the project, so there are several different classes of techniques for big data, Jose said:
- Design patterns to mash up semistructured data (e.g., medical transcripts, call center notes) with structured data (e.g., patient vectors).
- Design patterns to look for event sequence signals in high-velocity event streams (e.g., "What sequence of alarms from firewalls led to a network breach? What sequence of patient symptoms resulted in an adverse event?").
- Design patterns to respond to signal patterns in real time to operational systems.
- Design patterns for matching up cloud-based data services (e.g., Google Analytics) to internally available customer behavior profiles.
Jose explained that it is helpful to think about the architecture of the big data projects as an information value chain with multiple layers around input for taming beastly log files and unstructured data, pattern detection, and operationalizing or outputting data. At each of these layers, there are recurrently occurring challenges one needs to write patterns for.
Keeping an eye on governance, risk management and compliance
Organizations need to think strategically about the opportunities for leveraging new data sources while minimizing new liabilities and risks. "People have not done a good job thinking about Big Data governance and archiving strategy," said Dave Beulke, database consultant and trainer, and president of Data Management Association-North Capital Region (DAMA-NCR). "Security, governance and archiving are critical for these big data things," he said. The best practices are similar to those used for traditional databases, largely focusing on defining roles and responsibilities for people in various departments.
Big data growth poses access, monitoring problems
Scaling issues associated with the growing need for access to data is a modern and tough challenge. New sources of data can be 10 or 1,000 times as large as with a traditional database. Without a good strategy in place, especially for archiving, organizations have problems with data retention and privacy and other traditional data management issues, said Beulke. "Just because it is big data does not mean that you can bypass those security and governance requirements. This is especially important when working with healthcare data, monitor data and other types of (PII) personally identifiable information," he said.
The best design pattern really depends on how an organization is using the data within the business for your big data application. Some organizations are just using social impact and then, once they have scanned through the information, will throw it away. There are other applications of big data, such as in health care or monitoring, where they need to have more of a design pattern for the temporal aspects of the information.
For example, some enterprises in manufacturing and engineering are monitoring machines or conditions (e.g., What were the revolutions per minute (RPM) of the engine in this car or, when monitoring farm equipment, how many times did it raise its shovel?). They are analyzing and monitoring the data from mean time to failure and other aspects of their business. "In this case, if you get all of this data, the challenge is, how good is engine RPM information from two years ago?" Beulke asked.
There needs to be a review process when deciding to gather new information. Beulke explained, "This is where you get into deciding if it is normal RPM, is it something you want to keep around, or is it the case where you only want the extraordinary data, such as when the RPM is out of bounds? Those kinds of situations need to be analyzed for business impact. Will it help solve a problem 10 years down the road, or is it an anomaly you don't need to bother with?"
An organization should go through a standardized governance and security review in place for the business and related to data content. There are some things that don't need extra review, like tweets. "You are just trying to engage customer sentiments and social likes, and the security on that stuff is not important," Beulke said.
NoSQL shines for social applications where you are going to dispose of the data afterwards. Trend analysis is fine, but for people trying to do repeatable functions, the governance and security issues come into play. Follow existing development standards and database platform procedures already in place. Beulke noted, "A lot of people are adopting open source Hadoop or other NoSQL platforms, which, in some ways, is causing problems."
The reason is that the staff does not know anything about NoSQL databases. There are no procedures in place for handling these. "A lot of organizations don't even need NoSQL," Beulke explained. "They can do what they are trying to achieve with traditional relational or even flat-file systems."
The other aspect of this is that NoSQL databases are not necessarily faster. That is one assumption that people take for granted. Beulke said, "Oracle and DB2 have more performance built into them. You have to remember that DB2 has huge compression capabilities that can save huge amounts of I/O and CPU. You can get down to one-tenth of the storage requirements and improve analysis speed tenfold using that compression."
NoSQL applications have R as the interface of the programming language, which is very complex compared with the simpler SQL interface. This is where the existing trained staff of SQL people take care of development easily.
With NoSQL, there is a need to bring someone on board or train them on R. Beulke said, "This can be a huge liability for a complex, big project. Some of the database management maintenance aspects of things wrapped around NoSQL databases can really become real time-consuming because of the lack of NoSQL compression or additional tools."
The traditional relational databases are already starting to encapsulate those functionalities. "DB2 already has a graph store and in a way that supports governance and security," noted Beulke.
Expanding the scope of the enterprise data warehouse
The broader approach is to think about the idea of an enterprise data warehouse. Steve Wooledge, senior director of marketing at Aster Data Systems, explained, "We call it a 'unified data architecture,' where there are workload-specific platforms for servicing users and applications, depending on what they are trying to do. If someone is trying to service thousands of users and give them ad hoc access to all sorts of data, this would get ingested into a data warehouse."
The other big use case is that those data warehouses have become so mission-critical that they stop doing some of the free-form data exploration that a data scientist would do. There is more data available now, and it is diverse, in terms of data structure and format. Technologies such as Hadoop have given us a low-cost way to ingest this without having to do data transformation in advance. The challenge lies in determining what is valuable in that data once it is captured and stored.
This approach to a unified data architecture gives all users in the organization access to new and old data, so they can do analysis through their tool of choice, Wooledge said. "It is a loosely coupled architecture that integrates all of these systems with their strengths and weaknesses, and we provide it to the enterprise in a way that is manageable and usable."
One of the key challenges lies in getting unstructured data into an organization's data warehouse. Wooledge sees Hadoop as a distributed file system under the cover instead of a relational database, so you don't need to place data into columns and tables. He explained, "We see an opportunity to store that data in its native format and use Hadoop to distill it, which we can join with other structured, known information."
Although it is possible to write Hive queries and do MapReduce jobs, the challenge is that once the data is in Hadoop, it can be difficult for someone familiar with SQL or business intelligence tools who wants to explore and interact with that data. Organizations might consider using HCatalog to improve metadata. This tool maps data stored in Hadoop with a table structure that can be read by SQL tools. This means that the business user, with a tool like Tableau or MicroStrategy, can grab data from Hadoop and Teradata in a single query.
For data coming off of a transaction system, such as point of sale or inventory, the data is already stored in a relational format, with known table mappings, such as the number of goods and prices. From a data storage perspective, the value of Hadoop in this case is not great, since you might as well put it into the data warehouse in a relational format.
On the other hand, if you are trying to extract information from unstructured data, Hadoop makes more sense. For example, an insurance company might decide to do content analysis to identify words used in insurance reports associated with an increased risk of fraud.
Follow us on Twitter at @SearchSOA.