If you are facing a computing problem far beyond the capabilities of your present hardware, you are probably evaluating the possibility of cloud computing as being faster and taking much less investment than buying more hardware. Now it is time to consider how you are going to distribute your computing jobs to cloud computing resources, manage the whole thing, and get results back. In this article I am going to look at two technologies, one very mature and the other rather new.
Objects with Tuple-Spaces
Tuple-Space processing presents a model of distributed processing that is strikingly different from other schemes. A client application needing some computation done creates an object that contains the necessary information and writes it to a "space manager." A space manager is responsible for matching the client object with a worker process which can perform the desired job. When finished, the worker process writes a new object containing the results back to the space manager.
- Write - A process writes a serialized object to the Space Manager. This is analogous to a Rest PUT operation.
- Read - A process requests a copy of an object from the Space Manager by specifying the object contents it wants to match. This is analogous to a Rest GET operation getting the current state of a resource.
- Take - A process removes an object from the Space Manager. In Rest terms this would be a GET followed by a DELETE operation.
- Notify - Sends a message to a process that a object of interest has been written into the space. There is no exact Rest analogy, a Rest client might do repeated GET operations using the If-Modified-Since request header.
Tuple-Space pros and cons
Space based computing uses an object to define a computing problem. Since these objects can be quite complex, a wide variety of computation problems can be tackled. Space based computing easily adapts to problems in which a series of operations needs to take place. For example, suppose the problem is to conduct lexical analysis on large blocks of text to determine the likely author. An object specifying the text to be analyzed could first go to a simple parsing and counting program which would write the modified object back into the space to be picked up by a more complex program performing statistical analysis. The advantage for cloud computing is the extreme flexibility in scaling computing power to match demand by adding or removing worker processes.
JavaSpaces in the cloud
Java is a natural fit for tuple-space computing due to the ease of serializing Java objects, but implementations exist in many languages. I discussed the applicability of JavaSpaces to web services in this article.
Hadoop - the Open-Source MapReduce Implementation
Google famously uses a programming architecture they call "MapReduce" to index the entire web, terabytes of data, using thousands of commodity computers. A publication describing the programming model and implementation is available here. Many programmers have implemented their own versions of this architecture, with the Apache Hadoop open source project using Java the best known. The Apache Hadoop project is very active with many contributors and frequent releases but is still at the release 0.20 (April 2009) stage, so this has to be regarded as an immature technology.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is far from what your desktop operating system uses. It is a "distributed" system that achieves fault tolerance on cheap commodity hardware, typically running Linux, by extensive duplication and high throughput by choosing not to provide many typical file system functions. Files once written to the system and closed are not expected to be changed. This vastly simplifies replicating data, buffering and access control. A single HDFS instance may contain terabytes in tens of millions of files and be spread over hundreds of thousands of servers. For speed and simplification of networking, file read operations deal with large blocks of data, 64MB or more in size. In spite of all these simplifications, processes deal with HDFS using typical file names and directory structures.
Hadoop Job Processing
In order to use Hadoop, you must have a way to fit it into the MapReduce logical structure. It must be possible to break the problem into pieces which can be processed independently and which together cover all of the input data. Furthermore, it must be possible to express the results of each process in a form which lends itself to being easily combined with all of the other processes to create a final result.
- MAP - The total problem space is broken into suitable independent subsets for distribution to workers. Input data takes the form of a collection of pairs of key and value objects where both key objects and value objects must be serializable Java objects. Worker processes return a collection of pairs of serialized key and value objects. Note that "map" is being used in two senses here, the problem space is mapped into independent tasks, and the description of each task and the returned data are in the form of a map.
- REDUCE - When all workers have returned result lists, one or more "reduce" processes combines them progressively to get a final product. In order for this to work, it must be possible to sort the returned list by the keys so that the reduce process only has to look at the next item in each result list to determine if the results can be combined. The usual example given is counting the number of times different words appear in a collection of documents. The result lists have word keys and count values, sorted alphabetically by word so the reduce process can determine when counts can be added and combine the lists.
Pros and Cons of Hadoop
Hadoop is well suited to analysis of huge volumes of data by virtue of the efficiency, capacity, and fault tolerance of HDFS. Fault tolerance permits use of cheap commodity hardware In addition to the limit on the types of jobs suitable for mapping, Hadoop has a single point of failure in that there is only one file system master index process and one job distributing process.
JavaSpaces concentrates on distributing work to multiple processors with a very flexible system for defining jobs. JavaSpaces processes can use any kind of file system or database. Hadoop concentrates on reliable fast access to huge datasets through a restricted client api for file reading and can only work on problems which can be mapped to separate processes.
GigaSpaces Cloud Application Server
Wikipedia article on the original Linda model of tuple-space processing
Getting started with JavaSpaces article
Apache Hadoop Core website
Wikipedia article on Hadoop includes a list of commercial users
An account of how the New York Times used Hadoop on Amazon's EC2 service to turn 4 terabytes of data into 11 million article PDFs