Collaborative MapReduce in the browser lowers the barriers for collaborative supercomputing.
The idea of enrolling a large audience of people to donate computer time to worthwhile computing projects achieved widespread fame when UC Berkeley rolled out the SETI@home project in 1999 to search of extraterrestrial intelligence. Recently a number of different developers have proposed techniques for running MapReduce directly in the browser to significantly lower the barrier for swarm computing.
Since the initial release of SETI@home, millions of consumers have donated spare CPU time on their computers and in the process have created what has been claimed to be the largest computer on the planet. A more general purpose version called the Berkeley Open Infrastructure for Network Computing (BOINC) has been created for virtually any kind of computation including searching for the cure for cancer, creating better climate models, looking for gravitational waves, and providing clean energy. Almost 4 million computers have participated in BOINC, enabling over 1.5 PetaFLOPS of performance, compared to 500 TeraFLOPS for Blue Gene, the largest supercomputer in the world. Meanwhile, MapReduce has been getting a lot of attention since it was first publicly announced by Google in 2004. As noted earlier in an article on SearchSOA, MapReduce is useful in three categories of applications: Text tokenization, indexing, and search; Creation of other kinds of data structures (e.g., graphs); and, Data mining and machine learning. Data processing requiring massive parallel processing is ideally suited for capabilities of MapReduce.
MapReduce creates a framework for mapping a computation to run across thousands of low-cost PCs, and then reducing, or reassembling the individual computations into a final answer. Although Google has not publicly disclosed its own implementation, it has gained widespread attention in the development community with a variety of open source implementations including Hadoop, GridGain, Skynet, and Disco. At the same time, both Greenplum and Aster Data Systems have released commercial versions.
Sean McCullough was the first to mention the general concept in January. He describes a basic technique to write programs in the MapReduce style, but does not elaborate on how to distribute and reassemble a computation across a group of machines.
Although the technique has considerable promises, it also faces numerous challenges around security, economics and speed, McCullough later wrote. Workers could intentionally poison the jobs if they have an incentive to. How do you know if you can trust a worker? In the case of SETI@home, some individuals gamed the system so they could rank higher and gain more status in the quest for extraterrestrials.
In a later post, Grigorik noted that the subject of performance and scalability generated a lot of conversation. There are concerns that the job servers are a single point of failure, and the non-stateful nature of HTTP introduces the need for a storage layer. However, he believes that many of these problems have already been addressed in the P2P community with protocols such as BitTorrent and Distributed Hash Tables.