The Apache Hadoop distributed file processing system has benefits and is gaining traction. However, it can have drawbacks. Some organizations find that starting up with Hadoop requires rethinking software architecture and that acquiring new data skills is necessary.
For some, a problem with Hadoop’s batch-processing model is that it assumes there will be downtime to run the batch in between bursts of data acquisition. This is the case for many businesses that operate locally and have a large number of transactions during the day, but very little (if any) at night. If that nightly window is large enough to process the accumulation of data from the previous day, everything goes smoothly. For some businesses though, that window of downtime is small or non-existent and even with Hadoop’s high-powered processing, they still get more data in one day than they can process every 24 hours.
For organizations with small windows of acceptable, an approach that adds components of stream-based data processing may help, writes GigaSpaces CTO Nati Shalom in a recent blog on making Hadoop faster. By constantly processing incoming data into useful packets and removing static data that does not need to be processed (or reprocessed) enterprise organizations can significantly accelerate their big data batch processes. – James Denman