Filed under: Uncategorized
Kevin’s Urgent Jobs at Accenture
By: Huan Liu
If you ask three people what is the cloud (as in cloud computing), you’ll probably get back 10 different definitions. Even though no one can agree on the definition, most will agree that “on-demand” and “pay-per-use” are its key characteristics, and all would use Amazon Web Services as an example of the cloud. So it’s really confusing when people say MapReduce is a cloud technology, because MapReduce is not associated with “on-demand” or “pay-per-use”. If you’re wondering about the connection between MapReduce and cloud, read on.
First of all, if you have not heard of MapReduce, it’s a technology first proposed by Google in 2003 to cope with the challenge of processing an exponentially growing amount of data. In the same year the technology was invented, Google’s production index system was converted to MapReduce. Since then, it has quickly proven to be applicable to a wide range of problems. For example, there are roughly 10,000 MapReduce programs written in Google by June 2007, and there are 2,217,000 MapReduce job runs in the month of September 2007.
MapReduce enjoyed wide adoption outside of Google too. Many enterprises are increasingly facing the same challenges of dealing with a large amount of data. They want to analyze and act on their data quickly to gain competitive advantages, but their existing technology could not keep up with the workload. Facebook is using it in production, and many large traditional enterprises are experimenting with the technology. It turns out that MapReduce can perform most tasks a database management system (e.g., Oracle) can, but it has many advantages over other technologies, including its scale, its ad-hoc query capability and its flexibility.
The first connection between MapReduce and cloud is that MapReduce could benefit from cloud technology. It is demonstrated in the Cloud MapReduce project, which is an implementation of the MapReduce programming model on top of the Amazon services (EC2, S3, SQS and SimpleDB). Back in late 2008, we saw the emergence of a cloud Operating System (OS) — a set of cloud services managing a large cloud infrastructure rather than an individual PC. We asked ourselves the following questions: what if we build systems on top of a cloud OS instead of directly on bare metal? Can we dramatically simplify system design? We thought we will try implementing MapReduce as a proof of concept. In the course of the project, we encountered a lot of problems working with the Amazon cloud OS, most could be attributed to the weaker consistent model it presents. Fortunately, we were able to work through all the issues and successfully built MapReduce on top of the Amazon cloud OS. The end result surprised us somewhat because Cloud MapReduce has several advantages over other implementations:
- It is faster. In one case, it is 60 times faster than Hadoop (Actual speedup depends on the application and the input data).
- It is more scalable. It has a fully distributed architecture, so there is no single point of bottleneck.
- It is more fault tolerant. Again due to its fully distributed architecture, it has no single point of failure.
- It is dramatically simpler. It has only 3,000 lines of code, two orders of magnitude smaller than Hadoop.
All these advantages directly translate into lower cost, higher reliability and faster turn-around for enterprises to gain competitive advantages. Cloud MapReduce is now open sourced to benefit the community. I have recently wrote a guest blog on why Cloud MapReduce, so I would not bore you with the details here. You can also read on reasons to adopt Cloud MapReduce.
Cloud MapReduce advocates building more cloud services. If we can separate out a common component as a stand-alone cloud service, the component not only can be leveraged for other systems, but it can also evolve independently. As we have seen in other contexts (e.g., SOA, virtualization), decoupling enables faster innovation.
The second connection between MapReduce and cloud is that MapReduce is building the foundation of cloud. Other MapReduce implementations, such as Hadoop, are building cloud services, except those services are embedded in the project today and cannot be easily used by other projects. Fortunately, those implementations are moving towards cloud services separation. In the recent 0.20.1 release of Hadoop, the HDFS file system is separated out as an independent component. This makes a lot of sense because HDFS is useful as a stand alone component to store large data sets, even if the users are not interested in MapReduce at all. In the future, MapReduce may indeed be a cloud technology, when it starts to include cloud services implementations.
Leave a Comment so far
Leave a comment