The Open Source Cloud is Ready for Hadoop, Projects Say
Two major trends in enterprise computing this year show increasing overlap: big data processing and open source cloud adoption.
To Hortonworks, the software company behind open source Apache Hadoop, the connection makes sense. Enterprise customers want the ability to spin up a cluster on demand and quickly process massive amounts of data, said Jim Walker, director of product marketing at Hortonworks, in an interview at OSCON in July. The cloud provides this kind of access by its ability to scale and handle computing tasks elastically.
The open source cloud offers the additional benefit of low-cost deployment and extra plugability you won’t get with a proprietary cloud infrastructure.
All three major open source IaaS platforms -- OpenStack, CloudStack and Eucalyptus -- have made much progress this year in testing Hadoop deployments on their stacks. And Eucalyptus is working on full integration with the platform.
Although no formal relationship exists between Hadoop and the open source IaaS platforms now, Hortonworks does see potential for collaboration given the nature of cloud computing, in general, Walker said.
“(Hadoop) could be a piece of any open cloud platform today,” he said.
Here’s what each of the three major platforms had to say recently about their progress with Hadoop on the open cloud.
In the past, deploying Hadoop in a cloud datacenter proved too challenging for business-critical applications, said Somik Behera, a founding core developer of the OpenStack Quantum project at Nicira, which has since been acquired by VMware. Big data applications require a guaranteed bandwidth, which was difficult to do, Behera said.
OpenStack’s Quantum networking project, which was recently integrated in the new Folsom release, offers an Open vSwitch pluggable networking patch to help ensure performance on Hadoop deployments, Behera said. His Quora post on the topic explains it best:
The biggest challenge for deploying Hadoop on CloudStack has been allocation of resources, said Caleb Call, manager of website systems at Openstock.com and a CloudStack contributor, via email.
“In order to crunch the data we need to in our Hadoop cluster, we currently have many bare metal boxes,” Call said. “Reproducing this same model in the cloud, even being a private cloud, has proven to be tough.”
Though CloudStack is not currently working on an Hadoop integration, the team has built its cloud environment to guarantee performance for Hadoop workloads by building a dedicated resource pool, said Call, who oversees a team of engineers on the CloudStack project’s “Move to the Cloud” initiative.
“We've also built and tuned our compute templates around Hadoop for this cluster so we don't have to throw large amounts of computing power at the problems,” Call said. “Same as you would do for a bare metal system, but now the saved resources are still left in our compute resource pool available to be used by other Hadoop processes.”
At Eucalyptus, performance challenges with Hadoop in the cloud have been largely overcome in the past year, said Andy Knosp, VP of Product at the company.
“There’s been some good research that’s shown near-native performance of Hadoop workloads in a virtualized environment,” Knosp said. This has made Hadoop “a perfect use case” for the open cloud.
Amazon Web Services currently offers the Elastic MapReduce (EMR) service, a hosted Hadoop framework that runs on EC2 and S3. Through the company’s partnership with AWS, Eucalyptus is developing a similar offering that will provide simplified deployment of Hadoop on Eucalyptus.
Customers can run Hadoop on the Eucalyptus private cloud platform as-is – no plugins required, Knosp said. But the company also has a team working on integrating Hadoop with the platform for simplified deployment.
“We want to make it as simple as possible for our community and partners to deploy,” Knosp said. “It improves time to market for Hadoop applications.”