How many generations tried to find the secret formula for success! Though we are unaware of a universal formula, we definitely know how to achieve success in Hadoop implementation. The latest proof is Hadoop lab deployment for one of the largest educational institutions in the United States.
In our article, we dwell on a mix of business issues and technical details that make the foundation of a great Hadoop implementation project:
- On-premises vs. cloud
- Open-source vs. commercial
- Cluster size and structure
- Integration of Hadoop components
Let’s start from identifying the clear boundaries of Hadoop, as the term conveys different meanings. In this article, Hadoop means four base modules:
- Hadoop distributed file system (HDFS) – a storage component.
- Hadoop MapReduce – a data processing framework.
- Hadoop Common – a collection of libraries and utilities supporting other Hadoop modules.
- Hadoop YARN – a resource manager.
Our definition does not embrace Apache Hive, Apache HBase, Apache Zookeeper, Apache Oozie and other elements of Hadoop ecosystem.
What seems to be a simple either-or choice is, in fact, an important step. And to make this step, one should start from gathering the requirements of all the stakeholders. A classic example of what happens when this rule is neglected: your IT team plans to deploy the solution on-premises and your finance team says that there are no CAPEX funds available to make this happen.
The list of factors to be considered is close to endless, and to make on-premises vs. in-the-cloud choice, one should assess each of the components and guide the decision based on their priorities. Our consultants have summed up several high-level factors that should be weighed before making a decision.
Consider Hadoop on-premises if:
- You clearly understand the scope of your project and are ready for serious investments in hardware, office space, support team development, etc.
- You would like to have full control over hardware and software and believe that security is of utmost importance.
Consider Hadoop in the cloud if:
- You are not sure about the storage resources you would need in the future.
- You strive for elasticity, for example, you would need to cope with peaks (similar to the ones that happen with the sales on Black Friday compared to standard days).
- You don’t have a highly professional administration team to configure and support the solution.
If among all the technologies you set your choice on Hadoop, it does not mean that the selection process is over. You have to opt for either vanilla Hadoop or one of vendor distributions (for example, the ones provided by Hortonworks, Cloudera or MapR).
First, let’s clarify the terms. Vanilla Hadoop is an open-source framework by Apache Software Foundation, while Hadoop distributions are commercial versions of Hadoop that comprise several frameworks and custom components added by a vendor. For example, Cloudera’s Hadoop cluster includes Apache Hadoop, Apache Flume, Apache HBase, Apache Hive, Apache Impala, Apache Kafka, Apache Spark, Apache Kudu, Cloudera Search and many other components.
Our Hadoop consultants have contrasted these two alternatives to highlight the principal differences:
Huge and ever-growing volumes of data are some of big data-specific features. Naturally, you have to plan your Hadoop cluster so that there’s enough storage space for your current and future big data. We won’t overload this article with formulas. Still, here are several important factors one needs to take into account to calculate the cluster size correctly:
- Volume of data to be ingested by Hadoop.
- Expected data flow growth.
- Replication factor (for instance, for a multi-node HDFS cluster it’s 3 by default).
- Compression rate (if applied).
- Space reserved for the immediate output of mappers (usually 25-30% of overall disk space available).
- Space reserved for OS activities.
It frequently happens that companies define their cluster’s size based on assumed peak loads and ultimately end up with having more cluster resources than required. We recommend calculating cluster size with standard loads in mind. However, you should also plan how to cope with the peaks. The scenarios can be different: you can opt for the elasticity that the cloud offers or you can design a hybrid solution.
Another thing to take into account is workload distribution. As different jobs compete for the same resources, it’s necessary to structure the cluster in a way that will make the load even. When adding new nodes to a cluster, make sure to launch a load balancer. Otherwise, you can face a situation depicted in a picture below: new data is concentrated on newly-added nodes, which may result in a decreased cluster throughput or even the system’s temporary failure.
Data source: Real-World Hadoop by Ted Dunning, Ellen Friedman
Your solution’s architecture will definitely include multiple elements. We’ve already clarified that Hadoop itself consists of several components. And striving to solve their business tasks, companies may enrich the architecture with other additional frameworks. For example, a company can find Hadoop MapReduce’s functionality insufficient and strengthen their solution with Apache Spark. Or another company has to analyze streaming data in real time and opts for Apache Kafka as an extra component. But these examples are quite simple. In reality, companies have to choose among numerous combinations of frameworks and technologies. And, of course, all these elements should be working smoothly together, which is, in fact, a big challenge.
Even if two frameworks are recognized as highly compatible (for instance, HDFS and Apache Spark), this does not mean that your big data solution will work smoothly. A wrong choice of versions – and instead of a lightning speed data processing you’ll have to cope with the system that doesn’t work at all.
And Apache Spark is at least a whole different product. What will you say if the troubles come even from the inner elements of your Hadoop ecosystem? Nobody expects that Apache Hive, designed to query the data stored in HDFS, can fail to integrate with the latter, but it sometimes does.
So, how to succeed?
We shared our formula for a successful Hadoop implementation. Its components are well-thought decisions on deploying in the cloud or on-premises, opting for vanilla Hadoop or a commercial version, calculating cluster size and integrating smoothly. Obviously, this formula is a simplified one as it covers general issues inherent to any company. However, every business is unique and in addition to resolving standard challenges, one should be ready to deal with a lot of individual ones.