Hadoop Lab Deployment and Support

Customer

The Customer is one of the largest educational institutions in the United States.

Challenge

The Customer offers a computer science course for future data analysts and big data professionals. To provide their students with comprehensive training encompassing both theory and practice, the Customer had a Hadoop lab deployed in the cloud. However, the Customer didn’t find this solution to be cost-effective. To cut the expenses, they decided to deploy an on-premises Hadoop lab. And they commissioned ScienceSoft to install and configure the Hadoop cluster for them, as well as support to ensure the lab’s fast adoption.

Process

Consulting

Based on the projected data volume to be processed by students and the tasks to be performed, ScienceSoft’s Hadoop consulting team estimated minimal and optimal hardware requirements. Apart from that, our consultants advised the Customer on what operating system to choose and what big data technologies and frameworks to deploy so that the Hadoop lab would function as intended. Our team also analyzed what versions of the suggested technologies would make the best combination for the lab.

Deployment

To keep travel costs down, our team did all the preliminary work offsite. For example, we remotely installed the operating system and configured it. Only the final step – Hadoop deployment itself – required the presence of our consultant onsite.

As decided during the consulting stage, ScienceSoft installed in the Customer’s lab Hortonworks Data Platform consisting of the following components:

  • Core Hadoop platform (Hadoop Distributed File System and Hadoop MapReduce)
  • Apache Hive (data warehouse software built on top of HDFS)
  • Apache Hadoop YARN (a resource manager and a job scheduler)
  • Apache Ambari (Hadoop management and administration service)
  • Apache Oozie (a workflow processor)
  • Apache Spark (a data processing engine)
  • Apache Pig (an ETL scripting platform)
  • Apache Zeppelin (a notebook for analytics)
  • Apache Ranger (a framework for ensuring the Hadoop cluster’s security)
  • Anaconda (a platform for data science and machine learning tasks)
  • Apache ZooKeeper (a framework that enables synchronization across the Hadoop cluster)
  • The Jupyter Notebook.

Support

Our team ensured the fast adoption of the lab. After deployment, our consultants conducted a number of remote assistance sessions, where we explained in detail how each component of the data platform should work.

ScienceSoft also created guides explaining how to work with the technologies (how to create a user, a workspace for a student, etc.).

As the lab is designed for training, there’s a good chance that something can go wrong. This is why ScienceSoft provided the Customer with a step-by-step instruction on how to re-install the software without involving our team or any other third party.

Results

The Customer got a smoothly functioning on-premises Hadoop lab that serves a valuable source of practical knowledge for their students. Thanks to the training organized by ScienceSoft, the Customer has quickly understood the role of every technology that makes part of the Hadoop lab and is ready to use them accordingly.

Technologies and Tools

Hadoop Distributed File System, Hadoop MapReduce, Apache Hive, Apache Hadoop YARN, Apache Ambari, Apache Oozie, Apache Spark, Apache Pig, Apache Zeppelin, Apache Ranger, Anaconda, Apache ZooKeeper, the Jupyter Notebook.