The Hadoop Ecosystem: A Crash Course

the hadoop ecosystem

 

6 MINUTE READ

 

Hadoop has revolutionised both data processing and data warehousing, but its explosive growth has generated a lot of uncertainty, hype, and confusion.

 

This article aims to provide a concise crash course on the key components of the Hadoop ecosystem. At present, there are simply too many components to review in a single article so we have narrowed the focus to the most widely-used parts of the sprawling ecosystem.

 

What is Hadoop?

 

Apache Hadoop, to give it its full name, is an open source framework designed to handle the storage and processing of large data sets on low-cost commodity hardware. Since its initial release in 2011, it has become the dominant platform for handling big data. It offers the ability to cheaply process large amounts of data (>100 gigabytes), regardless of its structure.

 

This offers a significant advantage over traditional business data warehouses. While relational databases excel at processing structured data, the requirement for structure restricts the kinds of data that can be processed. They lack the ability to explore heterogeneous data sets.

 

The amount of effort required to manually structure data for warehousing in a relational database often means that valuable data sources in organisations are left untapped. This is where Hadoop makes a big difference.

 

MapReduce: The core of Hadoop

 

Created by Google in response to the problem of creating web search indexes, the MapReduce framework is the powerhouse behind most of today’s big data processing.

 

The important innovation of MapReduce is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes. Distributing computation in this way solves the issue of data too large to fit onto a single machine. Combining this technique with commodity Linux servers offers a cost-effective alternative to massive computing arrays.

 

Essentially, Hadoop is an open source MapReduce implementation.

 

As the Hadoop project matured, it acquired further components to enhance its usability and functionality. The name “Hadoop” has come to represent this continually growing ecosystem of different components.

 

Hadoop’s lower levels: Hadoop Distributed File System and MapReduce

 

In order for MapReduce to distribute computation across multiple servers, each server must have access to the relevant data. This is the role of HDFS – the Hadoop Distributed File System.

 

HDFS and MapReduce are robust. Individual servers in a Hadoop cluster can fail without aborting the computing process. HDFS ensures data is replicated, creating redundancy across cluster. On completion of a calculation, a node will write its results back into HDFS.

 

There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By contrast, relational databases require that data be structured and schemas be defined before storing the data. With HDFS, making sense of the data is the responsibility of the developer’s code.

 

Programming Hadoop at the MapReduce level involves working with Java APIs, and manually loading data files into HDFS. However, this process is tedious and error prone.

 

Improving programmability: Pig and Hive

 

To sidestep the problems associated with working with JavaAPI’s Hadoop offers two solutions for making programming easier.

  1. Pig is a programming language that simplifies the routine tasks associated with working with Hadoop: loading data, expressing transformations on the data, and storing the final results. Pig’s built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
  2. Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax. As with Pig, Hive’s core capabilities are extensible.

 

Choosing between Hive and Pig can be confusing. Generally speaking, Hive is more suitable for data warehousing tasks.  Hive’s closeness to SQL makes it an ideal point of integration between Hadoop and other business intelligence tools.

 

Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications. Pig is a thinner layer over Hadoop than Hive, and its main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop’s Java APIs.

 

Improving data access: HBase, Sqoop and Flume

 

At its heart, Hadoop is a batch-oriented system. Data are loaded into HDFS, processed, and then retrieved. This is somewhat of a computing throwback, and often, interactive and random access to data is required.

 

This is where HBase comes in, a column-oriented database that runs on top of HDFS. HBase’s function is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase.

 

Coordination and workflow: Zookeeper and Oozie

 

With a growing family of services running as part of a Hadoop cluster, there is a need for effective coordination. As computing nodes can come and go, all members of the cluster need to synchronise with each other, know where to access services, and know how they should be configured. Zookeeper provides each node with this information.

 

Production systems utilising Hadoop often contain complex pipelines of transformations. These pipelines are interdependent. For example, the arrival of a new batch of data will trigger an import, which must then trigger recalculations in dependent datasets. The Oozie component provides features to manage such dependencies, removing the need for developers to painstakingly code custom solutions.

 

Machine learning: Mahout

 

While every organisation’s data are unique and diverse, there is much less variation in the kinds of analyses typically performed on that data. The Mahout project is a library of common Hadoop analytical computations. Use cases include user collaborative filtering, creating tailored user recommendations, clustering and classification.

 

The future of Hadoop

 

As millions more Internet of Things devices come online each year more data is being collected than ever before. The vast majority of this data is unstructured. Inevitably, this means the demand for Hadoop’s processing capabilities will grow too. Its ability to process large volumes of data quickly means that Hadoop systems will become increasingly important to day-to-day business decisions.

 

Organisations of all sizes now have an opportunity to take advantage of the promising opportunities offered by big data. The open source nature of Hadoop and its ability to run on commodity hardware means that its processing power is available to all.