Apache Hadoop: an introduction and a Cheat Sheet

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.


All the modules in Hadoop are designed with a fundamental assumption:

“Hardware failures are common and should be automatically handled by the framework”

The Framework

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common

Libraries and utilities needed by other Hadoop modules.

  • Hadoop Distributed File System (HDFS)

Distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.

  • Hadoop YARN

Resource-management platform for managing computing resources in clusters and using them for scheduling of users’ applications.

  • Hadoop MapReduce

An implementation of the MapReduce programming model for large scale data processing.

Not only a framework, but an ecosystem

Hadoop ecosystem

The term Hadoop has come to refer not just to the base modules above, but also to the ecosystem, a collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm.

A usefull table of entire Hadoop ecosystem can be read from hadoopecosystemtable.github.io.

A most complete introduction and a useful cheatsheet

I found on DZone website a very comprehensive guide, with attached a useful cheat sheet.

The guide can be read at this address, the cheatsheet is below:

A great video introduction from Stanford University

Amr Awadallah introduces Apache Hadoop and asserts that it is the data operating system of the future. He explains many of the data problems faced by modern data systems while highlighting the benefits and features of Hadoop.

Published: February 03 2016