Page tree

Skip to end of metadata
Go to start of metadata

Hadoop overview

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. 

DataNodeA DataNode stores data in the HadoopFileSystem. A functional filesystem has more than one DataNode, with data replicated across them.

NameNode: The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.  

JobTrackerThe JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack. The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

SecondaryNameNode: When the NameNode goes down and the file system goes offline, there is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy. 

TaskTrackerA TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.

NodeManager: A NodeManager accepts instructions from the ResourceManager and manage resources available on a single node.

ResourceManager: A ResourceManager is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system. It works together with the per-node NodeManagers and the per-application ApplicationMasters.

JobHistoryThe JobHistory server saves the history of all the running jobs in the Hadoop system. The current version of Hadoop KM monitors only the availability of these history servers.

PATROL for Hadoop overview 

BMC PATROL for Hadoop discovers and monitors following components of a configured Hadoop environment:

  • DataNode
  • NameNode
  • JobTracker
  • SecondaryNameNode
  • TaskTracker
  • JobHistory
  • NodeManager
  • ResourceManager

Related topics

1.0.00 features

Architecture


  • No labels