Hadoop Ecosystem
Introduction: Hadoop Ecosystem is a platform or a suite that provides various services to solve big data problems. It includes Apache projects and various commercial tools and solutions. four major elements of Hadoop are HDFS, MapReduce, YARN, and Hadoop Common. Hadoop is a framework that enables processing of large data sets which reside in the form of clusters. Being a framework, Hadoop was made up of several modules that are supported by a large ecosystem of technologies.
Components that collectively form a Hadoop ecosystem:-
HDFS: Hadoop Distributed File System
YARN: Yet Another Resource Negotiator
MapReduce: Programming-based Data Processing
Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm libraries
Zookeeper: Managing cluster
Oozie: Job Scheduling
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
HDFS: Hadoop Distributed File System
HDFS is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework.
HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
HDFS provides high throughput access to application data and is suitable for applications that have large data sets
HDFS consists of two core components :
Name Node
Data Node
Name Node:
Name Node, a master server, manages the file system namespace and regulates access to files by clients.
Maintains and manages the blocks which are present on the data node.
Name Node is the prime node that contains metadata
Meta-data in Memory
– The entire metadata is in the main memory
Types of Metadata
– List of files
– List of Blocks for each file
– List of Data Nodes for each block
– File attributes, example: creation time, replication
A Transaction Log
– Records file creations, and file deletions. Etc.
Data Node:
Data Nodes, one per node in the cluster, manage storage attached to the nodes that they run on.
data nodes that store the actual data. These data nodes are commodity hardware in the distributed environment.
A Block Server
Stores data in the local file system
Stores meta-data of a block
Serves data and meta-data to Clients
Block Report
Periodically sends a report of all existing blocks to the Name Node
Facilitates Pipelining of Data
Forwards data to other specified Data Nodes
YARN: Yet Another Resource Negotiator
Apache YARN is Hadoop’s cluster resource management system.
YARN was introduced in Hadoop 2.0 for improving the MapReduce utilization.
It handles the cluster of nodes and acts as Hadoop’s resource management unit. YARN allocates RAM, memory, and other resources to different applications.
YARN has two components :
Resource Manager
Global resource scheduler
Runs on the master node
Manages other Nodes
Tracks heartbeats from Node Manager
Manages Containers
Handles AM requests for resources
De-allocates containers when they expire, or the application completes
Manages Application Master
Creates a container from AM and tracks heartbeats
Manages Security
Node Manager
Runs on slave node
Communicates with RM
Registers and provides info on Node resources
Sends heartbeats and container status
Manages processes and container status
Launches AM on request from RM
Launches application process on request from AM
Monitors resource usage by containers.
Provides logging services to applications
Aggregates logs for an application and saves them to HDFS
· MapReduce: Programming-based Data Processing
HDFS handles the Distributed File system layer
MapReduce is a programming model for data processing.
MapReduce
– Framework for parallel computing
– Programmers get simple API
– Don’t have to worry about handling
parallelization
data distribution
load balancing
fault tolerance
Allows one to process huge amounts of data (terabytes and petabytes) on thousands of processors
Map Reduce Concepts (Hadoop-1.0)
Job Tracker
The Job-Tracker is responsible for accepting jobs from clients, dividing those jobs into tasks, and assigning those tasks to be executed by worker nodes.
Task Tracker
Task-Tracker is a process that manages the execution of the tasks currently assigned to that node. Each Task Tracker has a fixed number of slots for executing tasks (two maps and two reduces by default).
Hadoop 2.0 Cluster Components
Split up the two major functions of Job Tracker
Cluster resource management
Application lifecycle management
Resource Manager
Global resource scheduler
Hierarchical queues
Node Manager
Per-machine agent
Manages the life cycle of the container
Container resource monitoring
Application Master
Per-application
Manages application scheduling and task execution
g. MapReduce Application Master
Hadoop as a Next-Gen Platform
Spark: In-Memory data processing
Spark is an open-source distributed processing system.
It is a cluster computing platform designed to be fast.
In memory computation (RAM) that increases the processing speed of an application.
Combines different processing types like
Batch processing
Streaming Data
Machine learning
Structure Data
Graph X Data
Batch Processing: It is the processing of big data at rest. You can filter, aggregate, and prepare very large datasets using long-running jobs in parallel.
It is the processing of data in a particular frequency of time.
Streaming data: Streaming or real-time, data is data in motion. Real-time data can be processed to provide useful information. using application data was generated immediately by streaming data.
Machine learning: Spark’s library for machine learning is called MLlib (Machine Learning library). It’s heavily based on learn’ s ideas on pipelines. In this library to create an ML model the basics concepts are:
Data Frame: This ML API uses Data Frame from Spark SQL as an ML dataset, which can hold a variety of data types.
Structured data: It is something that has a schema that has a known set of fields. When the schema and the data have no separation, then the data is said to be semi-structured.
RDD is an immutable data structure that distributes the data in partitions across the nodes in the cluster.
PIG, HIVE: Query-based processing of data services
PIG:
To performed a lot of data administration operation, Pig Hadoop was developed by Yahoo which is Query based language works on a pig Latin language used with hadoop.
It is a platform for structuring the data flow, and processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, PIG stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig Runtime. Just the way Java runs on the JVM.
Features of PIG
Provides support for data types – long, float, char array, schemas, and functions
Is extensible and supports User Defined Functions
Provides common operations like JOIN, GROUP, FILTER, SORT
HIVE:
Relational databases that use SQL as the query language implemented by most of data Most data warehouse application. Hive is a data warehousing package built on top of Hadoop that lowers the barrier to moving these applications to Hadoop.
Structured and Semi-Structured data Processing by using Hive.
Series of automatically generated Map Reduce jobs is internal execution of Hive query.
Structure data used for data analysis.
HBase: NoSQL Database
Apache HBase is an open-source, distributed, versioned, fault-tolerant, scalable, column-oriented store modeled after Google’s Bigtable, with random real-time read/write access to data.
It’s a NoSQL database that runs on top of Hadoop as a distributed and scalable big data store.
It combines the scalability of Hadoop by running on the Hadoop Distributed File System (HDFS), with real-time data access as a key/value store and deep analytic capabilities of Map Reduce.
Mahout, Spark MLLib: Machine Learning algorithm libraries
Mahout provides an environment for creating machine learning applications that are scalable.
Mahout allows Machine Learnability to a system or application.
MLlib, Spark’s open-source distributed machine learning library.
MLlib provides efficient functionality for a wide range of learning settings and includes several underlying statistical, optimization, and linear algebra primitives.
It allows invoking algorithms as per our need with the help of its own libraries.
Zookeeper: Managing cluster
Apache Zookeeper is the coordinator of any Hadoop job which includes a combination of various services in a Hadoop Ecosystem.
Apache Zookeeper coordinates with various services in a distributed environment.
It is an open-source, distributed, and centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services across the cluster.
Zookeeper, There was a huge issue of management of coordination and synchronization among the resources or the components of the Hadoop Ecosystem.
Oozie: Job Scheduling
Apache Oozie is a clock and alarm service inside Hadoop Ecosystem.
Oozie simply performs the task scheduler, it combines multiple jobs sequentially into one logical unit of work.
Oozie is a workflow scheduler system that allows users to link jobs written on various platforms like MapReduce, Hive, Pig, etc. schedule a job in advance and create a pipeline of individual jobs was executed sequentially or in parallel to achieve a bigger task using Oozie.
There are two kinds of Oozie jobs:
Oozie workflow: Oozie workflow is a sequential set of actions to be executed.
Oozie Coordinator: These are the Oozie jobs that are triggered by time and data availability
Leave A Comment