We deal with massive amounts of data every day in our lives. From Instagram stories to online shopping sessions, you are creating data every second that you are interacting with technology. Just multiply it with the 5 billion people on the internet – It’s a mind-boggling calculation, incomprehensible by the human mind! The solution – Hadoop
As a software framework, Hadoop helps us leverage the potential of data, by addressing issues that involve massive data storage and computation. It’s an open-source platform, known for its impressive storage component framework – The Hadoop Distributed File System (HDFS).
Explore Hadoop Distributed File System(HDFS)?
Before we explore HDFS, let’s first understand what a Distributed File System is. It’s a system that manages the storage across a network of machines. For this, the data is broken down into smaller chunks and stored on multiple machines to ensure smooth maintenance of huge volumes of data in a single machine.
Hadoop Distributed file system is a Java-based system that decentralizes the storage of large & uncategorized data. Data is instead stored across a cluster of machines called ‘nodes in a Hadoop cluster’. Let’s imagine that you have ten machines with a hard drive of 2 TB on each machine. If you install Hadoop on all ten machines, you will get HDFS as a single-linked storage system.
Key Characteristics of HDFS
Huge Storage volume: It’s highly capable of storing petabytes of data without any errors.
Easy Data access: It’s easy to access as it runs on a simple philosophy of writing-once, and reading as many times as possible.
Highly Cost-effective: As they run on a cluster of commodity hardware, it’s relatively inexpensive.
Now when we have understood what HDFS is, it’s time to take a deeper look into Apache Hadoop HDFS Architecture.
Apache HDFS is a Java-supported file system, where each file is divided into blocks of a predetermined size, which are stored across a cluster. Apache Hadoop Distributed File System works on master-slave architecture, which includes:
NameNode/Masternode: Contains metadata in RAM and disk
Secondary NameNode: Contains a copy of NameNode’s metadata on the disk
DataNodApache HDFS is a Java-supported file system, where each file is divided into blocks of a predetermined size, which are stored across a cluster. Apache Hadoop Distributed File System works on master-slave architecture, which includes:
es/Slave Node: Contains the actual data in the form of blocks
NameNode is the master server in the Apache Hadoop HDFS Architecture that maintains, manages, and executes file system namespace operations, such as opening, closing, renaming files and directories.
The Secondary NameNode works in sync with the primary NameNode as a helper spirit. It’s responsible for maintaining a copy of the metadata in the disk.
DataNodes or slave nodes are commodity hardware in HDFS that stores data in the local file ext3 or ext4. It is a non-expensive system with moderate quality.
Set up of Pipeline
Data streaming and replication
Shutdown of Pipeline
1. Set up of Pipeline
The client creates a pipeline for the blocks by connecting the individual DataNodes and also confirms whether the DataNodes are present in each of the list of IPs and are ready to receive the data or not.
2. Data Streaming HDFS Read and Write Architecture
As mentioned earlier, HDFS follows Write Once – Read Much Philosophy. So, we can’t edit stored files in HDFS but can carve in new data by re-opening the file. Thus making HDFS Read and Write mechanisms parallel activities. This read/write in HDFS works with the NameNode wherein it checks the privileges of the client and permits them to read or write on the data blocks. Simply put, NameNode is pretty much important to this structure. If it fails, we are doomed.
To understand what is a ‘block’ and how the protocol works? Let’s suppose a situation where an HDFS client wants to write a file named “scenario.txt” of size 248 MB.
Assume that the system block size is configured for 128 MB So, the client will have to divide the file “scenario.txt” into 2 blocks – 128 MB each. This protocol is followed whenever the data is written into HDFS which includes three steps
After the pipeline is created, the data is pushed into the pipeline. Here, data is replicated based on the replication factor which is done by DataNodes sequentially.
3. Shutdown of Pipeline or Acknowledgement stage
Once the block has been copied, a series of acknowledgments will take place in reverse to ensure the client and NameNode data has been written successfully. Finally, the client will close the pipeline to end the session.
HDFS Read Architecture
HDFS Read architecture is easy to understand. Let’s take the same example again where the HDFS client wants to read the file “scenario.txt”. To read the client will reach out to NameNode for the block metadata of the file “scenario.txt”, second, they will return the list of DataNodes where blocks are stored, and the client will connect to read data-parallel from the DataNodes. Later the client gets all the file blocks and will combine these blocks and form a file.
While serving a read request of the client, a replica is selected that reduces the read latency and the bandwidth consumption.
We understand that there is a lot of information here and it may not be easy to get it in one go. But more or less now you must have a pretty good idea about Apache Hadoop HDFS Architecture.
Learn Apache Xebia’s Way!
If you have a zeal to learn Apache HDFC, talk to our experts and enroll yourself in Xebia’s Big Data & Analytics program, which helps learners become experts in HDFS, Yarn, and other similar software, using real-time scenarios on various domains.