Features of HDFS(Hadoop Distributed File System)

Swarupa P
3 min readJun 12, 2020

--

Hadoop distributed file system(HDFS) is a file system component in Hadoop Framework, which is used to store the large files and we can perform read and write operations on the files.

Features of HDFS:

Features of HDFS

1.Distributed Storage:

  1. Data is distributed across the cluster of nodes.
  2. Data is divided into small chunks and stored across multiple nodes of a cluster.
  3. By splitting of data, it provides a way to Map-Reduce to process large data parallelly on multiple nodes.
  4. Distributed storage also provides fault-tolerance capability.

Note: The more the data is distributed, the processing is more distributed.

2.Blocks:

  1. Files are split into n number of blocks(Physical division of data).
  2. As we already know about OS FileSystem, HDFS default block size is 64 MB and it can be increased based on the requirements).

Pros:

  1. It saves disk seek time.
  2. Another advantage is in the case of processing is 1 mapper can process only 1 block at a time.

Note: If a block size is 64MB, the mapper will process 64MB data at a time. The more will be the block size, the more will be the data processing.

The block size is allocated based on the file size and the block size can be configured on the hdfs-site.xml.

3.Block Replication:

  1. Datanode will replicate the blocks by the instruction from the Master.
  2. Namenode will have metadata i.e information about data node filename, file path, number of blocks, BlockID, block location, and number of replicas.

4.Replication:

  1. In Hdfs all the data blocks are replicated across the cluster of nodes and one replica is present at one location.
  2. The default replication factor is 3, we can change the replication factor according to requirements and we can change the replication factor in Hdfs-site.xml

5.High Availability:

  1. The names it suggests it will available in all the cases like any hardware failure or any crashes.
  2. HDFS provides high availability of data by creating multiple replicas of data blocks.
  3. If a hardware goes down, the data can be accessed from another node and the name node instructs the data node to create another replica in the cluster.

6.Data Reliability:

  1. Since the data is highly available due to replication, If there is a hardware crashes or any sort of crashes the data will be stored reliably.
  2. We can again balance the cluster by making the desired replication factor.

7.Fault-Tolerant:

  1. Replication factor and distributed storage helps us to obtain the fault tolerance layer of Hadoop.
  2. Fault Tolerance will be applied to all components of the Hadoop Ecosystem.

8.Scalability:

The most important feature for any storage system. There are two ways to scale :

  1. Horizontal and
  2. Vertical Scaling

Horizontal scaling means that you scale by adding more machines into your pool of resources. In HDFS, we can add more disks on the nodes of the cluster.

Vertical scaling means that you scale by adding more power (CPU, RAM) to an existing machine. In HDFS, we can add more nodes in the cluster.

Without any dilemma, If we have a machine on server-rack, we can add more machines across the horizontal direction and we can have more resources in a vertical direction.

9.High Throughput:

1. In Hdfs, Job is divided and shared among different systems that execute the task assigned to them independently and in parallel.

2.HDFS provides high throughput access to application data. Throughput is the amount of work done in unit time.

I hope you enjoyed the results! Thanks for reading. You can find me on

and on

--

--

Swarupa P

Hey, I’m Swarupa from Bangalore, India. Software Engineer. I write about startups and technology