1.What is HDFS?

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications

2.What are the Hadoop configuration files?

hdfs-site.xml
core-site.xml
mapred-site.xml

3.How NameNode Handles data node failures?

NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNode in the cluster.Receipt of a Heartbeat implies that the DataNode is functioning properly.
When NameNode notices that it has not received a heartbeat message from a DataNode after a certain amount of time, the DataNode is identified as dead. Since blocks will be under replicated the system NameNode begins replicating the blocks that were stored on the dead DataNode.
The NameNode takes responsibility of the replication of the data blocks from one DataNode to another.The replication data transfer happens directly between DataNodes and the data never passes through the NameNode.

4.What is MapReduce in Hadoop?

Hadoop MapReduce is a specially designed framework for distributed processing of large data sets on clusters of commodity hardware.
The framework itself can take care of scheduling tasks, monitoring them and reassigning of failed tasks.

5.What is the responsibility of NameNode in HDFS ?

NameNode is a master daemon for creating metadata for blocks, stored on DataNodes. Every DataNode sends heartbeat and block report to NameNode.

If NameNode not receives any heartbeat then it simply identifies that the DataNode is dead. This NameNode is the single Point of failover. If NameNode goes down HDFS cluster is inaccessible.

6.What it the responsibility of SecondaryNameNode in HDFS?

SecondaryNameNode is the mater Daemon to create Housekeeping work for NameNode.
SecondaryNameNode is not the backup of NameNode but it is the backup for metadata of the NameNode.

7.What is the DataNode in HDFS?

DataNode is the slave daemon of NameNode for storing actual data blocks. Each DataNode stores number of 64MB blocks.

8.What is the JobTracker in HDFS?

JobTracker is a mater daemon for assigning tasks to TaskTrackers in different DataNodes where it can find data blocks for input file.

9.How can we list all job running in a cluster?

]$ hadoop job -list

10.How can we kill a job?

]$ hadoop job –kill jobid

11.Whats the default port that jobtrackers listens to

http://localhost:50030

12.Whats the default port where the dfs namenode web ui will listen on

http://localhost:50070

13.What is Hadoop Streaming

Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations

14.Whats is Distributed Cache in Hadoop

Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job.
The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

15.What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it

This is because distributed cache is much faster. It copies the file to all trackers at the start of the job.
Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. on the other hand, if you put code in file to read it from
HDFS in the MR job then every mapper will try to access it from HDFS hence if a task tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also
HDFS is not very efficient when used like this.

16.Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job

Yes, The input format class provides methods to add multiple directories as input to a Hadoop job

17.What will a hadoop job do if you try to run it with an output directory that is already present? Will it overwrite it - warn you and continue - throw an exception and exit

The hadoop job will throw an exception and exit.

18.How can you set an arbitary number of mappers to be created for a job in Hadoop

This is a trick question. You cannot set it

19.How can you set an arbitary number of reducers to be created for a job in Hadoop

You can either do it programmatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting

20.How will you write a custom partitioner for a Hadoop job

To have hadoop use a custom partitioner you will have to do minimum the following three

Create a new class that extends Partitioner class
Override method getPartition
In the wrapper that runs the Map Reducer, either add the custom partitioner to the job programtically using method setPartitionerClass or add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

21.How did you debug your Hadoop code?

There can be several ways of doing this but most common ways are

By using counters
The web interface provided by Hadoop framework

22.What does the term "Replication factor" mean

Replication factor is the number of times a file needs to be replicated in HDFS

23.What is the default replication factor in HDFS

The default replication factor is 3

24. What is the typical block size of an HDFS block

The default HDFS block size is 64Mb or 128Mb

25.What is the benefit of having such big block size (when compared to block size of linux file system like ext)

It allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases). Furthermore, it allows for fast streaming reads of data, by keeping large amounts of data sequentially laidout on the disk

26.Why is it recommended to have few very large files instead of a lot of small files in HDFS

This is because the Name node contains the meta data of each and every file in HDFS and more files means more metadata and since namenode loads all the metadata in memory for speed hence having a lot of files may make the metadata information big enough to exceed the size of the memory on the Name node

27.What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered

There is no way. If Namenode dies and there is no backup then there is no way to recover data

28.Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc

To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file.
These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel.
The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

29.Using linux command line. how will you List the the number of files in a HDFS directory

hadoop fs -ls

30.Using linux command line. how will Create a directory in HDFS

hadoop fs -mkdir

Big data Hadoop interview questions answers freshers and experienced - Part 2

Top 60 Hadoop interview questions and answers for freshers and experienced - Part 1

Instance Of Java

No comments

Leave a Reply

Top 60 Hadoop interview questions and answers for freshers and experienced - Part 1

Instance Of Java

Next

Newer Post

Previous

Older Post

No comments

Leave a Reply