Big data Hadoop interview questions answers freshers and experienced - Part 2

Big data hadoop interview questions and answers freshers and experienced



Hadoop interview questions and answers for freshers and experienced - Part 1


31.Using linux command line. how will you Copy file from your local directory to HDFS

  • hadoop fs -put localfile hdfsfile

32.What platforms and Java versions does Hadoop run on?

  •  Java 1.6.x or higher, preferably from Sun. Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris are known to work. (Windows requires the installation of Cygwin).



33.Is there an easy way to see the status and health of a cluster?

  • There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system. 
  • By default, these are located at http://job.tracker.addr:50030/ and http://name.node.addr:50070/.
  • The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks.
  • The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.
  • You can also see some basic HDFS cluster health data by running:
  • $ bin/hadoop dfsadmin –report

34.Do I have to write my job in Java?

  • No. There are several ways to incorporate non-Java code.

35.How do I submit extra content (jars, static files, etc) for my job to use during runtime?

  • The distributed cache feature is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a URL (either hdfs: or http:) on to the slave node before any tasks for the job are executed on that node.
  • The files are only copied once per job and so should not be modified by the application.
  • Copying content into lib is not recommended and highly discouraged. Changes in that directory will require Hadoop services to be restarted.
36.How do I change final output file name with the desired name rather than in partitions like part-00000, part-00001?

  • You can subclass the OutputFormat.java class and write your own. You can look at the code of TextOutputFormat MultipleOutputFormat.java etc. for reference. It might be the case that you only need to do minor changes to any of the existing Output Format classes.
  • To do that you can just subclass that class and override the methods you need to change.

37.How do you gracefully stop a running job?

  • hadoop job -kill <JOBID>

38.How the HDFS Blocks are replicated?

  • A. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. 
  • The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. 
  • The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a data block on HDFS, 2 copies are stored on data nodes on same rack and 3rd copy on a different track.

39.How the Client communicates with HDFS?

  • A.  The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
  •  Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

40.What is HDFS Block size? How is it different from traditional file system block size?
  • In HDFS data is split into blocks and distributed across multiple nodes in the cluster.
  • Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times.
  • Default is to replicate each block three times. Replicas are stored on different nodes.
  • HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS
  • Block size can not be compared with the traditional file system block size.

41.When is the reducers are started in a MapReduce job?
  • In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

42.If reducers do not start before all mappers finish then why does the progress on Map Reduce job shows something like Map(60%) Reduce(15%)? Why reducers progress percentage is displayed when mapper is not finished yet? 
  • Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. 
  • The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer.
  • Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

43.What is the Hadoop MapReduce API contract for a key and value Class?
  • The Key must implement the org.apache.hadoop.io.WritableComparable interface.
  • The value must implement the org.apache.hadoop.io.Writable interface.

44.What are combiners? When should I use a combiner in my MapReduce Job?
  • Combiners are used to increase the efficiency of a MapReduce program. 
  • They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers.
  • You can use your reducer code as a combiner if the operation performed is commutative and associative. 
  • The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your
  • MapReduce jobs should not depend on the combiners execution.

45.Where is the Mapper Output (intermediate kay-value data) stored ?
  • A. The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. 
  • This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

46.Name the most common InputFormats defined in Hadoop? Which
one is default ?
 
  • Following 2 are most common InputFormats defined in Hadoop
  1. TextInputFormat
  2. KeyValueInputFormat
  3. SequenceFileInputFormat
  • TextInputFormatis the hadoop default

47. What is the difference between TextInputFormat and KeyValueInputFormat class
  • TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper KeyValueInputFormat: Reads text file and parses lines into key, val pairs.
  • Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

48. What is InputSplit in Hadoop

  • When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split

49. How is the splitting of file invoked in Hadoop Framework
  • It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user

50. Consider case scenario: In M/R system,
  • HDFS block size is 64 MB
  • Input format is FileInputFormat
  • We have 3 files of size 64K, 65Mb and 127Mb
  • then how many input splits will be made by Hadoop framework?
  • Hadoop will make 5 splits as follows
  • 1 split for 64K files
  • 2 splits for 65Mb files
  • 2 splits for 127Mb file

51. What is the purpose of RecordReader in Hadoop
  • The InputSplithas defined a slice of work, but does not describe how to access it. The RecordReaderclass actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat

52. After the Map phase finishes, the hadoop framework does
"Partitioning, Shuffle and sort". Explain what happens in this phase?

  • Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same
  • Shuffle
  • After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
  • Sort
  • Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer

53. If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?
  • The default partitioner computes a hash value for the key and assigns the partition based on this result.

54. What is a Combiner
  • The Combiner is a "mini-reduce" process which operates only on data generated by a mapper.
  • The Combiner will receive as input all data emitted by the Mapper instances on a given node.
  • The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

55. Give an example scenario where a combiner can be used and where it cannot be used
  • There can be several examples following are the most common ones
  • Scenario where you can use combiner
  • Getting list of distinct words in a file
  • Scenario where you cannot use a combiner
  • Calculating mean of a list of numbers

56.What is job tracker
  • Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster

57. What are some typical functions of Job Tracker
  • The following are some typical tasks of Job Tracker
  • Accepts jobs from clients
  • It talks to the NameNode to determine the location of the data
  • It locates TaskTracker nodes with available slots at or near the data
  • It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker

58.What is task tracker
  • Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker

59. Whats the relationship between Jobs and Tasks in Hadoop
  • One job is broken down into one or many tasks in Hadoop.

60. Suppose Hadoop spawned 100 tasks for a job and one of the task
failed. What will hadoop do ?
  • It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job

61.Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this 
  • Speculative Execution
62. How does speculative execution works in Hadoop
  • Job tracker makes different task trackers process same input. 
  • When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs.
  • The Reducers then receive their inputs from whichever Mapper completed successfully, first.

How to call non static method from static method java

  • Class is a template and when we create object instance variables gets memory.
  • If we create two objects variables get memory in two objects. So instance variables gets memory whenever object is created.
  • When we create variables as static , memory will not be created in objects because static means class level and it belongs to class not object but we can access static variables and static methods from objects.
  • Calling static method from non static method in java 
  • In our scenario calling a non static method from static method in java.
  • If we are calling a non static method then we need to use object so that it will call corresponding object non static method.
  • Non static methods will be executed or called by using object so whenever we want to call a non static method from static method we need to create an instance and call that method.
  • If we are calling non static method directly from a static method without creating object then compiler throws an error. 



Program #1: Java example program to call non static method from static method. 


calling non static method from static method.png

  • In the above program we are trying to call non static method of class from a static method so compiler throwing error.
  • Can not make a static reference to the non- static method nonStaticMethod() from the type StaticMethodDemo
  • So without object we can not cal non static method of a class.
  • Check the below example program calling non static method from static method by creating object of that class and on that object calling non static method.

Program #2: Java example program to call non static method from static method.

  1. package com.instanceofjava.staticinterviewquestions;
  2. //www.instanceofjava.com
  3.  
  4. public class StaticMethodDemo {
  5.  
  6.     
  7.     void nonStaticMethod(){
  8.         System.out.println("non static method");
  9.     }
  10.     
  11.     public static void staticMethod(){
  12.         
  13.         new StaticMethodDemo().nonStaticMethod();
  14.     }
  15.     
  16.     
  17.     
  18.     public static void main(String[] args) {
  19.         
  20.         StaticMethodDemo.staticMethod();
  21. }
  22.  
  23. }
   
Output:

  1. non static method

Calling static method from non static method in java

  • Static means class level and non static means object level.
  • Non static variable gets memory in each in every object dynamically.
  • Static variables are not part of object and while class loading itself all static variables gets memory.
  • Like static variables we have static methods. Without creating object we can access static methods.
  • Static methods are class level. and We can still access static methods in side non static methods.
  • We can call static methods without using object also by using class name.
  • And the answer to the question of  "is it possible to call static methods from non static methods in java" is yes.
  • If we are calling a static method from non static methods means calling a single common method using unique object of class which is possible. 


Program #1: Java example program to call static method from non static method.


  1. package com.instanceofjava.staticinterviewquestions;
  2. public class StaticMethodDemo {
  3.  
  4. void nonStaticMethod(){
  5.         
  6.         System.out.println("Hi i am non static method");
  7.         staticMethod();
  8.  }
  9.     
  10.  public static void staticMethod(){
  11.         
  12.         System.out.println("Hi i am static method");
  13.   }
  14.     
  15.  public static void main(String[] args) {
  16.         StaticMethodDemo obj= new StaticMethodDemo();
  17.         
  18.         obj.nonStaticMethod();
  19.  
  20.     }
  21.  
  22. }
 Output:

  1. Hi i am non static method
  2. Hi i am static method

  • In the above program we have created object of the class and called a non static method on that object and in side non static method called a static method.
  • So it is always possible to access static variables and static methods in side non static methods

 Program #2: Java example program to call static method from non static method.


Top 60 Hadoop interview questions and answers for freshers and experienced - Part 1

hadoop interview questions and answers for frehsers and experienced


1.What is HDFS?

  • HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information.
  • Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications


 
2.What are the Hadoop configuration files?

  1.     hdfs-site.xml
  2.     core-site.xml
  3.     mapred-site.xml


3.How NameNode Handles data node failures?

  • NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNode in the cluster.Receipt of a Heartbeat implies that the DataNode is functioning properly.
  • When NameNode notices that it has not received a heartbeat message from a DataNode after a certain amount of time, the DataNode is identified as dead. Since blocks will be under replicated the system NameNode begins replicating the blocks that were stored on the dead DataNode.
  • The NameNode takes responsibility of the replication of the data blocks from one DataNode to another.The replication data transfer happens directly between DataNodes and the data never passes through the NameNode.


4.What is MapReduce in Hadoop?

  • Hadoop MapReduce is a specially designed framework for distributed processing of large data sets on clusters of commodity hardware. 
  • The framework itself can take care of scheduling tasks, monitoring them and reassigning of failed tasks.

5.What is the responsibility of NameNode in HDFS ?

  • NameNode is a master daemon for creating metadata for blocks, stored on DataNodes. Every DataNode sends heartbeat and block report to NameNode.
  • If NameNode not receives any heartbeat then it simply identifies that the DataNode is dead. This NameNode is the single Point of failover. If NameNode goes down HDFS cluster is inaccessible.

6.What it  the responsibility of SecondaryNameNode in HDFS?

  • SecondaryNameNode is the mater Daemon to create Housekeeping work for NameNode.
  • SecondaryNameNode is not the backup of NameNode but it is the backup for metadata of the NameNode.

7.What is the DataNode in HDFS?

  • DataNode is the slave daemon of NameNode for storing actual data blocks. Each DataNode stores number of 64MB blocks.

8.What is the JobTracker in HDFS?

  • JobTracker is a mater daemon for assigning tasks to TaskTrackers in different DataNodes where it can find data blocks for input file.

9.How can we list all job running in a cluster?

  •  ]$ hadoop job -list

10.How can we kill a job?

  • ]$ hadoop job –kill jobid

11.Whats the default port that jobtrackers listens to

  •  http://localhost:50030

12.Whats the default port where the dfs namenode web ui will listen on

  •     http://localhost:50070

13.What is Hadoop Streaming

  • Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations


14.Whats is Distributed Cache in Hadoop

  • Distributed Cache is a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job.
  • The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

15.What is the benifit of Distributed cache, why can we just have the file in HDFS and have the application read it

  • This is because distributed cache is much faster. It copies the file to all trackers at the start of the job.
  • Now if the task tracker runs 10 or 100 mappers or reducer, it will use the same copy of distributed cache. on the other hand, if you put code in file to read it from
  • HDFS in the MR job then every mapper will try to access it from HDFS hence if a task    tracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also
  • HDFS is not very efficient when used like this.


16.Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job

  • Yes, The input format class provides methods to add multiple directories as input to a Hadoop job

17.What will a hadoop job do if you try to run it with an output directory that is already present? Will it overwrite it - warn you and continue - throw an exception and exit

  • The hadoop job will throw an exception and exit.


18.How can you set an arbitary number of mappers to be created for a job in Hadoop

  • This is a trick question. You cannot set it

19.How can you set an arbitary number of reducers to be created for a job in Hadoop

  • You can either do it programmatically by using method setNumReduceTasksin the JobConfclass or set it up as a configuration setting

20.How will you write a custom partitioner for a Hadoop job

  • To have hadoop use a custom partitioner you will have to do minimum the following three
  1. Create a new class that extends Partitioner class
  2. Override method getPartition
  3. In the wrapper that runs the Map Reducer, either  add the custom partitioner to the job programtically using method setPartitionerClass or add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

21.How did you debug your Hadoop code?

  • There can be several ways of doing this but most common ways are
  1.     By using counters
  2.     The web interface provided by Hadoop framework

22.What does the term "Replication factor" mean

  • Replication factor is the number of times a file needs to be replicated in HDFS


23.What is the default replication factor in HDFS

  • The default replication factor is 3

24. What is the typical block size of an HDFS block

  • The default HDFS block size is 64Mb or 128Mb

25.What is the benefit of having such big block size (when compared to block size of linux file system like ext)

  • It allows HDFS to decrease the amount of metadata storage required per file (the list of blocks per file will be smaller as the size of individual blocks increases). Furthermore, it allows for fast streaming reads of data, by keeping large amounts of data sequentially laidout on the disk

26.Why is it recommended to have few very large files instead of a lot of small files in HDFS

  • This is because the Name node contains the meta data of each and every file in HDFS and more files means more metadata and since namenode loads all the metadata in memory for speed hence having a lot of files may make the metadata information big enough to exceed the size of the memory on the Name node

27.What alternate way does HDFS provides to recover data in case a Namenode, without backup, fails and cannot be recovered

  • There is no way. If Namenode dies and there is no backup then there is no way to recover data

28.Describe how a HDFS client will read a file in HDFS, like will it talk to data node or namenode ... how will data flow etc

  • To open a file, a client contacts the Name Node and retrieves a list of locations for the blocks that comprise the file.
  • These locations identify the Data Nodes which hold each block. Clients then read file data directly from the Data Node servers, possibly in parallel.
  • The Name Node is not directly involved in this bulk data transfer, keeping its overhead to a minimum.

29.Using linux command line. how will you List the the number of files in a HDFS directory

  •      hadoop fs -ls

30.Using linux command line. how will  Create a directory in HDFS

  •     hadoop fs -mkdir

Big data Hadoop interview questions answers freshers and experienced - Part 2  
Select Menu