Hadoop interview questions and answers for freshers and experienced - Part 1
31.Using linux command line. how will you Copy file from your local directory to HDFS
- hadoop fs -put localfile hdfsfile
32.What platforms and Java versions does Hadoop run on?
- Java 1.6.x or higher, preferably from Sun. Linux and Windows are the supported operating systems, but BSD, Mac OS/X, and OpenSolaris are known to work. (Windows requires the installation of Cygwin).
33.Is there an easy way to see the status and health of a cluster?
- There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.
- By default, these are located at http://job.tracker.addr:50030/ and http://name.node.addr:50070/.
- The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks.
- The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.
- You can also see some basic HDFS cluster health data by running:
- $ bin/hadoop dfsadmin –report
34.Do I have to write my job in Java?
- No. There are several ways to incorporate non-Java code.
35.How do I submit extra content (jars, static files, etc) for my job to use during runtime?
- The distributed cache feature is used to distribute large read-only files that are needed by map/reduce jobs to the cluster. The framework will copy the necessary files from a URL (either hdfs: or http:) on to the slave node before any tasks for the job are executed on that node.
- The files are only copied once per job and so should not be modified by the application.
- Copying content into lib is not recommended and highly discouraged. Changes in that directory will require Hadoop services to be restarted.
- You can subclass the OutputFormat.java class and write your own. You can look at the code of TextOutputFormat MultipleOutputFormat.java etc. for reference. It might be the case that you only need to do minor changes to any of the existing Output Format classes.
- To do that you can just subclass that class and override the methods you need to change.
37.How do you gracefully stop a running job?
- hadoop job -kill <JOBID>
38.How the HDFS Blocks are replicated?
- A. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size.
- The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
- The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a data block on HDFS, 2 copies are stored on data nodes on same rack and 3rd copy on a different track.
39.How the Client communicates with HDFS?
- A. The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
- Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.
40.What is HDFS Block size? How is it different from traditional file system block size?
- In HDFS data is split into blocks and distributed across multiple nodes in the cluster.
- Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times.
- Default is to replicate each block three times. Replicas are stored on different nodes.
- HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS
- Block size can not be compared with the traditional file system block size.
41.When is the reducers are started in a MapReduce job?
- In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
42.If reducers do not start before all mappers finish then why does the progress on Map Reduce job shows something like Map(60%) Reduce(15%)? Why reducers progress percentage is displayed when mapper is not finished yet?
- Reducers start copying intermediate key-value pairs from the mappers as soon as they are available.
- The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer.
- Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.
43.What is the Hadoop MapReduce API contract for a key and value Class?
- The Key must implement the org.apache.hadoop.io.WritableComparable interface.
- The value must implement the org.apache.hadoop.io.Writable interface.
44.What are combiners? When should I use a combiner in my MapReduce Job?
- Combiners are used to increase the efficiency of a MapReduce program.
- They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers.
- You can use your reducer code as a combiner if the operation performed is commutative and associative.
- The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your
- MapReduce jobs should not depend on the combiners execution.
45.Where is the Mapper Output (intermediate kay-value data) stored ?
- A. The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes.
- This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
46.Name the most common InputFormats defined in Hadoop? Which
one is default ?
- Following 2 are most common InputFormats defined in Hadoop
- TextInputFormatis the hadoop default
47. What is the difference between TextInputFormat and KeyValueInputFormat class
- TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper KeyValueInputFormat: Reads text file and parses lines into key, val pairs.
- Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
48. What is InputSplit in Hadoop
- When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split
49. How is the splitting of file invoked in Hadoop Framework
- It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user
50. Consider case scenario: In M/R system,
- HDFS block size is 64 MB
- Input format is FileInputFormat
- We have 3 files of size 64K, 65Mb and 127Mb
- then how many input splits will be made by Hadoop framework?
- Hadoop will make 5 splits as follows
- 1 split for 64K files
- 2 splits for 65Mb files
- 2 splits for 127Mb file
51. What is the purpose of RecordReader in Hadoop
- The InputSplithas defined a slice of work, but does not describe how to access it. The RecordReaderclass actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the InputFormat
52. After the Map phase finishes, the hadoop framework does
"Partitioning, Shuffle and sort". Explain what happens in this phase?
- Partitioning is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same
- After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling.
- Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer
53. If no custom partitioner is defined in the hadoop then how is data partitioned before its sent to the reducer?
- The default partitioner computes a hash value for the key and assigns the partition based on this result.
54. What is a Combiner
- The Combiner is a "mini-reduce" process which operates only on data generated by a mapper.
- The Combiner will receive as input all data emitted by the Mapper instances on a given node.
- The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
55. Give an example scenario where a combiner can be used and where it cannot be used
- There can be several examples following are the most common ones
- Scenario where you can use combiner
- Getting list of distinct words in a file
- Scenario where you cannot use a combiner
- Calculating mean of a list of numbers
56.What is job tracker
- Job Tracker is the service within Hadoop that runs Map Reduce jobs on the cluster
57. What are some typical functions of Job Tracker
- The following are some typical tasks of Job Tracker
- Accepts jobs from clients
- It talks to the NameNode to determine the location of the data
- It locates TaskTracker nodes with available slots at or near the data
- It submits the work to the chosen Task Tracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker
58.What is task tracker
- Task Tracker is a node in the cluster that accepts tasks like Map, Reduce and Shuffle operations - from a JobTracker
59. Whats the relationship between Jobs and Tasks in Hadoop
- One job is broken down into one or many tasks in Hadoop.
60. Suppose Hadoop spawned 100 tasks for a job and one of the task
failed. What will hadoop do ?
- It will restart the task again on some other task tracker and only if the task fails more than 4 (default setting and can be changed) times will it kill the job
61.Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this
- Speculative Execution
- Job tracker makes different task trackers process same input.
- When tasks complete, they announce this fact to the Job Tracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs.
- The Reducers then receive their inputs from whichever Mapper completed successfully, first.