Hadoop 2.x Administration Cookbook
上QQ阅读APP看书,第一时间看更新

Configuring YARN history server

Whenever a MapReduce job runs, it launches containers on multiple nodes and the logs for that container are only written on that particular node. If the user needs details of the job, he needs to go to all the nodes to fetch the logs, which could be very tedious in large clusters.

A better approach will be to aggregate the logs at a common location once the job finishes and then it can be accessed using a web server or other means. To address this, History Server was introduced in Hadoop, to aggregate logs and provide a Web UI, for users to see logs for all the containers of a job at one place.

Getting ready

You need to have a running cluster with YARN set up and should have completed the previous recipe to make sure the cluster is working fine in terms of HDFS and YARN.

The following steps will guide you through the process of setting up Job history server.

How to do it...

  1. Connect to the ResourceManager node, which is the YARN master and switch to user hadoop.
  2. Navigate to the directory /opt/cluster/hadoop/etc/hadoop.
  3. Edit the yarn-site.xml file to add the following configurations, as shown in the upcoming steps and screenshots.
  4. Firstly, enable yarn.log aggregation using the following parameter:
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
  5. Add jobhistory server address. The following is the RPC configuration parameter:
    How to do it...
  6. Add the jobhistory web server address:
    How to do it...
  7. Configure a location to store logs on HDFS:
    How to do it...
  8. Copy the yarn-site.xml file to all nodes in the cluster.
  9. Start history server on the master using the following command:
    $ mr-jobhistory-daemon.sh start historyserver
    
  10. Restart YARN daemons for changes to take effect, as shown next:
    $ stop-yarn.sh
    $ start-yarn.sh
    

How it works...

Let's take a look at what we did throughout this recipe. In steps 1 through 7, we enabled YARN log aggregation, which is disabled by default. Then, we configured the RPC and web server ports and also the location where logs will be stored.

Whenever a container is cleaned, a log collection thread wakes up and does an upload of the logs to the configured location. The log location is similar to a web hosting directory, where the history server can publish its contents and is accessible through Web UI. There is a retention period, for how long the logs must be stored by the yarn.log-aggregation.retain-seconds parameter.

There's more...

In the upcoming releases, a new server for maintaining the history logs is used, which is called Timeline server and its job history server might be deprecated in the future.