Personal tools
You are here: Home Projects Profiling MapReduce

Profiling MapReduce

Using Starfish to analyze Hadoop MapReduce logs.


Occasionally, one encounters situations where a MapReduce program does not perform as well as expected to. In such cases possible solutions might be alternative approaches to data partitioning, redesigning the MapReduce implementation of the used algorithm or choosing a different algorithm altogether. To help the debugging process Hadoop MapReduce generates verbose log files, which the programmer can use to pinpoint the issues of his implementation. Unfortunately, for large jobs these logs can grow to hundreds of megabytes in size and finding relevant information by hand can be extremely frustrating and time-consuming, not to mention that getting a general overview by just reading the logs is nigh impossible. To alleviate these concerns the Starfish project provides a tool for analysing Hadoop MapReduce logs through a graphical user interface, providing views of the execution timeline, data flows and data/time skew. Additionally, it provides hints on optimizing one's Hadoop MapReduce cluster configuration for running specific jobs. Some of the features (such as the data flow view) of the tool require the use of the Starfish profiler, which, as of now, is only available for the older versions of Hadoop MapReduce (0.20.2). Until an update is release, one can still use most features of the tool with the more recent 1.0.x series.

Using the Log Analyzer

The Starfish Log Analyzer uses Java Webstart and will automatiocally set up the necessary environment with but a few clicks. The installation instructions for the Starfish Profiler can be found on the Starfish website.