Hadoop

From ACENET
Jump to: navigation, search
Description
Apache Hadoop is a software framework that supports data-intensive distributed applications.
Modulefile
None yet. Should extend PATH, and maybe replace /usr/local/hadoop-1.0.4/conf/hadoop-sge.sh. Requires java module, probably.
Examples
See below
Documentation
Apache Hadoop home page
Command reference
SGE integration (for ACENET tech staff)
Notes
Support for Hadoop is being introduced at ACENET experimentally. At the present time it can only be run on selected hosts at Glooscap, and only by selected users who have been invited to help us evaluate our rollout. This documentation page is being made available to help those users, but has intentionally not yet been linked to the Available Software page.

Usage

Here's a sample job script. To make this work you'll need:

  • Java code. We use /usr/local/hadoop-1.0.4/hadoop-examples-1.0.4.jar below, containing the standard Hadoop word-count example.
  • Sample data. We used Gutenberg e-text 20417.
  • Add /usr/local/hadoop-1.0.4/bin to your PATH.
#$ -cwd
#$ -j y
#$ -l h_rt=1:00:00,h_vmem=4G
#$ -pe hadoop 64

source /usr/local/hadoop-1.0.4/conf/hadoop-sge.sh
hadoop_config  # Prepare the Hadoop environment.
hadoop_format  # OPTIONAL: Create the Hadoop filesystem.
hadoop_start   # Start the Hadoop services.

# Copy input to the HDFS filesystem.
hadoop fs -put gutenberg/pg20417.txt pg20417.txt

# Run the hadoop task(s) here. Specify the jar, class, input, and output.
hadoop jar ./hadoop-examples-1.0.4.jar wordcount pg20417.txt output

# Copy the output files from the HDFS filesystem.
hadoop fs -get output hadoop-output.$JOB_ID

# PLEASE always stop the Hadoop service processes before exiting the job!
hadoop_stop

Results from this sample script should be found in hadoop-output.$JOB_ID/part-r-00000.

For debugging purposes you might find it useful to run interactively:

$ qrsh -pe hadoop 16 -l h_rt=1:0:0,h_vmem=4G -cwd bash

Data copied into the Hadoop Filesystem will persist between jobs, though running hadoop_format may (or may not) destroy old data. Best practice is to delete data explicitly with hadoop fs -rmr or the equivalent when you are done with it.

Please always remember to call hadoop_stop before exiting, so daemons are not left occupying memory.

The hadoop parallel environment will only work if you request a multiple of 16 slots.

We've empirically measured the time to hadoop fs -put 100GB to be about 25 minutes.

Troubleshooting and Known Issues

Various different errors can be produced by insufficient memory. Always request h_vmem=4G. In combination with the -pe hadoop this will ensure Grid Engine makes all the RAM in the machine available.

The default replication factor in HDFS is set to 3. One of our testers recommended a value of 4. You can change the replication factor by editing 'hdfs-site.xml' in conf.$jobid after calling the shell function hadoop_config but before calling hadoop_format. A block size of 512M or 1G in HDFS has also been recommended.