Hadoop
- Description
- Apache Hadoop is a software framework that supports data-intensive distributed applications.
- Modulefile
- None yet. Should extend PATH, and maybe replace /usr/local/hadoop-1.0.4/conf/hadoop-sge.sh. Requires java module, probably.
- Examples
- See below
- Documentation
- Apache Hadoop home page
- Command reference
- SGE integration (for ACENET tech staff)
- Notes
- Support for Hadoop is being introduced at ACENET experimentally. At the present time it can only be run on selected hosts at Glooscap, and only by selected users who have been invited to help us evaluate our rollout. This documentation page is being made available to help those users, but has intentionally not yet been linked to the Available Software page.
Usage
Here's a sample job script. To make this work you'll need:
- Java code. We use
/usr/local/hadoop-1.0.4/hadoop-examples-1.0.4.jar
below, containing the standard Hadoop word-count example. - Sample data. We used Gutenberg e-text 20417.
- Add
/usr/local/hadoop-1.0.4/bin
to your PATH.
#$ -cwd #$ -j y #$ -l h_rt=1:00:00,h_vmem=4G #$ -pe hadoop 64 source /usr/local/hadoop-1.0.4/conf/hadoop-sge.sh hadoop_config # Prepare the Hadoop environment. hadoop_format # OPTIONAL: Create the Hadoop filesystem. hadoop_start # Start the Hadoop services. # Copy input to the HDFS filesystem. hadoop fs -put gutenberg/pg20417.txt pg20417.txt # Run the hadoop task(s) here. Specify the jar, class, input, and output. hadoop jar ./hadoop-examples-1.0.4.jar wordcount pg20417.txt output # Copy the output files from the HDFS filesystem. hadoop fs -get output hadoop-output.$JOB_ID # PLEASE always stop the Hadoop service processes before exiting the job! hadoop_stop
Results from this sample script should be found in hadoop-output.$JOB_ID/part-r-00000.
For debugging purposes you might find it useful to run interactively:
$ qrsh -pe hadoop 16 -l h_rt=1:0:0,h_vmem=4G -cwd bash
Data copied into the Hadoop Filesystem will persist between jobs, though running hadoop_format
may (or may not) destroy old data. Best practice is to delete data explicitly with hadoop fs -rmr
or the equivalent when you are done with it.
Please always remember to call hadoop_stop before exiting, so daemons are not left occupying memory.
The hadoop
parallel environment will only work if you request a multiple of 16 slots.
We've empirically measured the time to hadoop fs -put
100GB to be about 25 minutes.
Troubleshooting and Known Issues
Various different errors can be produced by insufficient memory. Always request h_vmem=4G
. In combination with the -pe hadoop
this will ensure Grid Engine makes all the RAM in the machine available.
The default replication factor in HDFS is set to 3. One of our testers recommended a value of 4. You can change the replication factor by editing 'hdfs-site.xml' in conf.$jobid after calling the shell function hadoop_config but before calling hadoop_format. A block size of 512M or 1G in HDFS has also been recommended.