Measuring Code Performance

From ACENET
Jump to: navigation, search

You are writing code, perhaps parallel code, and its performance and scaling properties are of interest. You need tools to let you evaluate the code's speed, but you probably also want to control the sources of variance in the speed. The scheduling policies and variety of hardware in ACENET clusters don't make this easy, but it is possible to eliminate many sources of variance by making appropriate resource requests for your jobs.

This article will address the following sources of variance:

  • Different numbers of parallel processes per node.
  • Different node hardware: CPU models, motherboard architecture.
  • Contention for node resources: Sharing of node-local buses, disk, etc.
  • Contention for cluster-wide resources: Inter-node communication network. Shared filesystems.

Your are encouraged to contact your local Computational Research Consultant for help, advice and discussions on these matters.

Record the assigned hardware

Each job you run should record in its standard output the identity of the hosts on which it was run, as well as other information such as the dates and times the job began and ended. Grid Engine creates a "hostfile" listing the assigned hosts and the number of slots on each, and puts the name of the hostfile in the environment variable PE_HOSTFILE, so we recommend including lines something like this in your job scripts:

#$ -l h_rt=...
echo "Contents of PE_HOSTFILE ------"
cat $PE_HOSTFILE
echo "----------- end of PE_HOSTFILE"
echo "Job begins at $(date)"
mpirun ...
echo "Job ends at $(date)"

For serial or shared-memory jobs you can just call hostname instead of dumping $PE_HOSTFILE.

Use local disk

(To be written: $TMPDIR)

Get a regular distribution of processes

For distributed parallel programming (such as MPI) the normal practice at ACENET is to request -pe ompi* N, where N is the slot count. When your job is scheduled, the slots may be packed onto a minimal number of nodes, or distributed across N nodes, or anything in between. This unpredictable distribution of processes can result in performance variance because communications between processes on the same node are usually faster than communications between nodes. While you should certainly write your code to handle that situation efficiently, during development and testing you may wish to get a regular distribution of processes; say, one on every node, or four per node, or sixteen.

ACENET provides parallel environments analogous to ompi* which guarantee this, named

1per*
2per*
4per*
16per*

Not all of these are available at all sites. You can get the list of what's available at a given site with qconf -spl.

Obviously, jobs with these parallel environments may take longer to schedule than similar jobs requesting ompi* since the scheduler has to satisfy tighter constraints. Also note that with 1per*, 2per*, and 4per*, the parallel tasks may still be in contention with other users' processes on the individual nodes; see Get entire nodes below.

Get entire nodes

When using -pe ompi* your code may share hosts with other jobs belonging to other users. If you want to eliminate variance due to competition for host resources (e.g. memory bus bandwidth) you can do this by asking for entire hosts.

The openmp parallel environment guarantees that all the job's slots are assigned on a single host. Since ACENET has no general-production hosts with more than 16 slots, -pe openmp 16 guarantees that your job will be the only job on its host. To get exclusive control of multiple hosts, use -pe 16per* as described under Get a regular distribution of processes above.

A job requesting -pe openmp 4 might gain exclusive control of a 4-slot host, or it might be assigned to part of a 16-slot host. Similarly, -pe 4per* 8 might be assigned exclusive control of two 4-slot hosts, parts of two larger hosts, or some mixture. If you need to do this sort of experiment, see the hints under Get consistent hardware, below

Get consistent hardware

Most of ACENET's clusters have grown over time with successive purchases of different hardware. As a result, two otherwise identical jobs may be assigned to different hardware and thus run at different speeds. You can control for this source of variance in your analysis by Recording the assigned hardware as noted above, and then comparing the recorded host identities with the hardware information given at Compute Resources and the individual cluster descriptions linked therefrom.

However, you can also submit jobs that are constrained only to run on certain machines. This requires that you identify the block of machines that suit your requirements, for which again you should consult Compute Resources. But here are some Grid Engine directives that will guarantee assignment to certain blocks of hardware of matching architecture:

# At Glooscap:
#$ -q *@cl0[0-4]*|cl05[0-8]        # cl001-cl058, Hewlett-Packard 4-core nodes
#$ -q *@cl059|cl0[6-8]*|cl09[0-7]  # cl059-cl097, SunBlade x6440 16-core nodes
#$ -q *@cl098|cl099|cl1*           # cl098-cl183, SGI C2112-4G3 16-core nodes
#
# At Placentia:
#$ -q *@cl00*|cl01[0-2]|cl02[1-9]|cl030          # cl001-..-cl030, SunFire X4600 16-core nodes
#$ -q *@cl05[6-9]|cl0[6789]*|cl10[0-8]           # cl056-cl108, SunFire X2200M2 4-core nodes
#$ -q *@cl13[5-9]|cl1[4-9]*|cl2[0-5]*|cl26[0-6]  # cl135-cl266, SunBlade X6240 16-core nodes
#$ -q *@cl32[1-9]|cl33[0-7]                      # cl321-cl337, SunBlade X4140 nodes

Insert only one of these in any given job script, else you might as well not bother! Also note that within these example blocks, other details such as amount of RAM or local disk may vary.

Detailed data on the installed CPUs (steppings, clock rates) can be obtained for a given host with ssh $host cat /proc/cpuinfo.

Use test.q

To be written. test.q will be useful in certain cases and at certain sites.

Remaining sources of variance

(To be written: Network. Shared filesystems.)