HTCondor for AHEP members

The aim of this page is mainly to explain how the to use the HTCondor network hosted by the nac* computers, for the AHEP group of IFIC. It is intended as an addition to the notes available here. While the notes treat HTCondor more in general, here you will find specific instructions for the AHEP network.

Go to:

Submission Instructions
Common software
Docker examples

Submission Instructions Top

Job management within the nodes of the AHEP HTCondor network can be done only through nac60, while from all the nodes of the network you can monitor the status of the runs (condor_q) and of the network (condor_status). If you don't have an account on nac60, let me know.

The nac* nodes do not share a common filesystem nor the same users (in principle most of them have the same username, but not always the same ID or permissions). For HTCondor purposes, this is sufficient to prevent running the jobs with the same username as who submitted the job, except when the job runs in the same nac60.
HTCondor will use special users (ahepcondor[1-12]) for running all jobs. Take this into account when you need to use packages or libraries that have only been installed for your user! In particular:

for python packages, if they are not installed globally, you can try to specify in the condor submission file the path where the libraries are installed, editing manually the PYTHONPATH in the environment (export PYTHONPATH=...), or installing them locally at the beginning of the run (in this case, use for example pip install --user ...).
if you need specific software you may need to install it in advance in the nodes where you want to run! Otherwise the jobs may die because the required files will not be found. See also
The next section
To request the run to use only specific nodes, you can specify a requirement in the job submission script, for example adding the line requirements = TARGET.Machine=="nac49.ahep" || TARGET.Machine=="nac50.ahep" to use either nac49 or nac50.

Another point is that jobs will run in some HTCondor folder, which will always have a different name. It is better if your program can work only in the local folder, without using absolute paths. The list of files that HTCondor has to copy before and after the execution can be specified in the submission script. As a general suggestion, I recommend running each job with some bash wrapper to be able to transfer properly input and output files, in particular if you need to copy many files or folders. For example, you can use the following configuration in the submission script:

Universe   = vanilla
Executable = runner.sh
jobname = somename
transfer_input_files=code.tar.gz,in.tar.gz
when_to_transfer_output = ON_EXIT_OR_EVICT
should_transfer_files = YES
transfer_output_files = out.tar.gz
transfer_output_remaps= "out.tar.gz=$(JOBNAME).$(Cluster).$(process).tar.gz"
Log        = $(JOBNAME).$(Cluster).$(process).l
Output     = $(JOBNAME).$(Cluster).$(process).o
Error      = $(JOBNAME).$(Cluster).$(process).e
Queue Arguments From (
    10 a
    11 b
    12 c
    13 d
)

In this case, HTCondor will copy to the execution machine all the files specified in transfer_input_files before starting the run (the executable will always be copied, except in docker universe jobs), and at the end it will transfer the one specified in transfer_output_files from the executing machine to the nac60, but renamed according to the instruction transfer_output_remaps, in this case you will find something like somename.109.(0-3).tar.gz. A list of log|output|error files named e.g. somename.109.(0-3).(l|o|e) will also be created in the same folder.
The arguments of the Queue Arguments from list will generate 4 different runs, corresponding to calling, each in a different process:

runner.sh 10 a
runner.sh 11 b
runner.sh 12 c
runner.sh 13 d

The runner.sh will be a script accepting two arguments, for example something like:

#!/bin/bash
#extract code and input data, compile:
tar xzf code.tar.gz
tar xzf in.tar.gz
make
#run the real job with the first argument of runner.sh
./mycommand $1
#detect if it was successful or not, if it applies:
err=$?

#compress output folder to the output file for transfer it back
tar czvf out.tar.gz $2/*
exit $err

In this case, the code.tar.gz and in.tar.gz contain the input files that will be copied and extracted in the appropriate positions, while the output folder will be compressed to out.tar.gz before concluding the job.

Common software Top

In HTCondor, it is possible to define user-specified variables that will be checked at the matching time to decide if a machine is appropriate to run a specific job. For convenience, the variables HasGlobes, HasPlanckLast (Planck 2018 likelihoods in /home/common/Planck18/) and HasRoot are available at matching time. For example, if you need Root for your run, you can ask HTCondor to use only machines where it is present, by adding requirements = TARGET.HasRoot to your submission file. The same can be done with Globes (requirements = TARGET.HasGlobes).

To list all the machines that have the XXX library or code, use:

condor_status -const "HasXXX"

At the moment, only these two variables are considered. In principle it is possible to define more of them, either to detect the presence of other packages or to ask for specific software versions. Ask me if you need.

Docker Universe examples Top

Sometimes, if you need to install many packages with complicated dependencies in many different computers, or you do not have root access to install things globally, it is easier to configure a container and run your programs inside it (see this page for more details). Containers provide the same environment in all instances executing your jobs independently on the host system, even when the required libraries are not present on the system where the container runs (they are inside the container!).

If you have a docker image (say nameoftheimage) that you want to use in your docker runs, you should first push it to the local registry nac46.ific.uv.es:443, for example using:

docker tag nameoftheimage nac46.ific.uv.es:443/nameoftheimage
docker login -u ahepdocker -p `cat /etc/docker/ahepdocker_nac46_pass` nac46.ific.uv.es:443
docker push nac46.ific.uv.es:443/nameoftheimage

If any of the two above commands gives a permission denied error, your user just lacks the required permissions to run docker (sudo adduser yourusername docker && sudo adduser yourusername ahepcondor, log out and log in again will fix the problem).
If you want to check that the image has been pushed correctly, you can query the list of images available in the local registry and check that the image name appears:

curl https://ahepdocker:`cat /etc/docker/ahepdocker_nac46_pass`@nac46.ific.uv.es:443/v2/_catalog

The Docker hub contains several ready-to-use containers, for example hello-world, used below.

To submit a docker universe job in the AHEP network with HTCondor, you can use this submission script:

Universe = docker
JOBNAME = testDockerNac46

imagename = hello-world

docker_image=nac46.ific.uv.es:443/$(imagename)

transfer_input_files = logincondor.sh
+PreCmd = "./logincondor.sh"
+PreArgs = "$(imagename)"

Log        = $(JOBNAME).$(Cluster).l
Output     = $(JOBNAME).$(Cluster).o
Error      = $(JOBNAME).$(Cluster).e

Queue

Save the file, for example, as docker_job and you will be able to submit with condor_submit docker_job.
The script logincondor.sh (must be executable: chmod +x logincondor.sh) is required to login to the local registry and download the image locally:

#!/bin/bash
docker login -u ahepdocker -p `cat /etc/docker/ahepdocker_nac46_pass` nac46.ific.uv.es:443
echo nac46.ific.uv.es:443/$1
docker pull nac46.ific.uv.es:443/$1

If the job does not run or if there are problems, let me know!