The aim of this page is mainly to explain how the to use the HTCondor network hosted by the nac* computers, for the AHEP group of IFIC.
It is intended as an addition to the notes available here.
While the notes treat HTCondor more in general, here you will find specific instructions for the AHEP network.
Go to:
Job management within the nodes of the AHEP HTCondor network can be done only through nac60,
while from all the nodes of the network you can monitor
the status of the runs (condor_q)
and of the network (condor_status).
If you don't have an account on nac60, let me know.
The nac* nodes do not share a common filesystem nor the same users
(in principle most of them have the same username,
but not always the same ID or permissions).
For HTCondor purposes, this is sufficient to prevent running the jobs with the same username as who submitted the job,
except when the job runs in the same nac60.
HTCondor will use special users (ahepcondor[1-12]) for running all jobs.
Take this into account when you need to use packages or libraries that have only been installed for your user!
In particular:
PYTHONPATH in the environment (export PYTHONPATH=...),
or installing them locally at the beginning of the run
(in this case, use for example pip install --user ...).
requirements = TARGET.Machine=="nac49.ahep" || TARGET.Machine=="nac50.ahep"
to use either nac49 or nac50.
Another point is that jobs will run in some HTCondor folder, which will always have a different name.
It is better if your program can work only in the local folder, without using absolute paths.
The list of files that HTCondor has to copy before and after the execution can be specified in the submission script.
As a general suggestion, I recommend running each job with some bash wrapper to be able to transfer properly input and output files,
in particular if you need to copy many files or folders.
For example, you can use the following configuration in the submission script:
Universe   = vanilla
Executable = runner.sh
jobname = somename
transfer_input_files=code.tar.gz,in.tar.gz
when_to_transfer_output = ON_EXIT_OR_EVICT
should_transfer_files = YES
transfer_output_files = out.tar.gz
transfer_output_remaps= "out.tar.gz=$(JOBNAME).$(Cluster).$(process).tar.gz"
Log        = $(JOBNAME).$(Cluster).$(process).l
Output     = $(JOBNAME).$(Cluster).$(process).o
Error      = $(JOBNAME).$(Cluster).$(process).e
Queue Arguments From (
    10 a
    11 b
    12 c
    13 d
)
transfer_input_files before starting the run
(the executable will always be copied, except in docker universe jobs),
and at the end it will transfer the one specified in transfer_output_files
from the executing machine to the nac60,
but renamed according to the instruction transfer_output_remaps,
in this case you will find something like somename.109.(0-3).tar.gz.
A list of log|output|error files named e.g. somename.109.(0-3).(l|o|e) will also be created in the same folder.
Queue Arguments from list will generate 4 different runs,
corresponding to calling, each in a different process:
runner.sh 10 a runner.sh 11 b runner.sh 12 c runner.sh 13 d
runner.sh will be a script accepting two arguments, for example something like:
#!/bin/bash #extract code and input data, compile: tar xzf code.tar.gz tar xzf in.tar.gz make #run the real job with the first argument of runner.sh ./mycommand $1 #detect if it was successful or not, if it applies: err=$? #compress output folder to the output file for transfer it back tar czvf out.tar.gz $2/* exit $err
code.tar.gz and in.tar.gz contain the input files that will be copied and extracted in the appropriate positions,
while the output folder will be compressed to out.tar.gz before concluding the job.
In HTCondor, it is possible to define user-specified variables that will be checked at the matching time to decide if a machine is appropriate to run a specific job.
For convenience, the variables
HasGlobes,
HasPlanckLast (Planck 2018 likelihoods in /home/common/Planck18/)
and
HasRoot
are available at matching time.
For example, if you need Root for your run, you can ask HTCondor to use only machines where it is present, by adding
requirements = TARGET.HasRoot to your submission file.
The same can be done with Globes (requirements = TARGET.HasGlobes).
To list all the machines that have the XXX library or code, use:
condor_status -const "HasXXX"
At the moment, only these two variables are considered. In principle it is possible to define more of them, either to detect the presence of other packages or to ask for specific software versions. Ask me if you need.
Sometimes, if you need to install many packages with complicated dependencies in many different computers, or you do not have root access to install things globally, it is easier to configure a container and run your programs inside it (see this page for more details). Containers provide the same environment in all instances executing your jobs independently on the host system, even when the required libraries are not present on the system where the container runs (they are inside the container!).
If you have a docker image (say nameoftheimage) that you want to use in your docker runs,
you should first push it to the local registry nac46.ific.uv.es:443, for example using:
docker tag nameoftheimage nac46.ific.uv.es:443/nameoftheimage docker login -u ahepdocker -p `cat /etc/docker/ahepdocker_nac46_pass` nac46.ific.uv.es:443 docker push nac46.ific.uv.es:443/nameoftheimage
permission denied error, your user just lacks the required permissions to run docker
(sudo adduser yourusername docker && sudo adduser yourusername ahepcondor, log out and log in again will fix the problem).
curl https://ahepdocker:`cat /etc/docker/ahepdocker_nac46_pass`@nac46.ific.uv.es:443/v2/_catalog
hello-world, used below.
To submit a docker universe job in the AHEP network with HTCondor, you can use this submission script:
Universe = docker JOBNAME = testDockerNac46 imagename = hello-world docker_image=nac46.ific.uv.es:443/$(imagename) transfer_input_files = logincondor.sh +PreCmd = "./logincondor.sh" +PreArgs = "$(imagename)" Log = $(JOBNAME).$(Cluster).l Output = $(JOBNAME).$(Cluster).o Error = $(JOBNAME).$(Cluster).e Queue
docker_job and you will be able to submit with condor_submit docker_job.
logincondor.sh (must be executable: chmod +x logincondor.sh)
is required to login to the local registry and download the image locally:
#!/bin/bash docker login -u ahepdocker -p `cat /etc/docker/ahepdocker_nac46_pass` nac46.ific.uv.es:443 echo nac46.ific.uv.es:443/$1 docker pull nac46.ific.uv.es:443/$1