REANA-Job-Controller

image image image image image image

REANA-Job-Controller is a component of the REANA reusable and reproducible research data analysis platform. It takes care of executing and managing jobs on compute clouds.

Features

  • submit jobs to compute clouds

  • enquire about status of running jobs

Usage

The detailed information on how to install and use REANA can be found in docs.reana.io.

API

Compute backends

REANA-Job-Controller offers an abstract interface to submit jobs to different compute backends.

Job Manager.

class reana_job_controller.job_manager.JobManager(docker_img='', cmd=[], prettified_cmd='', env_vars={}, workflow_uuid=None, workflow_workspace=None, job_name=None)[source]

Job management interface.

after_execution()[source]

After job submission hook.

before_execution()[source]

Before job submission hook.

cache_job()[source]

Cache a job.

create_job_in_db(backend_job_id)[source]

Create job in db.

execution_hook()[source]

Add before execution hooks and DB operations.

classmethod get_logs(backend_job_id, **kwargs)[source]

Return job logs if log files are present.

Parameters:
  • backend_job_id – ID of the job in the backend.

  • kwargs – Additional parameters needed to fetch logs. These depend on the chosen compute backend.

Returns:

String containing the job logs.

get_status()[source]

Get job status.

Returns:

job status.

Return type:

str

stop()[source]

Stop a job.

update_job_status()[source]

Update job status in DB.

_images/reana-job-manager.png

Kubernetes

Kubernetes Job Manager.

class reana_job_controller.kubernetes_job_manager.KubernetesJobManager(docker_img=None, cmd=None, prettified_cmd=None, env_vars=None, workflow_uuid=None, workflow_workspace=None, cvmfs_mounts='false', shared_file_system=False, job_name=None, kerberos=False, kubernetes_uid=None, kubernetes_memory_limit=None, voms_proxy=False, rucio=False, kubernetes_job_timeout: int | None = None, **kwargs)[source]

Kubernetes job management.

MAX_NUM_JOB_RESTARTS = 0

Maximum number of job restarts in case of internal failures.

MAX_NUM_RESUBMISSIONS = 3

Maximum number of job submission/creation tries

add_eos_volume()[source]

Add EOS volume to a given job spec.

add_hostpath_volumes()[source]

Add hostPath mounts from configuration to job.

add_image_pull_secrets()[source]

Attach to the container the configured image pull secrets.

add_kubernetes_job_timeout()[source]

Add job timeout to the job spec.

add_memory_limit(job_spec)[source]

Add limits.memory to job accordingly.

add_shared_volume()[source]

Add shared CephFS volume to a given job spec.

add_volumes(volumes)[source]

Add provided volumes to job.

Parameters:

volumes – A list of tuple composed 1st of a Kubernetes volumeMount spec and 2nd of Kubernetes volume spec.

add_workspace_volume()[source]

Add workspace volume to a given job spec.

classmethod get_logs(backend_job_id, **kwargs)[source]

Return job logs.

Parameters:
  • backend_job_id – ID of the job in the backend.

  • kwargs – Additional parameters needed to fetch logs. In the case of Kubernetes, the job_pod parameter can be specified to avoid fetching the pod specification from Kubernetes.

Returns:

String containing the job logs.

set_memory_limit(kubernetes_memory_limit)[source]

Set memory limit for job pods. Validate if provided format is correct.

set_user_id(kubernetes_uid)[source]

Set user id for job pods. UIDs < 100 are refused for security.

stop(asynchronous=True)[source]

Stop Kubernetes job execution.

Parameters:
  • backend_job_id – Kubernetes job id.

  • asynchronous – Whether the function waits for the action to be performed or does it asynchronously.

Note

REANA-Job-Controller supports the Kubernetes job manager by default, no need to pass any build argument.

HTCondor

Note

To build REANA-Job-Controller Docker image with HTCondor dependencies use build argument COMPUTE_BACKENDS=kubernetes,htcondorcern.

$ reana-dev docker-build -c reana-job-controller \
  -b COMPUTE_BACKENDS=kubernetes,htcondorcern

Slurm

Note

To build REANA-Job-Controller Docker image with Slum dependencies use build argument COMPUTE_BACKENDS=kubernetes,slurmcern.

$ reana-dev docker-build -c reana-job-controller \
  -b COMPUTE_BACKENDS=kubernetes,slurmcern

Note

Please note that CERN Slurm cluster access is not granted by default.

REST API

The REANA Job Controller API offers different endpoints to create, manage and monitor jobs. Detailed REST API documentation can be found here.

Changelog

0.9.3 (2024-03-04)

Build

  • certificates: update expired CERN Grid CA certificate (#440) (8d6539a), closes #439

  • docker: non-editable submodules in “latest” mode (#416) (3bdda63)

  • python: bump all required packages as of 2024-03-04 (#442) (de119eb)

  • python: bump shared REANA packages as of 2024-03-04 (#442) (fc77628)

Features

  • shutdown: stop all running jobs before stopping workflow (#423) (866675b)

Bug fixes

  • database: limit the number of open database connections (#437) (980f749)

Performance improvements

  • cache: avoid caching jobs when the cache is disabled (#435) (553468f), closes #422

Code refactoring

  • db: set job status also in the main database (#423) (9d6fc99)

  • docs: move from reST to Markdown (#428) (4732884)

  • monitor: centralise logs and status updates (#423) (3685b01)

  • monitor: move fetching of logs to job-manager (#423) (1fc117e)

Code style

Continuous integration

  • commitlint: addition of commit message linter (#417) (f547d3b)

  • commitlint: allow release commit style (#443) (0fc9794)

  • commitlint: check for the presence of concrete PR number (#425) (35bc1c5)

  • pytest: move to PostgreSQL 14.10 (#429) (42622fa)

  • release-please: initial configuration (#417) (fca6f74)

  • release-please: update version in Dockerfile/OpenAPI specs (#421) (e6742f2)

  • shellcheck: fix exit code propagation (#425) (8e74a85)

Documentation

  • authors: complete list of contributors (#434) (b9f8364)

0.9.2 (2023-12-12)

  • Adds metadata labels to Dockerfile.

  • Adds automated multi-platform container image building for amd64 and arm64 architectures.

  • Changes CVMFS support to allow users to automatically mount any available repository.

  • Fixes container image building on the arm64 architecture.

  • Fixes the creation of Kubernetes jobs by retrying in case of error and by correctly handling the error after reaching the retry limit.

  • Fixes job monitoring in cases when job creation fails, for example when it is not possible to successfully mount volumes.

0.9.1 (2023-09-27)

  • Adds unique error messages to Kubernetes job monitor to more easily identify source of problems.

  • Changes Paramiko to version 3.0.0.

  • Changes HTCondor to version 9.0.17 (LTS).

  • Changes Rucio authentication helper to version 1.1.1 allowing users to override the Rucio server and authentication hosts independently of VO name.

  • Fixes intermittent Slurm connection issues by DNS-resolving the Slurm head node IPv4 address before establishing connections.

  • Fixes deletion of failed jobs not being performed when Kerberos is enabled.

  • Fixes job monitoring to consider OOM-killed jobs as failed.

  • Fixes Slurm command generation issues when using fully-qualified image names.

  • Fixes location of HTCondor build dependencies.

  • Fixes detection of default Rucio server and authentication host for ATLAS VO.

  • Fixes container image names to be Podman-compatible.

0.9.0 (2023-01-20)

  • Adds support for Rucio authentication for workflow jobs.

  • Adds support for specifying slurm_partition and slurm_time for Slurm compute backend jobs.

  • Adds Kerberos sidecar container to renew ticket periodically for long-running jobs.

  • Changes reana-auth-vomsproxy sidecar to the latest stable version to support client-side proxy file generation technique and ESCAPE VOMS.

  • Changes default Slurm partition to inf-short.

  • Changes to PostgreSQL 12.13.

  • Changes the base image of the component to Ubuntu 20.04 LTS and reduces final Docker image size by removing build-time dependencies.

0.8.1 (2022-02-07)

  • Adds support for specifying kubernetes_job_timeout for Kubernetes compute backend jobs.

  • Adds a new condition to allow processing jobs in case of receiving multiple failed events when job containers are not in a running state.

0.8.0 (2021-11-22)

  • Adds database connection closure after each REST API request.

  • Adds labels to job and run-batch pods to reduce k8s events to listen to for job-monitor.

  • Fixes auto-mounting of Kubernetes API token inside user jobs by disabling it.

  • Changes job dispatching to use only job-specific node labels.

  • Changes to PostgreSQL 12.8.

0.7.5 (2021-07-05)

  • Changes HTCondor to 8.9.11.

  • Changes myschedd package and configuration to latest versions.

  • Fixes job command formatting bug for CWL workflows on HTCondor.

0.7.4 (2021-04-28)

  • Adds configuration environment variable to set job memory limits for the Kubernetes compute backend (REANA_KUBERNETES_JOBS_MEMORY_LIMIT).

  • Fixes Kubernetes job log capture to include information about failures caused by external factors such as OOMKilled.

  • Adds support for specifying kubernetes_memory_limit for Kubernetes compute backend jobs.

0.7.3 (2021-03-17)

  • Adds new configuration to toggle Kubernetes user jobs clean up.

  • Fixes HTCondor Docker networking and machine version requirement setup.

  • Fixes HTCondor logs and workspace files retrieval on job failure.

  • Fixes Slurm job submission providing the correct shell environment to run Singularity.

  • Changes HTCondor myschedd to the latest version.

  • Changes job status succeeded to finished to use central REANA nomenclature.

  • Changes how to deserialise job commands using central REANA-Commons deserialiser function.

0.7.2 (2021-02-03)

  • Fixes minor code warnings.

  • Changes CI system to include Python flake8 and Dockerfile hadolint checkers.

0.7.1 (2020-11-10)

  • Adds support for specifying htcondor_max_runtime and htcondor_accounting_group for HTCondor compute backend jobs.

  • Fixes Docker build by properly exiting when there are problems with myschedd installation.

0.7.0 (2020-10-20)

  • Adds support for running unpacked Docker images from CVMFS on HTCondor jobs.

  • Adds support for pulling private images using image pull secrets.

  • Adds support for VOMS proxy as a new authentication method.

  • Adds pinning of all Python dependencies allowing to easily rebuild component images at later times.

  • Fixes HTCondor job submission retry technique.

  • Changes error reporting on Docker image related failures.

  • Changes runtime pods to prefix user workflows with the configured REANA prefix.

  • Changes CVMFS to be read-only mount.

  • Changes runtime job instantiation into the configured runtime namespace.

  • Changes test suite to enable running tests locally also on macOS platform.

  • Changes CERN HTCondor compute backend to use the new myschedd connection library.

  • Changes CERN Slurm compute backend to improve job status detection.

  • Changes base image to use Python 3.8.

  • Changes code formatting to respect black coding style.

  • Changes documentation to single-page layout.

0.6.1 (2020-05-25)

  • Upgrades REANA-Commons package using latest Kubernetes Python client version.

0.6.0 (2019-12-20)

  • Adds generic job manager class and provides example classes for CERN HTCondor and CERN Slurm clusters.

  • Moves job controller to the same Kubernetes pod with the REANA-Workflow-Engine-* (sidecar pattern).

  • Adds sidecar container to the Kubernetes job pod if Kerberos authentication is required.

  • Provides user secrets to the job container runtime tasks.

  • Refactors job monitoring using singleton pattern.

0.5.1 (2019-04-23)

  • Pins urllib3 due to a conflict while installing Kubernetes Python library.

  • Fixes documenation build badge.

0.5.0 (2019-04-23)

  • Adds a new endpoint to delete jobs (Kubernetes).

  • Introduces new common interface for job management which defines what the compute backends should offer to be compatible with REANA, currently only Kubernetes backend is supported.

  • Fixes security vulnerability which allowed users to access other people’s workspaces.

  • Makes CVMFS mounts optional and configurable at repository level.

  • Updates the creation of CVMFS volumes specification, it now uses normal persistent volume claims.

  • Increases stability and improves test coverage.

0.4.0 (2018-11-06)

  • Improves REST API documentation rendering.

  • Changes license to MIT.

0.3.2 (2018-09-26)

  • Adapts Kubernetes API adaptor to mount shared volumes on jobs as CEPH persistentVolumeClaim’s (managed by reana-cluster) instead of plain CEPH volumes.

0.3.1 (2018-09-07)

  • Pins REANA-Commons and REANA-DB dependencies.

0.3.0 (2018-08-10)

  • Adds uwsgi for production deployments.

  • Switches from pykube to official Kubernetes python client.

  • Adds compatibility with latest Kubernetes.

0.2.0 (2018-04-19)

  • Adds dockerignore file to ease developments.

0.1.0 (2018-01-30)

  • Initial public release.

Contributing

Bug reports, issues, feature requests, and other contributions are welcome. If you find a demonstrable problem that is caused by the REANA code, please:

  1. Search for already reported problems.

  2. Check if the issue has been fixed or is still reproducible on the latest master branch.

  3. Create an issue, ideally with a test case.

If you create a pull request fixing a bug or implementing a feature, you can run the tests to ensure that everything is operating correctly:

$ ./run-tests.sh

Each pull request should preserve or increase code coverage.

License

MIT License

Copyright (C) 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024 CERN.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

In applying this license, CERN does not waive the privileges and immunities granted to it by virtue of its status as an Intergovernmental Organization or submit itself to any jurisdiction.

Authors

The list of contributors in alphabetical order: