Skip to content

Job

Note

This supersedes the pyslurm.job class, which will be removed in a future release

pyslurm.Job

A Slurm Job.

All attributes in this class are read-only.

Parameters:

Name Type Description Default
job_id int

An Integer representing a Job-ID.

required

Attributes:

Name Type Description
steps JobSteps

Steps this Job has. Before you can access the Steps data for a Job, you have to call the reload() method of a Job instance or the load_steps() method of a Jobs collection.

stats JobStatistics

Real-time statistics of a Job. Before you can access the stats data for a Job, you have to call the load_stats method of a Job instance or the Jobs collection.

pids dict[str, list]

Current Process-IDs of the Job, organized by node name. Before you can access the pids data for a Job, you have to call the load_stats method of a Job instance or the Jobs collection.

name str

Name of the Job

id int

Unique ID of the Job.

association_id int

ID of the Association this Job runs with.

account str

Name of the Account this Job is runs with.

user_id int

UID of the User who submitted the Job.

user_name str

Name of the User who submitted the Job.

group_id int

GID of the Group that Job runs under.

group_name str

Name of the Group this Job runs under.

priority int

Priority of the Job.

nice int

Nice Value of the Job.

qos str

QOS Name of the Job.

min_cpus_per_node int

Minimum Amount of CPUs per Node the Job requested.

state str

State this Job is currently in.

state_reason str

A Reason explaining why the Job is in its current state.

is_requeueable bool

Whether the Job is requeuable or not.

requeue_count int

Amount of times the Job has been requeued.

is_batch_job bool

Whether the Job is a batch job or not.

node_reboot_required bool

Whether the Job requires the Nodes to be rebooted first.

dependencies dict

Dependencies the Job has to other Jobs.

time_limit int

Time-Limit, in minutes, for this Job.

time_limit_min int

Minimum Time-Limit in minutes for this Job.

submit_time int

Time the Job was submitted, as unix timestamp.

eligible_time int

Time the Job is eligible to start, as unix timestamp.

accrue_time int

Job accrue time, as unix timestamp

start_time int

Time this Job has started execution, as unix timestamp.

resize_time int

Time the job was resized, as unix timestamp.

deadline int

Time when a pending Job will be cancelled, as unix timestamp.

preempt_eligible_time int

Time the Job is eligible for preemption, as unix timestamp.

preempt_time int

Time the Job was signaled for preemption, as unix timestamp.

suspend_time int

Last Time the Job was suspended, as unix timestamp.

last_sched_evaluation_time int

Last time evaluated for Scheduling, as unix timestamp.

pre_suspension_time int

Amount of seconds the Job ran prior to suspension, as unix timestamp

mcs_label str

MCS Label for the Job

partition str

Name of the Partition the Job runs in.

submit_host str

Name of the Host this Job was submitted from.

batch_host str

Name of the Host where the Batch-Script is executed.

num_nodes int

Amount of Nodes the Job has requested or allocated.

max_nodes int

Maximum amount of Nodes the Job has requested.

allocated_nodes str

Nodes the Job is currently using. This is only valid when the Job is running. If the Job is pending, it will always return None.

required_nodes str

Nodes the Job is explicitly requiring to run on.

excluded_nodes str

Nodes that are explicitly excluded for execution.

scheduled_nodes str

Nodes the Job is scheduled on by the slurm controller.

derived_exit_code int

The derived exit code for the Job.

derived_exit_code_signal int

Signal for the derived exit code.

exit_code int

Code with which the Job has exited.

exit_code_signal int

The signal which has led to the exit code of the Job.

batch_constraints list

Features that node(s) should have for the batch script. Controls where it is possible to execute the batch-script of the job. Also see 'constraints'

federation_origin str

Federation Origin

federation_siblings_active int

Federation siblings active

federation_siblings_viable int

Federation siblings viable

cpus int

Total amount of CPUs the Job is using. If the Job is still pending, this will be the amount of requested CPUs.

cpus_per_task int

Number of CPUs per Task used.

cpus_per_gpu int

Number of CPUs per GPU used.

boards_per_node int

Number of boards per Node.

sockets_per_board int

Number of sockets per board.

sockets_per_node int

Number of sockets per node.

cores_per_socket int

Number of cores per socket.

threads_per_core int

Number of threads per core.

ntasks int

Number of parallel processes.

ntasks_per_node int

Number of parallel processes per node.

ntasks_per_board int

Number of parallel processes per board.

ntasks_per_socket int

Number of parallel processes per socket.

ntasks_per_core int

Number of parallel processes per core.

ntasks_per_gpu int

Number of parallel processes per GPU.

delay_boot_time int

https://slurm.schedmd.com/sbatch.html#OPT_delay-boot, in minutes

constraints list

A list of features the Job requires nodes to have. In contrast, the 'batch_constraints' option only focuses on the initial batch-script placement. This option however means features to restrict the list of nodes a job is able to execute on in general beyond the initial batch-script.

cluster str

Name of the cluster the job is executing on.

cluster_constraints list

A List of features that a cluster should have.

reservation str

Name of the reservation this Job uses.

resource_sharing str

Mode controlling how a job shares resources with others.

requires_contiguous_nodes bool

Whether the Job has allocated a set of contiguous nodes.

licenses list

List of licenses the Job needs.

network str

Network specification for the Job.

command str

The command that is executed for the Job.

working_directory str

Path to the working directory for this Job.

admin_comment str

An arbitrary comment set by an administrator for the Job.

system_comment str

An arbitrary comment set by the slurmctld for the Job.

container str

The container this Job uses.

comment str

An arbitrary comment set for the Job.

standard_input str

The path to the file for the standard input stream.

standard_output str

The path to the log file for the standard output stream.

standard_error str

The path to the log file for the standard error stream.

required_switches int

Number of switches required.

max_wait_time_switches int

Amount of seconds to wait for the switches.

burst_buffer str

Burst buffer specification

burst_buffer_state str

Burst buffer state

cpu_frequency_min Union[str, int]

Minimum CPU-Frequency requested.

cpu_frequency_max Union[str, int]

Maximum CPU-Frequency requested.

cpu_frequency_governor Union[str, int]

CPU-Frequency Governor requested.

billable_tres float

Amount of billable trackable resources.

wckey str

Name of the WCKey this Job uses.

mail_user list

Users that should receive Mails for this Job.

mail_types list

Mail Flags specified by the User.

heterogeneous_id int

Heterogeneous job id.

heterogeneous_offset int

Heterogeneous job offset.

temporary_disk_per_node int

Temporary disk space in Mebibytes available per Node.

array_id int

The master Array-Job ID.

array_tasks_parallel int

Max number of array tasks allowed to run simultaneously.

array_task_id int

Array Task ID of this Job if it is an Array-Job.

array_tasks_waiting str

Array Tasks that are still waiting.

end_time int

Time at which this Job will end, as unix timestamp.

run_time int

Amount of seconds the Job has been running.

cores_reserved_for_system int

Amount of cores reserved for System use only.

threads_reserved_for_system int

Amount of Threads reserved for System use only.

memory int

Total Amount of Memory this Job has, in Mebibytes

memory_per_cpu int

Amount of Memory per CPU this Job has, in Mebibytes

memory_per_node int

Amount of Memory per Node this Job has, in Mebibytes

memory_per_gpu int

Amount of Memory per GPU this Job has, in Mebibytes

gres_per_node dict

Generic Resources (e.g. GPU) this Job is using per Node.

profile_types list

Types for which detailed accounting data is collected.

gres_binding str

Binding Enforcement of a Generic Resource (e.g. GPU).

gres_tasks_per_sharing str

Task Sharing of a Generic Resource (e.g. GPU).

kill_on_invalid_dependency bool

Whether the Job should be killed on an invalid dependency.

spreads_over_nodes bool

Whether the Job should be spread over as many nodes as possible.

is_cronjob bool

Whether this Job is a cronjob.

cronjob_time str

The time specification for the Cronjob.

elapsed_cpu_time int

Amount of CPU-Time used by the Job so far. This is the result of multiplying the run_time with the amount of cpus requested.

run_time_remaining int

The amount of seconds the job has still left until hitting the time_limit.

cancel() method descriptor

Cancel a Job.

Implements the slurm_kill_job RPC.

Raises:

Type Description
RPCError

When cancelling the Job was not successful.

Examples:

>>> import pyslurm
>>> pyslurm.Job(9999).cancel()

get_batch_script() method descriptor

Return the content of the script for a Batch-Job.

Returns:

Type Description
str

The content of the batch script.

Raises:

Type Description
RPCError

When retrieving the Batch-Script for the Job was not successful.

Examples:

>>> import pyslurm
>>> script = pyslurm.Job(9999).get_batch_script()

get_resource_layout_per_node() method descriptor

Retrieve the resource layout of this Job on each node.

Warning

Return type may still be subject to change in the future

Returns:

Type Description
dict

Resource layout, where the key is the name of the node and the value another dict with the keys cpu_ids, memory and gres.

hold(mode=None) method descriptor

Hold a currently pending Job, preventing it from being scheduled.

Parameters:

Name Type Description Default
mode str

Determines in which mode the Job should be held. Possible values are user or admin. By default, the Job is held in admin mode, meaning only an Administrator will be able to release the Job again. If you specify the mode as user, the User will also be able to release the job.

None

Raises:

Type Description
RPCError

When holding the Job was not successful.

Examples:

>>> import pyslurm
>>>
>>> # Holding a Job (in "admin" mode by default)
>>> pyslurm.Job(9999).hold()
>>>
>>> # Holding a Job in "user" mode
>>> pyslurm.Job(9999).hold(mode="user")

load(job_id) staticmethod

Load information for a specific Job.

Implements the slurm_load_job RPC.

Note

If the Job is not pending, the related Job steps will also be loaded. Job statistics are however not loaded automatically.

Parameters:

Name Type Description Default
job_id int

An Integer representing a Job-ID.

required

Returns:

Type Description
Job

Returns a new Job instance

Raises:

Type Description
RPCError

If requesting the Job information from the slurmctld was not successful.

Examples:

>>> import pyslurm
>>> job = pyslurm.Job.load(9999)

load_stats() method descriptor

Load realtime statistics for a Job and its steps.

Calling this function returns the Job statistics, and additionally populates the stats and pids attribute of the instance.

Returns:

Type Description
JobStatistics

The statistics of the job.

Raises:

Type Description
RPCError

When receiving the Statistics was not

Examples:

>>> import pyslurm
>>> job = pyslurm.Job.load(9999)
>>> stats = job.load_stats()
>>>
>>> # Print the CPU Time Used
>>> print(stats.total_cpu_time)
>>>
>>> # Print the Process-IDs for the whole Job, organized by hostname
>>> print(job.pids)

modify(changes) method descriptor

Modify a Job.

Implements the slurm_update_job RPC.

Parameters:

Name Type Description Default
changes JobSubmitDescription

A JobSubmitDescription object which contains all the modifications that should be done on the Job.

required

Raises:

Type Description
RPCError

When updating the Job was not successful.

Examples:

>>> import pyslurm
>>>
>>> # Setting the new time-limit to 20 days
>>> changes = pyslurm.JobSubmitDescription(time_limit="20-00:00:00")
>>> pyslurm.Job(9999).modify(changes)

notify(msg) method descriptor

Sends a message to the Jobs stdout.

Implements the slurm_notify_job RPC.

Parameters:

Name Type Description Default
msg str

The message that should be sent.

required

Raises:

Type Description
RPCError

When sending the message to the Job was not successful.

Examples:

>>> import pyslurm
>>> pyslurm.Job(9999).notify("Hello Friends!")

release() method descriptor

Release a currently held Job, allowing it to be scheduled again.

Raises:

Type Description
RPCError

When releasing a held Job was not successful.

Examples:

>>> import pyslurm
>>> pyslurm.Job(9999).release()

requeue(hold=False) method descriptor

Requeue a currently running Job.

Implements the slurm_requeue RPC.

Parameters:

Name Type Description Default
hold bool

Controls whether the Job should be put in a held state or not. Default for this is False, so it will not be held.

False

Raises:

Type Description
RPCError

When requeing the Job was not successful.

Examples:

>>> import pyslurm
>>>
>>> # Requeing a Job while allowing it to be
>>> # scheduled again immediately
>>> pyslurm.Job(9999).requeue()
>>>
>>> # Requeing a Job while putting it in a held state
>>> pyslurm.Job(9999).requeue(hold=True)

send_signal(signal, steps='children', hurry=False) method descriptor

Send a signal to a running Job.

Implements the slurm_signal_job RPC.

Parameters:

Name Type Description Default
signal Union[str, int]

Any valid signal which will be sent to the Job. Can be either a str like SIGUSR1, or simply an int.

required
steps str

Selects which steps should be signaled. Valid values for this are: all, batch and children. The default value is children, where all steps except the batch-step will be signaled. The value batch in contrast means, that only the batch-step will be signaled. With all every step is signaled.

'children'
hurry bool

If True, no burst buffer data will be staged out. The default value is False.

False

Raises:

Type Description
RPCError

When sending the signal was not successful.

Examples:

Specifying the signal as a string:

>>> from pyslurm import Job
>>> Job(9999).send_signal("SIGUSR1")

or passing in a numeric signal:

>>> Job(9999).send_signal(9)

suspend() method descriptor

Suspend a running Job.

Implements the slurm_suspend RPC.

Raises:

Type Description
RPCError

When suspending the Job was not successful.

Examples:

>>> import pyslurm
>>> pyslurm.Job(9999).suspend()

to_dict() method descriptor

Job information formatted as a dictionary.

Returns:

Type Description
dict

Job information as dict

unsuspend() method descriptor

Unsuspend a currently suspended Job.

Implements the slurm_resume RPC.

Raises:

Type Description
RPCError

When unsuspending the Job was not successful.

Examples:

>>> import pyslurm
>>> pyslurm.Job(9999).unsuspend()

pyslurm.Jobs

Bases: pyslurm.xcollections.MultiClusterMap

A Multi Cluster collection of pyslurm.Job objects.

Parameters:

Name Type Description Default
jobs Union[list[int], dict[int, Job], str]

Jobs to initialize this collection with.

None
frozen bool

Control whether this collection is frozen when reloading Job information.

False

Attributes:

Name Type Description
memory int

Total amount of memory requested for all Jobs in this collection, in Mebibytes

cpus int

Total amount of cpus requested for all Jobs in this collection.

ntasks int

Total amount of tasks requested for all Jobs in this collection.

elapsed_cpu_time int

Total amount of CPU-Time used by all the Jobs in the collection. This is the result of multiplying the run_time with the amount of cpus requested for each job.

frozen bool

If this is set to True and the reload() method is called, then ONLY Jobs that already exist in this collection will be reloaded. New Jobs that are discovered will not be added to this collection, but old Jobs which have already been purged from the Slurm controllers memory will not be removed either. The default is False, so old jobs will be removed, and new Jobs will be added - basically the same behaviour as doing Jobs.load().

stats JobStatistics

Real-time statistics of all Jobs in this collection. Before you can access the stats data for this, you have to call the load_stats method on this collection.

load(preload_passwd_info=False, frozen=False) staticmethod

Retrieve all Jobs from the Slurm controller

Parameters:

Name Type Description Default
preload_passwd_info bool

Decides whether to query passwd and groups information from the system. Could potentially speed up access to attributes of the Job where a UID/GID is translated to a name. If True, the information will fetched and stored in each of the Job instances.

False
frozen bool

Decide whether this collection of Jobs should be frozen.

False

Returns:

Type Description
Jobs

A collection of Job objects.

Raises:

Type Description
RPCError

When getting all the Jobs from the slurmctld failed.

Examples:

>>> import pyslurm
>>> jobs = pyslurm.Jobs.load()
>>> print(jobs)
pyslurm.Jobs({1: pyslurm.Job(1), 2: pyslurm.Job(2)})
>>> print(jobs[1])
pyslurm.Job(1)

load_stats() method descriptor

Load realtime stats for this collection of Jobs.

This function additionally fills in the stats attribute for all Jobs in the collection, and also populates its own stats attribute. Implicitly calls load_steps().

Note

Pending Jobs will be ignored, since they don't have any Stats yet.

Returns:

Type Description
JobStatistics

The statistics of this job collection.

Raises:

Type Description
RPCError

When retrieving the stats for all the Jobs failed.

Examples:

>>> import pyslurm
>>> jobs = pyslurm.Jobs.load()
>>> stats = jobs.load_stats()
>>>
>>> # Print the CPU Time Used
>>> print(stats.total_cpu_time)

load_steps() method descriptor

Load all Job steps for this collection of Jobs.

This function fills in the steps attribute for all Jobs in the collection.

Note

Pending Jobs will be ignored, since they don't have any Steps yet.

Raises:

Type Description
RPCError

When retrieving the information for all the Steps failed.

reload() method descriptor

Reload the information for jobs in a collection.

Returns:

Type Description
Jobs

Returns self

Raises:

Type Description
RPCError

When getting the Jobs from the slurmctld failed.