Job
Note
This supersedes the pyslurm.job class, which will be removed in a future release
pyslurm.Job
A Slurm Job.
All attributes in this class are read-only.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
job_id
|
int
|
An Integer representing a Job-ID. |
required |
Attributes:
Name | Type | Description |
---|---|---|
steps |
JobSteps
|
Steps this Job has.
Before you can access the Steps data for a Job, you have to call
the |
stats |
JobStatistics
|
Real-time statistics of a Job.
Before you can access the stats data for a Job, you have to call
the |
pids |
dict[str, list]
|
Current Process-IDs of the Job, organized by node name.
Before you can access the pids data for a Job, you have to call
the |
name |
str
|
Name of the Job |
id |
int
|
Unique ID of the Job. |
association_id |
int
|
ID of the Association this Job runs with. |
account |
str
|
Name of the Account this Job is runs with. |
user_id |
int
|
UID of the User who submitted the Job. |
user_name |
str
|
Name of the User who submitted the Job. |
group_id |
int
|
GID of the Group that Job runs under. |
group_name |
str
|
Name of the Group this Job runs under. |
priority |
int
|
Priority of the Job. |
nice |
int
|
Nice Value of the Job. |
qos |
str
|
QOS Name of the Job. |
min_cpus_per_node |
int
|
Minimum Amount of CPUs per Node the Job requested. |
state |
str
|
State this Job is currently in. |
state_reason |
str
|
A Reason explaining why the Job is in its current state. |
is_requeueable |
bool
|
Whether the Job is requeuable or not. |
requeue_count |
int
|
Amount of times the Job has been requeued. |
is_batch_job |
bool
|
Whether the Job is a batch job or not. |
node_reboot_required |
bool
|
Whether the Job requires the Nodes to be rebooted first. |
dependencies |
dict
|
Dependencies the Job has to other Jobs. |
time_limit |
int
|
Time-Limit, in minutes, for this Job. |
time_limit_min |
int
|
Minimum Time-Limit in minutes for this Job. |
submit_time |
int
|
Time the Job was submitted, as unix timestamp. |
eligible_time |
int
|
Time the Job is eligible to start, as unix timestamp. |
accrue_time |
int
|
Job accrue time, as unix timestamp |
start_time |
int
|
Time this Job has started execution, as unix timestamp. |
resize_time |
int
|
Time the job was resized, as unix timestamp. |
deadline |
int
|
Time when a pending Job will be cancelled, as unix timestamp. |
preempt_eligible_time |
int
|
Time the Job is eligible for preemption, as unix timestamp. |
preempt_time |
int
|
Time the Job was signaled for preemption, as unix timestamp. |
suspend_time |
int
|
Last Time the Job was suspended, as unix timestamp. |
last_sched_evaluation_time |
int
|
Last time evaluated for Scheduling, as unix timestamp. |
pre_suspension_time |
int
|
Amount of seconds the Job ran prior to suspension, as unix timestamp |
mcs_label |
str
|
MCS Label for the Job |
partition |
str
|
Name of the Partition the Job runs in. |
submit_host |
str
|
Name of the Host this Job was submitted from. |
batch_host |
str
|
Name of the Host where the Batch-Script is executed. |
num_nodes |
int
|
Amount of Nodes the Job has requested or allocated. |
max_nodes |
int
|
Maximum amount of Nodes the Job has requested. |
allocated_nodes |
str
|
Nodes the Job is currently using. This is only valid when the Job is running. If the Job is pending, it will always return None. |
required_nodes |
str
|
Nodes the Job is explicitly requiring to run on. |
excluded_nodes |
str
|
Nodes that are explicitly excluded for execution. |
scheduled_nodes |
str
|
Nodes the Job is scheduled on by the slurm controller. |
derived_exit_code |
int
|
The derived exit code for the Job. |
derived_exit_code_signal |
int
|
Signal for the derived exit code. |
exit_code |
int
|
Code with which the Job has exited. |
exit_code_signal |
int
|
The signal which has led to the exit code of the Job. |
batch_constraints |
list
|
Features that node(s) should have for the batch script. Controls where it is possible to execute the batch-script of the job. Also see 'constraints' |
federation_origin |
str
|
Federation Origin |
federation_siblings_active |
int
|
Federation siblings active |
federation_siblings_viable |
int
|
Federation siblings viable |
cpus |
int
|
Total amount of CPUs the Job is using. If the Job is still pending, this will be the amount of requested CPUs. |
cpus_per_task |
int
|
Number of CPUs per Task used. |
cpus_per_gpu |
int
|
Number of CPUs per GPU used. |
boards_per_node |
int
|
Number of boards per Node. |
sockets_per_board |
int
|
Number of sockets per board. |
sockets_per_node |
int
|
Number of sockets per node. |
cores_per_socket |
int
|
Number of cores per socket. |
threads_per_core |
int
|
Number of threads per core. |
ntasks |
int
|
Number of parallel processes. |
ntasks_per_node |
int
|
Number of parallel processes per node. |
ntasks_per_board |
int
|
Number of parallel processes per board. |
ntasks_per_socket |
int
|
Number of parallel processes per socket. |
ntasks_per_core |
int
|
Number of parallel processes per core. |
ntasks_per_gpu |
int
|
Number of parallel processes per GPU. |
delay_boot_time |
int
|
https://slurm.schedmd.com/sbatch.html#OPT_delay-boot, in minutes |
constraints |
list
|
A list of features the Job requires nodes to have. In contrast, the 'batch_constraints' option only focuses on the initial batch-script placement. This option however means features to restrict the list of nodes a job is able to execute on in general beyond the initial batch-script. |
cluster |
str
|
Name of the cluster the job is executing on. |
cluster_constraints |
list
|
A List of features that a cluster should have. |
reservation |
str
|
Name of the reservation this Job uses. |
resource_sharing |
str
|
Mode controlling how a job shares resources with others. |
requires_contiguous_nodes |
bool
|
Whether the Job has allocated a set of contiguous nodes. |
licenses |
list
|
List of licenses the Job needs. |
network |
str
|
Network specification for the Job. |
command |
str
|
The command that is executed for the Job. |
working_directory |
str
|
Path to the working directory for this Job. |
admin_comment |
str
|
An arbitrary comment set by an administrator for the Job. |
system_comment |
str
|
An arbitrary comment set by the slurmctld for the Job. |
container |
str
|
The container this Job uses. |
comment |
str
|
An arbitrary comment set for the Job. |
standard_input |
str
|
The path to the file for the standard input stream. |
standard_output |
str
|
The path to the log file for the standard output stream. |
standard_error |
str
|
The path to the log file for the standard error stream. |
required_switches |
int
|
Number of switches required. |
max_wait_time_switches |
int
|
Amount of seconds to wait for the switches. |
burst_buffer |
str
|
Burst buffer specification |
burst_buffer_state |
str
|
Burst buffer state |
cpu_frequency_min |
Union[str, int]
|
Minimum CPU-Frequency requested. |
cpu_frequency_max |
Union[str, int]
|
Maximum CPU-Frequency requested. |
cpu_frequency_governor |
Union[str, int]
|
CPU-Frequency Governor requested. |
billable_tres |
float
|
Amount of billable trackable resources. |
wckey |
str
|
Name of the WCKey this Job uses. |
mail_user |
list
|
Users that should receive Mails for this Job. |
mail_types |
list
|
Mail Flags specified by the User. |
heterogeneous_id |
int
|
Heterogeneous job id. |
heterogeneous_offset |
int
|
Heterogeneous job offset. |
temporary_disk_per_node |
int
|
Temporary disk space in Mebibytes available per Node. |
array_id |
int
|
The master Array-Job ID. |
array_tasks_parallel |
int
|
Max number of array tasks allowed to run simultaneously. |
array_task_id |
int
|
Array Task ID of this Job if it is an Array-Job. |
array_tasks_waiting |
str
|
Array Tasks that are still waiting. |
end_time |
int
|
Time at which this Job will end, as unix timestamp. |
run_time |
int
|
Amount of seconds the Job has been running. |
cores_reserved_for_system |
int
|
Amount of cores reserved for System use only. |
threads_reserved_for_system |
int
|
Amount of Threads reserved for System use only. |
memory |
int
|
Total Amount of Memory this Job has, in Mebibytes |
memory_per_cpu |
int
|
Amount of Memory per CPU this Job has, in Mebibytes |
memory_per_node |
int
|
Amount of Memory per Node this Job has, in Mebibytes |
memory_per_gpu |
int
|
Amount of Memory per GPU this Job has, in Mebibytes |
gres_per_node |
dict
|
Generic Resources (e.g. GPU) this Job is using per Node. |
profile_types |
list
|
Types for which detailed accounting data is collected. |
gres_binding |
str
|
Binding Enforcement of a Generic Resource (e.g. GPU). |
gres_tasks_per_sharing |
str
|
Task Sharing of a Generic Resource (e.g. GPU). |
kill_on_invalid_dependency |
bool
|
Whether the Job should be killed on an invalid dependency. |
spreads_over_nodes |
bool
|
Whether the Job should be spread over as many nodes as possible. |
is_cronjob |
bool
|
Whether this Job is a cronjob. |
cronjob_time |
str
|
The time specification for the Cronjob. |
elapsed_cpu_time |
int
|
Amount of CPU-Time used by the Job so far. This is the result of multiplying the run_time with the amount of cpus requested. |
run_time_remaining |
int
|
The amount of seconds the job has still left until hitting the
|
cancel()
method descriptor
Cancel a Job.
Implements the slurm_kill_job RPC.
Raises:
Type | Description |
---|---|
RPCError
|
When cancelling the Job was not successful. |
Examples:
get_batch_script()
method descriptor
get_resource_layout_per_node()
method descriptor
Retrieve the resource layout of this Job on each node.
Warning
Return type may still be subject to change in the future
Returns:
Type | Description |
---|---|
dict
|
Resource layout, where the key is the name of the node and
the value another dict with the keys |
hold(mode=None)
method descriptor
Hold a currently pending Job, preventing it from being scheduled.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mode
|
str
|
Determines in which mode the Job should be held. Possible
values are |
None
|
Raises:
Type | Description |
---|---|
RPCError
|
When holding the Job was not successful. |
Examples:
load(job_id)
staticmethod
Load information for a specific Job.
Implements the slurm_load_job RPC.
Note
If the Job is not pending, the related Job steps will also be loaded. Job statistics are however not loaded automatically.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
job_id
|
int
|
An Integer representing a Job-ID. |
required |
Returns:
Type | Description |
---|---|
Job
|
Returns a new Job instance |
Raises:
Type | Description |
---|---|
RPCError
|
If requesting the Job information from the slurmctld was not successful. |
Examples:
load_stats()
method descriptor
Load realtime statistics for a Job and its steps.
Calling this function returns the Job statistics, and additionally
populates the stats
and pids
attribute of the instance.
Returns:
Type | Description |
---|---|
JobStatistics
|
The statistics of the job. |
Raises:
Type | Description |
---|---|
RPCError
|
When receiving the Statistics was not |
Examples:
modify(changes)
method descriptor
Modify a Job.
Implements the slurm_update_job RPC.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
changes
|
JobSubmitDescription
|
A JobSubmitDescription object which contains all the modifications that should be done on the Job. |
required |
Raises:
Type | Description |
---|---|
RPCError
|
When updating the Job was not successful. |
Examples:
notify(msg)
method descriptor
release()
method descriptor
Release a currently held Job, allowing it to be scheduled again.
Raises:
Type | Description |
---|---|
RPCError
|
When releasing a held Job was not successful. |
Examples:
requeue(hold=False)
method descriptor
Requeue a currently running Job.
Implements the slurm_requeue RPC.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hold
|
bool
|
Controls whether the Job should be put in a held state or not.
Default for this is |
False
|
Raises:
Type | Description |
---|---|
RPCError
|
When requeing the Job was not successful. |
Examples:
send_signal(signal, steps='children', hurry=False)
method descriptor
Send a signal to a running Job.
Implements the slurm_signal_job RPC.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
signal
|
Union[str, int]
|
Any valid signal which will be sent to the Job. Can be either
a str like |
required |
steps
|
str
|
Selects which steps should be signaled. Valid values for this
are: |
'children'
|
hurry
|
bool
|
If True, no burst buffer data will be staged out. The default value is False. |
False
|
Raises:
Type | Description |
---|---|
RPCError
|
When sending the signal was not successful. |
Examples:
Specifying the signal as a string:
or passing in a numeric signal:
suspend()
method descriptor
Suspend a running Job.
Implements the slurm_suspend RPC.
Raises:
Type | Description |
---|---|
RPCError
|
When suspending the Job was not successful. |
Examples:
to_dict()
method descriptor
unsuspend()
method descriptor
Unsuspend a currently suspended Job.
Implements the slurm_resume RPC.
Raises:
Type | Description |
---|---|
RPCError
|
When unsuspending the Job was not successful. |
Examples:
pyslurm.Jobs
Bases: pyslurm.xcollections.MultiClusterMap
A Multi Cluster
collection of pyslurm.Job objects.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
jobs
|
Union[list[int], dict[int, Job], str]
|
Jobs to initialize this collection with. |
None
|
frozen
|
bool
|
Control whether this collection is |
False
|
Attributes:
Name | Type | Description |
---|---|---|
memory |
int
|
Total amount of memory requested for all Jobs in this collection, in Mebibytes |
cpus |
int
|
Total amount of cpus requested for all Jobs in this collection. |
ntasks |
int
|
Total amount of tasks requested for all Jobs in this collection. |
elapsed_cpu_time |
int
|
Total amount of CPU-Time used by all the Jobs in the collection. This is the result of multiplying the run_time with the amount of cpus requested for each job. |
frozen |
bool
|
If this is set to True and the |
stats |
JobStatistics
|
Real-time statistics of all Jobs in this collection.
Before you can access the stats data for this, you have to call
the |
load(preload_passwd_info=False, frozen=False)
staticmethod
Retrieve all Jobs from the Slurm controller
Parameters:
Name | Type | Description | Default |
---|---|---|---|
preload_passwd_info
|
bool
|
Decides whether to query passwd and groups information from the system. Could potentially speed up access to attributes of the Job where a UID/GID is translated to a name. If True, the information will fetched and stored in each of the Job instances. |
False
|
frozen
|
bool
|
Decide whether this collection of Jobs should be frozen. |
False
|
Returns:
Type | Description |
---|---|
Jobs
|
A collection of Job objects. |
Raises:
Type | Description |
---|---|
RPCError
|
When getting all the Jobs from the slurmctld failed. |
Examples:
load_stats()
method descriptor
Load realtime stats for this collection of Jobs.
This function additionally fills in the stats
attribute for all Jobs
in the collection, and also populates its own stats
attribute.
Implicitly calls load_steps()
.
Note
Pending Jobs will be ignored, since they don't have any Stats yet.
Returns:
Type | Description |
---|---|
JobStatistics
|
The statistics of this job collection. |
Raises:
Type | Description |
---|---|
RPCError
|
When retrieving the stats for all the Jobs failed. |
Examples:
load_steps()
method descriptor
Load all Job steps for this collection of Jobs.
This function fills in the steps
attribute for all Jobs in the
collection.
Note
Pending Jobs will be ignored, since they don't have any Steps yet.
Raises:
Type | Description |
---|---|
RPCError
|
When retrieving the information for all the Steps failed. |