parse and monitor job status on slurm system

Chentao Yang Lv4

最近想写一个对投递到集群上面的任务监控的脚本,以方便多个任务批量操作并且可以整合到一个现有的流程里面。所有的集群任务管理系统是slurm, 其中有一个步骤是获取所有任务的运行状态,但开始总是获取不到任务的状态,会直接返回----------字符串,表示没有获取到这个任务,我尝试了以下三种写法都不能解决问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

cmd = f" /usr/bin/sacct -o State -j {job} | tail -n 1 "

## method 1
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output, error = process.communicate()
job_status = job.decode().strip()

## method 2
result = subprocess.run(command, shell=True, capture_output=True, text=True)
job_status = result.stdout.strip()
if result.stderr:
print("An error occurred while running the squeue command: {command}", error.decode())
check_error_list.add(job)


# method 3, os.system
with os.popen(cmd, 'r') as p:
job_status = p.read().strip()
return job_status

最后跟同事讨论发现,她也遇到过这个问题,这个原因是因为这个任务系统有时候,他就会卡住投不上去,然后要过段时间才能获取到, 所以循环多次获取或者在第一次获取的时候等一会。

1
2
3
4
5
# if this is the first checking, please hold your horses,
# because the system may not response immediately
if check_round == 0:
time.sleep(30)
job_status = parse_job_status(job) # 其实上面的3种方法任一一个都可以

这就叫,欲速则不达~

  • Title: parse and monitor job status on slurm system
  • Author: Chentao Yang
  • Created at : 2023-10-27 11:31:16
  • Updated at : 2023-10-27 03:43:22
  • Link: https://comery.github.io/2023/10/27/parse-and-monitor-job-status-on-slurm-system/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments
On this page
parse and monitor job status on slurm system