如何使用TACC launcher来批量提交串行任务
# TACC launcher 是什么?
它是一个简单实用的工具,用来帮助用户在一个批处理脚本中提交多个单线程或多线程的任务。
它的详细介绍请参考官网:传送门 (opens new window)。
它的下载地址:传送门 (opens new window)。
# TACC launcher 怎么用?
非常推荐前往官网查看它的使用方法,有很详细的介绍。我就不再重复了,英文不好的朋友可以使用网页翻译工具翻译一下。
简单讲,就是:
- 将这个工具下载下来
- 解压缩
- 不需要编译!
- 配置环境变量
- 写一个joblist文件,里面写上所有需要执行的任务
- 使用launcher的命令提交
# TACC launcher + slurm 实例
# 准备算例
我们准备一个joblist文件:myjoblist,里面写上要执行的任务,先简单些12行helloworld做测试:
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
echo "hello, world"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
2
3
4
5
6
7
8
9
10
11
12
13
14
# 编写提交脚本
我们再编写一个提交脚本sub.sh,里面写上launcher的相关命令:
#!/bin/bash
export LAUNCHER_JOB_FILE=/path/to/myjoblist
export LAUNCHER_DIR=$HOME/launcher/launcher-3.1.1
export PATH=$LAUNCHER_DIR:$PATH
export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins
export LAUNCHER_RMI=SLURM
export LAUNCHER_SCHED=interleaved
export LAUNCHER_WORKDIR=`pwd`
$LAUNCHER_DIR/paramrun
1
2
3
4
5
6
7
8
9
2
3
4
5
6
7
8
9
说明:
- LAUNCHER_JOB_FILE 为myjoblist路径,请改为实际路径
- LAUNCHER_DIR 为launcher的安装路径,请改为实际路径
- 其他的变量暂时不需要修改
# 提交脚本
yhbatch -N 2 -n 6 -p debug sub.sh
1
说明:
- -N 2 表示2个节点
- -n 6 表示6个cpu核(一共6个,不是每个节点6个;另外,注意n需要能被N整除,否则报错)
- -p debug 表示使用debug分区
# 查看结果
使用slurm作业调度系统提交的任务会有一个默认的输出文件slurm-jobid.out,我们查看这个文件:
Launcher: Setup complete.
------------- SUMMARY ---------------
Number of hosts: 2
Working directory: $HOME/workdir/test
Processes per host: 3
Total processes: 6
Total jobs: 12
Scheduling method: interleaved
-------------------------------------
Launcher: Starting parallel tasks...
Launcher: Task 1 running job 2 on cn95 (echo "hello, world")
Launcher: Task 0 running job 1 on cn95 (echo "hello, world")
hello, world
hello, world
Launcher: Task 2 running job 3 on cn95 (echo "hello, world")
hello, world
Launcher: Job 1 completed in 0 seconds.
Launcher: Task 5 running job 6 on cn96 (echo "hello, world")
Launcher: Task 4 running job 5 on cn96 (echo "hello, world")
hello, world
hello, world
Launcher: Task 3 running job 4 on cn96 (echo "hello, world")
Launcher: Job 3 completed in 0 seconds.
hello, world
Launcher: Job 2 completed in 0 seconds.
Launcher: Job 6 completed in 0 seconds.
Launcher: Job 5 completed in 0 seconds.
Launcher: Job 4 completed in 0 seconds.
Launcher: Task 0 running job 7 on cn95 (echo "hello, world")
hello, world
Launcher: Task 2 running job 9 on cn95 (echo "hello, world")
hello, world
Launcher: Task 1 running job 8 on cn95 (echo "hello, world")
hello, world
Launcher: Task 5 running job 12 on cn96 (echo "hello, world")
hello, world
Launcher: Task 3 running job 10 on cn96 (echo "hello, world")
hello, world
Launcher: Task 4 running job 11 on cn96 (echo "hello, world")
hello, world
Launcher: Job 7 completed in 0 seconds.
Launcher: Job 9 completed in 0 seconds.
Launcher: Job 8 completed in 0 seconds.
Launcher: Job 12 completed in 0 seconds.
Launcher: Job 10 completed in 0 seconds.
Launcher: Job 11 completed in 0 seconds.
Launcher: Task 0 done. Exiting.
Launcher: Task 2 done. Exiting.
Launcher: Task 1 done. Exiting.
Launcher: Task 5 done. Exiting.
Launcher: Task 3 done. Exiting.
Launcher: Task 4 done. Exiting.
Launcher: Done. Job exited without errors
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
说明:
参数 | 值 | 说明 |
---|---|---|
Number of hosts | 2 | -N 2,所以为2个节点 |
Working directory | $HOME/workdir/test | 这个是实际的提交目录 |
Processes per host | 3 | 每个节点的进程数,是通过 6/2=3 得到,所以注意要整除 ! |
Total processes | 6 | -n 6,所以有一共6个进程 |
Total jobs | 12 | 在myjobslist中我们写了12行,所以是12个jobs |
Scheduling method | interleaved | 这个参数是调度方法,有3种,详见官网 |