Torque не запускает задания массива PBS

Я ищу помощь в создании заданий массива PBS, работающих на HPC. Я знаком с заданиями массива на других высокопроизводительных компьютерах. Я могу запускать задания PBS без массива на моем текущем высокопроизводительном компьютере. Однако задания массива, которые я отправляю, будут иметь статус Q, и после использования qstat -f или tracejob могут временно измениться на статус R, а затем стать C без каких-либо признаков фактического выполнения задания (т. е. без вывода). или файлы ошибок, нет вывода из скрипта задания). У моего менеджера HPC, пользователя root, точно такая же проблема. Мне не удалось найти много вопросов и ответов по массивам вакансий PBS. Например, этот вопрос (задания PBS остаются в очереди (' Q'), но выполняется с помощью qrun) имеет аналогичную проблему. Однако их проблема относилась ко всем заданиям PBS, в то время как моя касается только заданий массива.

Вот более подробное описание:

Мы используем крутящий момент 6.1.2. Я представил свою работу с:

qsub -t 0 test_array.sh

Содержимое test_array.sh следующее:

#! /bin/sh -l
#PBS -l walltime=00:10:00
#PBS -l mem=2gb
#PBS -t 0

cd ~
echo "1" > ~/test.txt
echo ${PBS_ARRAYID} > ~/test2.txt

Вот что я получаю с tracejob <jobID>

/var/spool/torque/mom_logs/20201217: No matching job records located
/var/spool/torque/sched_logs/20201217: No matching job records located

Job: 34930[].mprc00.cluster.local

12/17/2020 14:15:35.629 S    enqueuing into batch, state 1 hop 1
12/17/2020 14:15:35.630 S    enqueuing into batch, state 2 hop 1
12/17/2020 14:15:35.975 S    Job Modified at request of [email protected]
12/17/2020 14:15:35.976 S    Job Run at request of [email protected]
12/17/2020 14:15:35  A    queue=batch
12/17/2020 14:15:36.033 S    unable to run job, MOM rejected/timeout
12/17/2020 14:15:36.035 S    unable to run job, send to MOM '172.20.0.201' failed

Ниже то, что я получил с qstat -f. Кажется, что планировщик понимает, что это задание массива, и правильно получил идентификатор задания массива (0).

Job Id: 34931[].mprc00.cluster.local
    Job_Name = test_array.sh
    Job_Owner = [email protected]
    job_state = R
    queue = batch
    server = mprc00.cluster.local
    Checkpoint = u
    ctime = Thu Dec 17 14:25:41 2020
    Error_Path = mprc00.cluster.local:/data/mprc_data1/home/mayizhou/Documents
    /PBSscripts/test_array.sh.e34931
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Thu Dec 17 14:25:48 2020
    Output_Path = mprc00.cluster.local:/data/mprc_data1/home/mayizhou/Document
    s/PBSscripts/test_array.sh.o34931
    Priority = 0
    Rerunable = True
    Resource_List.walltime = 00:10:00
    Resource_List.mem = 2gb
    Resource_List.nodes = 1
    Resource_List.nodect = 1
    Variable_List = PBS_O_QUEUE=batch,
    PBS_O_HOME=/data/mprc_data1/home//mayizhou,PBS_O_LOGNAME=mayizhou,
    PBS_O_PATH=/data/mprc_data1/home//mayizhou/bin:/data/mprc_data1/home/
    /mayizhou/.local/bin:/usr/local/bin:/usr/local/sbin:/opt/rh/rh-python3
    6/root/usr/bin:/opt/confluent/bin:/usr/local/bin:/usr/bin:/usr/local/s
    bin:/usr/sbin:/opt/ibutils/bin:/data/mprc_data1/fsl6.0.2/bin/,
    PBS_O_MAIL=/var/spool/mail/mayizhou,PBS_O_SHELL=/bin/bash,
    PBS_O_LANG=en_US.UTF-8,
    PBS_O_WORKDIR=/data/mprc_data1/home/mayizhou/Documents/PBSscripts,
    PBS_O_HOST=mprc00.cluster.local,PBS_O_SERVER=mprc00
    euser = <username>
    egroup = lenovo
    queue_type = E
    etime = Thu Dec 17 14:25:41 2020
    submit_args = -t 0 test_array.sh
    job_array_request = 0
    fault_tolerant = False
    job_radix = 0
    submit_host = mprc00.cluster.local
    init_work_dir = /data/mprc_data1/home/mayizhou/Documents/PBSscripts
    request_version = 1

Я также использовал инструмент диагностики крутящего момента (http://docs.adaptivecomputing.com/torque/4-2-8/help.htm#topics/12-appendices/diagnosticsAndErrorCodes.htm). Я получил следующие строки в выводе server.txt (слегка усеченные, чтобы соответствовать лимиту символов). Кажется, это указывает на ошибку Bad UID. Хотя я сомневаюсь, что это так, поскольку 1) я могу отправлять другие задания PBS и 2) мой менеджер HPC, который является пользователем root, получил ту же ошибку Bad UID.

1554:12/11/2020 16:39:02.558;256;PBS_Server.310206;Job;34653[].mprc00.cluster.local;enqueuing into batch, state 1 hop 1
1555:12/11/2020 16:39:02.559;08;PBS_Server.310206;Job;perform_commit_work;threading job_clone_wt: job id 34653[].mprc00.cluster.local
1556:12/11/2020 16:39:02.559;08;PBS_Server.310206;Job;perform_commit_work;job_id: 34653[].mprc00.cluster.local
1558:12/11/2020 16:39:02.571;256;PBS_Server.310186;Job;34653[0].mprc00.cluster.local;enqueuing into batch, state 2 hop 1
1559:12/11/2020 16:39:02.897;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1560:12/11/2020 16:39:02.898;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1562:12/11/2020 16:39:02.930;01;PBS_Server.310207;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1563:12/11/2020 16:39:02.930;01;PBS_Server.310207;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1564:12/11/2020 16:39:02.931;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1566:12/11/2020 16:39:02.932;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1567:12/11/2020 16:39:02.933;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1568:12/11/2020 16:39:02.965;01;PBS_Server.333987;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1570:12/11/2020 16:39:03.418;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1571:12/11/2020 16:39:03.419;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1573:12/11/2020 16:39:03.453;01;PBS_Server.310205;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1574:12/11/2020 16:39:03.453;01;PBS_Server.310205;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1575:12/11/2020 16:39:03.453;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1577:12/11/2020 16:39:03.455;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1578:12/11/2020 16:39:03.456;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1579:12/11/2020 16:39:03.482;01;PBS_Server.310300;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1581:12/11/2020 16:39:03.887;08;PBS_Server.331627;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1582:12/11/2020 16:39:03.888;08;PBS_Server.331627;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1584:12/11/2020 16:39:03.901;01;PBS_Server.331627;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1585:12/11/2020 16:39:03.901;01;PBS_Server.331627;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1586:12/11/2020 16:39:03.902;08;PBS_Server.331627;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1588:12/11/2020 16:39:03.903;08;PBS_Server.331627;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1589:12/11/2020 16:39:03.904;08;PBS_Server.331627;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1590:12/11/2020 16:39:03.932;01;PBS_Server.331628;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1592:12/11/2020 16:39:04.430;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1593:12/11/2020 16:39:04.431;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1595:12/11/2020 16:39:04.445;01;PBS_Server.310761;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1596:12/11/2020 16:39:04.445;01;PBS_Server.310761;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1597:12/11/2020 16:39:04.445;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1599:12/11/2020 16:39:04.447;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1600:12/11/2020 16:39:04.448;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1601:12/11/2020 16:39:04.475;01;PBS_Server.333987;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1603:12/11/2020 16:39:04.920;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1604:12/11/2020 16:39:04.921;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1606:12/11/2020 16:39:04.936;01;PBS_Server.310760;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1607:12/11/2020 16:39:04.936;01;PBS_Server.310760;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1608:12/11/2020 16:39:04.937;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1610:12/11/2020 16:39:04.938;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1611:12/11/2020 16:39:04.939;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1612:12/11/2020 16:39:04.969;01;PBS_Server.310300;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1614:12/11/2020 16:39:05.405;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1615:12/11/2020 16:39:05.406;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1617:12/11/2020 16:39:05.421;01;PBS_Server.310759;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1618:12/11/2020 16:39:05.421;01;PBS_Server.310759;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1619:12/11/2020 16:39:05.421;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1621:12/11/2020 16:39:05.423;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1622:12/11/2020 16:39:05.424;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1629:12/11/2020 16:39:05.945;01;PBS_Server.331628;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1630:12/11/2020 16:39:05.946;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1632:12/11/2020 16:39:05.947;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1633:12/11/2020 16:39:05.948;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1634:12/11/2020 16:39:05.973;01;PBS_Server.333987;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1636:12/11/2020 16:39:06.363;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1637:12/11/2020 16:39:06.364;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1639:12/11/2020 16:39:06.379;01;PBS_Server.310205;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1640:12/11/2020 16:39:06.379;01;PBS_Server.310205;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1641:12/11/2020 16:39:06.379;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1643:12/11/2020 16:39:06.381;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1644:12/11/2020 16:39:06.382;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1645:12/11/2020 16:39:06.404;01;PBS_Server.310759;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1647:12/11/2020 16:39:06.630;08;PBS_Server.310300;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1648:12/11/2020 16:39:06.631;08;PBS_Server.310300;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1650:12/11/2020 16:39:06.657;01;PBS_Server.310300;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1651:12/11/2020 16:39:06.657;01;PBS_Server.310300;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1652:12/11/2020 16:39:06.658;08;PBS_Server.310300;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1654:12/11/2020 16:39:06.659;08;PBS_Server.310300;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1655:12/11/2020 16:39:06.660;08;PBS_Server.310300;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1656:12/11/2020 16:39:06.685;01;PBS_Server.331627;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1658:12/11/2020 16:39:07.120;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1659:12/11/2020 16:39:07.121;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1661:12/11/2020 16:39:07.136;01;PBS_Server.310207;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1662:12/11/2020 16:39:07.136;01;PBS_Server.310207;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1663:12/11/2020 16:39:07.137;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1665:12/11/2020 16:39:07.139;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1666:12/11/2020 16:39:07.139;08;PBS_Server.310207;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1667:12/11/2020 16:39:07.174;01;PBS_Server.310760;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1669:12/11/2020 16:39:07.367;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1670:12/11/2020 16:39:07.368;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1672:12/11/2020 16:39:07.381;01;PBS_Server.331628;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1673:12/11/2020 16:39:07.381;01;PBS_Server.331628;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1674:12/11/2020 16:39:07.382;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1676:12/11/2020 16:39:07.383;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1677:12/11/2020 16:39:07.384;08;PBS_Server.331628;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1678:12/11/2020 16:39:07.410;01;PBS_Server.310759;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1680:12/11/2020 16:39:07.636;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1681:12/11/2020 16:39:07.637;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1683:12/11/2020 16:39:07.651;01;PBS_Server.310205;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1684:12/11/2020 16:39:07.651;01;PBS_Server.310205;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1685:12/11/2020 16:39:07.651;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1687:12/11/2020 16:39:07.653;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1688:12/11/2020 16:39:07.654;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1689:12/11/2020 16:39:07.682;01;PBS_Server.331627;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1691:12/11/2020 16:39:08.115;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1692:12/11/2020 16:39:08.116;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1694:12/11/2020 16:39:08.131;01;PBS_Server.310761;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1695:12/11/2020 16:39:08.131;01;PBS_Server.310761;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1696:12/11/2020 16:39:08.132;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1698:12/11/2020 16:39:08.133;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1699:12/11/2020 16:39:08.134;08;PBS_Server.310761;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1700:12/11/2020 16:39:08.159;01;PBS_Server.331628;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1702:12/11/2020 16:39:08.379;08;PBS_Server.333987;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1703:12/11/2020 16:39:08.380;08;PBS_Server.333987;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1705:12/11/2020 16:39:08.395;01;PBS_Server.333987;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1706:12/11/2020 16:39:08.395;01;PBS_Server.333987;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1707:12/11/2020 16:39:08.396;08;PBS_Server.333987;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1709:12/11/2020 16:39:08.397;08;PBS_Server.333987;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1710:12/11/2020 16:39:08.398;08;PBS_Server.333987;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1711:12/11/2020 16:39:08.422;01;PBS_Server.310206;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1713:12/11/2020 16:39:08.872;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1714:12/11/2020 16:39:08.873;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1716:12/11/2020 16:39:08.888;01;PBS_Server.310759;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1717:12/11/2020 16:39:08.888;01;PBS_Server.310759;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1718:12/11/2020 16:39:08.888;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1720:12/11/2020 16:39:08.890;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1721:12/11/2020 16:39:08.891;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1722:12/11/2020 16:39:08.922;01;PBS_Server.310207;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1724:12/11/2020 16:39:09.111;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1725:12/11/2020 16:39:09.112;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1727:12/11/2020 16:39:09.127;01;PBS_Server.310205;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1728:12/11/2020 16:39:09.127;01;PBS_Server.310205;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1729:12/11/2020 16:39:09.128;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1731:12/11/2020 16:39:09.130;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1732:12/11/2020 16:39:09.132;08;PBS_Server.310205;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1733:12/11/2020 16:39:09.156;01;PBS_Server.310761;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1735:12/11/2020 16:39:09.396;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1736:12/11/2020 16:39:09.397;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1738:12/11/2020 16:39:09.412;01;PBS_Server.310760;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1739:12/11/2020 16:39:09.412;01;PBS_Server.310760;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1740:12/11/2020 16:39:09.413;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1742:12/11/2020 16:39:09.414;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1743:12/11/2020 16:39:09.415;08;PBS_Server.310760;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1744:12/11/2020 16:39:09.440;01;PBS_Server.333987;Svr;PBS_Server;LOG_ERROR::Request invalid for state of job (15018) in 34653[0].mprc00.cluster.local, obit received for job 34653[0].mprc00.cluster.local from host mprc01 with bad state (state: QUEUED)
1746:12/11/2020 16:39:09.881;08;PBS_Server.310206;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1747:12/11/2020 16:39:09.882;08;PBS_Server.310206;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1749:12/11/2020 16:39:09.897;01;PBS_Server.310206;Svr;PBS_Server;LOG_CRITICAL::Bad UID for job execution (15025) in log_commit_error, child failed in commit request for job 34653[0].mprc00.cluster.local
1750:12/11/2020 16:39:09.897;01;PBS_Server.310206;Svr;PBS_Server;LOG_ERROR::Success (0) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1751:12/11/2020 16:39:09.898;08;PBS_Server.310206;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/timeout
1753:12/11/2020 16:39:09.900;08;PBS_Server.310206;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1754:12/11/2020 16:39:09.900;08;PBS_Server.310206;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1755:12/11/2020 16:39:10.123;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Modified at request of [email protected]
1756:12/11/2020 16:39:10.124;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Job Run at request of [email protected]
1757:12/11/2020 16:39:10.147;01;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;MOM reports job already running
1758:12/11/2020 16:39:10.148;01;PBS_Server.310759;Svr;PBS_Server;LOG_ERROR::Job with requested ID already exists (15011) in send_job_over_network_with_retries, child failed in previous commit request for job 34653[0].mprc00.cluster.local
1759:12/11/2020 16:39:10.148;01;PBS_Server.310759;Svr;PBS_Server;LOG_ERROR::Job with requested ID already exists (15011) in send_job_work, child failed and will not retry job 34653[0].mprc00.cluster.local
1760:12/11/2020 16:39:10.148;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;unable to run job, MOM rejected/rc=-1
1762:12/11/2020 16:39:10.150;08;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;unable to run job, send to MOM '172.20.0.201' failed
1764:12/11/2020 16:39:10.150;128;PBS_Server.310759;Job;34653[0].mprc00.cluster.local;Request invalid for state of job EXITING
1766:12/11/2020 16:39:10.174;08;PBS_Server.310174;Job;34653[0].mprc00.cluster.local;on_job_exit valid pjob: 34653[0].mprc00.cluster.local (substate=50)
1767:12/11/2020 16:39:10.214;13;PBS_Server.310174;Job;handle_stageout;Post job file processing error; job 34653[0].mprc00.cluster.local on host mprc01
1780:12/11/2020 16:44:13.092;256;PBS_Server.310176;Job;34653[0].mprc00.cluster.local;dequeuing from batch, state COMPLETE
1781:12/11/2020 16:44:13.100;256;PBS_Server.310176;Job;34653[].mprc00.cluster.local;dequeuing from batch, state COMPLETE

Надеюсь, этой информации достаточно. Любой вклад приветствуется!


person Yizhou Ma    schedule 17.12.2020    source источник