您可以通過客戶端工具查看任務日志、任務列表和任務詳情。本文介紹查詢相關的命令詳情,包括調用格式、參數解釋及使用示例。
查看任務日志(logs)
功能
查看一個訓練任務的日志詳情。
格式
./dlc logs <yourJobId> <yourPodId> [--max_events_num <yourMaxNum>] [--start_time <yourStartTime>] [--end_time <yourEndTime>]
參數
參數
是否必選
描述
類型
<yourJobId>
是
待查看訓練任務的ID。
STRING
<yourPodId>
是
待查看日志的實例(Pod)ID。在分布式任務場景下,存在多個實例(Pod)。
STRING
max_events_num <yourMaxNum>
否
返回的日志最大行數,默認值為2000。
INT
start_time <yourStartTime>
否
日志查詢的起始時間,默認值為7天前。例如,start_time 2020-11-08T16:00:00Z。
STRING
end_time <yourEndTime>
否
日志查詢的截止時間,默認值為當前時間。例如,end_time 2020-11-08T17:00:00Z。
STRING
示例
針對分布式訓練任務的0號Worker節點,獲取十行日志。
./dlc logs dlcdys3r9jlu**** dlcdys3r********-worker-0 --max_events_num 10
系統返回如下類似結果。
WARN: ./requirements.txt not found, skip installing requirements. ================================================ | PAI Tensorflow powered by Aliyun PAI Team. | ================================================ Network is under initialization... Network successfully initialized. [2021-04-16 12:27:56.368026] [INFO] [7#7] [tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA [2021-04-16 12:27:56.375586] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:80] ====================CPU Architecture===================== [2021-04-16 12:27:56.375600] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:84] Disable AVX512. [2021-04-16 12:27:56.375605] [INFO] [7#7] [tensorflow/core/distributed_runtime/master.cc:87] CPU Vendor ID: GenuineIntel
查看任務列表與狀態
功能
獲取訓練任務的信息。如果不指定JobID,則會將所有的任務信息列出;如果指定了JobID,則只會展示對應的任務信息。
格式
./dlc get job [JOB_ID] [--workspace_id <yourWorkspaceId>] [--display_name <yourJobName>] [--job_type <yourJobType>] [--status <yourJobStatus>] [--start_time <yourStartTime>] [--end_time <yourEndTime>] [--page_num <yourPageNum>] [--page_size <yourPageSize>] [--max_events_num <yourMaxNum>] [--events] [--events_only]
參數
參數
是否必選
描述
類型
JOB_ID
否
待查看訓練任務的ID。
STRING
workspace_id <yourWorkspaceId>
否
工作空間ID。
STRING
display_name <yourJobName>
否
任務名稱,支持模糊查詢,不支持通配符查詢,大小寫不敏感。
STRING
job_type <yourJobType>
否
任務類型,支持查詢所有任務類型。默認為空,代表所有類型。
STRING
status <yourJobStatus>
否
任務狀態。默認為空,代表任務所有狀態。
STRING
start_time <yourStartTime>
否
查詢區間的起始時間,使用任務的創建時間來過濾。例如:start_time 2022-08-04T02:09:32Z。
STRING
end_time <yourEndTime>
否
查詢區間的截止時間,使用任務的創建時間來過濾。例如:end_time 2022-08-04T02:09:32Z。
STRING
page_num <yourPageNum>
否
分頁查詢,指定當前查詢需要返回的頁碼,編號從1開始,默認為1。
INT
page_size <yourPageSize>
否
分頁查詢中,指定當前查詢每頁返回的數量,默認為10。
INT
max_events_num <yourMaxNum>
否
返回的系統事件的最大行數,默認為2000。
INT
events
否
是否查詢任務的系統事件,僅查詢單個任務時才會生效。默認為false。
BOOL
events_only
否
是否只查詢任務的系統事件,僅查詢單個任務時才會生效。默認為false。
BOOL
示例
按照任務名稱模糊匹配查詢所有的訓練任務。
./dlc get job --display_name epl
系統返回如下類似結果。
+--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | Name | JobId | WorkspaceId | WorkspaceName | ResourceId | ResourceName | JobType | Priority | JobStatus | UserId | CreateTime | SubmittedTime | RunningTime | SuccessedTime | StoppedTime | FailedTime | FinishTime | Duration(seconds) | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+ | test_epl_test-**** | dlc02xipvt5z**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T06:41:05Z | 2022-08-01T06:45:08Z | 2022-08-01T06:48:57Z | 2022-08-01T06:53:21Z | | | 2022-08-01T06:53:21Z | 736 | | test_epl_**** | dlc1iyv3szl2**** | 23**** | doc_test_**** | | public-cluster | TFJob | 1 | Succeeded | 144963168668**** | 2022-08-01T03:23:51Z | 2022-08-01T03:27:22Z | 2022-08-01T03:27:50Z | 2022-08-01T03:33:48Z | | | 2022-08-01T03:33:48Z | 597 | +--------------------+------------------+-------------+------------------+------------+----------------+---------+----------+-----------+------------------+----------------------+----------------------+----------------------+----------------------+-------------+------------+----------------------+-------------------+
查詢指定的訓練任務。
./dlc get job dlc02xipvt5z****
系統返回如下類似結果。
{ "ClusterId": "", "CodeSource": { "Branch": "main", "CodeSourceId": "code-29****c****c4****ae0c9ec75a5****", "MountPath": "" }, "DataSources": [ { "DataSourceId": "d-ya7gc2p2iqq240****", "MountPath": "" } ], "DisplayName": "test_epl_test-****", "Duration": 736, "ElasticSpec": { "AIMasterType": "", "EnableElasticTraining": false, "MaxParallelism": 0, "MinParallelism": 0 }, "EnabledDebugger": false, "GmtCreateTime": "2022-08-01T06:41:05Z", "GmtFinishTime": "2022-08-01T06:53:21Z", "GmtRunningTime": "2022-08-01T06:48:57Z", "GmtSubmittedTime": "2022-08-01T06:45:08Z", "GmtSuccessedTime": "2022-08-01T06:53:21Z", "JobId": "dlc02xipvt5z****", "JobSpecs": [ { "AssignNodeSpec": { "EnableAssignNode": false, "NodeNames": "" }, "EcsSpec": "ecs.gn6v-c8g1.2xlarge", "Image": "registry.cn-shanghai.aliyuncs.com/pai-dlc/tensorflow-training:1.15-gpu-py36-cu100-ubuntu1****", "PodCount": 2, "ResourceConfig": { "CPU": "", "GPU": "", "GPUType": "", "Memory": "", "SharedMemory": "" }, "Type": "Worker", "UseSpotInstance": false } ], "JobType": "TFJob", "Pods": [ { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:52:06Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-0", "PodUid": "", "Status": "Succeeded", "Type": "worker" }, { "GmtCreateTime": "2022-08-01T06:45:08Z", "GmtFinishTime": "2022-08-01T06:53:20Z", "GmtStartTime": "2022-08-01T06:48:57Z", "Ip": "10.224.xx.xx", "PodId": "dlc02xipvt5z****-worker-1", "PodUid": "", "Status": "Succeeded", "Type": "worker" } ], "ReasonCode": "JobSucceeded", "ReasonMessage": "TFJob dlc02xipvt5z**** successfully completed.", "RequestId": "76FC3500-xxxx-533F-B24A-AC9B2A72****", "ResourceId": "", "Priority": 1, "ResourceLevel": "", "Settings": { "BusinessUserId": "", "Caller": "", "EnableErrorMonitoringInAIMaster": false, "EnableTideResource": false, "ErrorMonitoringArgs": "", "PipelineId": "" }, "Status": "Succeeded", "ThirdpartyLibDir": "", "UserCommand": "cd /root/xxxx/xxxx/\npip install .\ncd examples/resnet\nbash scripts/xxxx_dp.sh", "UserId": "144963168668****", "WorkspaceId": "23****", "WorkspaceName": "doc_test_****" }
相關文檔
您可以通過控制臺查看任務詳情。具體操作,請參見查看訓練詳情。