EXPLAIN
在實際開發(fā)過程中,通常需要分析查詢語句或表結(jié)構(gòu)來分析性能瓶頸,MaxCompute SQL為您提供explain
語句實現(xiàn)此功能。本文為您介紹explain
的功能、命令格式及使用示例。
功能介紹
EXPLAIN語句可以顯示MaxCompute SQL對應(yīng)的DML語句執(zhí)行計劃(執(zhí)行SQL語義的程序)的結(jié)構(gòu),幫助您了解SQL語句的處理過程,為優(yōu)化SQL語句提供幫助。一個查詢語句作業(yè)會對應(yīng)多個Job,一個Job對應(yīng)多個Task。
如果查詢語句足夠復(fù)雜,EXPLAIN的結(jié)果較多,超過4 MB則會觸發(fā)API的限制,無法得到完整的EXPLAIN結(jié)果。此時您可以拆分查詢語句,對各部分分別執(zhí)行EXPLAIN語句,以了解Job的結(jié)構(gòu)。
命令格式
EXPLAIN <dml query>;
dml query:必填。SELECT
語句,更多信息請參見SELECT語法。
返回說明
EXPLAIN
的執(zhí)行結(jié)果包含如下信息:
Job間的依賴關(guān)系
例如
job0 is root job
。如果查詢只需要一個Job(job0
),只會顯示一行信息。Task間的依賴關(guān)系
In Job job0: root Tasks: M1, M2 J3_1_2_Stg1 depends on: M1, M2
job0
包含三個Task,M1
、M2
和J3_1_2_Stg1
。系統(tǒng)會先執(zhí)行M1
和M2
兩個Task,執(zhí)行完成后,再執(zhí)行J3_1_2_Stg1
。Task的命名規(guī)則如下:
在MaxCompute中,共有四種Task類型:MapTask、ReduceTask、JoinTask和LocalWork。Task名稱的第一個字母表示了當(dāng)前Task的類型,例如
M2Stg1
就是一個MapTask。緊跟著第一個字母后的數(shù)字,代表了當(dāng)前Task的ID。這個ID在當(dāng)前查詢對應(yīng)的所有Task中是唯一的。
用下劃線(_)分隔的數(shù)字代表當(dāng)前Task的直接依賴,例如
J3_1_2_Stg1
表示當(dāng)前Task ID為3,依賴ID為1(M1)和ID為2(M2)的兩個Task。
Task中所有Operator的依賴結(jié)構(gòu)。
Operator串描述了一個Task的執(zhí)行語義。結(jié)構(gòu)示例如下:
In Task M2: Data source: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china # "Data source"描述了當(dāng)前Task的輸入內(nèi)容。 TS: mf_mc_bj.sale_detail_jt/sale_date=2013/region=china # TableScanOperator FIL: ISNOTNULL(customer_id) # FilterOperator RS: order: + # ReduceSinkOperator nullDirection: * optimizeOrderBy: False valueDestLimit: 0 dist: HASH keys: customer_id values: customer_id (string) total_price (double) partitions: customer_id In Task J3_1_2: JOIN: # JoinOperator StreamLineRead1 INNERJOIN StreamLineRead2 keys: 0:customer_id 1:customer_id AGGREGATE: group by:customer_id # GroupByOperator UDAF: SUM(total_price) (__agg_0_sum)[Complete],SUM(total_price) (__agg_1_sum)[Complete] RS: order: + nullDirection: * optimizeOrderBy: True valueDestLimit: 10 dist: HASH keys: customer_id values: customer_id (string) __agg_0 (double) __agg_1 (double) partitions: In Task R4_3: SEL: customer_id,__agg_0,__agg_1 # SelectOperator LIM:limit 10 # LimitOperator FS: output: Screen # FileSinkOperator schema: customer_id (string) AS ashop __agg_0 (double) AS ap __agg_1 (double) AS bp
各Operator的含義如下:
TableScanOperator(TS):描述查詢語句中的
FROM
語句塊的邏輯。EXPLAIN
結(jié)果中會顯示輸入表的名稱(Alias)。SelectOperator(SEL):描述查詢語句中的SELECT語句塊的邏輯。
EXPLAIN
結(jié)果中會顯示向下一個Operator傳遞的列,多個列由逗號分隔。如果是列的引用,則顯示為
<alias>.<column_name>
。如果是表達式的結(jié)果,則顯示為函數(shù)形式,例如
func1(arg1_1, arg1_2, func2(arg2_1, arg2_2))
。如果是常量,則直接顯示常量值。
FilterOperator(FIL):描述查詢語句中的
WHERE
語句塊的邏輯。EXPLAIN
結(jié)果中會顯示一個WHERE
條件表達式,形式類似SelectOperator的顯示規(guī)則。JoinOperator(JOIN):描述查詢語句中的
JOIN
語句塊的邏輯。EXPLAIN
結(jié)果中會顯示哪些表以哪種方式JOIN在一起。GroupByOperator(例如AGGREGATE):描述聚合操作的邏輯。如果查詢中使用了聚合函數(shù),就會出現(xiàn)該結(jié)構(gòu),
EXPLAIN
結(jié)果中會顯示聚合函數(shù)的內(nèi)容。ReduceSinkOperator(RS):描述Task間數(shù)據(jù)分發(fā)操作的邏輯。如果當(dāng)前Task的結(jié)果會傳遞給另一個Task,則必然需要在當(dāng)前Task的最后,使用ReduceSinkOperator執(zhí)行數(shù)據(jù)分發(fā)操作。
EXPLAIN
的結(jié)果中會顯示輸出結(jié)果的排序方式、分發(fā)的Key、Value以及用來求Hash值的列。FileSinkOperator(FS):描述最終數(shù)據(jù)的存儲操作。如果查詢中有
INSERT
語句塊,EXPLAIN
結(jié)果中會顯示目標(biāo)表名稱。LimitOperator(LIM):描述查詢語句中的
LIMIT
語句塊的邏輯。EXPLAIN
結(jié)果中會顯示LIMIT
數(shù)。MapjoinOperator(HASHJOIN):類似JoinOperator,描述大表的
JOIN
操作。
示例數(shù)據(jù)
為便于理解,本文為您提供源數(shù)據(jù),基于源數(shù)據(jù)提供相關(guān)示例。創(chuàng)建表sale_detail和sale_detail_jt,并添加數(shù)據(jù),命令示例如下:
--創(chuàng)建分區(qū)表sale_detail和sale_detail_jt。
CREATE TABLE if NOT EXISTS sale_detail
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
CREATE TABLE if NOT EXISTS sale_detail_jt
(
shop_name STRING,
customer_id STRING,
total_price DOUBLE
)
PARTITIONED BY (sale_date STRING, region STRING);
--向源表增加分區(qū)。
ALTER TABLE sale_detail ADD PARTITION (sale_date='2013', region='china') PARTITION (sale_date='2014', region='shanghai');
ALTER TABLE sale_detail_jt ADD PARTITION (sale_date='2013', region='china');
--向源表追加數(shù)據(jù)。
INSERT INTO sale_detail PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s3','c3',100.3);
INSERT INTO sale_detail PARTITION (sale_date='2014', region='shanghai') VALUES ('null','c5',null),('s6','c6',100.4),('s7','c7',100.5);
INSERT INTO sale_detail_jt PARTITION (sale_date='2013', region='china') VALUES ('s1','c1',100.1),('s2','c2',100.2),('s5','c2',100.2);
--查詢表sale_detail和sale_detail_jt中的數(shù)據(jù),命令示例如下:
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail;
--返回結(jié)果
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s3 | c3 | 100.3 | 2013 | china |
| null | c5 | NULL | 2014 | shanghai |
| s6 | c6 | 100.4 | 2014 | shanghai |
| s7 | c7 | 100.5 | 2014 | shanghai |
+------------+-------------+-------------+------------+------------+
SET odps.sql.allow.fullscan=true;
SELECT * FROM sale_detail_jt;
-- 返回結(jié)果
+------------+-------------+-------------+------------+------------+
| shop_name | customer_id | total_price | sale_date | region |
+------------+-------------+-------------+------------+------------+
| s1 | c1 | 100.1 | 2013 | china |
| s2 | c2 | 100.2 | 2013 | china |
| s5 | c2 | 100.2 | 2013 | china |
+------------+-------------+-------------+------------+------------+
--創(chuàng)建做關(guān)聯(lián)的表。
SET odps.sql.allow.fullscan=true;
CREATE TABLE shop AS SELECT shop_name, customer_id, total_price FROM sale_detail;
使用示例
下述示例均基于示例數(shù)據(jù)執(zhí)行。
示例1
查詢語句:
SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.customer_id=b.customer_id GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
獲取查詢語句語義,命令如下:
EXPLAIN SELECT a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.customer_id=b.customer_id GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
返回結(jié)果如下:
job0 is root job In Job job0: root Tasks: M1 M2_1 depends on: M1 R3_2 depends on: M2_1 R4_3 depends on: R3_2 In Task M1: Data source: doc_****.default.sale_detail/sale_date=2013/region=china TS: doc_****.default.sale_detail/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 324.0 FIL: ISNOTNULL(customer_id) Statistics: Num rows: 2.7, Data size: 291.6 RS: valueDestLimit: 0 dist: BROADCAST keys: values: customer_id (string) total_price (double) partitions: Statistics: Num rows: 2.7, Data size: 291.6 In Task M2_1: Data source: doc_****.default.sale_detail_jt/sale_date=2013/region=china TS: doc_****.default.sale_detail_jt/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 324.0 FIL: ISNOTNULL(customer_id) Statistics: Num rows: 2.7, Data size: 291.6 HASHJOIN: Filter1 INNERJOIN StreamLineRead1 keys: 0:customer_id 1:customer_id non-equals: 0: 1: bigTable: Filter1 Statistics: Num rows: 3.6450000000000005, Data size: 787.32 RS: order: + nullDirection: * optimizeOrderBy: False valueDestLimit: 0 dist: HASH keys: customer_id values: customer_id (string) total_price (double) total_price (double) partitions: customer_id Statistics: Num rows: 3.6450000000000005, Data size: 422.82000000000005 In Task R3_2: AGGREGATE: group by:customer_id UDAF: SUM(total_price) (__agg_0_sum)[Complete],COUNT(total_price) (__agg_1_count)[Complete] Statistics: Num rows: 1.0, Data size: 116.0 RS: order: + nullDirection: * optimizeOrderBy: True valueDestLimit: 10 dist: HASH keys: customer_id values: customer_id (string) __agg_0 (double) __agg_1 (bigint) partitions: Statistics: Num rows: 1.0, Data size: 116.0 In Task R4_3: SEL: customer_id,__agg_0,__agg_1 Statistics: Num rows: 1.0, Data size: 116.0 SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id Statistics: Num rows: 1.0, Data size: 216.0 FS: output: Screen schema: ashop (string) ap (double) bp (bigint) Statistics: Num rows: 1.0, Data size: 116.0 OK
示例2
查詢語句:
SELECT /*+ mapjoin(a) */ a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.total_price<b.total_price GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
獲取查詢語句語義:
EXPLAIN SELECT /*+ mapjoin(a) */ a.customer_id AS ashop, SUM(a.total_price) AS ap,COUNT(b.total_price) AS bp FROM (SELECT * FROM sale_detail_jt WHERE sale_date='2013' AND region='china') a INNER JOIN (SELECT * FROM sale_detail WHERE sale_date='2013' AND region='china') b ON a.total_price<b.total_price GROUP BY a.customer_id ORDER BY a.customer_id LIMIT 10;
返回結(jié)果如下:
job0 is root job In Job job0: root Tasks: M1 M2_1 depends on: M1 R3_2 depends on: M2_1 R4_3 depends on: R3_2 In Task M1: Data source: doc_****.sale_detail_jt/sale_date=2013/region=china TS: doc_****.sale_detail_jt/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 324.0 RS: valueDestLimit: 0 dist: BROADCAST keys: values: customer_id (string) total_price (double) partitions: Statistics: Num rows: 3.0, Data size: 324.0 In Task M2_1: Data source: doc_****.sale_detail/sale_date=2013/region=china TS: doc_****.sale_detail/sale_date=2013/region=china Statistics: Num rows: 3.0, Data size: 24.0 HASHJOIN: StreamLineRead1 INNERJOIN TableScan2 keys: 0: 1: non-equals: 0: 1: bigTable: TableScan2 Statistics: Num rows: 9.0, Data size: 1044.0 FIL: LT(total_price,total_price) Statistics: Num rows: 6.75, Data size: 783.0 AGGREGATE: group by:customer_id UDAF: SUM(total_price) (__agg_0_sum)[Partial_1],COUNT(total_price) (__agg_1_count)[Partial_1] Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069 RS: order: + nullDirection: * optimizeOrderBy: False valueDestLimit: 0 dist: HASH keys: customer_id values: customer_id (string) __agg_0_sum (double) __agg_1_count (bigint) partitions: customer_id Statistics: Num rows: 2.3116438356164384, Data size: 268.1506849315069 In Task R3_2: AGGREGATE: group by:customer_id UDAF: SUM(__agg_0_sum)[Final] __agg_0,COUNT(__agg_1_count)[Final] __agg_1 Statistics: Num rows: 1.6875, Data size: 195.75 RS: order: + nullDirection: * optimizeOrderBy: True valueDestLimit: 10 dist: HASH keys: customer_id values: customer_id (string) __agg_0 (double) __agg_1 (bigint) partitions: Statistics: Num rows: 1.6875, Data size: 195.75 In Task R4_3: SEL: customer_id,__agg_0,__agg_1 Statistics: Num rows: 1.6875, Data size: 195.75 SEL: customer_id ashop, __agg_0 ap, __agg_1 bp, customer_id Statistics: Num rows: 1.6875, Data size: 364.5 FS: output: Screen schema: ashop (string) ap (double) bp (bigint) Statistics: Num rows: 1.6875, Data size: 195.75 OK