色婷婷综合缴情综在线播放,欧美人精品XXXXX在线,2021全国产精品网站

如果您的數據表經常需要進行GROUP BY、JOIN操作或為了避免數據傾斜，您可以在建表時設置分布鍵（Distribution Key），合適的分布鍵可以幫助數據均勻分布在所有計算節點上，顯著提高計算和查詢性能。本文為您介紹Hologres中為表設置Distribution Key。

Distribution Key介紹

在Hologres中，Distribution Key屬性指定了表數據的分布策略，系統會保證Distribution Key相同的記錄被分配到同一個Shard上。建表時設置語法如下：

-- Hologres V2.1版本起支持的語法
CREATE TABLE <table_name> (...) WITH (distribution_key = '[<columnName>[,...]]');

-- 所有版本支持的語法
BEGIN;
CREATE TABLE <table_name> (...);
call set_table_property('<table_name>', 'distribution_key', '[<columnName>[,...]]');
COMMIT;

參數說明如下：

參數	說明
table_name	設置分布鍵的表名稱。
columnName	設置分布鍵的字段名稱。

Distribution Key是非常重要的分布式概念，合理設置Distribution Key可以達到如下效果：

顯著提高計算性能。
不同的Shard可以進行并行計算，從而提高計算性能。
顯著提高每秒查詢率（QPS）。
當您以Distribution Key做過濾條件時，Hologres可以直接篩選出數據相關的Shard進行掃描。否則Hologres需要讓所有的Shard參與計算，影響QPS。
顯著提高Join性能。
當兩張表在同一個Table Group內，并且Join的字段是Distribution Key時，那么數據分布保證表A一個Shard內的數據和表B同一Shard內的數據對應，只需要直接在本節點Join本節點數據（Local Join）即可，可以顯著提高執行效率。

使用建議

Distribution Key設置原則總結如下：

Distribution Key盡量選擇分布均勻的字段，否則容易因為數據傾斜導致負載傾斜，使得查詢效率變低，排查數據傾斜請參見查看Worker傾斜關系。
選擇Group By頻繁的字段作為Distribution Key。
Join場景中，設置Join字段為Distribution Key，實現Local Join，避免數據Shuffle。同時進行Join的表需要在同一個Table Group內。
不建議為一個表設置多個Distribution Key，建議設置的Distribution Key不超過兩個字段。設置多字段為Distribution Key，查詢時若沒有全部命中，容易出現數據Shuffle。
支持單列或者多列設置為Distribution Key，指定列時如設置單列，命令語句中不要保留多余空格；如設置多個列，則以半角逗號（,）分隔，同樣不要保留多余空格。指定多列為Distribution Key時，列的順序不影響數據的布局和查詢性能。
表設置了主鍵（PK）時，Distribution Key必須為PK或者PK中的部分字段（不能為空，即不指定任何列），因為要求同一記錄的數據只能屬于一個Shard。如果沒有額外指定Distribution Key，默認將PK設置為Distribution Key。

使用限制

設置Distribution Key需要在建表時設置，建表后如需修改Distribution Key需要重新建表并導入數據。
不支持修改Distribution Key對應列的值，如需修改請重新建表。
不支持將Float、Double、Numeric、Array、Json及其他復雜數據類型的字段設為Distribution Key。
表未設置PK時，Distribution Key沒有限制，可以為空（不指定任何列）。如果為空，即隨機Shuffle，數據隨機分布到不同Shard上。從Hologres V1.3.28版本開始，Distribution Key禁止為空，示例用法如下。
```
--從1.3.28版本開始，寫法將會被禁止
CALL SET_TABLE_PROPERTY('<tablename>', 'distribution_key', '');
```
Distribution Key列的值中有null時，當作“”（空串）看待，即Distribution Key為空。

技術原理

Distribution Key指定了表的分布策略。根據實際的業務場景，存在以下情形。

設置Distribution Key

為表設置了Distribution Key之后，數據會根據Distribution Key被分配到各個Shard上，算法為Hash(distribution_key)%shard_count，結果為對應的Shard。系統會保證Distribution Key相同的記錄會被分配到同一個Shard上，示例如下。

V2.1版本起支持的建表語法：

--設置a列為distribution key，系統會對a列的值做hash操作，再取模，即hash(a)%shard_count = shard_id，結果相同的一組數據分布在同一個Shard內
CREATE TABLE tbl (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a'
);

--設置a、b兩列為distribution key，系統對a,b兩個列的值做hash操作，再取模，即hash(a,b)%shard_count = shard_id，結果相同的一組數據分布在同一個Shard內
CREATE TABLE tbl (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a,b'
);

所有版本支持的建表語法：

--設置a列為distribution key，系統會對a列的值做hash操作，再取模，即hash(a)%shard_count = shard_id，結果相同的一組數據分布在同一個Shard內
begin;
create table tbl (
a int not null,
b text not null
);
call set_table_property('tbl', 'distribution_key', 'a');
commit;

--設置a、b兩列為distribution key，系統對a,b兩個列的值做hash操作，再取模，即hash(a,b)%shard_count = shard_id，結果相同的一組數據分布在同一個Shard內
begin;
create table tbl (
  a int not null,
  b text not null
);
call set_table_property('tbl', 'distribution_key', 'a,b');
commit;

數據分布示意圖如下：但在設置Distribution Key時，需要關注設置為Distribution Key字段的數據最好是分布均勻的。Hologres的Shard數和Worker節點數有一定的關聯關系，詳情請參見基本概念。如果設置了數據分布不均勻的字段作為Distribution Key之后，那么數據會集中分布在某些Shard上，導致大部分的計算集中到部分Worker上，出現長尾效應，查詢效率降低。排查以及處理數據的傾斜情況詳情請參見查看Worker傾斜關系。

不設置Distribution Key

不設置Distribution Key時，數據將會被隨機分布在各個Shard，相同的數據可能會在相同Shard，也可能在不同的Shard，示例如下。

--不設置distribution key
begin;
create table tbl (
a int not null,
b text not null
);
commit;

數據分布示意圖如下：不設置distribution key

Group By聚合場景設置Distribution Key

為表設置了Distribution Key，那么相同的數據就分布在相同的Shard上，同時對于Group By聚合場景，數據在計算時按照設置的Distribution Key重新分布，因此可以將Group By頻繁的字段設置為Distribution Key，這樣數據在Shard內就已經聚合，減少數據在Shard間的重分配，提高查詢性能，示例如下。

V2.1版本起支持的建表語法：

CREATE TABLE agg_tbl (
    a int NOT NULL,
    b int NOT NULL
)
WITH (
    distribution_key = 'a'
);

--示例查詢，對a列做聚合查詢
select a,sum(b) from agg_tbl group by a;

所有版本支持的建表語法：

begin;
create table agg_tbl (
a int not null,
b int not null
);
call set_table_property('agg_tbl', 'distribution_key', 'a');
commit;

--示例查詢，對a列做聚合查詢
select a,sum(b) from agg_tbl group by a;

通過查看執行計劃（explain SQL），如下所示執行計劃結果中沒有redistribution算子，說明數據沒有重分布。 QUERY PLAN

兩表關聯場景設置Distribution Key

兩表Join字段設置為Distribution Key

在兩表關聯（Join）的場景，如果兩表Join字段分別在對應表里都設置為Distribution Key，那么Join字段相同的數據就會分布在相同的Shard，這樣就能實現Local Join，從而實現查詢加速的效果，示例如下。

建表DDL。

V2.1版本起支持的建表語法：

--tbl1按照a列分布，tbl2按照c列分布，當tbl1與tbl2以a=c關聯條件join時，對應的數據分布在同一個Shard內，這種查詢可以實現Local Join的加速效果
BEGIN;
CREATE TABLE tbl1 (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a'
);
CREATE TABLE tbl2 (
    c int NOT NULL,
    d text NOT NULL
)
WITH (
    distribution_key = 'c'
);
COMMIT;

所有版本支持的建表語法：

--tbl1按照a列分布，tbl2按照c列分布，當tbl1與tbl2以a=c關聯條件join時，對應的數據分布在同一個Shard內，這種查詢可以實現Local Join的加速效果
begin;
create table tbl1(
a int not null,
b text not null
);
call set_table_property('tbl1', 'distribution_key', 'a');

create table tbl2(
c int not null,
d text not null
);
call set_table_property('tbl2', 'distribution_key', 'c');
commit;

查詢語句。

select * from tbl1  join tbl2 on tbl1.a=tbl2.c;

數據分布示意圖如下。兩表關聯join 通過查看執行計劃（explain SQL），如下所示執行計劃結果中沒有redistribution算子，說明數據沒有重分布。 join執行計劃

兩表Join字段未都設置為Distribution Key
在兩表關聯（Join）的場景，如果兩表Join字段在對應表里未都設置為Distribution Key，那么查詢時數據就會在各個Shard Shuffle（執行計劃會根據關聯的兩個表大小，來判斷是進行Shuffle還是Broadcast）。如下示例，設置tbl1的a字段為Distribution Key，tbl2的d字段為Distribution Key，而Join條件是a=c，那么c字段就會在每個Shard Shuffle一遍，從而導致查詢效率變低。
- 建表DDL。
  - V2.1版本起支持的建表語法：
    BEGIN; CREATE TABLE tbl1 ( a int NOT NULL, b text NOT NULL ) WITH ( distribution_key = 'a' ); CREATE TABLE tbl2 ( c int NOT NULL, d text NOT NULL ) WITH ( distribution_key = 'd' ); COMMIT;
  - 所有版本支持的建表語法：
    begin; create table tbl1( a int not null, b text not null ); call set_table_property('tbl1', 'distribution_key', 'a'); create table tbl2( c int not null, d text not null ); call set_table_property('tbl2', 'distribution_key', 'd'); commit;
- 查詢語句。
```
select * from tbl1  join tbl2 on tbl1.a=tbl2.c;
```
數據分布示意圖如下。通過查看執行計劃（explain SQL），如下所示執行計劃中有redistribution算子，說明數據進行了重分布，表明Distribution Key設置的不合理，需要重新設置。

多表關聯場景設置Distribution Key

多表關聯的場景比較復雜，可以遵循如下原則：

每個表的Join字段都相同，那么將Join字段都設置為Distribution Key。
每個表的Join字段不同，優先考慮大表間的Join，將大表的Join字段設置為Distribution Key。

通過以下幾種情況舉例說明（本文中以三個表Join為例說明，大于三個表的Join情況可以參考本示例）。

三個表的Join字段相同

在三個表Join的場景中，三個表的Join字段都相同，那么這種情況是最簡單的，可以直接將三個表的Join字段都設置為Distribution Key，實現Local Join的能力。

V2.1版本起支持的建表語法：

BEGIN;
CREATE TABLE join_tbl1 (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a'
);
CREATE TABLE join_tbl2 (
    a int NOT NULL,
    d text NOT NULL,
    e text NOT NULL
)
WITH (
    distribution_key = 'a'
);
CREATE TABLE join_tbl3 (
    a int NOT NULL,
    e text NOT NULL,
    f text NOT NULL,
    g text NOT NULL
)
WITH (
    distribution_key = 'a'
);
COMMIT;

--3表join查詢
SELECT * FROM join_tbl1
INNER JOIN join_tbl2 ON join_tbl2.a = join_tbl1.a
INNER JOIN join_tbl3 ON join_tbl2.a = join_tbl3.a;

所有版本支持的建表語法：

begin;
create table join_tbl1(
a int not null,
b text not null
);
call set_table_property('join_tbl1', 'distribution_key', 'a');

create table join_tbl2(
a int not null,
d text not null,
e text not null
);
call set_table_property('join_tbl2', 'distribution_key', 'a');

create table join_tbl3(
a int not null,
e text not null,
f text not null,
g text not null
);
call set_table_property('join_tbl3', 'distribution_key', 'a');
commit;

--3表join查詢
SELECT * FROM join_tbl1
INNER JOIN join_tbl2 ON join_tbl2.a = join_tbl1.a
INNER JOIN join_tbl3 ON join_tbl2.a = join_tbl3.a;

通過查看執行計劃（explain SQL），如下所示執行計劃中：

沒有redistribution算子，說明數據沒有重分布，實現了Local Join。
exchange算子代表文件級別聚合到Shard級別聚合，這樣就只需要對應Shard的數據，提升數據的查詢效率。

3表join

三個表的Join字段不同

在實際業務中，多表關聯時會有Join字段不相同的場景，這個時候可以根據如下原則來設置Distribution Key：

核心優化原則是優先考慮大表間的Join，設置大表的Join字段為Distribution Key；小表因其數據量較少，無需過多考慮。
表數據量大致相同的情況，可以設置Group By頻繁的Join字段為Distribution Key。

如下示例，有三個表相互Join，Join的字段不完全一樣，這個時候選擇大表的Join字段為Distribution Key，join_tbl_1這個表數據量有一千萬條，join_tbl_2和join_tbl_3分別有一百萬條，以join_tbl_1為主要優化對象。

V2.1版本起支持的建表語法：

BEGIN;
-- join_tbl_1為1kw數據量
CREATE TABLE join_tbl_1 (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a'
);
-- join_tbl_2為100w數據量
CREATE TABLE join_tbl_2 (
    a int NOT NULL,
    d text NOT NULL,
    e text NOT NULL
)
WITH (
    distribution_key = 'a'
);
-- join_tbl_3為100w數據量
CREATE TABLE join_tbl_3 (
    a int NOT NULL,
    e text NOT NULL,
    f text NOT NULL,
    g text NOT NULL
);
 WITH (
     distribution_key = 'a'
 );
COMMIT;

--join key不相同時，選擇大表的join key為distribution key。
SELECT * FROM join_tbl_1
INNER JOIN join_tbl_2 ON join_tbl_2.a = join_tbl_1.a
INNER JOIN join_tbl_3 ON join_tbl_2.d = join_tbl_3.f;

所有版本支持的建表語法：

begin;
--join_tbl_1為1kw數據量
create table join_tbl_1(
a int not null,
b text not null
);
call set_table_property('join_tbl_1', 'distribution_key', 'a');

--join_tbl_2為100w數據量
create table join_tbl_2(
a int not null,
d text not null,
e text not null
);
call set_table_property('join_tbl_2', 'distribution_key', 'a');

--join_tbl_3為100w數據量
create table join_tbl_3(
a int not null,
e text not null,
f text not null,
g text not null
);
--call set_table_property('join_tbl_3', 'distribution_key', 'a');
commit;

--join key不相同時，選擇大表的join key為distribution key。
SELECT * FROM join_tbl_1
INNER JOIN join_tbl_2 ON join_tbl_2.a = join_tbl_1.a
INNER JOIN join_tbl_3 ON join_tbl_2.d = join_tbl_3.f;

通過查看執行計劃（explain SQL），如下所示執行計劃表明：

在join_tbl_2和join_tbl_3表之間有redistribution算子，因為join_tbl_3是小表，Join字段與Distribution Key不一致，所以數據進行了重分布。
join_tbl_1和join_tbl_2表之間沒有redistribution算子，因為兩表將Join字段都設置為Distribution Key，因此數據不會重分布。

3表join執行計劃

使用示例

V2.1版本起支持的建表語法：

--單表設置為distribution key
CREATE TABLE tbl (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a'
);

--設置多個distribution key
CREATE TABLE tbl (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a,b'
);


--join場景，設置join key為distribution key
BEGIN;
CREATE TABLE tbl1 (
    a int NOT NULL,
    b text NOT NULL
)
WITH (
    distribution_key = 'a'
);
CREATE TABLE tbl2 (
    c int NOT NULL,
    d text NOT NULL
)
WITH (
    distribution_key = 'c'
);
COMMIT;

SELECT b, count(*) FROM tbl1 JOIN tbl2 ON tbl1.a = tbl2.c GROUP BY b;

所有版本支持的建表語法：

--設置一個distribution key
begin;
create table tbl (a int not null, b text not null);
call set_table_property('tbl', 'distribution_key', 'a');
commit;

--設置多個distribution key
begin;
create table tbl (a int not null, b text not null);
call set_table_property('tbl', 'distribution_key', 'a,b');
commit;

--join場景，設置join key為distribution key
begin;
create table tbl1(a int not null, b text not null);
call set_table_property('tbl1', 'distribution_key', 'a');
create table tbl2(c int not null, d text not null);
call set_table_property('tbl2', 'distribution_key', 'c');
commit;

select b, count(*) from tbl1 join tbl2 on tbl1.a = tbl2.c group by b;

日本熟妇hd丰满老熟妇,中文字幕一区二区三区在线不卡 ,亚洲成片在线观看,免费女同在线一区二区

分布鍵Distribution Key

Distribution Key介紹

使用建議

使用限制

技術原理

設置Distribution Key

不設置Distribution Key

Group By聚合場景設置Distribution Key

兩表關聯場景設置Distribution Key

多表關聯場景設置Distribution Key

使用示例

相關文檔