歸一化
本文為您介紹Designer提供的歸一化組件。
組件配置
您可以使用以下任意一種方式,配置歸一化組件參數(shù)。
方式一:可視化方式
在Designer工作流頁面配置組件參數(shù)。
頁簽 | 參數(shù) | 描述 |
字段設(shè)置 | 默認(rèn)全選 | 默認(rèn)全選,多余列不影響預(yù)測結(jié)果。 |
保留原始列 | 處理過的列增加“stdized_”前綴。支持DOUBLE類型與BIGINT類型。 | |
執(zhí)行調(diào)優(yōu) | 計算核心數(shù) | 系統(tǒng)根據(jù)輸入數(shù)據(jù)量,自動分配訓(xùn)練的實(shí)例數(shù)量。 |
每個核內(nèi)存 | 系統(tǒng)根據(jù)輸入數(shù)據(jù)量,自動分配內(nèi)存。單位為MB。 |
方式二:PAI命令方式
使用PAI命令方式,配置該組件參數(shù)。您可以使用SQL腳本組件進(jìn)行PAI命令調(diào)用,詳情請參見SQL腳本。
稠密數(shù)據(jù)的命令
PAI -name Normalize -project algo_public -DkeepOriginal="true" -DoutputTableName="test_4" -DinputTablePartitions="pt=20150501" -DinputTableName="bank_data_partition" -DselectedColNames="emp_var_rate,euribor3m"
稀疏數(shù)據(jù)的命令
PAI -name Normalize -project projectxlib4 -DkeepOriginal="true" -DoutputTableName="kv_norm_output" -DinputTableName=kv_norm_test -DselectedColNames="f0,f1,f2" -DenableSparse=true -DoutputParaTableName=kv_norm_model -DkvIndices=1,2,8,6 -DitemDelimiter=",";
參數(shù)名稱 | 是否必選 | 參數(shù)描述 | 默認(rèn)值 |
inputTableName | 是 | 輸入表的表名。 | 無 |
selectedColNames | 否 | 輸入表中,參與訓(xùn)練的列。列名以英文逗號(,)分隔,支持INT和DOUBLE類型。如果輸入為稀疏格式,則支持STRING類型的列。 | 所有列 |
inputTablePartitions | 否 | 輸入表中,參與訓(xùn)練的分區(qū)。支持以下格式:
說明 如果指定多個分區(qū),則使用英文逗號(,)分隔。 | 所有分區(qū) |
outputTableName | 是 | 輸出結(jié)果表。 | 無 |
outputParaTableName | 否 | 配置輸出表。 | 輸出表1為非分區(qū)表 |
inputParaTableName | 是 | 配置輸入表。 | 無 |
keepOriginal | 否 | 是否保留原始列:
| false |
lifecycle | 否 | 輸出表的生命周期,取值范圍為[1,3650]。 | 無 |
coreNum | 否 | 計算的核心數(shù)目,取值為正整數(shù)。 | 系統(tǒng)自動分配 |
memSizePerCore | 否 | 每個核心的內(nèi)存(單位是兆),取值范圍為(1, 65536)。 | 系統(tǒng)自動分配 |
enableSparse | 否 | 是否打開稀疏支持:
| false |
itemDelimiter | 否 | KV對之間分隔符。 | 默認(rèn)”,” |
kvDelimiter | 否 | Key和Value之間分隔符。 | 默認(rèn)”:” |
kvIndices | 否 | KV表中需要?dú)w一化的特征索引。 | 無 |
示例
數(shù)據(jù)生成
drop table if exists normalize_test_input; create table normalize_test_input( col_string string, col_bigint bigint, col_double double, col_boolean boolean, col_datetime datetime); insert overwrite table normalize_test_input select * from ( select '01' as col_string, 10 as col_bigint, 10.1 as col_double, True as col_boolean, cast('2016-07-01 10:00:00' as datetime) as col_datetime union all select cast(null as string) as col_string, 11 as col_bigint, 10.2 as col_double, False as col_boolean, cast('2016-07-02 10:00:00' as datetime) as col_datetime union all select '02' as col_string, cast(null as bigint) as col_bigint, 10.3 as col_double, True as col_boolean, cast('2016-07-03 10:00:00' as datetime) as col_datetime union all select '03' as col_string, 12 as col_bigint, cast(null as double) as col_double, False as col_boolean, cast('2016-07-04 10:00:00' as datetime) as col_datetime union all select '04' as col_string, 13 as col_bigint, 10.4 as col_double, cast(null as boolean) as col_boolean, cast('2016-07-05 10:00:00' as datetime) as col_datetime union all select '05' as col_string, 14 as col_bigint, 10.5 as col_double, True as col_boolean, cast(null as datetime) as col_datetime ) tmp;
PAI命令行
drop table if exists normalize_test_input_output; drop table if exists normalize_test_input_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output" -DinputTableName="normalize_test_input" -DselectedColNames="col_double,col_bigint" -DkeepOriginal="true"; drop table if exists normalize_test_input_output_using_model; drop table if exists normalize_test_input_output_using_model_model_output; PAI -name Normalize -project algo_public -DoutputParaTableName="normalize_test_input_output_using_model_model_output" -DinputParaTableName="normalize_test_input_model_output" -Dlifecycle="28" -DoutputTableName="normalize_test_input_output_using_model" -DinputTableName="normalize_test_input";
輸入說明
normalize_test_input
col_string
col_bigint
col_double
col_boolean
col_datetime
01
10
10.1
true
2016-07-01 10:00:00
NULL
11
10.2
false
2016-07-02 10:00:00
02
NULL
10.3
true
2016-07-03 10:00:00
03
12
NULL
false
2016-07-04 10:00:00
04
13
10.4
NULL
2016-07-05 10:00:00
05
14
10.5
true
NULL
輸出說明
normalize_test_input_output
col_string
col_bigint
col_double
col_boolean
col_datetime
normalized_col_bigint
normalized_col_double
01
10
10.1
true
2016-07-01 10:00:00
0.0
0.0
NULL
11
10.2
false
2016-07-02 10:00:00
0.25
0.2499999999999989
02
NULL
10.3
true
2016-07-03 10:00:00
NULL
0.5000000000000022
03
12
NULL
false
2016-07-04 10:00:00
0.5
NULL
04
13
10.4
NULL
2016-07-05 10:00:00
0.75
0.7500000000000011
05
14
10.5
true
NULL
1.0
1.0
normalize_test_input_model_output
feature
json
col_bigint
{“name”: “normalize”, “type”:”bigint”, “paras”:{“min”:10, “max”: 14}}
col_double
{“name”: “normalize”, “type”:”double”, “paras”:{“min”:10.1, “max”: 10.5}}
normalize_test_input_output_using_model
col_string
col_bigint
col_double
col_boolean
col_datetime
01
0.0
0.0
true
2016-07-01 10:00:00
NULL
0.25
0.2499999999999989
false
2016-07-02 10:00:00
02
NULL
0.5000000000000022
true
2016-07-03 10:00:00
03
0.5
NULL
false
2016-07-04 10:00:00
04
0.75
0.7500000000000011
NULL
2016-07-05 10:00:00
05
1.0
1.0
true
NULL
normalize_test_input_output_using_model_model_output
feature
json
col_bigint
{“name”: “normalize”, “type”:”bigint”, “paras”:{“min”:10, “max”: 14}}
col_double
{“name”: “normalize”, “type”:”double”, “paras”:{“min”:10.1, “max”: 10.5}}