In-Database AI/ML概述
AnalyticDB PostgreSQL 7.0版中支持In-Database AI/ML功能。您可以應用該功能提供的算法和模型對數據進行處理,從而降低數據流轉成本。In-Database AI/ML框架在兼容PostgresML開源社區接口的基礎上,對其功能、性能和易用性進行了大量優化,利用GPU/CPU實現算法模型的訓練、Fine-Tune、部署和推理等。In-Database AI/ML框架是基于pgml插件實現的。pgml插件是PostgresML開源社區的組件之一,集成了XGBoost、LightGBM和SciKit-Learn等經典機器學習算法。
版本限制
內核版本為v7.1.1.0及以上的AnalyticDB PostgreSQL 7.0版實例。
實例資源類型為存儲彈性模式。
已經安裝pgml插件。
說明pgml暫不支持白屏化安裝,如有需要請提交工單聯系工作人員協助安裝。如有卸載插件需求,也請提交工單聯系工作人員協助卸載。
元數據簡介
AnalyticDB PostgreSQL 7.0版中In-Database AI/ML框架是基于pgml插件實現的。當在符合條件的版本中安裝完pgml插件后,系統會自動創建名為pgml的Schema。在該Schema下有以下元數據表。
元數據表名稱 | 描述 |
projects | 訓練任務中對應的項目信息。 |
models | 訓練后的模型信息。 |
files | 模型文件的存儲信息。 |
snapshots | 訓練時數據集的快照。 |
logs | 訓練過程中輸出的日志信息。 |
deployments | 訓練后模型的部署信息。 |
當發起訓練時,訓練信息會被自動寫入以上元數據表。
元數據表中pgml的自定義類型(如task、runtime和sampling等)的介紹請參見機器學習使用文檔。
projects
projects表記錄訓練任務的項目ID、項目名稱、任務類型、創建時間和更新時間。表結構和索引等信息如下。
Table "pgml.projects"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+--------------------------------------
id | bigint | | not null | nextval('projects_id_seq'::regclass)
name | text | | not null |
task | task | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"projects_pkey" PRIMARY KEY, btree (id)
"projects_name_idx" btree (name)
Triggers:
projects_auto_updated_at BEFORE UPDATE ON projects FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_projects BEFORE INSERT ON projects FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_projects()
Distributed Replicated
models
models表記錄模型訓練時指定的參數和關聯的項目ID和快照ID等信息。表結構和索引等信息如下。
Table "pgml.models"
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+------------------------------------
id | bigint | | not null | nextval('models_id_seq'::regclass)
project_id | bigint | | not null |
snapshot_id | bigint | | |
num_features | integer | | not null |
algorithm | text | | not null |
runtime | runtime | | | 'python'::runtime
hyperparams | jsonb | | not null |
status | text | | not null |
metrics | jsonb | | |
search | text | | |
search_params | jsonb | | not null |
search_args | jsonb | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"models_pkey" PRIMARY KEY, btree (id)
"models_project_id_idx" btree (project_id)
"models_snapshot_id_idx" btree (snapshot_id)
Triggers:
models_auto_updated_at BEFORE UPDATE ON models FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_models BEFORE INSERT ON models FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_models_fk()
Distributed Replicated
files
在訓練結束后,模型目錄下的每個文件以二進制形式被保存到files表的data列里,文件二進制流會按照每100MB切片保存。表結構和索引等信息如下。
Table "pgml.files"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+-----------------------------------
id | bigint | | not null | nextval('files_id_seq'::regclass)
model_id | bigint | | not null |
path | text | | not null |
part | integer | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
data | bytea | | not null |
Indexes:
"files_pkey" PRIMARY KEY, btree (id)
"files_model_id_path_part_idx" btree (model_id, path, part)
Triggers:
files_auto_updated_at BEFORE UPDATE ON files FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_files BEFORE INSERT ON files FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_files()
Distributed Replicated
snapshots
snapshots表記錄訓練時數據集的快照信息:數據表名稱、測試集劃分信息等。表結構和索引等信息如下。
Table "pgml.snapshots"
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------------------------------------
id | bigint | | not null | nextval('snapshots_id_seq'::regclass)
relation_name | text | | not null |
y_column_name | text[] | | |
test_size | real | | not null |
test_sampling | sampling | | not null |
status | text | | not null |
columns | jsonb | | |
analysis | jsonb | | |
created_at | timestamp without time zone | | not null | clock_timestamp()
updated_at | timestamp without time zone | | not null | clock_timestamp()
materialized | boolean | | | false
Indexes:
"snapshots_pkey" PRIMARY KEY, btree (id)
Triggers:
snapshots_auto_updated_at BEFORE UPDATE ON snapshots FOR EACH ROW EXECUTE FUNCTION set_updated_at()
Distributed Replicated
logs
Logs表記錄輸出訓練過程中的信息。對于一個訓練任務可能會存在多條訓練信息,可以對created_at列升序查看。表結構和索引等信息如下。
Table "pgml.logs"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+----------------------------------
id | integer | | not null | nextval('logs_id_seq'::regclass)
model_id | bigint | | |
project_id | bigint | | |
created_at | timestamp without time zone | | | CURRENT_TIMESTAMP
logs | jsonb | | |
Indexes:
"logs_pkey" PRIMARY KEY, btree (id)
Distributed Replicated
deployments
當模型需要部署時,系統會創建一條部署信息,關聯項目ID、部署ID和模型ID,deployments表記錄部署的策略。表結構和索引等信息如下。
Table "pgml.deployments"
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+-----------------------------------------
id | bigint | | not null | nextval('deployments_id_seq'::regclass)
project_id | bigint | | not null |
model_id | bigint | | not null |
strategy | strategy | | not null |
created_at | timestamp without time zone | | not null | clock_timestamp()
Indexes:
"deployments_pkey" PRIMARY KEY, btree (id)
"deployments_model_id_created_at_idx" btree (model_id)
"deployments_project_id_created_at_idx" btree (project_id)
Triggers:
deployments_auto_updated_at BEFORE UPDATE ON deployments FOR EACH ROW EXECUTE FUNCTION set_updated_at()
trigger_before_insert_pgml_deployments BEFORE INSERT ON deployments FOR EACH ROW EXECUTE FUNCTION trigger_check_pgml_deployments_fk()
Distributed Replicated