特征平臺當前提供的特征生產功能旨在簡化特征創建過程,通過固化常用的和普遍的生產步驟,您僅需進行簡單配置就能輕松生成特征,從而有效降低了特征生產的復雜性。特征生產在多個領域(包括推薦、廣告、風控以及機器學習等)都有廣泛應用,本文將以推薦場景為例,為您介紹從原始表到特征生產加工生成樣本表,再到訓練模型的完整過程。
前提條件
在開始執行操作前,請確認您已完成以下準備工作:
依賴產品 | 具體操作 |
人工智能平臺PAI |
|
云原生大數據計算服務MaxCompute |
|
大數據開發治理平臺DataWorks |
|
一、準備工作
準備原始數據
一般對于推薦場景,特征生產通常需要以下三張原始表,以這三張原始表為基礎,特征生產可以生產出成百上千個特征,方便建立模型來擬合目標。
用戶表(rec_sln_demo_user_table_preprocess_v1):包含一些基礎的用戶特征,例如性別、年齡、城市和關注數等。
物品表(rec_sln_demo_item_table_preprocess_v1):包含一些基礎的物品特征,例如類別、作者、累計點擊數和累計點贊數等。
行為表(rec_sln_demo_behavior_table_preprocess_v1):包含一些行為特征,例如某時用戶點擊某個物品等。
數據表存放在有公開讀取權限的pai_online_project中,其數據均為模擬數據生成。您需要在DataWorks中執行SQL命令,將上表數據從pai_online_project項目同步到您的MaxCompute項目中。具體操作步驟如下:
登錄DataWorks控制臺。
在左側導航欄單擊數據開發與治理 > 數據開發。
選擇已創建的DataWorks工作空間后,單擊進入數據開發。
鼠標懸停至新建,選擇新建節點 > MaxCompute > ODPS SQL,在彈出的頁面中配置節點參數。
參數
取值建議
引擎實例
選擇已創建的MaxCompute引擎。
節點類型
ODPS SQL
路徑
業務流程/Workflow/MaxCompute
名稱
可自定義名稱。
單擊確認。
在新建節點區域運行以下SQL命令,將用戶表、物品表、行為表從pai_online_project項目同步到您自己的MaxCompute中。資源組選擇已創建的獨享資源組。
同步用戶表:rec_sln_demo_user_table_preprocess_v1
CREATE TABLE IF NOT EXISTS rec_sln_demo_user_table_preprocess_v1 like pai_online_project.rec_sln_demo_user_table_preprocess_v1 STORED AS ALIORC LIFECYCLE 90; INSERT OVERWRITE TABLE rec_sln_demo_user_table_preprocess_v1 PARTITION(ds) SELECT * FROM pai_online_project.rec_sln_demo_user_table_preprocess_v1 WHERE ds >= '20240530' and ds <='20240605';
同步物品表:rec_sln_demo_item_table_preprocess_v1
CREATE TABLE IF NOT EXISTS rec_sln_demo_item_table_preprocess_v1 like pai_online_project.rec_sln_demo_item_table_preprocess_v1 STORED AS ALIORC LIFECYCLE 90; INSERT OVERWRITE TABLE rec_sln_demo_item_table_preprocess_v1 PARTITION(ds) SELECT * FROM pai_online_project.rec_sln_demo_item_table_preprocess_v1 WHERE ds >= '20240530' and ds <='20240605';
同步行為表:rec_sln_demo_behavior_table_preprocess_v1
CREATE TABLE IF NOT EXISTS rec_sln_demo_behavior_table_preprocess_v1 like pai_online_project.rec_sln_demo_behavior_table_preprocess_v1 STORED AS ALIORC LIFECYCLE 90; INSERT OVERWRITE TABLE rec_sln_demo_behavior_table_preprocess_v1 PARTITION(ds) SELECT * FROM pai_online_project.rec_sln_demo_behavior_table_preprocess_v1 WHERE ds >= '20240530' and ds <='20240605';
安裝FeatureStore Python SDK
以下代碼均建議在Jupyter Notebook環境下運行。
安裝特征平臺Python SDK,要求在Python3環境下運行。
%pip install https://feature-store-py.oss-cn-beijing.aliyuncs.com/package/feature_store_py-1.8.0-py3-none-any.whl
導入需要的功能模塊:
import os from feature_store_py import FeatureStoreClient from feature_store_py.fs_datasource import MaxComputeDataSource from feature_store_py.feature_engineering import TableTransform, Condition, DayOf, ComboTransform, Feature, AggregationTransform, auto_count_feature_transform, WindowTransform, auto_window_feature_transform
二、原始表初步變換
準備流程完成后,您可以在自己的項目空間中查看已準備就緒的三張表:用戶表(rec_sln_demo_user_table_preprocess_v1)、物品表(rec_sln_demo_item_table_preprocess_v1)以及行為表(rec_sln_demo_behavior_table_preprocess_v1),您可以以這三張表為基礎進行特征生產。
在進行后續的特征生產加工之前,為了方便您后續做特征統計,您需要對數據表做以下預處理操作。您可以將以下SQL命令粘貼到已創建的ODPS SQL節點中執行,具體操作,請參見準備原始數據。
將用戶表、物品表的特征和行為表進行連接。
CREATE TABLE IF NOT EXISTS rec_sln_demo_behavior_table_preprocess_wide_v1 ( request_id bigint ,user_id string ,page string ,net_type string ,day_h bigint COMMENT '行為發生在當天的第幾小時' ,week_day bigint COMMENT '行為發生在當前周的第幾天' ,event_unix_time bigint ,item_id string ,event string ,playtime double ,gender string ,age bigint ,city string ,item_cnt bigint ,follow_cnt bigint ,follower_cnt bigint ,is_new_user bigint ,tags string ,duration double ,category string ,author bigint ,click_count bigint ,praise_count bigint ,is_new_item bigint ) PARTITIONED BY ( ds string ) LIFECYCLE 90 ; INSERT OVERWRITE TABLE rec_sln_demo_behavior_table_preprocess_wide_v1 PARTITION(ds='${bdp.system.bizdate}') SELECT sq0.request_id ,sq0.user_id ,sq0.page ,sq0.net_type ,sq0.day_h ,sq0.week_day ,sq0.event_unix_time ,sq0.item_id ,sq0.event ,sq0.playtime ,sq1.gender ,sq1.age ,sq1.city ,sq1.item_cnt ,sq1.follow_cnt ,sq1.follower_cnt ,sq1.is_new_user ,sq1.tags ,sq2.duration ,sq2.category ,sq2.author ,sq2.click_count ,sq2.praise_count ,sq2.is_new_item FROM ( SELECT * FROM rec_sln_demo_behavior_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' ) sq0 LEFT JOIN ( SELECT * FROM rec_sln_demo_user_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' ) sq1 ON sq0.user_id = sq1.user_id LEFT JOIN ( SELECT * FROM rec_sln_demo_item_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' ) sq2 ON sq0.item_id = sq2.item_id ;
因訓練模型需要有標簽,您需要對行為表進行預處理,將其轉換成Label表。本示例將點擊、播放時間、點贊作為標簽。
CREATE TABLE IF NOT EXISTS rec_sln_demo_fs_label_table_v1 ( request_id bigint ,user_id string ,page string ,net_type string ,day_h bigint COMMENT '行為發生在當天的第幾小時' ,week_day bigint COMMENT '行為發生在當前周的第幾天' ,day_min string ,event_unix_time bigint ,item_id string ,playtime double ,is_click BIGINT ,ln_playtime DOUBLE ,is_praise BIGINT ) PARTITIONED BY ( ds string ) LIFECYCLE 90 ; INSERT OVERWRITE TABLE rec_sln_demo_fs_label_table_v1 PARTITION(ds='${bdp.system.bizdate}') SELECT request_id ,user_id ,MAX(page) page ,MAX(net_type) net_type ,MAX(day_h) day_h ,MAX(week_day) week_day ,TO_CHAR(FROM_UNIXTIME(MIN(event_unix_time)),'yyyymmddhhmi') day_min ,MAX(event_unix_time) event_unix_time ,item_id ,MAX(playtime) playtime ,max(if(event='click', 1, 0)) is_click ,ln(sum(playtime) + 1) ln_playtime ,max(if(event='praise', 1, 0)) is_praise FROM rec_sln_demo_behavior_table_preprocess_v1 WHERE ds = '${bdp.system.bizdate}' GROUP BY request_id ,user_id ,item_id ;
經過上面的初步變換后,您將擁有以下兩種表:
新的行為表:rec_sln_demo_behavior_table_preprocess_wide_v1,后續的統計特征將以此表為基礎進行變換。
Label表:rec_sln_demo_fs_label_table_v1,后續構建樣本表時需要用到此表。
三、特征生產加工
您可以調用特征生產中的自動擴展函數進行特征生產,僅需幾行代碼,就可以生產出上百種特征。
特征生產加工分別需要加工用戶側特征和物品側特征。具體操作步驟如下:
用戶側特征的特征生產
以下代碼均建議在Jupyter Notebook環境下運行。
初始化Client。
access_key_id = os.environ.get("ALIBABA_CLOUD_ACCESS_KEY_ID") # 填入您的Access Key ID access_key_secret = os.environ.get("ALIBABA_CLOUD_ACCESS_KEY_SECRET") # 填入您的Access Key Secret project = 'project_name' # 填入您的項目名 region = 'cn-hangzhou' # 填入您的項目所在區域,例如華東1(杭州)為cn-hangzhou fs_client = FeatureStoreClient(access_key_id=access_key_id, access_key_secret=access_key_secret, region=region)
指定要進行特征變換的數據源。
input_bhv_table_name = "rec_sln_demo_behavior_table_preprocess_wide_v1" ds_bhv = MaxComputeDataSource(table=input_bhv_table_name, project=project) input_user_table_name = "rec_sln_demo_user_table_preprocess_v1" ds_user = MaxComputeDataSource(table=input_user_table_name, project=project) input_item_table_name = "rec_sln_demo_item_table_preprocess_v1" ds_item = MaxComputeDataSource(table=input_item_table_name, project=project)
指定JoinTransform和AggregationTransform后輸出的用戶側特征表名稱。
agg_user_table_v1 = 'rec_sln_demo_user_table_preprocess_agg_v1'
利用自動擴展函數,對統計特征進行自動擴展。
name_prefix = "user" input_list = ["playtime", "duration", "click_count", "praise_count"] event_name = 'event' event_type = 'expr' group_by_key = "user_id" window_size = [3,7,15] user_count_feature_list = auto_count_feature_transform(name_prefix, input_list, event_name, event_type, group_by_key, window_size) print("len_count_feature_list = ", len(user_count_feature_list)) print("count_feature_list = ", user_count_feature_list)
執行生產過程,并查看表運行結果。
創建Pipeline。
agg_user_bhv_pipeline = fs_client.create_pipeline(ds_bhv).add_feature_transform(user_count_feature_list) agg_user_pipeline =fs_client.create_pipeline(ds_user, agg_user_table_v1).merge(agg_user_bhv_pipeline, keep_input_columns=False)
執行Pipeline生產過程。
execute_date = '20240605' output_agg_user_table = agg_user_pipeline.execute(execute_date, drop_table=True)
查看表運行結果。
agg_user_ret = output_agg_user_table.to_pandas(execute_date, limit=20) agg_user_ret
進行WindowTransform變換,指定WindowTransform后輸出的用戶側特征表名稱。
win_user_table_v1 = 'rec_sln_demo_user_table_preprocess_win_v1'
利用內置的自動擴展函數,自動生成WindowTransform特征定義。
name_prefix = 'user' input_list = ['day_h', 'category'] agg_field = ['duration', 'click_count'] event_name = 'event' event_type = 'expr' group_by_key = 'user_id' window_size = [7, 15, 30, 45] user_win_feature_list = auto_window_feature_transform(name_prefix, input_list, agg_field, event_name, event_type, group_by_key, window_size) print("len_user_win_feature_list = ", len(user_win_feature_list)) print("user_win_feature_list = ", user_win_feature_list)
執行生產過程,并查看表運行結果。
創建pipeline。
win_user_bhv_pipeline = fs_client.create_pipeline(ds_bhv).add_feature_transform(user_win_feature_list) win_user_pipeline = fs_client.create_pipeline(ds_user, win_user_table_v1).merge(win_user_bhv_pipeline, keep_input_columns=False)
執行pipeline生產過程。
execute_date = '20240605' output_win_user_table = win_user_pipeline.execute(execute_date, drop_table=True) # 詳細實現見功能文檔。因為有中間表存在,第一次執行時需要補數據,可能需要較長時間,運行補數據可以執行下面的命令 # output_win_user_table = win_user_pipeline.execute(execute_date, drop_table=True, backfill_partitions=True)
查看表運行結果。
win_user_ret = output_win_user_table.to_pandas(execute_date, limit=20) win_user_ret
物品側特征的特征生產
在完成用戶側特征的提取和加工之后,您可以繼續對物品表進行特征生產加工,生成基于物品類型的特征。具體操作步驟如下:
指定AggregationTransform后輸出的物品側特征表名稱。
agg_item_table_v1 = 'rec_sln_demo_item_table_preprocess_agg_v1'
利用自動擴展函數,對統計特征進行自動擴展。
name_prefix = "item" input_list = ["item_cnt", "follow_cnt", "follower_cnt"] event_name = 'event' event_type = 'expr' group_by_key = "item_id" window_size = [3,7,15] item_count_feature_list = auto_count_feature_transform(name_prefix, input_list, event_name, event_type, group_by_key, window_size) print("len_count_feature_list = ", len(item_count_feature_list)) print("count_feature_list = ", item_count_feature_list)
執行生產過程,并查看表運行結果。
創建pipeline。
agg_item_bhv_pipeline =fs_client.create_pipeline(ds_bhv).add_feature_transform(item_count_feature_list) agg_item_pipeline =fs_client.create_pipeline(ds_item, agg_item_table_v1).merge(agg_item_bhv_pipeline, keep_input_columns=False)
執行pipeline生產過程。
execute_date = '20240605' output_agg_item_table = agg_item_pipeline.execute(execute_date, drop_table=True)
查看表運行結果。
agg_item_ret = output_agg_item_table.to_pandas(execute_date, limit=20) agg_item_ret
指定WindowTransform后輸出的用戶側特征表名稱。
win_item_table_v1 = 'rec_sln_demo_item_table_preprocess_win_v1'
利用內置的自動擴展函數,自動生成WindowTransform特征定義。
name_prefix = 'item' input_list = ['day_h', 'category'] agg_field = ['click_count', 'praise_count'] event_name = 'event' event_type = 'expr' group_by_key = 'item_id' window_size = [7, 15, 30, 45] item_win_feature_list = auto_window_feature_transform(name_prefix, input_list, agg_field, event_name, event_type, group_by_key, window_size) print("len_item_win_feature_list = ", len(item_win_feature_list)) print("item_win_feature_list = ", item_win_feature_list)
執行生產過程,并查看表運行結果。
創建pipeline。
win_item_bhv_pipeline = fs_client.create_pipeline(ds_bhv).add_feature_transform(item_win_feature_list) win_item_pipeline = fs_client.create_pipeline(ds_item, win_item_table_v1).merge(win_item_bhv_pipeline, keep_input_columns=False)
執行pipeline生產過程。
execute_date = '20240605' output_win_item_table = win_item_pipeline.execute(execute_date, drop_table=True) # 詳細實現見功能文檔。因為有中間表存在,第一次執行時需要補數據,可能需要較長時間,運行補數據可以執行下面的命令 # output_win_item_table = win_item_pipeline.execute(execute_date, drop_table=True, backfill_partitions=True)
查看表運行結果。
win_item_ret = output_win_item_table.to_pandas(execute_date, limit=20) win_item_ret
四、生成樣本表
經過上述流程,您已獲取了Label表,并在物品側和用戶側分別整理出三張特征表(包括一張原始表及兩張經過特征生產加工出的表)。隨后,您可以將以下SQL命令粘貼到已創建的ODPS SQL節點中執行,將這七張表進行合并,以構建完整的樣本表。具體操作,請參見準備原始數據。
CREATE TABLE IF NOT EXISTS fs_demo_fs_engineering_v1_training_set(
request_id BIGINT,
user_id STRING,
page STRING,
net_type STRING,
day_h BIGINT,
week_day BIGINT,
day_min STRING,
event_unix_time BIGINT,
item_id STRING,
playtime DOUBLE,
is_click BIGINT,
ln_playtime DOUBLE,
is_praise BIGINT,
duration DOUBLE,
category STRING,
author BIGINT,
click_count BIGINT,
praise_count BIGINT,
is_new_item BIGINT,
item__sum_item_cnt_3d BIGINT,
item__sum_follow_cnt_3d BIGINT,
item__sum_follower_cnt_3d BIGINT,
item__max_item_cnt_3d BIGINT,
item__max_follow_cnt_3d BIGINT,
item__max_follower_cnt_3d BIGINT,
item__min_item_cnt_3d BIGINT,
item__min_follow_cnt_3d BIGINT,
item__min_follower_cnt_3d BIGINT,
item__avg_item_cnt_3d DOUBLE,
item__avg_follow_cnt_3d DOUBLE,
item__avg_follower_cnt_3d DOUBLE,
item__sum_item_cnt_7d BIGINT,
item__sum_follow_cnt_7d BIGINT,
item__sum_follower_cnt_7d BIGINT,
item__max_item_cnt_7d BIGINT,
item__max_follow_cnt_7d BIGINT,
item__max_follower_cnt_7d BIGINT,
item__min_item_cnt_7d BIGINT,
item__min_follow_cnt_7d BIGINT,
item__min_follower_cnt_7d BIGINT,
item__avg_item_cnt_7d DOUBLE,
item__avg_follow_cnt_7d DOUBLE,
item__avg_follower_cnt_7d DOUBLE,
item__sum_item_cnt_15d BIGINT,
item__sum_follow_cnt_15d BIGINT,
item__sum_follower_cnt_15d BIGINT,
item__max_item_cnt_15d BIGINT,
item__max_follow_cnt_15d BIGINT,
item__max_follower_cnt_15d BIGINT,
item__min_item_cnt_15d BIGINT,
item__min_follow_cnt_15d BIGINT,
item__min_follower_cnt_15d BIGINT,
item__avg_item_cnt_15d DOUBLE,
item__avg_follow_cnt_15d DOUBLE,
item__avg_follower_cnt_15d DOUBLE,
item__kv_day_h_click_count_sum_7d STRING,
item__kv_category_click_count_sum_7d STRING,
item__kv_day_h_praise_count_sum_7d STRING,
item__kv_category_praise_count_sum_7d STRING,
item__kv_day_h_click_count_max_7d STRING,
item__kv_category_click_count_max_7d STRING,
item__kv_day_h_praise_count_max_7d STRING,
item__kv_category_praise_count_max_7d STRING,
item__kv_day_h_click_count_min_7d STRING,
item__kv_category_click_count_min_7d STRING,
item__kv_day_h_praise_count_min_7d STRING,
item__kv_category_praise_count_min_7d STRING,
item__kv_day_h_click_count_avg_7d STRING,
item__kv_category_click_count_avg_7d STRING,
item__kv_day_h_praise_count_avg_7d STRING,
item__kv_category_praise_count_avg_7d STRING,
item__kv_day_h_click_count_sum_15d STRING,
item__kv_category_click_count_sum_15d STRING,
item__kv_day_h_praise_count_sum_15d STRING,
item__kv_category_praise_count_sum_15d STRING,
item__kv_day_h_click_count_max_15d STRING,
item__kv_category_click_count_max_15d STRING,
item__kv_day_h_praise_count_max_15d STRING,
item__kv_category_praise_count_max_15d STRING,
item__kv_day_h_click_count_min_15d STRING,
item__kv_category_click_count_min_15d STRING,
item__kv_day_h_praise_count_min_15d STRING,
item__kv_category_praise_count_min_15d STRING,
item__kv_day_h_click_count_avg_15d STRING,
item__kv_category_click_count_avg_15d STRING,
item__kv_day_h_praise_count_avg_15d STRING,
item__kv_category_praise_count_avg_15d STRING,
item__kv_day_h_click_count_sum_30d STRING,
item__kv_category_click_count_sum_30d STRING,
item__kv_day_h_praise_count_sum_30d STRING,
item__kv_category_praise_count_sum_30d STRING,
item__kv_day_h_click_count_max_30d STRING,
item__kv_category_click_count_max_30d STRING,
item__kv_day_h_praise_count_max_30d STRING,
item__kv_category_praise_count_max_30d STRING,
item__kv_day_h_click_count_min_30d STRING,
item__kv_category_click_count_min_30d STRING,
item__kv_day_h_praise_count_min_30d STRING,
item__kv_category_praise_count_min_30d STRING,
item__kv_day_h_click_count_avg_30d STRING,
item__kv_category_click_count_avg_30d STRING,
item__kv_day_h_praise_count_avg_30d STRING,
item__kv_category_praise_count_avg_30d STRING,
item__kv_day_h_click_count_sum_45d STRING,
item__kv_category_click_count_sum_45d STRING,
item__kv_day_h_praise_count_sum_45d STRING,
item__kv_category_praise_count_sum_45d STRING,
item__kv_day_h_click_count_max_45d STRING,
item__kv_category_click_count_max_45d STRING,
item__kv_day_h_praise_count_max_45d STRING,
item__kv_category_praise_count_max_45d STRING,
item__kv_day_h_click_count_min_45d STRING,
item__kv_category_click_count_min_45d STRING,
item__kv_day_h_praise_count_min_45d STRING,
item__kv_category_praise_count_min_45d STRING,
item__kv_day_h_click_count_avg_45d STRING,
item__kv_category_click_count_avg_45d STRING,
item__kv_day_h_praise_count_avg_45d STRING,
item__kv_category_praise_count_avg_45d STRING,
gender STRING,
age BIGINT,
city STRING,
item_cnt BIGINT,
follow_cnt BIGINT,
follower_cnt BIGINT,
is_new_user BIGINT,
tags STRING,
user__sum_playtime_3d DOUBLE,
user__sum_duration_3d DOUBLE,
user__sum_click_count_3d BIGINT,
user__sum_praise_count_3d BIGINT,
user__max_playtime_3d DOUBLE,
user__max_duration_3d DOUBLE,
user__max_click_count_3d BIGINT,
user__max_praise_count_3d BIGINT,
user__min_playtime_3d DOUBLE,
user__min_duration_3d DOUBLE,
user__min_click_count_3d BIGINT,
user__min_praise_count_3d BIGINT,
user__avg_playtime_3d DOUBLE,
user__avg_duration_3d DOUBLE,
user__avg_click_count_3d DOUBLE,
user__avg_praise_count_3d DOUBLE,
user__sum_playtime_7d DOUBLE,
user__sum_duration_7d DOUBLE,
user__sum_click_count_7d BIGINT,
user__sum_praise_count_7d BIGINT,
user__max_playtime_7d DOUBLE,
user__max_duration_7d DOUBLE,
user__max_click_count_7d BIGINT,
user__max_praise_count_7d BIGINT,
user__min_playtime_7d DOUBLE,
user__min_duration_7d DOUBLE,
user__min_click_count_7d BIGINT,
user__min_praise_count_7d BIGINT,
user__avg_playtime_7d DOUBLE,
user__avg_duration_7d DOUBLE,
user__avg_click_count_7d DOUBLE,
user__avg_praise_count_7d DOUBLE,
user__sum_playtime_15d DOUBLE,
user__sum_duration_15d DOUBLE,
user__sum_click_count_15d BIGINT,
user__sum_praise_count_15d BIGINT,
user__max_playtime_15d DOUBLE,
user__max_duration_15d DOUBLE,
user__max_click_count_15d BIGINT,
user__max_praise_count_15d BIGINT,
user__min_playtime_15d DOUBLE,
user__min_duration_15d DOUBLE,
user__min_click_count_15d BIGINT,
user__min_praise_count_15d BIGINT,
user__avg_playtime_15d DOUBLE,
user__avg_duration_15d DOUBLE,
user__avg_click_count_15d DOUBLE,
user__avg_praise_count_15d DOUBLE,
user__kv_day_h_duration_sum_7d STRING,
user__kv_category_duration_sum_7d STRING,
user__kv_day_h_click_count_sum_7d STRING,
user__kv_category_click_count_sum_7d STRING,
user__kv_day_h_duration_max_7d STRING,
user__kv_category_duration_max_7d STRING,
user__kv_day_h_click_count_max_7d STRING,
user__kv_category_click_count_max_7d STRING,
user__kv_day_h_duration_min_7d STRING,
user__kv_category_duration_min_7d STRING,
user__kv_day_h_click_count_min_7d STRING,
user__kv_category_click_count_min_7d STRING,
user__kv_day_h_duration_avg_7d STRING,
user__kv_category_duration_avg_7d STRING,
user__kv_day_h_click_count_avg_7d STRING,
user__kv_category_click_count_avg_7d STRING,
user__kv_day_h_duration_sum_15d STRING,
user__kv_category_duration_sum_15d STRING,
user__kv_day_h_click_count_sum_15d STRING,
user__kv_category_click_count_sum_15d STRING,
user__kv_day_h_duration_max_15d STRING,
user__kv_category_duration_max_15d STRING,
user__kv_day_h_click_count_max_15d STRING,
user__kv_category_click_count_max_15d STRING,
user__kv_day_h_duration_min_15d STRING,
user__kv_category_duration_min_15d STRING,
user__kv_day_h_click_count_min_15d STRING,
user__kv_category_click_count_min_15d STRING,
user__kv_day_h_duration_avg_15d STRING,
user__kv_category_duration_avg_15d STRING,
user__kv_day_h_click_count_avg_15d STRING,
user__kv_category_click_count_avg_15d STRING,
user__kv_day_h_duration_sum_30d STRING,
user__kv_category_duration_sum_30d STRING,
user__kv_day_h_click_count_sum_30d STRING,
user__kv_category_click_count_sum_30d STRING,
user__kv_day_h_duration_max_30d STRING,
user__kv_category_duration_max_30d STRING,
user__kv_day_h_click_count_max_30d STRING,
user__kv_category_click_count_max_30d STRING,
user__kv_day_h_duration_min_30d STRING,
user__kv_category_duration_min_30d STRING,
user__kv_day_h_click_count_min_30d STRING,
user__kv_category_click_count_min_30d STRING,
user__kv_day_h_duration_avg_30d STRING,
user__kv_category_duration_avg_30d STRING,
user__kv_day_h_click_count_avg_30d STRING,
user__kv_category_click_count_avg_30d STRING,
user__kv_day_h_duration_sum_45d STRING,
user__kv_category_duration_sum_45d STRING,
user__kv_day_h_click_count_sum_45d STRING,
user__kv_category_click_count_sum_45d STRING,
user__kv_day_h_duration_max_45d STRING,
user__kv_category_duration_max_45d STRING,
user__kv_day_h_click_count_max_45d STRING,
user__kv_category_click_count_max_45d STRING,
user__kv_day_h_duration_min_45d STRING,
user__kv_category_duration_min_45d STRING,
user__kv_day_h_click_count_min_45d STRING,
user__kv_category_click_count_min_45d STRING,
user__kv_day_h_duration_avg_45d STRING,
user__kv_category_duration_avg_45d STRING,
user__kv_day_h_click_count_avg_45d STRING,
user__kv_category_click_count_avg_45d STRING
)
PARTITIONED BY (ds STRING)
LIFECYCLE 90;
insert overwrite table fs_demo_fs_engineering_v1_training_set partition (ds = '${bdp.system.bizdate}')
select
sq0.request_id,
sq0.user_id,
sq0.page,
sq0.net_type,
sq0.day_h,
sq0.week_day,
sq0.day_min,
sq0.event_unix_time,
sq0.item_id,
sq0.playtime,
sq0.is_click,
sq0.ln_playtime,
sq0.is_praise,
sq2.duration,
sq2.category,
sq2.author,
sq2.click_count,
sq2.praise_count,
sq2.is_new_item,
sq5.item__sum_item_cnt_3d,
sq5.item__sum_follow_cnt_3d,
sq5.item__sum_follower_cnt_3d,
sq5.item__max_item_cnt_3d,
sq5.item__max_follow_cnt_3d,
sq5.item__max_follower_cnt_3d,
sq5.item__min_item_cnt_3d,
sq5.item__min_follow_cnt_3d,
sq5.item__min_follower_cnt_3d,
sq5.item__avg_item_cnt_3d,
sq5.item__avg_follow_cnt_3d,
sq5.item__avg_follower_cnt_3d,
sq5.item__sum_item_cnt_7d,
sq5.item__sum_follow_cnt_7d,
sq5.item__sum_follower_cnt_7d,
sq5.item__max_item_cnt_7d,
sq5.item__max_follow_cnt_7d,
sq5.item__max_follower_cnt_7d,
sq5.item__min_item_cnt_7d,
sq5.item__min_follow_cnt_7d,
sq5.item__min_follower_cnt_7d,
sq5.item__avg_item_cnt_7d,
sq5.item__avg_follow_cnt_7d,
sq5.item__avg_follower_cnt_7d,
sq5.item__sum_item_cnt_15d,
sq5.item__sum_follow_cnt_15d,
sq5.item__sum_follower_cnt_15d,
sq5.item__max_item_cnt_15d,
sq5.item__max_follow_cnt_15d,
sq5.item__max_follower_cnt_15d,
sq5.item__min_item_cnt_15d,
sq5.item__min_follow_cnt_15d,
sq5.item__min_follower_cnt_15d,
sq5.item__avg_item_cnt_15d,
sq5.item__avg_follow_cnt_15d,
sq5.item__avg_follower_cnt_15d,
sq6.item__kv_day_h_click_count_sum_7d,
sq6.item__kv_category_click_count_sum_7d,
sq6.item__kv_day_h_praise_count_sum_7d,
sq6.item__kv_category_praise_count_sum_7d,
sq6.item__kv_day_h_click_count_max_7d,
sq6.item__kv_category_click_count_max_7d,
sq6.item__kv_day_h_praise_count_max_7d,
sq6.item__kv_category_praise_count_max_7d,
sq6.item__kv_day_h_click_count_min_7d,
sq6.item__kv_category_click_count_min_7d,
sq6.item__kv_day_h_praise_count_min_7d,
sq6.item__kv_category_praise_count_min_7d,
sq6.item__kv_day_h_click_count_avg_7d,
sq6.item__kv_category_click_count_avg_7d,
sq6.item__kv_day_h_praise_count_avg_7d,
sq6.item__kv_category_praise_count_avg_7d,
sq6.item__kv_day_h_click_count_sum_15d,
sq6.item__kv_category_click_count_sum_15d,
sq6.item__kv_day_h_praise_count_sum_15d,
sq6.item__kv_category_praise_count_sum_15d,
sq6.item__kv_day_h_click_count_max_15d,
sq6.item__kv_category_click_count_max_15d,
sq6.item__kv_day_h_praise_count_max_15d,
sq6.item__kv_category_praise_count_max_15d,
sq6.item__kv_day_h_click_count_min_15d,
sq6.item__kv_category_click_count_min_15d,
sq6.item__kv_day_h_praise_count_min_15d,
sq6.item__kv_category_praise_count_min_15d,
sq6.item__kv_day_h_click_count_avg_15d,
sq6.item__kv_category_click_count_avg_15d,
sq6.item__kv_day_h_praise_count_avg_15d,
sq6.item__kv_category_praise_count_avg_15d,
sq6.item__kv_day_h_click_count_sum_30d,
sq6.item__kv_category_click_count_sum_30d,
sq6.item__kv_day_h_praise_count_sum_30d,
sq6.item__kv_category_praise_count_sum_30d,
sq6.item__kv_day_h_click_count_max_30d,
sq6.item__kv_category_click_count_max_30d,
sq6.item__kv_day_h_praise_count_max_30d,
sq6.item__kv_category_praise_count_max_30d,
sq6.item__kv_day_h_click_count_min_30d,
sq6.item__kv_category_click_count_min_30d,
sq6.item__kv_day_h_praise_count_min_30d,
sq6.item__kv_category_praise_count_min_30d,
sq6.item__kv_day_h_click_count_avg_30d,
sq6.item__kv_category_click_count_avg_30d,
sq6.item__kv_day_h_praise_count_avg_30d,
sq6.item__kv_category_praise_count_avg_30d,
sq6.item__kv_day_h_click_count_sum_45d,
sq6.item__kv_category_click_count_sum_45d,
sq6.item__kv_day_h_praise_count_sum_45d,
sq6.item__kv_category_praise_count_sum_45d,
sq6.item__kv_day_h_click_count_max_45d,
sq6.item__kv_category_click_count_max_45d,
sq6.item__kv_day_h_praise_count_max_45d,
sq6.item__kv_category_praise_count_max_45d,
sq6.item__kv_day_h_click_count_min_45d,
sq6.item__kv_category_click_count_min_45d,
sq6.item__kv_day_h_praise_count_min_45d,
sq6.item__kv_category_praise_count_min_45d,
sq6.item__kv_day_h_click_count_avg_45d,
sq6.item__kv_category_click_count_avg_45d,
sq6.item__kv_day_h_praise_count_avg_45d,
sq6.item__kv_category_praise_count_avg_45d,
sq1.gender,
sq1.age,
sq1.city,
sq1.item_cnt,
sq1.follow_cnt,
sq1.follower_cnt,
sq1.is_new_user,
sq1.tags,
sq3.user__sum_playtime_3d,
sq3.user__sum_duration_3d,
sq3.user__sum_click_count_3d,
sq3.user__sum_praise_count_3d,
sq3.user__max_playtime_3d,
sq3.user__max_duration_3d,
sq3.user__max_click_count_3d,
sq3.user__max_praise_count_3d,
sq3.user__min_playtime_3d,
sq3.user__min_duration_3d,
sq3.user__min_click_count_3d,
sq3.user__min_praise_count_3d,
sq3.user__avg_playtime_3d,
sq3.user__avg_duration_3d,
sq3.user__avg_click_count_3d,
sq3.user__avg_praise_count_3d,
sq3.user__sum_playtime_7d,
sq3.user__sum_duration_7d,
sq3.user__sum_click_count_7d,
sq3.user__sum_praise_count_7d,
sq3.user__max_playtime_7d,
sq3.user__max_duration_7d,
sq3.user__max_click_count_7d,
sq3.user__max_praise_count_7d,
sq3.user__min_playtime_7d,
sq3.user__min_duration_7d,
sq3.user__min_click_count_7d,
sq3.user__min_praise_count_7d,
sq3.user__avg_playtime_7d,
sq3.user__avg_duration_7d,
sq3.user__avg_click_count_7d,
sq3.user__avg_praise_count_7d,
sq3.user__sum_playtime_15d,
sq3.user__sum_duration_15d,
sq3.user__sum_click_count_15d,
sq3.user__sum_praise_count_15d,
sq3.user__max_playtime_15d,
sq3.user__max_duration_15d,
sq3.user__max_click_count_15d,
sq3.user__max_praise_count_15d,
sq3.user__min_playtime_15d,
sq3.user__min_duration_15d,
sq3.user__min_click_count_15d,
sq3.user__min_praise_count_15d,
sq3.user__avg_playtime_15d,
sq3.user__avg_duration_15d,
sq3.user__avg_click_count_15d,
sq3.user__avg_praise_count_15d,
sq4.user__kv_day_h_duration_sum_7d,
sq4.user__kv_category_duration_sum_7d,
sq4.user__kv_day_h_click_count_sum_7d,
sq4.user__kv_category_click_count_sum_7d,
sq4.user__kv_day_h_duration_max_7d,
sq4.user__kv_category_duration_max_7d,
sq4.user__kv_day_h_click_count_max_7d,
sq4.user__kv_category_click_count_max_7d,
sq4.user__kv_day_h_duration_min_7d,
sq4.user__kv_category_duration_min_7d,
sq4.user__kv_day_h_click_count_min_7d,
sq4.user__kv_category_click_count_min_7d,
sq4.user__kv_day_h_duration_avg_7d,
sq4.user__kv_category_duration_avg_7d,
sq4.user__kv_day_h_click_count_avg_7d,
sq4.user__kv_category_click_count_avg_7d,
sq4.user__kv_day_h_duration_sum_15d,
sq4.user__kv_category_duration_sum_15d,
sq4.user__kv_day_h_click_count_sum_15d,
sq4.user__kv_category_click_count_sum_15d,
sq4.user__kv_day_h_duration_max_15d,
sq4.user__kv_category_duration_max_15d,
sq4.user__kv_day_h_click_count_max_15d,
sq4.user__kv_category_click_count_max_15d,
sq4.user__kv_day_h_duration_min_15d,
sq4.user__kv_category_duration_min_15d,
sq4.user__kv_day_h_click_count_min_15d,
sq4.user__kv_category_click_count_min_15d,
sq4.user__kv_day_h_duration_avg_15d,
sq4.user__kv_category_duration_avg_15d,
sq4.user__kv_day_h_click_count_avg_15d,
sq4.user__kv_category_click_count_avg_15d,
sq4.user__kv_day_h_duration_sum_30d,
sq4.user__kv_category_duration_sum_30d,
sq4.user__kv_day_h_click_count_sum_30d,
sq4.user__kv_category_click_count_sum_30d,
sq4.user__kv_day_h_duration_max_30d,
sq4.user__kv_category_duration_max_30d,
sq4.user__kv_day_h_click_count_max_30d,
sq4.user__kv_category_click_count_max_30d,
sq4.user__kv_day_h_duration_min_30d,
sq4.user__kv_category_duration_min_30d,
sq4.user__kv_day_h_click_count_min_30d,
sq4.user__kv_category_click_count_min_30d,
sq4.user__kv_day_h_duration_avg_30d,
sq4.user__kv_category_duration_avg_30d,
sq4.user__kv_day_h_click_count_avg_30d,
sq4.user__kv_category_click_count_avg_30d,
sq4.user__kv_day_h_duration_sum_45d,
sq4.user__kv_category_duration_sum_45d,
sq4.user__kv_day_h_click_count_sum_45d,
sq4.user__kv_category_click_count_sum_45d,
sq4.user__kv_day_h_duration_max_45d,
sq4.user__kv_category_duration_max_45d,
sq4.user__kv_day_h_click_count_max_45d,
sq4.user__kv_category_click_count_max_45d,
sq4.user__kv_day_h_duration_min_45d,
sq4.user__kv_category_duration_min_45d,
sq4.user__kv_day_h_click_count_min_45d,
sq4.user__kv_category_click_count_min_45d,
sq4.user__kv_day_h_duration_avg_45d,
sq4.user__kv_category_duration_avg_45d,
sq4.user__kv_day_h_click_count_avg_45d,
sq4.user__kv_category_click_count_avg_45d
from
(
select *
from rec_sln_demo_fs_label_table_v1
where ds = '${bdp.system.bizdate}'
) sq0
left join (
select
*
from <project_name>.rec_sln_demo_user_table_preprocess_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq1 on sq0.user_id = sq1.user_id
left join (
select
*
from <project_name>.rec_sln_demo_item_table_preprocess_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq2 on sq0.item_id = sq2.item_id
left join (
select
*
from <project_name>.rec_sln_demo_user_table_preprocess_win_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq4 on sq0.user_id = sq4.user_id
left join (
select
*
from <project_name>.rec_sln_demo_item_table_preprocess_agg_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq5 on sq0.item_id = sq5.item_id
left join (
select
*
from <project_name>.rec_sln_demo_item_table_preprocess_win_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq6 on sq0.item_id = sq6.item_id
left join (
select
*
from <project_name>.rec_sln_demo_user_table_preprocess_agg_v1
where ds = TO_CHAR(DATEADD(TO_DATE('${bdp.system.bizdate}','yyyymmdd'), - 1,'dd'),'yyyymmdd')
) sq3 on sq0.user_id = sq3.user_id
;
其中<project_name>需要替換為您的項目名稱。
五、訓練模型
獲取樣本表后,您可以直接根據樣本表來進行模型訓練,詳情請參見模型訓練。
相關文檔
特征生產的詳細過程,請參見特征平臺與特征生產。
阿里云特征平臺(FeatureStore)基本適用于所有需要特征的場景,比如推薦場景、金融風控場景、用戶增長場景。同時,FeatureStore與阿里云常用數據源引擎、推薦服務引擎完成對接,可為您提供端到端高效便捷的一站式從特征注冊管理到模型開發應用的全流程操作平臺。更多關于FeatureStore的信息,請參見FeatureStore概述。
如果您在配置或使用過程中有任何問題,可以搜索釘釘群號:34415007523,進入答疑群聯系技術人員進行咨詢。