亚洲成av人电影网站,久久只有这里有精品热久久,97色偷偷色噜噜狠狠爱网站

在分析社交媒體、論壇或在線交流中的文本時，可能會遇到含糊不清、無邏輯性或亂碼的文本，導致數據分析的準確性降低，進而影響到數據驅動決策的質量。本文介紹如何在Elasticsearch（簡稱ES）中通過一個NLP模型識別和過濾出亂碼的文本。

準備工作

上傳模型

本文選擇huggingface倉庫中的text_classification模型madhurjindal/autonlp-Gibberish-Detector-492513457。將模型上傳到阿里云ES中，請參見通過Eland上傳第三方NLP模型。

由于中國內地網絡訪問huggingface較慢，本文采用離線上傳模型的方式。

下載模型，請單擊madhurjindal--autonlp-gibberish-detector-492513457.tar.gz。
將模型上傳到ECS中。
- 在ECS的根目錄下新建一個文件夾，例如model，將模型上傳到該文件夾中，請不要將模型上傳到/root/目錄下。
- 由于模型比較大，建議通過WinSCP的方式上傳，請參見通過WinSCP上傳或下載文件（本地主機為Windows）。

在ECS中執行如下命令，在模型文件目錄下解壓該模型。

cd /model/
tar -xzvf madhurjindal--autonlp-gibberish-detector-492513457.tar.gz
cd

在ECS中執行如下命令，將模型上傳到ES中。

eland_import_hub_model       
--url 'http://es-cn-lbj3l7erv0009****.elasticsearch.aliyuncs.com:9200'       
--hub-model-id '/model/root/.cache/huggingface/hub/models--madhurjindal--autonlp-Gibberish-Detector-492513457/snapshots/c068f552cdee957e45d8773db9f7158d43902244'       
--task-type text_classification       
--es-username elastic       
--es-password  ****       
--es-model-id models--madhurjindal--autonlp-gibberish-detector \

部署模型

登錄Kibana。具體操作，請參見登錄Kibana控制臺。
單擊Kibana頁面左上角的圖標，選擇Analytics > Machine Learning。
在左側菜單欄，單擊模型管理（Model Management） > 已訓練模型（Trained Models）。
（可選）在頁面上方，單擊同步作業和已訓練模型（Synchronize your jobs and trained models），在彈出的面板中單擊同步（Synchronize）。
將鼠標移動到目標模型操作（Actions）列的前面，單擊圖標，啟動模型。
在彈出的對話框中，配置模型后，單擊啟動（Start）。
頁面右下角彈出已成功啟動的提示對話框，表明模型部署成功。
說明
模型無法啟動可能是集群內存不足，升配集群后再試。無法啟動的具體原因，請在提示對話框中單擊請參閱完整的錯誤信息查看。

測試模型

在已訓練模型頁面，在已部署模型的操作列，選擇 > 測試模型（Test model）。
在彈出的面板中，測試已訓練模型，查看輸出結果是否符合預期。
輸出結果說明：
- word salad：描述語言溝通中混亂或無法理解的術語和亂碼。用于檢測亂碼，得分越高表明是亂碼的概率越大。
  以下測試示例中word salad得分最高，證明測試文本大概率是亂碼。
- clean：正常文本。
- mild gibberish：疑似胡言亂語。
- noise：胡言亂語。
  以下測試示例中noise得分最高，證明輸入的文本大概率是胡言亂語。

通過Kibana Dev Tools實現亂碼文本識別

單擊Kibana頁面左上角的圖標，選擇Management > 開發工具（Dev Tools）。

依次執行以下代碼。

1.創建索引
PUT /gibberish_index
{
  "mappings": {
    "properties": {
      "text_field": { "type": "text" }
    }
  }
}

2.添加數據
POST /gibberish_index/_doc/1
{
  "text_field": "how are you"
}

POST /gibberish_index/_doc/2
{
  "text_field": "sdfgsdfg wertwert"
}

POST /gibberish_index/_doc/3
{
  "text_field": "I am not sure this makes sense"
}

POST /gibberish_index/_doc/4
{
  "text_field": "?????????????痀?糀?????????????????????"
}

POST /gibberish_index/_doc/5
{
  "text_field": "測試"
}

POST /gibberish_index/_doc/6
{
  "text_field": "????????痀?糀???????????????????"
}

3.創建攝取（ingest）管道
inference處理器字段：
model_id——用于推理的機器學習模型
target_field——推理的結果將存儲在文檔的這個字段中
field_map.text_field——用于將文檔的輸入字段映射到模型預期的字段

PUT /_ingest/pipeline/gibberish_detection_pipeline
{
  "description": "A pipeline to detect gibberish text",
  "processors": [
    {
      "inference": {
        "model_id": "models--madhurjindal--autonlp-gibberish-detector",
        "target_field": "inference_results",
        "field_map": {
          "text_field": "text_field"
        }
      }
    }
  ]
}

4.使用管道更新索引中的文檔
POST /gibberish_index/_update_by_query?pipeline=gibberish_detection_pipeline

5.搜索具有推理結果的文檔
GET /gibberish_index/_search
{
  "query": {
    "exists": {
      "field": "inference_results"
    }
  }
}

6.精確查詢
inference_results.predicted_value.keyword  字段的值匹配字符串 "word salad"
inference_results.prediction_probability   字段的值大于等于0.1

GET /gibberish_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "inference_results.predicted_value.keyword": "word salad"
          }
        },
        {
          "range": {
            "inference_results.prediction_probability": {
              "gte": 0.1
            }
          }
        }
      ]
    }
  }
}

精確查詢得到以下兩條數據。這兩條數據的word salad得分都是最高，大概率屬于亂碼。

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 2.0296195,
    "hits": [
      {
        "_index": "gibberish_index",
        "_id": "4",
        "_score": 2.0296195,
        "_source": {
          "text_field": "?????????????痀?糀?????????????????????",
          "inference_results": {
            "predicted_value": "word salad",
            "prediction_probability": 0.37115721979929084,
            "model_id": "models--madhurjindal--autonlp-gibberish-detector"
          }
        }
      },
      {
        "_index": "gibberish_index",
        "_id": "6",
        "_score": 2.0296195,
        "_source": {
          "text_field": "????????痀?糀???????????????????",
          "inference_results": {
            "predicted_value": "word salad",
            "prediction_probability": 0.3489011155424212,
            "model_id": "models--madhurjindal--autonlp-gibberish-detector"
          }
        }
      }
    ]
  }
}

通過Python實現亂碼文本識別

您也可以通過Python實現亂碼文本識別。在ECS中執行Python3加載Python環境后，執行以下命令。

from elasticsearch import Elasticsearch

es_username = 'elastic'
es_password = '****'

# 使用 basic_auth 參數創建 Elasticsearch 客戶端實例
es = Elasticsearch(
    "http://es-cn-lbj3l7erv0009****.elasticsearch.aliyuncs.com:9200",
    basic_auth=(es_username, es_password)
)

# 創建索引和映射
create_index_body = {
  "mappings": {
    "properties": {
      "text_field": { "type": "text" }
    }
  }
}
es.indices.create(index='gibberish_index2', body=create_index_body)

# 插入文檔
docs = [
    {"text_field": "how are you"},
    {"text_field": "sdfgsdfg wertwert"},
    {"text_field": "I am not sure this makes sense"},
    {"text_field": "?????????????痀?糀?????????????????????"},
    {"text_field": "測試"},
    {"text_field": "????????痀?糀???????????????????"}
]

for i, doc in enumerate(docs):
    es.index(index='gibberish_index2', id=i+1, body=doc)

# 創建處理器和管道
pipeline_body = {
    "description": "A pipeline to detect gibberish text",
    "processors": [
      {
        "inference": {
          "model_id": "models--madhurjindal--autonlp-gibberish-detector",
          "target_field": "inference_results",
          "field_map": {
            "text_field": "text_field"
          }
        }
      }
    ]
}
es.ingest.put_pipeline(id='gibberish_detection_pipeline2', body=pipeline_body)

# 使用管道更新現有文檔
es.update_by_query(index='gibberish_index2', body={}, pipeline='gibberish_detection_pipeline2')

# 搜索具有推理結果的文檔
search_body = {
  "query": {
    "exists": {
      "field": "inference_results"
    }
  }
}
response = es.search(index='gibberish_index2', body=search_body)
print(response)

# 精確查詢
# 1.nference_results.predicted_value.keyword字段的值匹配字符串 "word salad"
# 2.nference_results.prediction_probability 字段的值大于等于0.1
search_query = {
    "query": {
        "bool": {
            "must": [
                {
                    "match": {
                        "inference_results.predicted_value.keyword": "word salad"
                    }
                },
                {
                    "range": {
                        "inference_results.prediction_probability": {
                            "gte": 0.1
                        }
                    }
                }
            ]
        }
    }
}
response = es.search(index='gibberish_index2', body=search_query)
print(response)