Elasticsearch data node 重啟導致 sharding 找不到家

shazi7804

6 年前

今天遇到單位同仁重啟 Elasticsearch data node 後發現 Cluster 狀態變成 Red 的狀況，這篇記錄遇到這個問題時該怎麼處理

會遇到這個問題通常是「遺失的 Data node」大於「index.number_of_replicas」，Elasticsearch replication 預設為 1 也就是只有 1 個備份，而 Data node 最大只能消失 1 台，不然就會少掉 index sharing 無法拼湊完整的 index。

在這個案例是不小心操作錯誤將 Data node 下線，但是 index data 仍存在，將 Data node 重新加入 Cluster 後仍然會出現 index shards 缺少的狀況。

Search rejected due to missing shards [[.kibana_task_manager_1][0]]. Consider using `allow_partial_search_results`

從 localhost:9200/_cluster/health 看目前 Cluster 狀況：

{
  "cluster_name":"xxx",
  "status":"red",
  ... 
  "initializing_shards":20,
  "unassigned_shards":13983,
  "delayed_unassigned_shards":0,
  "number_of_pending_tasks":28,
  ...
}

其中會發現 unassigned_shards 這個數字有點異常，這個參數是代表目前有 13983 個 index shard 找不到家 (data node)，意味著

index shard 資料仍然存在
要替這些 unassigned shards 找到適合的 data node 分配

遇到 unassigned_shards 可以透過 Elasticsearch transient update 一次性讓 shard 找到 data node

curl -X PUT "localhost:9200/_cluster/settings?pretty" -H 'Content-Type: application/json' -d'
{
    "transient" : {
        "cluster.routing.allocation.enable" : "all"
    }
}'

再從 localhost:9200/cluster/health 查看 unassigned_shards 應該要陸續下降，狀態也應該會從 red -> yellow -> green 恢復正常

分享此文：