First reported on the discuss forum: https://discuss.elastic.co/t/ml-state-is-too-big/131561
I happened to look at my own setup and did also notice a few model states that belonged to jobs that no longer existed in my system.

There is no current job in my system called test_kpi - although I'm sure at one time there was and it was deleted.

Not sure what version is being used on the user on Discuss, but I'm currently using v6.2.0
Pinging @elastic/ml-core
@richcollier Can you see any model_snapshot documents in the results index with job_id set to test_kpi?
@dimitris-athanasiou - there are no documents in .ml-anomalies for job_id:test_kpi
We found the cause of this. It was a bug that was introduced in version 6.1. When we persist the model state, we persist the state documents in .ml-state index and a model_snapshot document in the results index. Later, in order to delete the state documents, we need to have the model snapshot doc. Due to the bug, during background periodic persistence, the state documents were persisted but the model snapshot document was put in a buffer. If the job was deleted from the UI before the buffer was flushed, the snapshot documents would never be indexed, meaning the state docs would be left behind after the job was deleted.
The above bug is resolved in 6.3.0 (and 6.2.5 if that version is ever released). However, in order to ensure those documents are deleted and to prevent such cases in the future, I will work on enhancing the daily maintenance service to look for left-behind state docs and clean them up.
Most helpful comment
We found the cause of this. It was a bug that was introduced in version
6.1. When we persist the model state, we persist the state documents in.ml-stateindex and amodel_snapshotdocument in the results index. Later, in order to delete the state documents, we need to have the model snapshot doc. Due to the bug, during background periodic persistence, the state documents were persisted but the model snapshot document was put in a buffer. If the job was deleted from the UI before the buffer was flushed, the snapshot documents would never be indexed, meaning the state docs would be left behind after the job was deleted.The above bug is resolved in
6.3.0(and6.2.5if that version is ever released). However, in order to ensure those documents are deleted and to prevent such cases in the future, I will work on enhancing the daily maintenance service to look for left-behind state docs and clean them up.