What happened:
We are running a training pipeline and writing 5 result metrics to mlpipeline-metrics. The metrics do not show up on the pipeline run UI. However, having just 2 metrics enables the UI to display them.
We are running this build: https://www.github.com/kubeflow/pipelines/commit/c9382474d688a34d3dd3b6d943cd71fea8f62e76
What did you expect to happen:
Show all the 5 metrics in the pipeline run UI
What steps did you take:
[A clear and concise description of what the bug is.]
1) Create a pipeline with a single training step. The step takes care of writing 5 metrics to /mlpipeline-metrics.json as well.
2) Upload the pipeline
3) Create an experiment and run
4) The run completes successfully
5) The run UI does not show the metrics.
Anything else you would like to add:
I traced the origin of the bug. Here is what I found:
1) The metrics get stored fine in the mysql backend - mlpipeline.run_metrics table.
@IronPan The JSON might be cut off for some reason. But this time the DB does not seem to be the culprit.
@Ark-kun , @IronPan , @neuromage
AFAIK, we are using GroupConcat method to gather the metrics: https://github.com/kubeflow/pipelines/blob/81341d3aa67268d25d8c5ae1dc31df2e59e610b8/backend/src/apiserver/storage/run_store.go#L143
This translates to MySQL GroupConcat method and we could be hitting the MySQL group_concat_max_len limit here.
This issue can be reproduced. It might also be the jsonpb unmarshalling in https://github.com/kubeflow/pipelines/blob/eee7834988dae6007372963e494252a78c9f0eee/backend/src/agent/persistence/worker/metrics_reporter.go#L102. This requires a little debugging.
@gaoning777 ,
I was able to debug the issue today and can confirm the earlier hypothesis that it is a MySQL GroupConcat issue.
Here is what I did to fix the issue temporarily.
========
1) Logged into the MySQL pod corresponding to KFP and increased the length for group_concat-max_len system variable:
set GLOBAL group_concat_max_len=4194304;
2) Refreshed the Pipelines UI and navigated to a run that was not showing the metrics earlier. The run had 6 metrics associated with it. The UI now showed at-least 2 metrics based on the system variable update.
3) Logged back into the MySQL pod and reset the length for group_concat_max_len to its earlier value of 1024.
I am interested in exploring potential fixes for this bug and assign it to myself. What is the best practice in KFP development to set DB level config information? I was thinking of setting it here: https://github.com/kubeflow/pipelines/blob/23993486c5c6c13a238587f7af48f5f73c9919f7/backend/src/apiserver/client/sql.go#L30
What do you think?
https://github.com/kubeflow/pipelines/blob/master/backend/src/apiserver/client/sql.go stores the mysql connection configuration.
And, generally the DB configurations is stored in https://github.com/kubeflow/pipelines/blob/master/backend/src/apiserver/config/config.json and accessed via viper.
Could you add it here instead?
@jingzhang36 @IronPan FYI
@gaoning777 ,
Thanks. One more thing: How do I build and test KFP from source?
Please follow https://github.com/kubeflow/pipelines/tree/master/backend for API backend image build and use the Standalone Deployment https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize to deploy the new image for fast local testing.
/cc @IronPan The DB truncation issue strikes again.
Hi @krajasek, are you still working on the fix? If you have some PRs that need review, Ning and I can help. Thanks!
Hi @krajasek, are you still working on the fix? If you have some PRs that need review, Ning and I can help. Thanks!
Hi @jingzhang36 , Sorry, did not get a chance to work on it last week. I am on it this week and will submit a PR by the end of this week.
@jingzhang36 , @gaoning777 - I just submitted a PR #2497 for review. Thanks.
Is there a way to fix this in 0.7? I'm basically hitting the same issue.
Also: There seems to be a limit to the decimal places that is shown in the UI. Is this intended?
Is there a way to fix this in 0.7? I'm basically hitting the same issue.
Also: There seems to be a limit to the decimal places that is shown in the UI. Is this intended?
My 2 cents on a temporary DIY fix for your use case:
1) sh into to the MySQL k8s pod (kubeflow namespace)
2) Get into mysql shell
3) Run the following command:
set GLOBAL group_concat_max_len=4194304
4) Exit the mysql shell and pod
5) Refresh the pipelines UI to see if the metrics show up
Few caveats:
Is this solved since #2497 has been merged?