Pipelines: Kubeflow-pipeline-postsubmit-integration-test failure

Created on 19 Jan 2021 · 13Comments · Source: kubeflow/pipelines

KFP Oncall noticed kubeflow-pipeline-postsubmit-integration-test being failing.

/kind bug

arebackend kinbug prioritp0

Source

hilcj

Most helpful comment

There're other build failures similar to the one I fixed above. Reopen and I'll make fixes shortly.

chensun on 27 Jan 2021

🚀1 👍1

All 13 comments

@Bobgy

hilcj on 19 Jan 2021

@hilcj do you need help from other members?
It's oncall's responsibility to do the initial investigations.

But feel free to delegate to us if you think it's out of your knowledge range

Bobgy on 20 Jan 2021

@Bobgy actually I just want to ask you if this is an known issue. Because the failure has started at least two weeks ago and previous oncalls may have already reported it.

If not, I'll do the investigation and get back to you.

Btw do you know where we keep track of the live issues? Seems the oncalls handover note was not updated since Dec 18, and no update on the kfp oncalls book - live issues since my last oncall in Nov.

hilcj on 20 Jan 2021

👍1

It should be the handover notes, but I guess @Ark-kun and @IronPan didn't take them.

Did you see this problem before?

Bobgy on 20 Jan 2021

Error is from dataflow sample test, and this is related to a recent fix I made for dataflow component. Will send a fix shortly.

chensun on 26 Jan 2021

👍1

Postsubmit is still red with multiple errors. Reopen this, and I'll investigate one by one shortly.

chensun on 26 Jan 2021

There're other build failures similar to the one I fixed above. Reopen and I'll make fixes shortly.

chensun on 27 Jan 2021

🚀1 👍1

Awesome, thank you @chensun!

Bobgy on 28 Jan 2021

JFYI:
The latest issue with the deprecated dataflow component container build was caused by pip 21.0 dropping support for python2. https://github.com/pypa/pip/issues/6148 Those container images were dynamically installing latest version of pip which cause the build to start failing.

Ark-kun on 28 Jan 2021

👍1

https://oss-prow.knative.dev/view/gs/oss-prow/logs/kubeflow-pipeline-postsubmit-standalone-component-test/1354926169316659200

Latest test error was:

Adding pip 21.0 to easy-install.pth file
Installing pip script to /usr/local/bin
Installing pip2.7 script to /usr/local/bin
Installing pip2 script to /usr/local/bin

Installed /usr/local/lib/python2.7/dist-packages/pip-21.0-py2.7.egg
Processing dependencies for pip
Finished processing dependencies for pip
Traceback (most recent call last):
  File "/usr/local/bin/pip", line 11, in <module>
    load_entry_point('pip==21.0', 'console_scripts', 'pip')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 561, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2631, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2291, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2297, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/usr/local/lib/python2.7/dist-packages/pip-21.0-py2.7.egg/pip/_internal/cli/main.py", line 60
    sys.stderr.write(f"ERROR: {exc}")

and this is due to the content of gs://ml-pipeline/sample-pipeline/xgboost/initialization_actions.sh

#!/bin/bash -e

# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Initialization actions to run in dataproc setup.
# The script will be run on each node in a dataproc cluster.

easy_install pip
pip install tensorflow==1.4.1
pip install pandas==0.18.1

I'm going to update its content to use Python 3.

chensun on 29 Jan 2021

After #5062 and updating gs://ml-pipeline/sample-pipeline/xgboost/initialization_actions.sh, the previous error is fixed.

Now we got a runtime error submitting Dataproc spark job:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class
    at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.<init>(XGBoostEstimator.scala:38)
    at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.<init>(XGBoostEstimator.scala:42)
    at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:182)
    at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer$.main(XGBoostTrainer.scala:120)
    at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer.main(XGBoostTrainer.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.util.MLWritable$class
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 17 more
21/01/30 03:22:56 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@72ab05ed{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
Job output is complete

Guessing we need to update this package [1] to accommodate newer version of Spark that comes with Dataproc 1.5 image.

[1] gs://ml-pipeline/sample-pipeline/xgboost/xgboost4j-example-0.8-SNAPSHOT-jar-with-dependencies.jar

chensun on 30 Jan 2021

Opened https://github.com/kubeflow/pipelines/issues/5089 to track the XGBoost issue, handing over the rest to @hongye-sun .

chensun on 4 Feb 2021

Postsubmit is now healthy

Bobgy on 5 Feb 2021

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Cannot create artifact when using func_to_container_op

Toeplitz · 4Comments

Metrics don't show with latest kfp version

Svendegroote91 · 3Comments

NOTICE: "Context retired without replacement" during migration to google-oss-robot

Bobgy · 4Comments

[Process] Update backend development README

Bobgy · 3Comments

Create componnet with metadata and metrics

xinbinhuang · 3Comments