KFP Oncall noticed kubeflow-pipeline-postsubmit-integration-test being failing.
/kind bug
@Bobgy
@hilcj do you need help from other members?
It's oncall's responsibility to do the initial investigations.
But feel free to delegate to us if you think it's out of your knowledge range
@Bobgy actually I just want to ask you if this is an known issue. Because the failure has started at least two weeks ago and previous oncalls may have already reported it.
If not, I'll do the investigation and get back to you.
Btw do you know where we keep track of the live issues? Seems the oncalls handover note was not updated since Dec 18, and no update on the kfp oncalls book - live issues since my last oncall in Nov.
It should be the handover notes, but I guess @Ark-kun and @IronPan didn't take them.
Did you see this problem before?
Error is from dataflow sample test, and this is related to a recent fix I made for dataflow component. Will send a fix shortly.
Postsubmit is still red with multiple errors. Reopen this, and I'll investigate one by one shortly.
There're other build failures similar to the one I fixed above. Reopen and I'll make fixes shortly.
Awesome, thank you @chensun!
JFYI:
The latest issue with the deprecated dataflow component container build was caused by pip 21.0 dropping support for python2. https://github.com/pypa/pip/issues/6148 Those container images were dynamically installing latest version of pip which cause the build to start failing.
Latest test error was:
Adding pip 21.0 to easy-install.pth file
Installing pip script to /usr/local/bin
Installing pip2.7 script to /usr/local/bin
Installing pip2 script to /usr/local/bin
Installed /usr/local/lib/python2.7/dist-packages/pip-21.0-py2.7.egg
Processing dependencies for pip
Finished processing dependencies for pip
Traceback (most recent call last):
File "/usr/local/bin/pip", line 11, in <module>
load_entry_point('pip==21.0', 'console_scripts', 'pip')()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 561, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2631, in load_entry_point
return ep.load()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2291, in load
return self.resolve()
File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2297, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/usr/local/lib/python2.7/dist-packages/pip-21.0-py2.7.egg/pip/_internal/cli/main.py", line 60
sys.stderr.write(f"ERROR: {exc}")
and this is due to the content of gs://ml-pipeline/sample-pipeline/xgboost/initialization_actions.sh
#!/bin/bash -e
# Copyright 2018 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Initialization actions to run in dataproc setup.
# The script will be run on each node in a dataproc cluster.
easy_install pip
pip install tensorflow==1.4.1
pip install pandas==0.18.1
I'm going to update its content to use Python 3.
After #5062 and updating gs://ml-pipeline/sample-pipeline/xgboost/initialization_actions.sh, the previous error is fixed.
Now we got a runtime error submitting Dataproc spark job:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/ml/util/MLWritable$class
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.<init>(XGBoostEstimator.scala:38)
at ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator.<init>(XGBoostEstimator.scala:42)
at ml.dmlc.xgboost4j.scala.spark.XGBoost$.trainWithDataFrame(XGBoost.scala:182)
at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer$.main(XGBoostTrainer.scala:120)
at ml.dmlc.xgboost4j.scala.example.spark.XGBoostTrainer.main(XGBoostTrainer.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.ml.util.MLWritable$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 17 more
21/01/30 03:22:56 INFO org.spark_project.jetty.server.AbstractConnector: Stopped Spark@72ab05ed{HTTP/1.1,[http/1.1]}{0.0.0.0:0}
Job output is complete
Guessing we need to update this package [1] to accommodate newer version of Spark that comes with Dataproc 1.5 image.
[1] gs://ml-pipeline/sample-pipeline/xgboost/xgboost4j-example-0.8-SNAPSHOT-jar-with-dependencies.jar
Opened https://github.com/kubeflow/pipelines/issues/5089 to track the XGBoost issue, handing over the rest to @hongye-sun .
Postsubmit is now healthy
Most helpful comment
There're other build failures similar to the one I fixed above. Reopen and I'll make fixes shortly.