Machinelearningnotebooks: PythonScriptStep fails to fail when node_count >= 2

Created on 8 Oct 2019 · 12Comments · Source: Azure/MachineLearningNotebooks

@sanpil @ninghu @sogansky @danielsc @yanrez
When I pass 2 to RunConfiguration.node_count the job will correctly execute on two nodes, but keep running forever. This does not happen if I leave node_count unmodified. This seems related to #398

@mhamilton723, perhaps a related question... what is up with this error?

log4j:ERROR Could not read configuration file from URL [file:./azureml-setup/log4j.properties].
java.io.FileNotFoundException: ./azureml-setup/log4j.properties (No such file or directory)

035afebc-174e-462c-9016-2366c8f09dc3 (Dropbox link to log files)

spark_run_config.node_count = 2
Run actually "completes" around 19/10/04 00:03:39 (see 70_driver_log_1.txt)
I manually cancel run at 2019-10-04 20:06:40Z (definitely some hours later, I'm skeptical that these timestamps are in the same timezone though...)

1c475276-e7fb-4c8b-9a97-235032dc913f (Dropbox link to log files):

no node_count declared
no 70_driver_log.txt file??

`pipeline.py` (excerpt)

def get_environment(env_name, yml_path, user_managed_dependencies, enable_docker, docker_base_image):
    env = Environment(env_name)
    cd = CondaDependencies(yml_path)
    env.python.conda_dependencies = cd
    env.python.user_managed_dependencies = user_managed_dependencies
    env.docker.enabled = enable_docker
    env.docker.base_image = docker_base_image
    return env

spark_env = get_environment(env_name='spark_env',
                              yml_path=os.path.join(os.getcwd(), 'compute/aml_config/spark_compute_dependencies.yml'),
                              user_managed_dependencies=False, enable_docker=True,
                              docker_base_image='microsoft/mmlspark:0.16')

# use pyspark framework
spark_run_config = RunConfiguration(framework="pyspark")
spark_run_config.node_count = 2 # commented out for 1c475276-e7fb-4c8b-9a97-235032dc913f
spark_run_config.environment = spark_env

roll_step = PythonScriptStep(
    name='roll.py',
    script_name='roll.py',
    arguments=['--input_dir', joined,
                '--output_dir', rolled_data,
                '--script_dir', ".",
                '--min_date', '2015-06-30',
                '--pct_rank', 'True'],
    compute_target=compute_target_spark,
    inputs=[joined],
    outputs=[rolled_data],
    runconfig=spark_run_config,
    source_directory=os.path.join(os.getcwd(), 'compute', 'roll'),
    allow_reuse=pipeline_reuse
)

`roll.py` (excerpt)

spark = (SparkSession.builder
         .master("local[*]")
         .appName("MMLSpark Docker App")
         .getOrCreate()
         )
print("spark con", spark)

print("reading TSVs")
file_location = os.path.join(args.input_dir, 'joined.tsv')

df = spark \
    .read.format("com.databricks.spark.csv") \
    .option("delimiter", "\t") \
    .option("inferSchema", "true") \
    .option("header", "true") \
    .load(file_location)

df.printSchema()
df.createOrReplaceTempView("holy_roller")
print("roll columns")

df = spark.sql(query_3m)
df.createOrReplaceTempView("holy_roller")
df = spark.sql(query_12m)
df.createOrReplaceTempView("holy_roller")
if args.pct_rank is True:
    query_rank = get_query_string(
        os.path.join(args.script_dir, 'pctrank.sql'))
    query_rankroll_3m = get_query_string(
        os.path.join(args.script_dir, 'RankRoll3m.sql'))
    query_rankroll_12m = get_query_string(
        os.path.join(args.script_dir, 'RankRoll12m.sql'))
    print("rank columns")

    df = spark.sql(query_rank)
    df.createOrReplaceTempView("holy_roller")
    print("roll the rank columns")
    df = spark.sql(query_rankroll_3m)
    df.createOrReplaceTempView("holy_roller")
    df = spark.sql(query_rankroll_12m)
    df.createOrReplaceTempView("holy_roller")

`logs/azureml/executionlogs.txt`

[2019-10-03 23:57:42Z] Experiment: ret-rollserrorinvest
[2019-10-03 23:57:42Z] Run Id:     035afebc-174e-462c-9016-2366c8f09dc3
[2019-10-03 23:57:42Z] Run target: ret-rollser-roll
[2019-10-03 23:57:43Z] Starting run in Execution Service
[2019-10-03 23:57:44Z] RunId:[035afebc-174e-462c-9016-2366c8f09dc3] ParentRunId:[cc986293-c167-4ee7-976b-85f0ce5c5ac1] ComputeTarget:[AmlCompute]
[2019-10-03 23:57:52Z] Job is running, job runstatus is Queued
[2019-10-03 23:59:47Z] Job is running, job runstatus is Queued
[2019-10-04 00:00:18Z] Job is running, job runstatus is Running
~~~1000 YEARS LATER~~~
[2019-10-04 20:03:31Z] Job is running, job runstatus is Running
[2019-10-04 20:05:35Z] Job is running, job runstatus is Running
[2019-10-04 20:06:40Z] 035afebc-174e-462c-9016-2366c8f09dc3 Job was cancelled

`azureml-logs/70_driver_log_1.txt`

starting org.apache.spark.deploy.master.Master, logging to /home/mmlspark/lib/spark/logs/spark-mmlspark-org.apache.spark.deploy.master.Master-1-55256676e2b241449e74550132a49204000001.out
failed to launch: nice -n 0 /home/mmlspark/lib/spark/bin/spark-class org.apache.spark.deploy.master.Master --host 55256676e2b241449e74550132a49204000001.ksefxc4keeku3j5c0yabdvalmb.xx.internal.cloudapp.net --port 7077 --webui-port 8080 -h 10.0.0.4 --webui-port 8080
    at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
    at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:989)
    at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
    at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:364)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    at java.lang.Thread.run(Thread.java:748)
full log in /home/mmlspark/lib/spark/logs/spark-mmlspark-org.apache.spark.deploy.master.Master-1-55256676e2b241449e74550132a49204000001.out
starting org.apache.spark.deploy.worker.Worker, logging to /home/mmlspark/lib/spark/logs/spark-mmlspark-org.apache.spark.deploy.worker.Worker-1-55256676e2b241449e74550132a49204000001.out
log4j:ERROR Could not read configuration file from URL [file:./azureml-setup/log4j.properties].
java.io.FileNotFoundException: ./azureml-setup/log4j.properties (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at java.io.FileInputStream.<init>(FileInputStream.java:93)
    at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
    at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
    at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
    at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
    at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
    at org.apache.spark.internal.Logging$class.initializeLogging(Logging.scala:120)
    at org.apache.spark.internal.Logging$class.initializeLogIfNecessary(Logging.scala:108)
    at org.apache.spark.deploy.SparkSubmit.initializeLogIfNecessary(SparkSubmit.scala:71)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:79)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
log4j:ERROR Ignoring configuration file [file:./azureml-setup/log4j.properties].
[] added as a remote repository with the name: repo-1
19/10/04 00:03:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
This is an MPI job. Rank:1
Starting the daemon thread to refresh tokens in background for process with pid = 592
all args:
{'input_dir': '/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined',
 'min_date': '2015-06-30',
 'output_dir': '/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/035afebc-174e-462c-9016-2366c8f09dc3/rolled_data',
 'pct_rank': True,
 'script_dir': '.'}
output_dir: /mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/035afebc-174e-462c-9016-2366c8f09dc3/rolled_data
reading queries
Connect to Spark
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/10/04 00:03:17 INFO SparkContext: Running Spark version 2.4.0
19/10/04 00:03:17 INFO SparkContext: Submitted application: MMLSpark Docker App
19/10/04 00:03:17 INFO SecurityManager: Changing view acls to: root
19/10/04 00:03:17 INFO SecurityManager: Changing modify acls to: root
19/10/04 00:03:17 INFO SecurityManager: Changing view acls groups to: 
19/10/04 00:03:17 INFO SecurityManager: Changing modify acls groups to: 
19/10/04 00:03:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
19/10/04 00:03:17 INFO Utils: Successfully started service 'sparkDriver' on port 38625.
19/10/04 00:03:17 INFO SparkEnv: Registering MapOutputTracker
19/10/04 00:03:17 INFO SparkEnv: Registering BlockManagerMaster
19/10/04 00:03:17 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/10/04 00:03:17 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/10/04 00:03:17 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-c7d5aacf-03f0-4f2e-b661-1292684ee735
19/10/04 00:03:17 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
19/10/04 00:03:17 INFO SparkEnv: Registering OutputCommitCoordinator
19/10/04 00:03:17 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/10/04 00:03:17 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.0.5:4040
19/10/04 00:03:17 INFO Executor: Starting executor ID driver on host localhost
19/10/04 00:03:17 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39135.
19/10/04 00:03:17 INFO NettyBlockTransferService: Server created on 10.0.0.5:39135
19/10/04 00:03:17 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/10/04 00:03:17 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.0.5, 39135, None)
19/10/04 00:03:17 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.0.5:39135 with 366.3 MB RAM, BlockManagerId(driver, 10.0.0.5, 39135, None)
19/10/04 00:03:18 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.0.5, 39135, None)
19/10/04 00:03:18 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.0.5, 39135, None)
19/10/04 00:03:18 INFO EventLoggingListener: Logging events to file:/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/workspacefilestore/azureml/035afebc-174e-462c-9016-2366c8f09dc3/azureml-logs/local-1570147397886
19/10/04 00:03:19 INFO SharedState: loading hive config file: file:/home/mmlspark/lib/spark/conf/hive-site.xml
19/10/04 00:03:19 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/workspacefilestore/azureml/035afebc-174e-462c-9016-2366c8f09dc3/spark-warehouse').
19/10/04 00:03:19 INFO SharedState: Warehouse path is 'file:/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/workspacefilestore/azureml/035afebc-174e-462c-9016-2366c8f09dc3/spark-warehouse'.
19/10/04 00:03:19 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
spark con <pyspark.sql.session.SparkSession object at 0x7f7455f01470>
reading TSVs
19/10/04 00:03:21 INFO FileSourceStrategy: Pruning directories with: 
19/10/04 00:03:21 INFO FileSourceStrategy: Post-Scan Filters: (length(trim(value#0, None)) > 0)
19/10/04 00:03:21 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
19/10/04 00:03:21 INFO FileSourceScanExec: Pushed Filters: 
19/10/04 00:03:21 INFO CodeGenerator: Code generated in 213.259158 ms
19/10/04 00:03:22 INFO CodeGenerator: Code generated in 23.640493 ms
19/10/04 00:03:22 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 279.3 KB, free 366.0 MB)
19/10/04 00:03:22 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.8 KB, free 366.0 MB)
19/10/04 00:03:22 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.0.5:39135 (size: 23.8 KB, free: 366.3 MB)
19/10/04 00:03:22 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:0
19/10/04 00:03:22 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
19/10/04 00:03:22 INFO SparkContext: Starting job: load at NativeMethodAccessorImpl.java:0
19/10/04 00:03:22 INFO DAGScheduler: Got job 0 (load at NativeMethodAccessorImpl.java:0) with 1 output partitions
19/10/04 00:03:22 INFO DAGScheduler: Final stage: ResultStage 0 (load at NativeMethodAccessorImpl.java:0)
19/10/04 00:03:22 INFO DAGScheduler: Parents of final stage: List()
19/10/04 00:03:22 INFO DAGScheduler: Missing parents: List()
19/10/04 00:03:22 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at load at NativeMethodAccessorImpl.java:0), which has no missing parents
19/10/04 00:03:22 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.8 KB, free 366.0 MB)
19/10/04 00:03:22 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.5 KB, free 366.0 MB)
19/10/04 00:03:22 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.0.5:39135 (size: 4.5 KB, free: 366.3 MB)
19/10/04 00:03:22 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
19/10/04 00:03:22 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at load at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
19/10/04 00:03:22 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
19/10/04 00:03:22 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8475 bytes)
19/10/04 00:03:22 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/10/04 00:03:22 INFO FileScanRDD: Reading File path: file:///mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined/joined.tsv, range: 0-134217728, partition values: [empty row]
19/10/04 00:03:22 INFO CodeGenerator: Code generated in 11.01279 ms
19/10/04 00:03:24 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2477 bytes result sent to driver
19/10/04 00:03:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1800 ms on localhost (executor driver) (1/1)
19/10/04 00:03:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
19/10/04 00:03:24 INFO DAGScheduler: ResultStage 0 (load at NativeMethodAccessorImpl.java:0) finished in 1.897 s
19/10/04 00:03:24 INFO DAGScheduler: Job 0 finished: load at NativeMethodAccessorImpl.java:0, took 1.957906 s
19/10/04 00:03:24 INFO FileSourceStrategy: Pruning directories with: 
19/10/04 00:03:24 INFO FileSourceStrategy: Post-Scan Filters: 
19/10/04 00:03:24 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
19/10/04 00:03:24 INFO FileSourceScanExec: Pushed Filters: 
19/10/04 00:03:24 INFO CodeGenerator: Code generated in 7.169557 ms
19/10/04 00:03:24 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 279.3 KB, free 365.7 MB)
19/10/04 00:03:24 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.8 KB, free 365.7 MB)
19/10/04 00:03:24 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.0.0.5:39135 (size: 23.8 KB, free: 366.2 MB)
19/10/04 00:03:24 INFO SparkContext: Created broadcast 2 from load at NativeMethodAccessorImpl.java:0
19/10/04 00:03:24 INFO FileSourceScanExec: Planning scan with bin packing, max size: 134217728 bytes, open cost is considered as scanning 4194304 bytes.
19/10/04 00:03:24 INFO SparkContext: Starting job: load at NativeMethodAccessorImpl.java:0
19/10/04 00:03:24 INFO DAGScheduler: Got job 1 (load at NativeMethodAccessorImpl.java:0) with 5 output partitions
19/10/04 00:03:24 INFO DAGScheduler: Final stage: ResultStage 1 (load at NativeMethodAccessorImpl.java:0)
19/10/04 00:03:24 INFO DAGScheduler: Parents of final stage: List()
19/10/04 00:03:24 INFO DAGScheduler: Missing parents: List()
19/10/04 00:03:24 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at load at NativeMethodAccessorImpl.java:0), which has no missing parents
19/10/04 00:03:24 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 15.0 KB, free 365.7 MB)
19/10/04 00:03:24 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 8.1 KB, free 365.7 MB)
19/10/04 00:03:24 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.0.0.5:39135 (size: 8.1 KB, free: 366.2 MB)
19/10/04 00:03:24 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1161
19/10/04 00:03:24 INFO DAGScheduler: Submitting 5 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at load at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4))
19/10/04 00:03:24 INFO TaskSchedulerImpl: Adding task set 1.0 with 5 tasks
19/10/04 00:03:24 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, PROCESS_LOCAL, 8475 bytes)
19/10/04 00:03:24 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, localhost, executor driver, partition 1, PROCESS_LOCAL, 8475 bytes)
19/10/04 00:03:24 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, localhost, executor driver, partition 2, PROCESS_LOCAL, 8475 bytes)
19/10/04 00:03:24 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 4, localhost, executor driver, partition 3, PROCESS_LOCAL, 8475 bytes)
19/10/04 00:03:24 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
19/10/04 00:03:24 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
19/10/04 00:03:24 INFO Executor: Running task 2.0 in stage 1.0 (TID 3)
19/10/04 00:03:24 INFO Executor: Running task 3.0 in stage 1.0 (TID 4)
19/10/04 00:03:24 INFO FileScanRDD: Reading File path: file:///mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined/joined.tsv, range: 0-134217728, partition values: [empty row]
19/10/04 00:03:24 INFO FileScanRDD: Reading File path: file:///mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined/joined.tsv, range: 134217728-268435456, partition values: [empty row]
19/10/04 00:03:24 INFO FileScanRDD: Reading File path: file:///mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined/joined.tsv, range: 0-134217728, partition values: [empty row]
19/10/04 00:03:24 INFO FileScanRDD: Reading File path: file:///mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined/joined.tsv, range: 134217728-268435456, partition values: [empty row]
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 26
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 30
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 19
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 20
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 23
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 12
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 31
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 25
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 14
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 8
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 17
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 15
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 13
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 27
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 9
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 16
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 11
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 22
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 7
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 24
19/10/04 00:03:24 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 10.0.0.5:39135 in memory (size: 4.5 KB, free: 366.2 MB)
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 28
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 21
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 29
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 18
19/10/04 00:03:24 INFO ContextCleaner: Cleaned accumulator 10
19/10/04 00:03:31 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 2078 bytes result sent to driver
19/10/04 00:03:31 INFO Executor: Finished task 3.0 in stage 1.0 (TID 4). 2121 bytes result sent to driver
19/10/04 00:03:31 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 5, localhost, executor driver, partition 4, PROCESS_LOCAL, 8507 bytes)
19/10/04 00:03:31 INFO Executor: Running task 4.0 in stage 1.0 (TID 5)
19/10/04 00:03:31 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 7309 ms on localhost (executor driver) (1/5)
19/10/04 00:03:31 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 4) in 7317 ms on localhost (executor driver) (2/5)
19/10/04 00:03:31 INFO FileScanRDD: Reading File path: file:///mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined/joined.tsv, range: 268435456-316481404, partition values: [empty row]
19/10/04 00:03:31 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 2078 bytes result sent to driver
19/10/04 00:03:31 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 7358 ms on localhost (executor driver) (3/5)
19/10/04 00:03:31 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 2078 bytes result sent to driver
19/10/04 00:03:31 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 3) in 7373 ms on localhost (executor driver) (4/5)
19/10/04 00:03:33 INFO FileScanRDD: Reading File path: file:///mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/attrition_pipeline_blob/azureml/465cc0e5-3224-4d81-a7f1-acd0957954cd/joined/joined.tsv, range: 268435456-316481404, partition values: [empty row]
19/10/04 00:03:36 INFO Executor: Finished task 4.0 in stage 1.0 (TID 5). 2078 bytes result sent to driver
19/10/04 00:03:36 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 5) in 4551 ms on localhost (executor driver) (5/5)
19/10/04 00:03:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/10/04 00:03:36 INFO DAGScheduler: ResultStage 1 (load at NativeMethodAccessorImpl.java:0) finished in 11.859 s
19/10/04 00:03:36 INFO DAGScheduler: Job 1 finished: load at NativeMethodAccessorImpl.java:0, took 11.864075 s
root
19/10/04 00:03:36 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
roll columns
rank columns
roll the rank columns
~~~~~~ REDACTED PRINT STATEMENT~~~~~~

The experiment completed successfully. Finalizing run...
19/10/04 00:03:39 INFO SparkContext: Invoking stop() from shutdown hook
19/10/04 00:03:39 INFO SparkUI: Stopped Spark web UI at http://10.0.0.5:4040
19/10/04 00:03:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/10/04 00:03:39 INFO MemoryStore: MemoryStore cleared
19/10/04 00:03:39 INFO BlockManager: BlockManager stopped
19/10/04 00:03:39 INFO BlockManagerMaster: BlockManagerMaster stopped
19/10/04 00:03:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/10/04 00:03:39 INFO SparkContext: Successfully stopped SparkContext
19/10/04 00:03:39 INFO ShutdownHookManager: Shutdown hook called
19/10/04 00:03:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-9ce98f2b-bd4f-4d43-b2a8-53043e7390ae/pyspark-2218336d-11dd-4323-98a1-5f188bd47bc9
19/10/04 00:03:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-0bb8b12d-c2ee-400d-bbe0-181324eec824
19/10/04 00:03:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-9ce98f2b-bd4f-4d43-b2a8-53043e7390ae
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/workspacefilestore/azureml/035afebc-174e-462c-9016-2366c8f09dc3/azureml-setup/start_standalone_spark.py", line 54, in <module>
    os.mkdir(master_work_done_marker_path)
FileExistsError: [Errno 17] File exists: '/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/035afebc-174e-462c-9016-2366c8f09dc3/mounts/workspacefilestore/azureml/035afebc-174e-462c-9016-2366c8f09dc3/azureml-setup/standalone_spark_done'

Pipelines product-issue

Source

swanderz

All 12 comments

@swanderz, @yuezengms is investigating this.

sanpil on 8 Oct 2019

🎉1

fyi - @magelo @lobrien

j-martens on 8 Oct 2019

👍1

@swanderz we are still investigating. BTW, do you always repro or is it a random failure?

yuezengms on 9 Oct 2019

@yuezengms this happens all the time. Just ran again and its doing the same thing run_id b0d7abef-4f87-4ee4-af73-c4d1ee653146

swanderz on 10 Oct 2019

@yuezengms I'd love an update on this. My spitball theories (in no order):

log4j is improperly configured on the mmlspark docker image so isn't able to end the run.

log4j:ERROR Could not read configuration file from URL [file:./azureml-setup/log4j.properties].
java.io.FileNotFoundException: ./azureml-setup/log4j.properties (No such file or directory)

a difference b/w MPIStep and PythonScriptStep that could be causing this issue?
The MPI log for 035afebc-174e-462c-9016-2366c8f09dc3 starts with this

Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/mnt/docker/overlay2/l/K2CDFL7CDBUVNYD5YDH34RBAMN:/mnt/docker/overlay2/l/AOZOLMX37EJIL3W4RZPCKZIJV7:/mnt/docker/overlay2/l/CH7GS4ENIMAHO6TERUAP6SJMOU:/mnt/docker/overlay2/l/DRVHZ6MYJE5Y4R7ZPVHHGFQ5ZA:/mnt/docker/overlay2/l/V3TS6ECQCVRWWTRMDGZN4TRCYF:/mnt/docker/overlay2/l/WHWPWS5ZFLDYMDFU7YNEOCFHFV:/mnt/docker/overlay2/l/6OFZOWC2QRAH3WETD4C7LJNT5L:/mnt/docker/overlay2/l/ZQZCX6JOX3RZD4MOJPXI553SAO:/mnt/docker/overlay2/l/7MNGVXWL34Q77VOQQ6EIBPTHGR:/mnt/docker/overlay2/l'
Unexpected end of /proc/mounts line `/7PRZC7NMBPLULYJKYTMIY3LXIE:/mnt/docker/overlay2/l/VXQJ7MXYXORROWO4IXNHCJDBIN:/mnt/docker/overlay2/l/KEIL42RPUOCDUAFJXJOKJCOPVU:/mnt/docker/overlay2/l/TALLO33HAY5VV2H2PZKTZYSODT:/mnt/docker/overlay2/l/NPUUYZLOIKLSGTJLHTJ452FBYO:/mnt/docker/overlay2/l/MH5HRRZ74BC4ABW7DCCTHAABTK:/mnt/docker/overlay2/l/EUSJC4YRYAVMH4FW53O7PO2HZM:/mnt/docker/overlay2/l/4ORO7EE2PBN2GWWARI66XJFJ4R:/mnt/docker/overlay2/l/3K5GVUXM2AR7JX4KSQXMAHVACC:/mnt/docker/overlay2/l/D7WUWUFYDOUF5DU3VXXDJT5D67,upperdir=/mnt/docker/overlay2/d9d'
/usr/bin/ssh: /azureml-envs/azureml_3f26b7932325483a8c71f9ee8d73c8d0/lib/libcrypto.so.1.0.0: no version information available (required by /usr/bin/ssh)
/usr/bin/ssh: /azureml-envs/azureml_3f26b7932325483a8c71f9ee8d73c8d0/lib/libcrypto.so.1.0.0: no version information available (required by /usr/bin/ssh)
Warning: Permanently added '[10.0.0.5]:23' (ECDSA) to the list of known hosts.
Unexpected end of /proc/mounts line `overlay / overlay rw,relatime,lowerdir=/mnt/docker/overlay2/l/P2VVU6V3CM4JL2IG4DEFRSOHHR:/mnt/docker/overlay2/l/ZFSZHUNKTBA4LPQODWZSPOTKFG:/mnt/docker/overlay2/l/HCBNBMKVZTT7UCSU3FHEZS34OY:/mnt/docker/overlay2/l/URFPO4CYUASFAH3SV4LVLJO422:/mnt/docker/overlay2/l/OBBBCMEK5VCSFROFGCIHGCNSQA:/mnt/docker/overlay2/l/KN5WPR2XOO4MR45WUMCG5WMU4U:/mnt/docker/overlay2/l/RBFUBP2RTLNJ4CDTFZRDNDBWGU:/mnt/docker/overlay2/l/CHLBFEUMCYCY7VAAWONOPP3INI:/mnt/docker/overlay2/l/Q7DYM4CAYVYSFPYIRHFNOBVKH4:/mnt/docker/overlay2/l'
Unexpected end of /proc/mounts line `/RKXZEYW4QQZI2R75666UQEN77W:/mnt/docker/overlay2/l/RD3N55MMVR33FWHFRYA34EZTVH:/mnt/docker/overlay2/l/S5VEVSFM57UX4WXXX7CYPI56UD:/mnt/docker/overlay2/l/6BEJM6WZFZ3BFRANOP7TOOPVNO:/mnt/docker/overlay2/l/7MLGLFCHKM57P2HMVIMHSTPQLM:/mnt/docker/overlay2/l/2RCZGGLZAURK5R5H4YSKQG4DFB:/mnt/docker/overlay2/l/SYH7DQZQ5NV265EC64A7LT6Z6A:/mnt/docker/overlay2/l/EK4UPOUL7OQCZ4VSFWFC3MG4JI:/mnt/docker/overlay2/l/D5IDE6PHAW63H4LNYQKBDBX4AL:/mnt/docker/overlay2/l/6XV6PXSZN3K6XAGTPJGJAEK24B,upperdir=/mnt/docker/overlay2/c60'

swanderz on 11 Oct 2019

@swanderz I'm able to repro this issue. log4j issue is not the root cause of the stuck job. It's caused by misconfiged two master nodes competing. I'm currently debugging on the compute to find out why the nodes settings are incorrect.

yuezengms on 11 Oct 2019

🎉1

@yuezengms according to Surika, a fix has been deployed in West US II. I've used the exact same code and branch to submit the job as before. Here's the result when I pass
spark_run_config.node_count = 2.

Unfortunately :(, I'm the same behavior is happening. Fortunately, it seems the error is different this time. Below are some interesting log excerpts, and here's a gist w/ the relevant files.

Questions:
1) Is this task better suited for an MPIStep or ParallelRunStep?
2) I'm creating an MPI run anyway right? is the behavior different if I do so w/ an MPIStep?

executionlogs.txt (excerpt)

[2019-11-22 15:50:54Z] Experiment: ret-master
[2019-11-22 15:50:54Z] Run Id:     75ba19c6-a62a-4740-a554-af00df094ec3
[2019-11-22 15:50:54Z] Run target: ret-master-roll
[2019-11-22 15:50:54Z] Starting run in Execution Service
[2019-11-22 15:50:55Z] RunId:[75ba19c6-a62a-4740-a554-af00df094ec3] ParentRunId:[0b0072c6-efc9-48dd-ac6e-2a0d3f61cbab] ComputeTarget:[AmlCompute]
[2019-11-22 15:51:02Z] Job is running, job runstatus is Queued
[2019-11-22 15:52:57Z] Job is running, job runstatus is Queued
[2019-11-22 15:54:58Z] Job is running, job runstatus is Queued
[2019-11-22 15:56:15Z] Job is running, job runstatus is Running
~~~~ skip a few ~~~~
[2019-11-22 17:54:28Z] Job is running, job runstatus is Running
````
### `70_driver_log_0.txt` (excerpt)

The experiment completed successfully. Finalizing run...
19/11/22 16:04:52 INFO SparkContext: Invoking stop() from shutdown hook
19/11/22 16:04:52 INFO SparkUI: Stopped Spark web UI at http://10.0.0.4:4040
19/11/22 16:04:52 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/11/22 16:04:53 INFO MemoryStore: MemoryStore cleared
19/11/22 16:04:53 INFO BlockManager: BlockManager stopped
19/11/22 16:04:53 INFO BlockManagerMaster: BlockManagerMaster stopped
19/11/22 16:04:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/11/22 16:04:53 INFO SparkContext: Successfully stopped SparkContext
19/11/22 16:04:53 INFO ShutdownHookManager: Shutdown hook called
19/11/22 16:04:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-d9a5f056-72c6-4db9-a54d-6f4038c7587e
19/11/22 16:04:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-931c5175-5a4e-4291-87fe-451a8a7e29c5
19/11/22 16:04:53 INFO ShutdownHookManager: Deleting directory /tmp/spark-931c5175-5a4e-4291-87fe-451a8a7e29c5/pyspark-ed7625df-2b36-4311-9a3d-ae5859380f88

### `70_driver_log_1.txt` (excerpt)

The experiment completed successfully. Finalizing run...
19/11/22 16:06:01 INFO SparkContext: Invoking stop() from shutdown hook
19/11/22 16:06:01 INFO SparkUI: Stopped Spark web UI at http://10.0.0.5:4040
19/11/22 16:06:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/11/22 16:06:02 INFO MemoryStore: MemoryStore cleared
19/11/22 16:06:02 INFO BlockManager: BlockManager stopped
19/11/22 16:06:02 INFO BlockManagerMaster: BlockManagerMaster stopped
19/11/22 16:06:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/11/22 16:06:03 INFO SparkContext: Successfully stopped SparkContext
19/11/22 16:06:03 INFO ShutdownHookManager: Shutdown hook called
19/11/22 16:06:03 INFO ShutdownHookManager: Deleting directory /tmp/spark-afefd389-46be-4c05-a130-59217011bcab/pyspark-2d31e589-c51b-486b-bb92-4295c2338045
19/11/22 16:06:03 INFO ShutdownHookManager: Deleting directory /tmp/spark-afefd389-46be-4c05-a130-59217011bcab
19/11/22 16:06:03 INFO ShutdownHookManager: Deleting directory /tmp/spark-600924b5-6735-4512-ba86-7d62dd30bfc1
Traceback (most recent call last):
File "/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/75ba19c6-a62a-4740-a554-af00df094ec3/mounts/workspaceblobstore/azureml/75ba19c6-a62a-4740-a554-af00df094ec3/azureml-setup/start_standalone_spark.py", line 54, in
os.mkdir(master_work_done_marker_path)
FileExistsError: [Errno 17] File exists: '/mnt/batch/tasks/shared/LS_root/jobs/avadevitsmlsvc/azureml/75ba19c6-a62a-4740-a554-af00df094ec3/mounts/workspaceblobstore/azureml/75ba19c6-a62a-4740-a554-af00df094ec3/azureml-setup/standalone_spark_done'
```

swanderz on 22 Nov 2019

@swanderz , The previous release was rolled back due to other issues. We are currently rolling out the fix and should reach all regions in about one week. We apologize for the inconvenience.

yuezengms on 22 Nov 2019

@yuezengms I _think_ this is resolved now... I'm not sure how the logging works though because one of the 70_driver_log_1.txt says:

This is an MPI job. Rank:1
starting org.apache.spark.deploy.worker.Worker, logging to /home/mmlspark/lib/spark/logs/spark-mmlspark-org.apache.spark.deploy.worker.Worker-1-c96d0600f5be4e94b3a889b5621d1f30000001.out

swanderz on 9 Dec 2019

@swanderz the first line is from aml control script to log the worker. For the next line logs, I need to sync with the dev who worked on this before. Will add him to the email threa.

yuezengms on 11 Dec 2019

👍1

Multi node has been fixed and the user' issue was resolved.

yuezengms on 21 Apr 2020

Thank you for your post. This issue was resolved so we closed this thread. Should you have additional questions, please continue to post here and and we will gladly continue the discussion.