Chapel: here.runningTasks() nondeterministic

Created on 10 Aug 2018  路  6Comments  路  Source: chapel-lang/chapel

Situation: two forall loops (over a Block domain) next to each other. The second loop queries here.runningTasks() to determine how many tasks to spawn on the current locale.

Issue: those runningTasks() queries may occasionally report a number >1. Even though there are actually no other running tasks at the moment.

This could be a performance problem due to under-utilization of CPU.

We believe this issue caused the failure in gasnet.darwin nightly testing on 7/30/2018 in the test parallel/forall/task-private-vars/block . We are updating this test to pass dataParIgnoreRunningTasks=true so it does not query runningTasks() and so avoids nondeterminism.

A simple reproducer is here. When I run it on 3 locales, it prints a >1 number a couple of times out of 100 runs on a linux64 desktop.

// inspired by:
//   test/parallel/forall/task-private-vars/block.chpl

use BlockDist;

config const numMessages = 36;
config const dptpl = 3;
config const ie = false;

const MessageDom = {1..numMessages};
const MessageSpace = MessageDom dmapped Block(MessageDom,
                                              dataParTasksPerLocale = dptpl,
                                              dataParIgnoreRunningTasks = ie);
forall msg in MessageSpace do
  write("."); // whatever
writeln();

coforall loc in Locales do
  on loc do
    if here.runningTasks() > 1 then
      writeln("got issue");
Compiler

Most helpful comment

Noooo, another task counting bug. This looks real, but I don't expect that it impacts real applications. Today we generate on-stmts wrappers as something like:

chpl_taskRunningCntInc();
on_fn(...);
_downEndCount(endcount_chpl, nil);
chpl_taskRunningCntDec();

Where the issue is that we decrement the remote end count before decrementing the local running task count so the coforall sync can finish and new remote tasks can be started before the previous ones are completely finished. In practice I think this will only happen for local oversubscription, but I think we can fix this by doing the decrement right before the downEndCount call.

The following program fails within a dozen trials on master:

for t in 1..50000 {
  coforall loc in Locales do on loc {
    const rt = here.runningTasks();
    if rt > 1 {
      writeln("Error trial "+t+" on locale "+loc.id+" has "+rt+" tasks");
      exit(1);
    }
  }
}

And passes with the following patch:

diff --git a/compiler/optimizations/optimizeOnClauses.cpp b/compiler/optimizations/optimizeOnClauses.cpp
index 5577a1f054..7fdabe6994 100644
--- a/compiler/optimizations/optimizeOnClauses.cpp
+++ b/compiler/optimizations/optimizeOnClauses.cpp
@@ -588,8 +588,10 @@ static void addRunningTaskModifiers(void) {
       // will not spawn a task.
       if (fn->hasFlag(FLAG_FAST_ON) == false) {
         SET_LINENO(fn);
-        fn->insertAtHead(new CallExpr(gChplIncRunningTask));
-        fn->insertBeforeEpilogue(new CallExpr(gChplDecRunningTask));
+        if (CallExpr* downEndCount = findDownEndCount(fn)) {
+          fn->insertAtHead(new CallExpr(gChplIncRunningTask));
+          downEndCount->insertBefore(new CallExpr(gChplDecRunningTask));
+        }
       }

       // For on stmts that aren't fast or non-blocking, decrement the local

I'll try to get a PR up for this tomorrow.

All 6 comments

coforall loc in Locales do
   on loc do
     writeln(here.runningTasks());   <---- expect 1; sometimes get 2 or 3

There is an implicit on-stmt for writeln from non-0 locales. The non-zero locales are spawning remote tasks to 0, so there are actually additional tasks running.

i.e this is the same as:

coforall loc in Locales do
    on loc do
      on Locales[0] do writeln(here.runningTasks());

Take a look at one of the taskCounting tests to see how we avoid this (we just store the running tasks into an array and print the array out at the end):

There is a separate discussion of whether things like writeln should be counted as running tasks (and a desire for users to mark tasks as "background" or whatever), but we should make a new issue for that

@ronawho - good point. I adjusted the test (see the main issue) and am still seeing >1 values.

Noooo, another task counting bug. This looks real, but I don't expect that it impacts real applications. Today we generate on-stmts wrappers as something like:

chpl_taskRunningCntInc();
on_fn(...);
_downEndCount(endcount_chpl, nil);
chpl_taskRunningCntDec();

Where the issue is that we decrement the remote end count before decrementing the local running task count so the coforall sync can finish and new remote tasks can be started before the previous ones are completely finished. In practice I think this will only happen for local oversubscription, but I think we can fix this by doing the decrement right before the downEndCount call.

The following program fails within a dozen trials on master:

for t in 1..50000 {
  coforall loc in Locales do on loc {
    const rt = here.runningTasks();
    if rt > 1 {
      writeln("Error trial "+t+" on locale "+loc.id+" has "+rt+" tasks");
      exit(1);
    }
  }
}

And passes with the following patch:

diff --git a/compiler/optimizations/optimizeOnClauses.cpp b/compiler/optimizations/optimizeOnClauses.cpp
index 5577a1f054..7fdabe6994 100644
--- a/compiler/optimizations/optimizeOnClauses.cpp
+++ b/compiler/optimizations/optimizeOnClauses.cpp
@@ -588,8 +588,10 @@ static void addRunningTaskModifiers(void) {
       // will not spawn a task.
       if (fn->hasFlag(FLAG_FAST_ON) == false) {
         SET_LINENO(fn);
-        fn->insertAtHead(new CallExpr(gChplIncRunningTask));
-        fn->insertBeforeEpilogue(new CallExpr(gChplDecRunningTask));
+        if (CallExpr* downEndCount = findDownEndCount(fn)) {
+          fn->insertAtHead(new CallExpr(gChplIncRunningTask));
+          downEndCount->insertBefore(new CallExpr(gChplDecRunningTask));
+        }
       }

       // For on stmts that aren't fast or non-blocking, decrement the local

I'll try to get a PR up for this tomorrow.

Better fix is in https://github.com/chapel-lang/chapel/pull/10750 -- @vasslitvinov do you mind testing out the original issue again to verify that it's fixed?

@ronawho - I ran the tester in this issue before and after #10750. Before: a failure each 100-200 runs. After: no failures in 4900 runs. I consider the fix verified.

Was this page helpful?
0 / 5 - 0 ratings