Cylc-flow: graph parsing: space removal bug

Created on 9 Jan 2018 · 20Comments · Source: cylc/cylc-flow

This:

foo => bar baz

should be an error, but it is parsed as equivalent to:

foo => barbaz

Not likely to cause a problem (so long as strict validation, or rose suite-run, is used) because barbaz would have to be defined under [runtime] ... but still a bug.

bug

Source

hjoliver

Most helpful comment

@hjoliver I am unable to think of a quick fix for this one. Are you able to do so? If not, I am tempted to drop this to a later milestone.

In the longer term, I'd like to take a good look at our parsers, and probably try to separate the logic into stages, e.g. by adding a lexical analyser for the raw input, before parsing the tokens into a data structure and for further processing.

matthewrmshin on 26 Jan 2018

👍2

All 20 comments

This is where the problem is: https://github.com/cylc/cylc/blob/4e985c3dda52540fa2c2c5cc32accb65373f6b60/lib/cylc/graph_parser.py#L166-L167

matthewrmshin on 9 Jan 2018

Yes, for this reason:
https://github.com/cylc/cylc/blob/4e985c3dda52540fa2c2c5cc32accb65373f6b60/lib/cylc/graph_parser.py#L153-L155
(which sounds good in most respects, but it might be nasty to fix)

hjoliver on 9 Jan 2018

@hjoliver I am unable to think of a quick fix for this one. Are you able to do so? If not, I am tempted to drop this to a later milestone.

matthewrmshin on 26 Jan 2018

👍2

Same, no quick fix, sorry. At least users are unlikely to hit this bug (for the reason given in the description above).

hjoliver on 26 Jan 2018

I'm not going to explore this properly, but, could you just do something like:

    if not re.match("^((?!\w+\s+\w+).)*$", line):
        Throw an exception or spit out a warning or however you want to handle it
    # Apparently this is the fastest way to strip all whitespace!: 
    line = "".join(line.split())

I don't know if any extra characters would need to be explored there (or how you would deal with something like

foo => bar \
    baz

which I think would have to rely on strict validation to catch unles you've joined lines in advance?

ColemanTom on 31 Jan 2018

@ColemanTom - thanks for the suggestion, we'll look it at. We don't need to consider line continuation markers - those lines are concatenated first up when the file is read in.

hjoliver on 31 Jan 2018

Maybe I am over-thinking things here, but the issue is a bit more complicated because task names can contain non-alphanumeric characters. (My fault, I suppose. Should have left things simple.)

matthewrmshin on 31 Jan 2018

Ok. I assume there is a list of unacceptable characters? e.g. = > | & ( ) :?

So ^((?!\w+\s+\w+).)*$ could be ^((?![^=>|&():]+\s+[^=>|&():]+).)*$ adding any other unacceptable characters in? It feels like it should be that simple (unless of course there is a whole heap of extra complexity in the namespace definition/acceptance).

Tried it on

foo => bar baz #failed
foo => bar & baz #passed
foo => ba3$r& baz@!^ #passed
foo => (ba3$r baz@!^) #failed
foo => (ba3$r| baz@!^) #passed

ColemanTom on 31 Jan 2018

The good set of characters for a task name can be found in this module:
https://github.com/cylc/cylc/blob/master/lib/cylc/task_id.py

The problem is that we also have a lot of grammar in a graph node. E.g. We can have cycle offsets, parameters and trigger qualifiers. (The most problematic being the hyphen character -, which can be used in task names and parameters values, but is also used as the minus sign in a cycle offset, and in trigger qualifiers such as :finish-all.)

matthewrmshin on 1 Feb 2018

This may be a touch crude, but:
\A[\s(]*\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?(\s*[&|]\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?[\s)]*)*(\s*=\s*>\s*[\s(]*\s*!?\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?(\s*[&|]\s*!?\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?[\s)]*)*)*\Z

Or, a bit easier to follow...

Construct the regex string

import re
taskname_pattern = "\s*\w[\w\-+@%]*"
task_suffix_options = "(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?"
lhs_task_pattern =  standard_pattern + task_suffix_options
rhs_task_pattern = "\s*!?" + lhs_task_pattern
multiple_task_pattern = "[\s(]*{pattern}(\s*[&|]{pattern}[\s)]*)*"
lhs_multiple_task_pattern = multiple_task_pattern.format(pattern=lhs_task_pattern)
rhs_multiple_task_pattern = multiple_task_pattern.format(pattern=rhs_task_pattern)
final_pattern = "\A" + lhs_multiple_task_pattern + "(\s*=\s*>\s*" + rhs_multiple_task_pattern + ")*\Z"
checker = re.compile(final_pattern)

Functions used to test:

def check_pass():
  for p in pass_list:
      if not checker.match(p):
          print("Pass did not pass...:  " + p)

def check_fail():
  for f in fail_list:
      if checker.match(f):
          print("Fail should not have passed...:  " + f)

Syntax that you would expect to fail:

fails = """    foo => !
    foo => @
    foo => +
    foo => %
    foo => @b
    foo => +b
    foo => %b
    foo => bar baz
    get<run pets> => feed<run,pets>
    get<run,pets> => feed<run pets>
    OBS_GET<run>: => OBS_PROC<run>
    OBS_GET<run>:succeed-all => OBS_PROC<>
    foo[] => foo"""

Syntax you would expect to pass

passes = """    pre => init<run>
    pre => init< run>
    pre => init<run >
    get<run,pets> => feed<run,pets>
    get< run,pets> => feed<run,pets>
    get<run, pets> => feed<run,pets>
    get<run,pets > => feed<run,pets>
    get<run , pets> => feed<run,pets>
    get <run,pets> => feed<run,pets>
    get<run,pets-1> => get<run,pets>
    get<run,pets -1> => get<run,pets>
    get<run,pets- 1> => get<run,pets>
    get<run,pets - 1> => get<run,pets>
    get<run =1,pets=dogs> => boo
    get<run= 1,pets=dogs> => boo
    get<run = 1,pets=dogs> => boo
    get < run = 1 , pets = dogs > => boo
    get<run=1,pets=dogs>=>boo
    OBS_GET<run>:succeed-all => OBS_PROC<run>
    OBS_GET<run> :succeed-all => OBS_PROC<run>
    OBS_GET<run>: succeed-all => OBS_PROC<run>
    OBS_GET<run> : succeed-all => OBS_PROC<run>
    OBS_GET<run>:succeed -all => OBS_PROC<run>
    OBS_GET<run>:succeed- all => OBS_PROC<run>
    OBS_GET<run> : succeed - all => OBS_PROC<run>
    foo => bar
    foo => b-
    foo => b@
    foo => b+
    foo => b%
    foo => bar& baz
    foo => bar &baz
    foo => bar & baz
    foo => bar| baz
    foo => bar |baz
    foo => bar | baz
    model[-P1M] => model
    model[- P1M] => model
    model[-P1M ] => model
    model [ - P1M ] => model
    Analysis[+PT12H] => ObSensitivity
    Analysis[ +PT12H] => ObSensitivity
    Analysis[+ PT12H] => ObSensitivity
    Analysis[ + PT12H ] => ObSensitivity
    copy:expired => !proc
    copy:expired => ! proc
    foo
    foo => bar => test
    foo => _
    foo => bar_baz
    foo[-P2M]:out2 => baz
    foo[ - P2M ] : out2 => baz
    foo[ -P2M]:out2 => baz
    foo[- P2M]:out2 => baz
    foo[-P2M] :out2 => baz
    foo[-P2M]: out2 => baz
    foo[ -P2M ]:out2 => baz
    sim<r,m,c=2>[-P1Y] => sim<r,m,c=0> | bar<g,h,i=3>
    OBS_GET<run>:succeed - all&foo => OBS_PROC<run> => hello => test | boo"""

Results:

>>> check_fail()
Fail should not have passed...:      get<run pets> => feed<run,pets>
Fail should not have passed...:      get<run,pets> => feed<run pets>
>>> check_pass()
>>>

So, not quite perfectly capturing the expected failures because I didn't put super logic inside the <...> or [....] portions of a task regex to account for param space param. It wouldn't be too hard to do, but I don't fully know the syntax permissible for a parameter name for example, or the time stuff inside the [...]. Those probably get captured fine anyway because a parameter wouldn't exist that is expected, or a time syntax wouldn't be accepted.

I'm sure there are other syntax I haven't accounted for, but following the format/functions above, it would be easy to create a function to do this stuff and add more strings to a pytest framework.

I know this isn't a high priority thing, but its a good excuse for me to practice and improve my regex knowledge.

ColemanTom on 2 Feb 2018

@ColemanTom - thanks for this. I haven't had time to look at it in detail, sorry, and am about to go away for a week. But we can probably use what you've done if you can accommodate those two failing cases (and if the massive regex doesn't create a perfomance problem... and maybe it won't, in context).

hjoliver on 7 Feb 2018

Sure thing. Feel free to assign this Issue to me if you like and I'll fork/send a merge request when I find the time to finish this and adding tests. I am guessing here - https://github.com/cylc/cylc/tree/master/tests/validate ?

ColemanTom on 8 Feb 2018

Hi @ColemanTom Please feel free to raise a pull request. Yes, please add a test under https://github.com/cylc/cylc/blob/master/tests/validate/ - e.g. https://github.com/cylc/cylc/blob/master/tests/validate/64-circular.t should give you a good starting point. (Tests are shell scripts that outputs results in the Test Anything Protocol.)

matthewrmshin on 8 Feb 2018

I've done more exploration of this. I'm very disappointed with the regex ability in python. No (natural) atomic groups or possessive quantifiers which would have made things a bit cleaner. But, with some pseudo-atomic groups which Python can handle, it is quite quick.

I have discovered two bugs in my logic which I still need to fix. And some behaviour in cylc that I was not expecting. I am wondering if it is intentional and I just didn't realise, or its a quirk?

graph = """
    a => b
        => c
"""

I expected that would fail as c is not following anything, but it works and is treated as a => b => c. I could not see that as an example in the user guide. So I'm not entirely sure how to handle that. It implies that on a given line you don't need a LHS node, so I could make that LHS optional I guess, change a + to a *, not hard, but I wanted to make sure it is expected and reasonable behaviour. If it were only the line => c, and no line above, cylc aborts. So I am guessing it is expected?

ColemanTom on 19 Feb 2018

I can answer my own question, yes, so long as the => isn't the leading thing on the first line, it can be on subsequent lines. Noticed it in the code.

ColemanTom on 20 Feb 2018

@ColemanTom Yes, this is deliberate - to allow users to easily manipulate continuation lines.

matthewrmshin on 20 Feb 2018

See #1900

hjoliver on 20 Feb 2018

(graph lines can get very long, but backslash continuation is unsafe - trailing whitespace breaks it).

hjoliver on 20 Feb 2018

Another question, this time about triggering off of remote tasks: https://cylc.github.io/cylc/html/single/cug-html.html#12.26

It doesn't say explicitly in the docs, but the validation fails, so I'm assuming it cannot be applied directly with any other graph node suffix (e.g. time triggers, parameters, family triggers), they have to be completely independent? I just want to make sure my understanding of things is correct.

ColemanTom on 21 Feb 2018

Thanks for pointing that out, I'll get the docs updated.

Task state triggers do work, e.g. "poll-foo:fail => bar" (i.e. trigger task bar if polling of other::foo eventually fails).

Technically, cycle point offsets should work with the suite-polling syntax, but they don't. However it doesn't matter because you can just put your polling task on the same cycle point sequence as the target task in the remote suite (which is recommended in the docs).

Technically, family triggers and parameters should work with suite-polling syntax, but it would not be a sensible thing to do: it would imply all family member/parameter-expanded tasks separately doing exactly the same polling operation, which is unnecessary and inefficient. (Interestingly, these do not fail validation - the polling syntax comes first - but the suite-polling scripting does not get applied to the member tasks).

So there's a couple of things here that should be done better, but probably not worth the effort because suite-state polling tasks will become redundant pretty soon as a result of #2423
(and for your purposes, no need to support family and parameter syntax next to suite-polling)

hjoliver on 21 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Remove reference test functionality from cylc.

hjoliver · 5Comments

Scheduler keeps running after CTRL + C (Py3.8)

kinow · 4Comments

cylc review: Server 404 error when trying Display Options for suites contain special characters

kinow · 3Comments

Remove executable bit of Python source code

kinow · 4Comments

empy: Python3 and the future of cylc support

oliver-sanders · 5Comments