This:
foo => bar baz
should be an error, but it is parsed as equivalent to:
foo => barbaz
Not likely to cause a problem (so long as strict validation, or rose suite-run, is used) because barbaz would have to be defined under [runtime] ... but still a bug.
This is where the problem is: https://github.com/cylc/cylc/blob/4e985c3dda52540fa2c2c5cc32accb65373f6b60/lib/cylc/graph_parser.py#L166-L167
Yes, for this reason:
https://github.com/cylc/cylc/blob/4e985c3dda52540fa2c2c5cc32accb65373f6b60/lib/cylc/graph_parser.py#L153-L155
(which sounds good in most respects, but it might be nasty to fix)
@hjoliver I am unable to think of a quick fix for this one. Are you able to do so? If not, I am tempted to drop this to a later milestone.
In the longer term, I'd like to take a good look at our parsers, and probably try to separate the logic into stages, e.g. by adding a lexical analyser for the raw input, before parsing the tokens into a data structure and for further processing.
Same, no quick fix, sorry. At least users are unlikely to hit this bug (for the reason given in the description above).
I'm not going to explore this properly, but, could you just do something like:
if not re.match("^((?!\w+\s+\w+).)*$", line):
Throw an exception or spit out a warning or however you want to handle it
# Apparently this is the fastest way to strip all whitespace!:
line = "".join(line.split())
I don't know if any extra characters would need to be explored there (or how you would deal with something like
foo => bar \
baz
which I think would have to rely on strict validation to catch unles you've joined lines in advance?
@ColemanTom - thanks for the suggestion, we'll look it at. We don't need to consider line continuation markers - those lines are concatenated first up when the file is read in.
Maybe I am over-thinking things here, but the issue is a bit more complicated because task names can contain non-alphanumeric characters. (My fault, I suppose. Should have left things simple.)
Ok. I assume there is a list of unacceptable characters? e.g. = > | & ( ) :?
So ^((?!\w+\s+\w+).)*$ could be ^((?![^=>|&():]+\s+[^=>|&():]+).)*$ adding any other unacceptable characters in? It feels like it should be that simple (unless of course there is a whole heap of extra complexity in the namespace definition/acceptance).
Tried it on
foo => bar baz #failed
foo => bar & baz #passed
foo => ba3$r& baz@!^ #passed
foo => (ba3$r baz@!^) #failed
foo => (ba3$r| baz@!^) #passed
The good set of characters for a task name can be found in this module:
https://github.com/cylc/cylc/blob/master/lib/cylc/task_id.py
The problem is that we also have a lot of grammar in a graph node. E.g. We can have cycle offsets, parameters and trigger qualifiers. (The most problematic being the hyphen character -, which can be used in task names and parameters values, but is also used as the minus sign in a cycle offset, and in trigger qualifiers such as :finish-all.)
This may be a touch crude, but:
\A[\s(]*\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?(\s*[&|]\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?[\s)]*)*(\s*=\s*>\s*[\s(]*\s*!?\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?(\s*[&|]\s*!?\s*\w[\w\-+@%]*(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?[\s)]*)*)*\Z
Or, a bit easier to follow...
Construct the regex string
import re
taskname_pattern = "\s*\w[\w\-+@%]*"
task_suffix_options = "(\s*<[^>]+>\s*)?(\s*\[[^\]]+\]\s*)?(\s*:\s*\w[\s\w\-+]+)?"
lhs_task_pattern = standard_pattern + task_suffix_options
rhs_task_pattern = "\s*!?" + lhs_task_pattern
multiple_task_pattern = "[\s(]*{pattern}(\s*[&|]{pattern}[\s)]*)*"
lhs_multiple_task_pattern = multiple_task_pattern.format(pattern=lhs_task_pattern)
rhs_multiple_task_pattern = multiple_task_pattern.format(pattern=rhs_task_pattern)
final_pattern = "\A" + lhs_multiple_task_pattern + "(\s*=\s*>\s*" + rhs_multiple_task_pattern + ")*\Z"
checker = re.compile(final_pattern)
Functions used to test:
def check_pass():
for p in pass_list:
if not checker.match(p):
print("Pass did not pass...: " + p)
def check_fail():
for f in fail_list:
if checker.match(f):
print("Fail should not have passed...: " + f)
Syntax that you would expect to fail:
fails = """ foo => !
foo => @
foo => +
foo => %
foo => @b
foo => +b
foo => %b
foo => bar baz
get<run pets> => feed<run,pets>
get<run,pets> => feed<run pets>
OBS_GET<run>: => OBS_PROC<run>
OBS_GET<run>:succeed-all => OBS_PROC<>
foo[] => foo"""
Syntax you would expect to pass
passes = """ pre => init<run>
pre => init< run>
pre => init<run >
get<run,pets> => feed<run,pets>
get< run,pets> => feed<run,pets>
get<run, pets> => feed<run,pets>
get<run,pets > => feed<run,pets>
get<run , pets> => feed<run,pets>
get <run,pets> => feed<run,pets>
get<run,pets-1> => get<run,pets>
get<run,pets -1> => get<run,pets>
get<run,pets- 1> => get<run,pets>
get<run,pets - 1> => get<run,pets>
get<run =1,pets=dogs> => boo
get<run= 1,pets=dogs> => boo
get<run = 1,pets=dogs> => boo
get < run = 1 , pets = dogs > => boo
get<run=1,pets=dogs>=>boo
OBS_GET<run>:succeed-all => OBS_PROC<run>
OBS_GET<run> :succeed-all => OBS_PROC<run>
OBS_GET<run>: succeed-all => OBS_PROC<run>
OBS_GET<run> : succeed-all => OBS_PROC<run>
OBS_GET<run>:succeed -all => OBS_PROC<run>
OBS_GET<run>:succeed- all => OBS_PROC<run>
OBS_GET<run> : succeed - all => OBS_PROC<run>
foo => bar
foo => b-
foo => b@
foo => b+
foo => b%
foo => bar& baz
foo => bar &baz
foo => bar & baz
foo => bar| baz
foo => bar |baz
foo => bar | baz
model[-P1M] => model
model[- P1M] => model
model[-P1M ] => model
model [ - P1M ] => model
Analysis[+PT12H] => ObSensitivity
Analysis[ +PT12H] => ObSensitivity
Analysis[+ PT12H] => ObSensitivity
Analysis[ + PT12H ] => ObSensitivity
copy:expired => !proc
copy:expired => ! proc
foo
foo => bar => test
foo => _
foo => bar_baz
foo[-P2M]:out2 => baz
foo[ - P2M ] : out2 => baz
foo[ -P2M]:out2 => baz
foo[- P2M]:out2 => baz
foo[-P2M] :out2 => baz
foo[-P2M]: out2 => baz
foo[ -P2M ]:out2 => baz
sim<r,m,c=2>[-P1Y] => sim<r,m,c=0> | bar<g,h,i=3>
OBS_GET<run>:succeed - all&foo => OBS_PROC<run> => hello => test | boo"""
Results:
>>> check_fail()
Fail should not have passed...: get<run pets> => feed<run,pets>
Fail should not have passed...: get<run,pets> => feed<run pets>
>>> check_pass()
>>>
So, not quite perfectly capturing the expected failures because I didn't put super logic inside the <...> or [....] portions of a task regex to account for param space param. It wouldn't be too hard to do, but I don't fully know the syntax permissible for a parameter name for example, or the time stuff inside the [...]. Those probably get captured fine anyway because a parameter wouldn't exist that is expected, or a time syntax wouldn't be accepted.
I'm sure there are other syntax I haven't accounted for, but following the format/functions above, it would be easy to create a function to do this stuff and add more strings to a pytest framework.
I know this isn't a high priority thing, but its a good excuse for me to practice and improve my regex knowledge.
@ColemanTom - thanks for this. I haven't had time to look at it in detail, sorry, and am about to go away for a week. But we can probably use what you've done if you can accommodate those two failing cases (and if the massive regex doesn't create a perfomance problem... and maybe it won't, in context).
Sure thing. Feel free to assign this Issue to me if you like and I'll fork/send a merge request when I find the time to finish this and adding tests. I am guessing here - https://github.com/cylc/cylc/tree/master/tests/validate ?
Hi @ColemanTom Please feel free to raise a pull request. Yes, please add a test under https://github.com/cylc/cylc/blob/master/tests/validate/ - e.g. https://github.com/cylc/cylc/blob/master/tests/validate/64-circular.t should give you a good starting point. (Tests are shell scripts that outputs results in the Test Anything Protocol.)
I've done more exploration of this. I'm very disappointed with the regex ability in python. No (natural) atomic groups or possessive quantifiers which would have made things a bit cleaner. But, with some pseudo-atomic groups which Python can handle, it is quite quick.
I have discovered two bugs in my logic which I still need to fix. And some behaviour in cylc that I was not expecting. I am wondering if it is intentional and I just didn't realise, or its a quirk?
graph = """
a => b
=> c
"""
I expected that would fail as c is not following anything, but it works and is treated as a => b => c. I could not see that as an example in the user guide. So I'm not entirely sure how to handle that. It implies that on a given line you don't need a LHS node, so I could make that LHS optional I guess, change a + to a *, not hard, but I wanted to make sure it is expected and reasonable behaviour. If it were only the line => c, and no line above, cylc aborts. So I am guessing it is expected?
I can answer my own question, yes, so long as the => isn't the leading thing on the first line, it can be on subsequent lines. Noticed it in the code.
@ColemanTom Yes, this is deliberate - to allow users to easily manipulate continuation lines.
See #1900
(graph lines can get very long, but backslash continuation is unsafe - trailing whitespace breaks it).
Another question, this time about triggering off of remote tasks: https://cylc.github.io/cylc/html/single/cug-html.html#12.26
It doesn't say explicitly in the docs, but the validation fails, so I'm assuming it cannot be applied directly with any other graph node suffix (e.g. time triggers, parameters, family triggers), they have to be completely independent? I just want to make sure my understanding of things is correct.
Thanks for pointing that out, I'll get the docs updated.
Task state triggers do work, e.g. "poll-foo
Technically, cycle point offsets should work with the suite-polling syntax, but they don't. However it doesn't matter because you can just put your polling task on the same cycle point sequence as the target task in the remote suite (which is recommended in the docs).
Technically, family triggers and parameters should work with suite-polling syntax, but it would not be a sensible thing to do: it would imply all family member/parameter-expanded tasks separately doing exactly the same polling operation, which is unnecessary and inefficient. (Interestingly, these do not fail validation - the polling syntax comes first - but the suite-polling scripting does not get applied to the member tasks).
So there's a couple of things here that should be done better, but probably not worth the effort because suite-state polling tasks will become redundant pretty soon as a result of #2423
(and for your purposes, no need to support family and parameter syntax next to suite-polling)
Most helpful comment
@hjoliver I am unable to think of a quick fix for this one. Are you able to do so? If not, I am tempted to drop this to a later milestone.
In the longer term, I'd like to take a good look at our parsers, and probably try to separate the logic into stages, e.g. by adding a lexical analyser for the raw input, before parsing the tokens into a data structure and for further processing.