Follow on from #2966 to cover unicode support in Cylc.
What unicode should Cylc support and where?
*native: Python3 defaults to the OS "default encoding" for file opens which it typically UTF-8, 3.7+ has a UTF_8 flag which overrides this. Consequently these came for free with the Python3 upgrade.
@oliver-sanders - suite names would be a good one to nail down right away. They are currently completely unrestricted, as you've noted.
Motivation: @dwsutherland needs a delimiter for these IDs in the code:
https://github.com/cylc/cylc-flow/pull/3202/files#diff-3426a12a1f378e02eac6fb12a9610a23R102
He's currently using / but hierarchical suite names are going to break that.
The minimal set of Unicode characters is probably alphanumeric plus dash, underscore, forward-slash:
'^[\w\-_/]+$'
Then @dwsutherland could use % (say) as a delimiter in the code.
Is there any compelling reason to allow more than this in suite names?
The only other characters that we may want to support are .+@.
OK, I've quickly run some characters from different unicode tables against Python re.
The regex:
^[\w_\.+@-]+$Changes we might want to make:
=, %- and . from being the first characterConclusions:
I think we are good to go with this Regex?
abc
ABC
0123
_.+@-
àðØ
ĐĵŌ
ɐɶʍ
ΘχϢ
𐅅𐅉𐅌
ФШѸ
ԱՔփ
☺
😅
🐼
꘍
!
⁃
-
/
\
~
:
|
>
<
Motivation: @dwsutherland needs a delimiter for these IDs in the code:
https://github.com/cylc/cylc-flow/pull/3202/files#diff-3426a12a1f378e02eac6fb12a9610a23R102
How about :, it is valid in Unix filenames, but obviously not a sensible choice.
The colon : is used in PATH-like variables, so no good for anything else.
Exactly!
See also #3117 where I investigated & documented task & (to some extent) suite names that work normally, which is now in the docs here & here.
I suggest we have the same limitations on suite name as on task name, as documented above, as largely in agreement with the comments above.
Note there is also a maximum length that suite names can safely be, given the OS restriction on file names length. It might sound a bit over the top to point this out, but I have seen some exceedingly long suite names across the MO. Can we validate on length too for extra care?
Can we validate on length too for extra care?
We can check for number of characters easily enough (can even do it in a regex), unfortunately most file-system file-name limits are in bytes.
https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits
We could just slap on a 255 character limit anyway.
See open questions on https://github.com/cylc/cylc-flow/pull/3274
Could we use | as the external ID delimiter?
Note - The other reason why I was originally vying for / or . was to be consistent with CLI task/job args. But as the external ID has to contain suite and owner/user info, this issue has come about.
Could we use
|as the external ID delimiter?
I suppose we could (but it looks annoyingly like pipe or OR to me). I think we can go with the possibly-temporary solution and make a final decision once allowed suite-name chars are defined (within a day or so, I expect).
All items addressed, closing.
Most helpful comment
See open questions on https://github.com/cylc/cylc-flow/pull/3274