Hi, I am trying to sepparate text from xml tags.
The first step is to extract the tags
Trying this
https://pythex.org/?regex=%3C(.*%3F)%3E&test_string=Hello%2C%20%3Csilence%201.0%3Emy%20name%20is%20Jonn%2C%20I%20am%20a%20%3Cspeed%200.2%3E%20blah%20blah%20blah%20blah%20blah&ignorecase=0&multiline=0&dotall=0&verbose=0
Works everywhere else but in godot. I get only the first match :(
Here is example code:
func _ready():
print("TAGS:",extractXmlTags("Hello, <silence 1.0>my name is Jonn, I am a <speed 0.2> blah blah blah blah blah"))
## should return [silence 1.0,speed 0.2], but returns [silence 1.0]
func extractXmlTags(text):
var NameRegEx = RegEx.new()
NameRegEx.compile('<(.*?)>') ## also <(.*?)> ## <([^<]+)>
NameRegEx.find(text)
var result = NameRegEx.get_captures()
return result
CC @leezh @StraToN
Here is an expression I came up with - that gets both the tags and the text and makes them easy to extract to two separate outputs:
https://pythex.org/?regex=(%5B%5E%3C%5D%2B%7C)(%3C%5B%5E%3C%5D%2B%3E)(%5B%5E%3C%5D%2B%7C)&test_string=Hi%2C%20%3Csilence%201.0%3Emy%20name%20is%20John%2C%20I%20am%20a%20%3Cspeed%200.2%3E%20blah%20blah%20blah%20blah%20blah%20%3Cspeed%200.3%3E%20!!!%20Lets%20be%20quiet%20%3Csilence%202.0%3E.Ok%20done&ignorecase=0&multiline=0&dotall=0&verbose=0
NameRegEx.compile('([^<]+|)(<[^<]+>)([^<]+|)',-1)
Unfortunately it absolutely doesn't work in godot :'(
Does work in other engines though
I really need it atm for my npc story text parser
The design behind RegEx.find() was kinda inspired by the C++ string find. You do subsequent searches by specifying the start point, which you can do via RegExMatch.get_end(0)
var text = "ab1ab2ab3ab4"
var ex = RegEx.new()
ex.compile("ab.")
var res = ex.search(text)
while res != null:
print(res.get_string(0))
res = ex.search(text, res.get_end(0))
I really need to get a more intuitive API for this, but I've been pretty busy lately.
Ah, wait, the example I gave is with the 3.0 branch. Here's the solution for the 2.1 branch:
func extractXmlTags(text):
var NameRegEx = RegEx.new()
NameRegEx.compile('<(.*?)>')
var start = 0
while NameRegEx.find(text, start) >= 0:
print(NameRegEx.get_captures())
start = NameRegEx.get_capture_start(0) + NameRegEx.get_capture(0).length()
EDIT: Fixed the typo >=0
@leezh Thank you for the solution. If possible, please make this more intuitive. From my experience in using regex in other programming languages, I never had to use a while loop to extract the matches. This is also not noted in any way in the gdscript api documentation. Please note that gdscript is aimed to people who seek easier programming languages than c++.My personal reason for picking it up was really python- it is very similar to python- thus why I used a python regex debugger to write my expressions.
The best case scenario really would be for the regex matching in godot to work the way it does in the easy popular programming languages - python being the best imo. Javascript is an alternative too.
Basically the way it works in the popular regex debuggers out there- most people write their regular expressions in online/offline debuggers. So if godot works the way standard regex debuggers do- it is more intuitive imo. :)
@leezh Unfortunately your solution doesnt work for
NameRegEx.compile('([^<]+|)(<[^<]+>)([^<]+|)',-1)
https://pythex.org/?regex=(%5B%5E%3C%5D%2B%7C)(%3C%5B%5E%3C%5D%2B%3E)(%5B%5E%3C%5D%2B%7C)&test_string=Hi%2C%20%3Csilence%201.0%3Emy%20name%20is%20John%2C%20I%20am%20a%20%3Cspeed%200.2%3E%20blah%20blah%20blah%20blah%20blah%20%3Cspeed%200.3%3E%20!!!%20Lets%20be%20quiet%20%3Csilence%202.0%3E.Ok%20done&ignorecase=0&multiline=0&dotall=0&verbose=0
Example code:
Func _ready():
var testString = "Hi, <silence 1.0>my name is John I am a <speed 0.2> blah blah blah blah blah <speed 0.3> !!! Lets be quiet <silence 2.0>.Ok done"
# print("TAGS:",extractXmlTags(testString))
for result in extractXmlTags(testString):
print(">",result)
func extractXmlTags(text):
var result = []
var NameRegEx = RegEx.new()
NameRegEx.compile('([^<]+|)(<[^<]+>)([^<]+|)')
var start = 0
while NameRegEx.find(text, start) >= 0:
print(NameRegEx.get_captures())
result.append(NameRegEx.get_captures()[0])
start = NameRegEx.get_capture_start(0) + NameRegEx.get_capture(0).length()
return result
How do I get it to print out what it does on the link?
What I want/expect it to print out:
>Hi,
>
><silence 1.0>
> my name is John, I am a
>
><speed 0.2>
>blah blah blah blah blah
>
><speed 0.3>
>!!! Lets be quiet
>
><silence 2.0>
> .Ok done
What I get in godot:
>Hi, <silence 1.0>
>my name is John I am a <speed 0.2>
> blah blah blah blah blah <speed 0.3>
> !!! Lets be quiet <silence 2.0>
Btw sometimes you may have null/empty captures inbetween valid captures? Does your code address that too?
Yeah, that inconsistency was because the custom regex library in 2.1 did not behave 100% exactly as it should. It was fixed in 3.0 by replacing the back end with PCRE.
It was quite a steep change and wasn't backwards-compatible to pre-existing scripts, which is why you aren't seeing it in 2.1.
is it possible to do what I am trying in 2.1? Godot 3 is not yet stable so I don't want to migrate my code yet :)
Try: ([^<]+)?(<[^<]+>)([^<]+)?
@leezh
It still doesn't split them as expected:
>Hi, <silence 1.0>my name is John I am a
><speed 0.2> blah blah blah blah blah
><speed 0.3> !!! Lets be quiet
><silence 2.0>.Ok done
Ah, sorry, I was just following your pythex link as reference. The following function:
func processTags(text):
var NameRegEx = RegEx.new()
NameRegEx.compile('([^<]*)(<[^<]+>)?')
var start = 0
while NameRegEx.find(text, start) >= 0:
# Do stuff with regular text
print("> ", NameRegEx.get_capture(1))
# Do stuff with tags
print("= ", NameRegEx.get_capture(2))
start = NameRegEx.get_capture_start(0) + NameRegEx.get_capture(0).length()
Should give you the output:
> Hi,
= <silence 1.0>
> my name is John I am a
= <speed 0.2>
> blah blah blah blah blah
= <speed 0.3>
> !!! Lets be quiet
= <silence 2.0>
> .Ok done
=
Hopefully that's more useful for you. Just replace the print with the actual functions you want.
EDIT: Changing ([^<]+) into ([^<]*) should deal with the case of text starting with a tag.
@leezh Thank you for the solution. This works now. 馃憤
I really hope this is easier to do in godot 3 :)
Can we in the future get it to a point where we dont need to use a while loop and
NameRegEx.get_captures()
Simply returns an array of all the captures?
You know, the way it would if we did it in python ?
Less code and more results :p
Yeah, I could do something like that. Perhaps something like:
var ex = RegEx.new()
ex.compile("([^<]*)(<[^<]+>)?")
for match in ex.search_all(text):
print(match.get_string(1))
print(match.get_string(2))
Should be easy enough. I'll get that done when I'm free.
@leezh thank you ^_^ I will keep an eye on this issue for it. Godot 3 is going to be an awesome release indeed! 馃憤
Btw why not simply do this instead:
var ex = RegEx.new()
ex.compile("([^<]*)(<[^<]+>)?")
print( ex.search_all(text)) ## returns an array with ALL matches
Why the need to complicate it with a method nested within a for loop?
And what is 1 and 2? You have to teach the user a total of like 3-4 concepts to get there.
Learning regex is complicated enough, why make it more complicated? Why not Just get godot to do it in one step - like other languages
@leezh Godot3's current implementation is broken! Sorry to say but try this:
([^<]+)(<[^<]+>)?
LINK
With this:
func splitXmlTags(text): ## borked
var result = []
var ex = RegEx.new()
ex.compile("([^<]+)(<[^<]+>)?")
var res = ex.search(text)
while res != null:
result.append(res.get_string(0))
res = ex.search(text, res.get_end(0))
return result
It fails to split them predictably!
Also if you try with:
ex.compile("([^<]*)(<[^<]+>)?")
Godot 3 will hang/crash entirely and fail to even start the project! =/
That is unfortunately the joy of while loops
Lets establish one thing:
If an expression that works in a regular expression debugger (such as pythex for example), fails in godot, godot is broken. The developer has no way of debugging. If it can even crash godot- even worse!
The current implementation creates complicated corner cases that turn it into a steep learning curve approach compared to python's. Please make it more predictable and less prone to failure and crash.
Ok using this fixes it:
func splitXmlTags(text): ## fixed
var result = []
var ex = RegEx.new()
ex.compile("([^<]+)(<[^<]+>)?")
var res = ex.search(text)
while res != null:
result.append(res.get_string(1))
result.append(res.get_string(2))
res = ex.search(text, res.get_end(0))
return result
But I still think the current approach to regex is a huge pain in the neck. We should not be able to crash godot with regular expressions and we should not have to use crazy while loops to extract the results. This is just insane
Ah, right. The problem with ([^<]*)(<[^<]+>)? is that it's valid with zero length strings, in any implementation, so once it gets to the end of your text it just loops infinitely (because a zero length match is a valid match). And then from there, the array just grows until it runs out of memory and crash.
And the problem with crashing isn't regex specific, because it essentially boils down to:
var result = []
while true:
result.append(1)
Anyways, I've written RegEx.search_all() in #12915 that prevents infinite loops such as this by detecting when a result doesn't move.
And yes, the current implementation is quite a bit of a pain. I'm not perfect but it is my free time that is going into this and there's only so much I can do at any given moment.
PS: I'm not saying don't stop the bug reports because it is nice to know when something goes wrong. Just be less ... shout-y about it and understand that I'm doing this using my spare time. There's no need to go about giving me (or anyone else for that matter) a tirade.
@leezh
Sorry if I came off as a bit shouty. That night had me frustrated in porting godot2 code to godot 3- getting things to work again. My criticism wasn't addressed personally to you, as much as to how regex is handled in godot.
I have used regex in java script, python and even in autohotkey, none of them has frustrated me as much as godot's. I am very happy to see that Godot 3 will have a better implementation than before thanks to you.
Thank you for taking the time to improve it. Your contribution is highly appreciated. I will try it once it gets merged in godot and replace the while loops that I have atm with the new search.all() method.
Thank you also for explaining the reason for the regex while loop to crash godot entirely. I had a suspicion that was the case- thats why I complained that we have to rely on a while loop to extract the results.
Eh, to be honest, it wasn't that bad. Just that I had a stressful day that day and it just added to it. No worries.
Anyways, now that it's been merged, does that solve your regex issues?
@leezh Thank you for implementing this and getting it merged 馃憤 I tried it tonight and it works like a charm. It has now replaced all the while loops.
Sorry if I somehow added to a stressful day. I owe you a drink :)
Closing this now, as it solves my issue
@leezh It would be really cool if we had a way to extract all the strings in an array without using the get_string(1,2,3...) method
Is that even possible?
Just asking to see if there is a way to write a generic method that can be applied to any regular expression. The current approach forces the user to write a different extraction method for different regular expressions - this is something that only godot does which is why I still have the complaint that it is easier to extract results in other programming languages
See for example how autohotkey handles it:
https://autohotkey.com/docs/commands/RegExMatch.htm
You can extract the results in a single line of code that works for any regular expression.
My only gripe with it is that instead of an array , it creates subPat variables for result- which is kind of weird
Javascript is similar:
https://www.w3schools.com/jsref/jsref_match.asp
but puts them in a proper array with the first entry being the full input string
var str = "The rain in SPAIN stays mainly in the plain";
var res = str.match(/ain/g); //results in [ain,ain,ain]
another example you can try in your web browser console (F12):
var str = "This is cool";
var matches = str.match(/(This is)( cool)$/);
console.log( JSON.stringify(matches) ); // will print ["This is cool","This is"," cool"] ...
It is still much simpler than godot's gdscript approach, where even with the for loop, you will have to think about which get_string(n)s to get for different regular expressions
the search_all() method could be further simplified to work like the one used in javascript.
Having that would eliminate the need to write different functions for each different regular expression
I am reopening this. Perhaps someone could be interested in simplifying it more? Gdscript should be as easy or easier than javascript, not more complicated imo
The issue is related to:
https://github.com/godotengine/godot/issues/9066
I am closing the other one, as this one has more relevant information and @leezh actually partially addressed the other one with the new search_all() method
It feels like you're piling a few problems together, so let me tackle each problem one-by-one.
It would be really cool if we had a way to extract all the strings in an array without using the get_string(1,2,3...) method
Is that even possible?
You mean like RegExMatch.get_strings()?
It is still much simpler than godot's gdscript approach, where even with the for loop, you will have to think about which get_string(n)s to get for different regular expressions
I'm not sure what you mean by that. get_string(0) (or alternatively get_string() with no parameters) is the naive I-dont-care-about-the-structure-of-the-regex match result.
Perhaps someone could be interested in simplifying it more? Gdscript should be as easy or easier than javascript, not more complicated imo
While there is some extra boiler-plate lines necessary, I fail to understand how it's more complicated. Here's the first example re-written in gdscript:
var text = "The rain in SPAIN stays mainly in the plain"
var ex = RegEx.new()
ex.compile("ain")
var res = ex.search_all(text)
The only difference is that it's two lines extra. And those two extra lines are because:
a) Regex is an optional module. Not everyone uses it. Having String.match() creates a hard dependency in the core type.
b) In native modules, Object.new() cannot accept any parameters. It's a limitation of the engine.
And here's the second example re-written in gdscript:
var text = "This is cool"
var ex = RegEx.new()
ex.compile("(This is)( cool)$")
print(to_json(ex.search(text).get_strings()))
@leezh thank you for taking the time to explain this and also for the help on the regular expressions. I did not know these things - more importantly about the limitations.
I feel that I have pushed this issue as much as it could be pushed so I am closing it now.
The new method has solved it for me, as now I dont have to use while loops at least.
My other request to have it return all the matches in an array without the need to use specific get_string(1,2,3) methods for each different regular expression - I can do without
Most helpful comment
Ah, right. The problem with
([^<]*)(<[^<]+>)?is that it's valid with zero length strings, in any implementation, so once it gets to the end of your text it just loops infinitely (because a zero length match is a valid match). And then from there, the array just grows until it runs out of memory and crash.And the problem with crashing isn't regex specific, because it essentially boils down to:
Anyways, I've written RegEx.search_all() in #12915 that prevents infinite loops such as this by detecting when a result doesn't move.