Tasmota: Tokenizing rules to save space

Created on 28 Apr 2020  路  17Comments  路  Source: arendst/Tasmota

Have you looked for this feature in other issues and in the docs?
Yes

Is your feature request related to a problem? Please describe.
_A clear and concise description of what the problem is._

Many users have asked about more room for rules. I understand the limit of 1533 bytes total was set to keep the flash used down and yet allow a reasonable amount of space for rules. But has any thought been given to storing keywords like backlog, endon and even the Tasmota commands like power, etc as tokens instead of full words?

Describe the solution you'd like
_A clear and concise description of what you want to happen._

In older interpreted BASIC, due to limited space on floppy disk and other storage media, the commands would be stored as a single-byte token (I'm showing my age and talking DOS days and older. Compiled BASIC and even newer interpreted BASIC probably don't use this.) Tasmota has too many commands for single bytes but 2-byte tokens would be plenty. Storing the key words in such tokens would free up quite a bit of space (although, yes, it'd take more room for code to convert them so this might not really help.)

Even just allowing shorter keywords (like bl for backlog, end instead of endon, v1 instead of var1) would help (but again use more code to parse so it might not be worth it), although it would make it harder to read at a glance. And these could be optional so it'd recognize either "bl" or "backlog" as being the same.

Describe alternatives you've considered
_A clear and concise description of any alternative solutions or features you've considered._

Additional context
_Add any other context or screenshots about the feature request here._

(Please, remember to close the issue when the problem has been addressed)

duplicated enhancement fixed

Most helpful comment

Just to keep you informed, some tests on actual rules show around 50%-70% compression (the longer the rule the better compression ratio). You'll be able to double or triple the rule code size. It looks also that CPU impact on decompression is minimal.

I tried with your rule above:
RUL: Compressed from 492 to 240 (-51%)

on Rules#Timer=1 
   do backlog var2 0; RuleTimer1 130; var3 0 
endon 
on tele-DS18B20#temperature 
   do backlog var2 1; RuleTimer1 130; event heat_ready=1; event heat_demand=%value%;var1=%value% 
endon
on event#heat_ready>%mem1% 
   do var2 0 endon on event#heat_demand>%mem2% do backlog publish mikes_heater/cmnd/power2 0;var3 0 
endon 
on event#heat_demand<%mem3% 
   do backlog publish mikes_heater/cmnd/power2 %var2%;var3 %var2% 
endon 
on ds18b20#temperature 
   do var10=%value%
endon

All 17 comments

Hi,

This was already discussed in an old issue. As a summary of that, the script feature has already short commands, so, if you need large and more complex rules, you can use script instead.
Rules were originally meant for simple tasks.
Changing all the rules' commands to a short version of them will save flash space and will allow you to write some few more rules but that would be a big breaking change, so I don't see it feasible.

In the other hand, adding new commands as a short version of actual commands will add a lot more flash space, so I don't see that feasible either.

What other option or solution do you have in mind?

Actually this would not be a breaking change if it's only about the internal representation of rules. I.e. the user would still enter and see backlog whereas it would be stored for ex as 2 bytes. You would lose case though. For ex if you entered backlog it would be displayed Backlog instead. Which actually could be a good feature.

We could also removed spaces between tokens.

It would definitely add code to do so, but Flash is far less precious than Settings space.

I like the idea.

For readability, tokenizing would need to "un-tokenize" when outputting the results in either status messages or displaying the rule set(s).

Also, Backlog could be made "implicit" altogether (as is done in the IF/THEN/ELSE implementation). If a user concatenates commands with semicolons, then the code could just imply the Backlog avoiding having to store that command/string.

With binary compression, binary size is definitely becoming less of an issue.

Like the idea. NO change of commands representation / syntax and winning more space for rules.

BTW, also strip extra spaces when storing a rule set and any spaces after semicolon.

Implicit backlog would break edge cases where semicolon is used as delimiter in commands like Publish, WebSend or SerialSend. Not sure if or how often this is used in the real world, though ...

Don't we already have to escape semicolons in Publish commands?

Regarding tokens I like the idea but keep in mind:

  • The current command implementation allows very flexible extension; commands are plugged in almost anywhere in the code without keeping score of token dependancy and uniqueness.
  • Token implementation would mean the current command structure needs to be stable; new commands can only be appended to the total command list(s).
  • We already have over 256 commands so a token would need to be 16-bits.

A solution I see would be:

  • A possible implementation could be unique tokens per *.ino file with the lower 8-bits containing the command token and the upper 8-bits discriminate commands.ino (0), xdrv= drivers (1 - 99) and xsns = sensor (101 - 199). This way new command can still be plugged in while keeping unique tokens.

It would be a nice task to re-enumarate the commands (which was removed some versions ago) and have the token parser implemented. Also keep in mind, as @meingraham already noticed a token also has to be retranslated to a user readable command when to rule is displayed.

This all can be done but needs time to implement it as flexible as possible.

If someone wants to pick this up I'll encourage it.

It looks like some of the keywords could be shortened easily by changing code like:

plen = rule.indexOf(" ENDON"); plen2 = rule.indexOf(" BREAK"); if ((plen == -1) && (plen2 == -1)) { return serviced; } // Bad syntax - No ENDON neither BREAK

to something like:

````
int myIndexOf(char c, char c2)
{
return max(indexOf(c), indexOf(c2));
}
...................

plen = rule.myIndexOf(" ENDON", " EON");
plen2 = rule.IndexOf(" BREAK", "BRK");
if ((plen == -1) && (plen2 == -1)) { return serviced; } // Bad syntax - No ENDON neither BREAK

````

I.e. let the new indexOf(0 take two parameters instead of just one. Then it just has to check for either's existence. Each call to myIndexOf() instead of indexOf() would only grow a few bytes and would allow more flexibility in Rules without breaking any existing Rules at all. The code might grow by a couple of KB but it would preserve the 1533-byte buffer size and be completely transparent to the currently-used rules as the user could use either form.

edited for typos

Thanks for the input. I'm also exploring a more holistic approach with general short text compression, like https://github.com/siara-cc/Unishox
It can squeeze by half the size of rules.

Stay tuned

"It is useful only if saving by compressing text content is over 3000 bytes since the decompressor takes as much space."

The rules buffer is only 1533 bytes in size. I don't see how Unishox would help any with that as the code would take up more room than is being saved (the only way it'd save is that it would allow for longer rules without changing the data storage area but if wanting to save both flash and ram, it doesn't seem like it'd do all that well. Along with the extra code size, you'd need a ram/heap buffer of 3066 bytes to hold the expanded rules in and ram is getting tight on the esp8266 with all the drivers, etc. Then there's the issue with how backwards compatible it is. Both the token idea I suggested (I just didn't think about this part at the time) and Unishox would allow for more than 511 bytes/rule without changing the buffer size but then programs such as TasmotaDeviceManager would break when trying to expand it back out since they're expecting no more than 511 chars.

The short-keywords, however, shouldn't add but a few bytes to the code for each short-code (it'd need to store both the short and long version and the function call would need to pass a second parameter. But the new myIndexOf would only add a few bytes no matter how many keywords are shortened.) So this may be the most compatible method.

I've taken a rule that I was using and shortened it as shown below. Original is 519 bytes and the short-code one is only 394 (the counts include the spaces & line breaks I added here for readability so the actual sizes are a bit less.)

That's quite a bit of savings (and, of course, I could save a few more bytes by shortening the MQTT topic and the event names. but in this case I did manage to squeeze it all in without needing either the short-codes or dropping the event/MQTT.) As I said before, it may not be as readable but the user could decide if readability or size savings is more important on a case-by-case basis.

on Rules#Timer=1 do backlog var2 0; RuleTimer1 130; var3 0 endon on tele-DS18B20#temperature do backlog var2 1; RuleTimer1 130; event heat_ready=1; event heat_demand=%value%;var1=%value% endon on event#heat_ready>%mem1% do var2 0 endon on event#heat_demand>%mem2% do backlog publish mikes_heater/cmnd/power2 0;var3 0 endon on event#heat_demand<%mem3% do backlog publish mikes_heater/cmnd/power2 %var2%;var3 %var2% endon on ds18b20#temperature do var10=%value% endon

````
// Yes, I know rules don't allow for comments :)
// ru = rules
// bl = backlog
// en = endon
// evt = event
// t- = tele-
// rt = RuleTimer
// v = var
// m = mem
// val = value
// tmr = timer

on ru#Timer=1
do bl v2 0; rt1 130; v3 0
en
on t-DS18B20#temp
do bl v2 1; rt1 130; evt heat_ready=1; evt heat_demand=%val%;v1=%val%
en
on evt#heat_ready>%m1%
do v2 0 en
on evt#heat_demand>%m2%
do bl pu mikes_heater/cmnd/power2 0;v3 0
en
on evt#heat_demand<%mem3%
do bl pu mikes_heater/cmnd/power2 %v2%;v3 %v2%
en
on ds18b20#temp
do v10=%val%
en

````

Saving bytes in RAM (where rules are stored) is much more important than saving bytes in flash (where the compiled binary is stored). Therefore, using 3000 bytes to squeeze more rules into 1533 bytes of rules RAM can make sense.

First tests with Unishox are very promising and will not require to change anything in the syntax. Preliminary tests show between 40% and 70% memory size reduction.

The number one constraint is the Setting buffer limited to 4KB. Then RAM to a lesser extent. Anyways let's not jump to conclusions before more experimentation.

Just to keep you informed, some tests on actual rules show around 50%-70% compression (the longer the rule the better compression ratio). You'll be able to double or triple the rule code size. It looks also that CPU impact on decompression is minimal.

I tried with your rule above:
RUL: Compressed from 492 to 240 (-51%)

on Rules#Timer=1 
   do backlog var2 0; RuleTimer1 130; var3 0 
endon 
on tele-DS18B20#temperature 
   do backlog var2 1; RuleTimer1 130; event heat_ready=1; event heat_demand=%value%;var1=%value% 
endon
on event#heat_ready>%mem1% 
   do var2 0 endon on event#heat_demand>%mem2% do backlog publish mikes_heater/cmnd/power2 0;var3 0 
endon 
on event#heat_demand<%mem3% 
   do backlog publish mikes_heater/cmnd/power2 %var2%;var3 %var2% 
endon 
on ds18b20#temperature 
   do var10=%value%
endon

I wonder what the code impact is ;-)

Roughly 3.5KB. The original size of 3KB mentioned in Unishox documentation was for decompress alone, I did some aggressive optimizations on code size :)

Btw, it could be used to compress some other strings in Tasmota, although my preliminary tests on Javascript showed "only" 20% size reduction. Let me explore though, there are chances we can do much better.

Resoved by #8397

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kckepz picture kckepz  路  3Comments

ximonline picture ximonline  路  3Comments

abzman picture abzman  路  3Comments

esp32x picture esp32x  路  3Comments

Joeyhza picture Joeyhza  路  3Comments