Description:
After the server starts with many connections sometimes the client freezes in connected message. It occurs after recent commits.
Current behaviour:
The client freezes in connected message.
Expected behaviour:
The client must enter the world.
Steps to reproduce the problem:
Branch(es): 3.3.5
TC rev. hash/commit: b4b031b
TDB version: TDB_full_world_335.62_2016_10_17 + updates
Operating system: Windows
That's existed for a very long time.
It has never happened to me and the number of users is the same, only the code has changed.
How recent? Can you pin point a commit where there were less client connection issues?
Feb 20 and 21, 2017
In my case happens after about 50-60 players online.
with https://github.com/jackpoz/BotFarm I can usually login with 1000 users at the same time
@jackpoz It has no connection with number of connections until the connections are "safe". It should be some unhandled exception or some unclosed socket.
Last night it happened on f96f1ce, my fast solution was to return to 1beb2e5. I do not say that the number of online users is a pattern, just that it does not happen with the first ones logged. For now is the information i can provide. I can do more tests today at night.
I suspect it's a client issue, really. This happened with some regularity on retail servers too, back in Wrath.
I can confirm that, not sure how long ago, but it wasn't happening before (discovered that 2 days ago when i tried to login from same PC, 6 clients).
The client must enter the world
does the connected message appear before or after the character selection screen ?
I reproduced the bug 1 time, the client got stuck between login and character selection (no list of characters)
does the connected message appear before or after the character selection screen ?
Between login and character selection (no list of characters).
Exactly as @Aokromes says, when the server reaches "_that state_" if someone logs out or logs in can not get the list of characters. Yesterday it happened again on d939018 and this time i did not go back to an earlier version (i confirm that works ok until 1beb2e5), i disconnected the network cable for a few seconds and reconnected it, problem solved, obviously not a good practice.
I saw this bug in my old server before 2015 (server with 1.5k players)
it's not a "new bug"
In my case at least since September 2016 the client never got stuck until now. The problem is that it happens sporadically and i have not identified a pattern.
its happen some times for me to, but its not new bug, see this bug from dec 2016 when i back to WoW )
I've had this happen on my single-player server with just me logging in.
In my case it's usually when I haven't had the server running for days. When the client first starts up, AHBot has to expire a crap ton of auctions (I had mine set to 500,000 for testing) which makes the core unresponsive to login attempts.
It's doesn't seem client problem. After updating to the last version: TrinityCore rev. 278353673639 2017-02-21 21:02:12 +0500 (3.3.5 branch) (Unix, Release, Static)
This bug occured. After downgrading server worked like before. And players can join..
Can all of you at least try to be helpful? I would like all reports to provide two commit hashes, the one that worked and the one that didn't work
Try to downgrade to ae9d01a3245c59a8a8d50516a79b79250337450d if it still freezes try 4eae29d421e1d7a28aaa50d401cbbf09c50bd476
If it's a random occurrence problem, any sort of "I downgraded to X and it worked again" report is fairly pointless.
When this happen some errors appears on server log:
```
Received unexpected opcode [CMSG_CANCEL_TRADE 0x011C (284)] Status: STATUS_LOGGEDIN_OR_RECENTLY_LOGGOUT Reason: the player has not logged in yet and not recently logout from [Player: Account: 104225]
Received unexpected opcode [CMSG_CANCEL_TRADE 0x011C (284)] Status: STATUS_LOGGEDIN_OR_RECENTLY_LOGGOUT Reason: the player has not logged in yet and not recently logout from [Player: Account: 41985]
Received unexpected opcode [CMSG_CANCEL_TRADE 0x011C (284)] Status: STATUS_LOGGEDIN_OR_RECENTLY_LOGGOUT Reason: the player has not logged in yet and not recently logout from [Player: Account: 105562]
Received unexpected opcode [CMSG_CANCEL_TRADE 0x011C (284)] Status: STATUS_LOGGEDIN_OR_RECENTLY_LOGGOUT Reason: the player has not logged in yet and not recently logout from [Player: Account: 96033]
again on f612b1c :(
Just today i got 2 freezes with 90 players, it doesn't freeze at the startup, but after the server has been online for a while.
Rev: 340ce38e01fed2e523784ee7a80f9fa25782d447
Can anyone share the latest Rev where this issue does not happen?
So here what i know so far...
This was the newest rev i tried that STILL has the issue: 4eae29d421e1d7a28aaa50d401cbbf09c50bd476
This rev doesn't have the issue d42faefe9a8c1a3f805e34cf0985dcab109ff8f5, been using it for 2 days now without issues.
After looking between all the commints between Commit A and Commit B, i think the issue is related to https://github.com/TrinityCore/TrinityCore/commit/4c27203c8f36dd2a5df0a4ae69fbdc4c9140b29d or the fixup commits after it.
Right now i cant test anymore for a couple of days to give players a sense of stability, but i'll try updating all the way to f57132b795a097a2c4c863a8153b0c1be5e008c0 (one commit before the one i think its the one causing the issue) and report feedback over here.
@Shauren
Could you try reverting 684a5fd3f1895703a52cff7e7f762883c74c5aba?
@ariel- this freez happen rarely, you can run the server over 7 days of up-time without any problem.
@Noryad I'm confused, reading through all the commits and doing tons of commit testing.
You originally said that you went to https://github.com/TrinityCore/TrinityCore/commit/1beb2e5fd6e85332173b1f3e414d5b385c3022fb to revert back because of the incident of freezing.
However, later you say that it works UNTIL https://github.com/TrinityCore/TrinityCore/commit/1beb2e5fd6e85332173b1f3e414d5b385c3022fb
is that correct?
I detect it after 1beb2e5
@Killyana I had it happen 2 days in arrow like 4 - 5 times each day once reaching +70 players, before reaching 70 i had it working without issues for 5 days.
@Noryad What rev are you using right now and how many players do you have on average ?
@ariel- Tried but having issues installing Boost 1.63 on Debian
Out of curiosity, did you try tweaking the player save interval? Maybe there's a bug with handling too many players.
@ariel- : Sadly reverting just this commit would not be enough, as this commit was fixing some compilation errors :(.
@Tonghost Can you take a look at this? It looks like your commit https://github.com/TrinityCore/TrinityCore/commit/684a5fd3f1895703a52cff7e7f762883c74c5aba is involved.
So when this issue starts, you have to wait a long time before the authserv responds to be connected, it could take more than 40 secs. But after some time, it will not respond at all. It looks like something is making it lag.
But after some time, it will not respond at all. It looks like something is making it lag.
Probably a long shot but this is what happens to me if I have too many items from AHBot expiring at the same time and I haven't started the server in a while. Has anything changed in the bot?
@MrSmite I didn't had AHBot enabled, and the Save Interval was 90 seconds (Default) for 70-90 players on average.
The DB for that server was brand new so there wasn't much to load.
hi my server update to 8e1e081d6cd3826a46e96cac211d5c12f5d8536b two days ago but after some minutes players cant log in, restarting authserver not work, i have to restart worldserver to fix it and it happens very soon after each restart
before i have 91201f11f805a7bd21291717af8008c6b5e855fe and dont have login prblems
this is really critical bug, please fix soon if you can
This commit is the problem source: 3ddcf40037bb032c429c9ad6f0817f0796fda535
std::numeric_limits<time_t>::max()
causes an overflow in linked respawn (Linked spawns are set to boss + 1-59 secs), this causes a database loop in the respawn function.
The easy fix is to just clear the respawn table with entries lower than 60 sec and replace std::numeric_limits<time_t>::max()
with time(NULL) + YEAR
, year should be enough
Cheers
Expiring auctions from AHBot also have a noticeable impact on login delay. I'll cap the amount of item deletions we process at the same time (like we do with session updates)
Oh noes, the deletes are executed as stand-alone statements, instead of sending the whole batch wrapped on a transaction! 馃あ
EDIT: after transaction-wrapping, cap wasn't needed :P
Expiring auctions from AHBot also have a noticeable impact on login delay
Indeed, as I mentioned earlier (comment). The OP said however that he didn't have AHBot enabled. While I like your fix for wrapping deletions in a transaction, it may not help this exact issue.
That's why I made the commit a reference to this issue, instead of closing. I'm fully aware of OP not having AHBot active :)
So, we have encountered this issue for a few weeks too, following us disabling the graphana service. However, what we forgot to do is disabling the posting of the metrics, which resulted in a 404 not found. After disabling the metrics, we did not encounter that issue anymore. I am not saying it is the cause of your particular issue, but it was the cause of ours, with the same symptoms.
@joshwhedon Just to know, what rev were you using and how many players did you had on average?
I can't check the rev right now, but 3.3.5 no older than 1 month. 350 players on peak hours, 200 on average through the day.
After a day or two, players could not log in anymore. And if the world server was restarted, there was always a huge rollback (as if no db save happened during the time where players could not log in).
We haven't had that issue since a week and half or so (after we disabled the metrics).
@joshwhedon Yeah i can confirm that, after the login freezes, the only way to fix it is by doing a worldserver restart, and the rollback is huge, our rollbacks were between 10 and 20 minutes since i was restarting the server as soon as a player reported me the login issue.
The weird part is that i just check my configs and didn't had any DB Stressful config enable, no metrics, no AHBot, so i guess my issue is related to what @A-Metaphysical-Drama said.
Thanks again for the info ;)
This commit is the problem source: 3ddcf40
std::numeric_limits::max() causes an overflow in linked respawn (Linked spawns are set to boss + 1-59 secs), this causes a database loop in the respawn function.
The easy fix is to just clear the respawn table with entries lower than 60 sec and replace std::numeric_limits
::max() with time(NULL) + YEAR, year should be enough
Cheers
Indeed this was the cause, thanks for taking your time :)
Very good catch, :+1:
@ariel-
``` sql
SQL(p): REPLACE INTO creature_respawn (guid, respawnTime, mapId, instanceId) VALUES (202796, 9223372036854775807, 724, 18)
O_o
9.223.372.036.854.775.807
We all will die, our Sun will die, everything will turn to dust, but this creature will be still not respawned.
hell, even universe will die and still not respawned.
If the wiki info is correct, (respawntime The time when the creature should be respawned in Unix time.),
https://www.epochconverter.com/
Assuming that this timestamp is in microseconds (1/1,000,000 second):
GMT: Fri, 11 Apr 2262 23:47:16 GMT
Your time zone: 12/04/2262, 01:47:16 GMT+2:00 DST
(edit) - roughly 245 years into the future....
BTW, cannot reproduce on a straight core (world DB replaced):
TrinityCore rev. 73ec3a1d3b34 2017-04-19 01:14:14 +0100 (3.3.5 branch) (Win64, Release, Static) (worldserver-daemon) ready...
TrinityCore rev. 73ec3a1d3b34 2017-04-19 01:14:14 +0100 (3.3.5 branch) (Win64, Release, Static) (authserver)
<Ctrl-C>
to stop.
That value is std::numeric_limits<int64>::max()
but the column is bigint(20) which is enough for that. Don't skip sql updates
Assuming that this timestamp is in microseconds (1/1,000,000 second):
Those are actual seconds, meaning the real timestamp is that much further into the future :P
@Re3os as previously stated you're missing updates, please setup the automatic updater so you don't need to worry about having missing sqls
Someone can confirm the issue on TrinityCore rev. d7e4dcc16e71 2017-09-17 or after https://github.com/TrinityCore/TrinityCore/commit/66755eecf117d21504b13a86410aa01cfc44c3ba ?
Most helpful comment
This commit is the problem source: 3ddcf40037bb032c429c9ad6f0817f0796fda535
std::numeric_limits<time_t>::max()
causes an overflow in linked respawn (Linked spawns are set to boss + 1-59 secs), this causes a database loop in the respawn function.The easy fix is to just clear the respawn table with entries lower than 60 sec and replace
std::numeric_limits<time_t>::max()
withtime(NULL) + YEAR
, year should be enoughCheers