Update from Archi:
Notice for everybody facing this issue: enable ASF Debug mode in order to record debug log from the freeze. After you notice freeze, let ASF work for at least ~10 more minutes in such state. Once recorded, pack whole debug directory together with log.txt to zip and send it on my e-mail: JustArchi[at]JustArchi[dot]net, as it can contain sensitive information.
Okay so I'm having this weird issue with only one bot. I'm sure if I restart ASF it will work as expected, but maybe this can help fix some _potential_ bug. Ok, so all my bots started normally (even the one that I'm having issues with), and all of them are redeeming keys too, except one bot, which is returning Timeout:
[8:02:57 PM] master: !redeem buggy_bot XXXXX-XXXXX-XXXXX
<slave_1> Key: XXXXX-XXXXX-XXXXX| Status: Timeout!
The ASF output looks like this:
2016-07-30 20:02:59.9729|ERROR|ArchiSteamFarm.Logging|buggy_bot|RedeemKey()
I've had this issue a couple of times and generally I just have to restart the bot to make it work again. But this time, I tried to !stop it and I got no response (I spammed some commands three times and I still didn't got any response):
[8:03:07 PM] master: !stop buggy_bot
[8:03:11 PM] master: !start buggy_bot
[8:04:05 PM] master: !stop buggy_bot
The ASF output after the first !stop:
2016-07-30 20:03:11.1083|INFO|ArchiSteamFarm.Logging|buggy_bot|Stop() Stopping...
Since bot slave_1's chat window became unresponsive, I tried to issue the commands to slave_2, but doing that, made slave_2 freeze too:
[8:06:29 PM] master: !statusall
[8:06:31 PM] slave_2:
<a reaaaally long response that isn't important>
[8:06:54 PM] master: !stop buggy_bot
[8:06:56 PM] slave_2: That bot instance is already inactive!
[8:06:59 PM] master: !start buggy_bot
[8:07:25 PM] master: !statusall
ASF didn't show any relevant information to above commands. I'm using the latest stable version: V2.1.3.1
Okay, so I scrolled up my ASF's log window and I saw this (I have ASF running for about 1 day, btw):
<...>
2016-07-30 12:59:04.7139|ERROR|ArchiSteamFarm.Logging|buggy_bot|Init()
2016-07-30 12:59:04.7139|INFO|ArchiSteamFarm.Logging|buggy_bot|Start() Starting...
<...>
And after that I didn't receive any information about that bot (about ~8 hours ago).
Can you show me all log lines of this bot instance instead of just selected few?
Use something like grep or even Notepad++ for showing only lines containing |buggy_bot|.
I'm pretty sure this is Steam fuckup and nothing worth investigating, but I can take a look at the log at least.
Yup. Here you go:
Did you modify logging layouts? Those errors should have message with them, ASF prohibits logging of null or empty messages.
Shit, I honestly don't know. I mean I have a NLog.config that I probably got from the Wiki page (I didn't read it properly and I thought it was mandatory to have this file :S). Anyway, here it is:
<?xml version="1.0" encoding="utf-8" ?>
<nlog xmlns="http://www.nlog-project.org/schemas/NLog.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<targets>
<target xsi:type="ColoredConsole" name="ColoredConsole" />
<target xsi:type="File" name="File" fileName="NLog.txt" deleteOldFileOnStartup="true" />
</targets>
<rules>
<logger name="*" minlevel="Trace" writeTo="ColoredConsole" />
<logger name="*" minlevel="Trace" writeTo="File" />
</rules>
</nlog>
EDIT: Yep, I got it from the Wiki page, I thought I had modified it, but thankfully I didn't.
Remove NLog.config if you use default settings, there is no need to override them if you don't have a reason.
Judging from the fact that it was in ArchiWebHandler, most likely it was some kind of networking problem - steam didn't announce end of connection, so SteamClient probably is trying to end it gracefully while not receiving any response.
This would make sense as SteamClient.Connect() should always disconnect the old session and initiate a connect, instead, it's stuck in your case, and SK2 still reports that SteamClient.IsConnected.
I'm not sure if there is anything that ASF can do in this case, it should be handled in SK2 library, preferably implementing a timeout and force reconnect instead of something like this. I've already experienced this problem - it's very rare, but it looks like SteamClient simply freezes and it's not possible to make it reconnect even with SteamClient.Connect() - that's why your keys redeeming requests are failing, because ASF properly tries to send the request, but Steam network never responds back, as SteamClient is frozen. It became frozen the moment it tried to log in to ISteamUserAuth the last time (and failed) - that initiated a reconnect request that was never fullified.
It's probably best to ask SK2 devs if they have any clue why SteamClient can reach such state of being unable to operate, and if perhaps it's possible to do anything to overcome this problem apart from throwing away old object and creating new SteamClient(). Maybe @Netshroud could say something about that.
Oh, good to know. So the bots not answering the following commands can't be handled by ASF? Like a 1 minute timeout to "abort" if SK2 don't respond or something... Anyway, like you said it's indeed very rare (in fact it happened twice). I'll just restart ASF then. Thanks for your attention!
If SteamClient is frozen then no request can reach ASF, even if you do make it reach ASF, through other bot instance (like you did), then redeem request will timeout anyway as SteamClient is non-operative. Thing is, during normal and even faulty operation SteamClient should NEVER reach such state, as SteamClient.Connect() should always force a reconnect, and ASF has routines to initiate reconnect if it's needed - it did initiate reconnect in your case.
This is somehow problematic as the workaround for this issue would require throwing away currently coded nice way of SteamClient.Disconnect() and SteamClient.Connect() and instead initialize new object on each reconnect attempt. I don't want to do this as this doesn't make any sense - SteamClient should be reusable and there should be never a need to throw it away, unless we're done with it for good.
So for now I don't have any real workaround for this, I might need to look into SK2 code and check how the whole connection is being handled to spot the deadlock/issue we're seeing, but hopefully @Netshroud or other dev from @SteamRE/SteamKit2 can at least confirm if my assumptions are right.
Because this is the only possible reason why SteamClient.Connect() freezes and doesn't return any callback - neither OnConnected() nor OnDisconnected(). If it did, then ASF would react accordingly, but the callback never arrived, so ASF is still waiting for SteamClient.Connect() to do something.
Can you map those line logs to SteamKit2 / callbacks / function calls? I'm not going to read through all of ASF to find all log output lines.
Also, are you using TCP or UDP?
Not sure if you are asking me, but I can only answer the second question: TCP.
Sorry @Netshroud, I'm mapping it for you right now.
It's TCP - the current web session expired, so ASF called ISteamUserAuth dynamically here with proper nonce. That request failed here so ASF initiated SteamClient reconnect here.
We reached Starting... in the log, so ASF's last call was SteamClient.Connect() here - after that no callback arrived - neither OnConnected() nor OnDisconnected().
That's what I can read from the log posted by @GUiHKX, but that already happened to me in ArchiBoT IIRC - it's basically SteamClient becoming frozen, any request results in timeout. You can't attempt a reconnect neither do anything with it, the only solution is doing new SteamClient().
This issue is very rare, so I suspect it's specific TcpClient issue or something like this, but I didn't look carefully in SK2 code yet so this is only my guess for now.
Let me know if you need any more info.
BTW, @GUiHKX can you double-check there is no exception thrown around 2016-07-30 12:58:04.6885? I have unhandled exception and unobserved exception handler, so if there is anything weird in SK2 code or anywhere else we should take a look at this.
I checked both ASF console and NLog.txt file and unfortunately nothing showed up...
This can be true as I didn't spot any exception back when I had this issue for the first time in ArchiBoT as well, thanks for checking.
Update: It seems that this can be also triggered with going into Windows hibernate mode. @Netshroud
(As in, after waking from longer hibernation lasting > 10 minutes, SteamClient becomes non-responsive)
So you have debuglog output from it?
No repro on the hiberation.
I connected to Steam then hibernated for 12 minutes. After resuming, SteamKit's TcpConnection got an IOException, cleanly disconnected and the loop I inserted into Sample1 to reconnect after disconnecting got hit and called steamClient.Connect(), which resulted in a successful reconnection.
This was even without Azuisleet's latest change around calling WaitOne.
I'll notify you once we have something to work with.
Notice for everybody facing this issue: enable ASF Debug mode in order to record debug log from the freeze. After you notice freeze, let ASF work for at least ~10 more minutes in such state. Once recorded, pack whole debug directory together with log.txt to zip and send it on my e-mail: JustArchi[at]JustArchi[dot]net, as it can contain sensitive information.
Also it's interesting because when SteamClient goes into such freeze state, it seems that all async jobs are failing instantly:
2016-08-06 16:37:58|INFO|MrMousePT|RefreshSessionIfNeeded() Refreshing our session!
2016-08-06 16:37:59|ERROR|MrMousePT|RefreshSession() SteamKit2.AsyncJobFailedException
The above is happening with await SteamUser.RequestWebAPIUserNonce();.
I noticed something very similar to this issue, but with new HeartBeat I added recently it seems to at least cause SteamClient to be "unstuck":
2016-08-23 05:51:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:51:41|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 05:51:41|INFO|archi|StopFarming() Sending signal to stop farming
2016-08-23 05:51:41|INFO|archi|StopFarming() Waiting for reaction...
2016-08-23 05:51:41|INFO|archi|Farm() Stopped farming: 463150 (BARRIER X)
2016-08-23 05:51:42|INFO|archi|StopFarming() Farming stopped!
2016-08-23 05:51:53|INFO|archi|OnConnected() Connected to Steam!
2016-08-23 05:51:53|INFO|archi|OnConnected() Logging in...
2016-08-23 05:52:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:53:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:54:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:55:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:56:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:57:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:58:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 05:59:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:00:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:01:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:02:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:03:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:04:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:05:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:06:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:07:41|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting...
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnDisconnected() Disconnected from Steam!
2016-08-23 06:08:19|INFO|archi|OnConnected() Connected to Steam!
2016-08-23 06:08:19|INFO|archi|OnConnected() Logging in...
2016-08-23 06:08:19|INFO|archi|OnLoggedOn() Successfully logged on!
2016-08-23 06:08:19|INFO|archi|Init() Logging in to ISteamUserAuth...
2016-08-23 06:08:21|INFO|archi|Init() Success!
SteamClient "died" for more than 15 minutes, and all scheduled requests including their callbacks got frozen for that period - it's clearly visible with HeartBeat() trying to SteamClient.Connect() every minute to finally wake it up. Then finally after long timeout (probably TCP socket internal one) all callbacks are immediately returned.
Really strange :confused:
I set up ArchiBoT with DebugLog and I hope to reproduce complete freeze eventually. It looks like some general issue with TCP socket rather than SK2 itself. Something low-level that prevents SteamClient from operating, and not SK2 code itself.
Interesting that you had 16 failed heartbeat attempts, then 16 disconnected callbacks all at the same time.
Are you sure you're calling the CallbackManager appropriately?
Feel free to check - I didn't spot any issue related to this function, recently I even added extra check that makes sure to run this function while SteamClient.IsConnected. KeepRunning is set to false only in Stop() function, but that one requires somebody to call !stop and results in appropriate Stopping... result in a log, so I'm sure that my callback loop was working properly. The loop is created here the first moment my bot instance is started.
Besides, if that loop wasn't working then I'd not get "unfreeze" after 15 minutes, because there is no place of re-creating that loop apart from Start(). My HeartBeat() was only calling `SteamClient.Connect()``.
BTW, I'm very grateful you're trying to help us Netshroud, I appreciate that. I'm also doing my own research regarding that, including running my bots with DebugLog so I can potentially check if there is something weird going on.
My main suspect for now is TCP socket that somehow "blocks" for given timeout during network issues or malfunctions, preventing CMClient from communication. That is however only wild guess based on logs and situations, I hope to reproduce it with debug log and proper log soon, but as noted this is quite rare issue that is very hard to reproduce - I can only set up a trap and hope for the best.
Finally after nearly a week of hunting I think I managed to reproduce this issue, and while socket timeout after ~15 minutes is another one, this issue is mostly caused by very rare situation of concurrent calls to SteamClient's Connect() and Disconnect(), which should be fixed in https://github.com/SteamRE/SteamKit/commit/330fe9bea8b80b3cf484ba09611b2e53948eefb3.
I wrapped ASF calls in lock() statements for now before fix above lands in SK2 release, and I consider this issue as fixed until somebody will be able to reproduce this on ASF V2.1.4.6+ (which will be released soon). I'm still actively watching this to ensure that the fix indeed works, but due to nature of the bug such thing can be stated semi-reliably only after longer period of tests.
Thanks to all who spent their time on trying to help and find the culprit, especially to @Netshroud for offering his helpful hand despite of not being directly connected with ASF project :smiley:
I consider this issue closed until somebody (including me) will be able to reproduce it once again in ASF V2.1.4.6+.
Most helpful comment
Interesting that you had 16 failed heartbeat attempts, then 16 disconnected callbacks all at the same time.
Are you sure you're calling the
CallbackManagerappropriately?