Arduino: WiFi.status() is not reflecting the true state

Created on 7 Jul 2020  路  8Comments  路  Source: esp8266/Arduino

This has been an issue for quite some time and also lots of issues have been reported which are probably related to this incorrect state of the WiFi.

So this is merely a collection issue to gather all insights and link topics, as I keep finding my own replies in lost of those topics over and over again, but still feeling lost in this problem.

Related issues:

  • #7005
  • #5912
  • #5239
  • #4810
  • #4210
  • #4152

And lots more.
In essence all calls that may check the WiFi.status() and base their actions on it may run into these problems.

First let's have a look at the enum-mapping performed here:

wl_status_t ESP8266WiFiSTAClass::status() {
    station_status_t status = wifi_station_get_connect_status();

    switch(status) {
        case STATION_GOT_IP:
            return WL_CONNECTED;
        case STATION_NO_AP_FOUND:
            return WL_NO_SSID_AVAIL;
        case STATION_CONNECT_FAIL:
        case STATION_WRONG_PASSWORD:
            return WL_CONNECT_FAILED;
        case STATION_IDLE:
            return WL_IDLE_STATUS;
        default:
            return WL_DISCONNECTED;
    }
}

typedef enum {
    STATION_IDLE = 0,
    STATION_CONNECTING,
    STATION_WRONG_PASSWORD,
    STATION_NO_AP_FOUND,
    STATION_CONNECT_FAIL,
    STATION_GOT_IP
} station_status_t;

Note that the case of STATION_CONNECTING results in WL_DISCONNECTED

What I'm observing on some nodes (really hard to reproduce on some and happening almost always on others) is this:

Initial attempt to connect is stuck forever, as the WiFi status never gets to WL_CONNECTED
I checked by calling wifi_station_get_connect_status() and see the state is stuck at STATION_CONNECTING.

However the web server may serve pages and the WiFiEventStationModeGotIP event has fired.
So all seems to be working already, but the state is not updated.
In one issue it was mentioned to call WiFi.setAutoReconnect(true); to fix this, but that's not the magic fix here.

My work-around for this is to keep track of how long it takes to get a successful connection and if that times out, I call my own resetWiFi() function.

void resetWiFi() {
  WifiDisconnect();
  initWiFi();
}

void initWiFi()
{
#ifdef ESP8266

  // See https://github.com/esp8266/Arduino/issues/5527#issuecomment-460537616
  // FIXME TD-er: Do not destruct WiFi object, it may cause crashes with queued UDP traffic.
  //  WiFi.~ESP8266WiFiClass();
  //  WiFi = ESP8266WiFiClass();
#endif // ifdef ESP8266

  WiFi.persistent(false); // Do not use SDK storage of SSID/WPA parameters
  WiFi.setAutoReconnect(false);
  // The WiFi.disconnect() ensures that the WiFi is working correctly. If this is not done before receiving WiFi connections,
  // those WiFi connections will take a long time to make or sometimes will not work at all.
  WiFi.disconnect();
  setWifiMode(WIFI_OFF);

#if defined(ESP32)
  WiFi.onEvent(WiFiEvent);
#else
  // WiFi event handlers
  stationConnectedHandler = WiFi.onStationModeConnected(onConnected);
  stationDisconnectedHandler = WiFi.onStationModeDisconnected(onDisconnect);
  stationGotIpHandler = WiFi.onStationModeGotIP(onGotIP);
  stationModeDHCPTimeoutHandler = WiFi.onStationModeDHCPTimeout(onDHCPTimeout);
  APModeStationConnectedHandler = WiFi.onSoftAPModeStationConnected(onConnectedAPmode);
  APModeStationDisconnectedHandler = WiFi.onSoftAPModeStationDisconnected(onDisconnectedAPmode);
#endif
}

// ********************************************************************************
// Disconnect from Wifi AP
// ********************************************************************************
void WifiDisconnect()
{
  #if defined(ESP32)
  WiFi.disconnect();
  #else // if defined(ESP32)
  ETS_UART_INTR_DISABLE();
  wifi_station_disconnect();
  ETS_UART_INTR_ENABLE();
  #endif // if defined(ESP32)
}

The initWiFi() is also called as one of the first functions in my setup()

The WiFi status is also incorrect when the unit gets disconnected.
For example when the ESP node is kicked from the access point (MikroTik AP allows you to disconnect a specific client via the web interface) or whatever other reason there may be to disconnect a node.

This is the code I use to detect if I have an IP-address:

#ifdef CORE_POST_2_5_0
# include <AddrList.h>
#endif // ifdef CORE_POST_2_5_0


bool hasIPaddr() {
#ifdef CORE_POST_2_5_0
  bool configured = false;

  for (auto addr : addrList) {
    if ((configured = (!addr.isLocal() && (addr.ifnumber() == STATION_IF)))) {
      /*
         Serial.printf("STA: IF='%s' hostname='%s' addr= %s\n",
                    addr.ifname().c_str(),
                    addr.ifhostname(),
                    addr.toString().c_str());
       */
      break;
    }
  }
  return configured;
#else // ifdef CORE_POST_2_5_0
  return WiFi.isConnected();
#endif // ifdef CORE_POST_2_5_0
}

N.B. the CORE_POST_2_5_0 define is set by me when compiling with a specific core version.

Some times, when the node gets disconnected, the WiFiEventStationModeDisconnected event is fired, but the WiFi state and/or the presence of the IP-address remains.
The only way to get out of this, is to call my resetWiFi() function and start over to create a connection.

For some reason, TCP/IP traffic is not causing crashes in this WiFi limbo state, but UDP is causing crashes.

So it would be really helpful if we could either fix this or at least explain it so we can use work-around which don't feel like "don't know why but it makes issues harder to reproduce", which has been the main modus operandi for the last 2 years with these WiFi issues.

Most helpful comment

For those that may have access to the (closed source) SDK or at least knowledge of what's happening in there.
It would be nice if my hypothesis could be confirmed or disproved.

My hypothesis:
It looks like the internals of the SDK also act on events to switch the WiFi status state machine.

The enum values somewhat suggest the order of how events should happen:

typedef enum {
    STATION_IDLE = 0,
    STATION_CONNECTING,
    STATION_WRONG_PASSWORD,
    STATION_NO_AP_FOUND,
    STATION_CONNECT_FAIL,
    STATION_GOT_IP
} station_status_t;

What if the events of STATION_CONNECTING and STATION_GOT_IP are processed out of order?
For example maybe both events are present and processed in the same loop, but in the wrong order which only makes a difference if processed in the same loop.
This could be a timing issue which only needs a slight difference in timings to give this different behavior.
Such a difference can be caused by slightly better tuned WiFi radio or quality of the crystal or different used flash chip, so it is plausible this can make a difference among ESP nodes.

Also, different builds of the SDK can introduce some extra delays somewhere.

And now for the possible fix.
Is it possible to add a function to correct this internal state? Or even better, to make a new build of the SDK which does show the correct state of the WiFi.

All 8 comments

For those that may have access to the (closed source) SDK or at least knowledge of what's happening in there.
It would be nice if my hypothesis could be confirmed or disproved.

My hypothesis:
It looks like the internals of the SDK also act on events to switch the WiFi status state machine.

The enum values somewhat suggest the order of how events should happen:

typedef enum {
    STATION_IDLE = 0,
    STATION_CONNECTING,
    STATION_WRONG_PASSWORD,
    STATION_NO_AP_FOUND,
    STATION_CONNECT_FAIL,
    STATION_GOT_IP
} station_status_t;

What if the events of STATION_CONNECTING and STATION_GOT_IP are processed out of order?
For example maybe both events are present and processed in the same loop, but in the wrong order which only makes a difference if processed in the same loop.
This could be a timing issue which only needs a slight difference in timings to give this different behavior.
Such a difference can be caused by slightly better tuned WiFi radio or quality of the crystal or different used flash chip, so it is plausible this can make a difference among ESP nodes.

Also, different builds of the SDK can introduce some extra delays somewhere.

And now for the possible fix.
Is it possible to add a function to correct this internal state? Or even better, to make a new build of the SDK which does show the correct state of the WiFi.

Can we get some info/help here?
There are countless issues reported about WiFi, and many great projects like ESPEasy suffer because of this.
@earlephilhower, @d-a-v, @devyte sorry for tagging you directly, but maybe you guys can put some light on this?

There are #6680 and #7391 pending.
If lwIP is truely made aware of disconnections from firmware, then we can use more / new / controlled callbacks (current ones are closed source). For example we could forbid "connected" when there is no valid IP address (or until we receive the connected callback).

Well, I'm not entirely sure that will be the magic fix also, as it is the closed source part that's reporting the wrong state and perhaps some parts in there also use that wrong state.

As a matter of fact, there are more bugs hidden in there, which I'm not yet able to fully detect, but I know they are there.
For example, on some nodes the responsiveness of the node on network-requests is running fine in some builds and terribly (unworkable) slow on other builds.
I thought it could be 'fixed' by using just another SDK build, but my latest deception is about seeing that theory shattered which makes it almost like "random build" that may or may not work on those units.
It can be as simple as linking order of objects that cause some extra flash activity which may just be enough difference in timing to cause these issues.

So it is all very good to have a more uniform interface to "network" regardless of the physical interface, but I am afraid it won't fix the WiFi part here as it does appear to have fundamental issues in the closed source part.

Some times, when the node gets disconnected, the WiFiEventStationModeDisconnected event is fired, but the WiFi state and/or the presence of the IP-address remains.
The only way to get out of this, is to call my resetWiFi() function and start over to create a connection.

What would be nice is to read the commented firmware output in debug mode.

For some reason, TCP/IP traffic is not causing crashes in this WiFi limbo state, but UDP is causing crashes.

Because TCP bufferizes.
If we could reproduce, we could add extra checks (valid interface) in core or in lwip2.

However the web server may serve pages and the WiFiEventStationModeGotIP event has (not?) fired.
So all seems to be working already, but the state is not updated.

That can be fixed with the above PRs in which an event is triggered when an IP is assigned.

That can be fixed with the above PRs in which an event is triggered when an IP is assigned.

But how does LWIP know the IP has assigned?
What part does send out the GOT_IP event?
Is that LWIP? Is sending that event based on the state reported by the closed source library?
I sometimes do get that event multiple times (very quickly after the first), but still the WiFi status is reported as not "Connected".

The events do seem to work OK, or at least more reliable compared to the WiFi status.

But how does LWIP know the IP has assigned?

The logic is:

Link layer (driver) calls a lwIP function when link is up (netif_set_link_up)
Then two cases:

  • static IP: lwIP uses the link layer callback to call another callback to set status
  • DHCP: lwIP sends the dhcp request, later receives the IP (it's a callback), and calls another callback to set status

What part does send out the GOT_IP event?

It's nonos-sdk. But we can add ours callbacks (open source full control) and use them.

What has to be done will be more clear after #6680 is merged (so we can make/fix things for any kind of interface).
Goal is to have wifi, ethernet (, ... ppp) and keep compatible with the current api (wifi.status()).

Goal is to have wifi, ethernet (, ... ppp) and keep compatible with the current api (wifi.status()).

That's a sensible goal :)

Was this page helpful?
0 / 5 - 0 ratings