Dietpi: upd 6.26.3 to 6.30.0 makes PRI hang after awhile

Created on 30 Jun 2020 · 53Comments · Source: MichaIng/DietPi

cat /boot/dietpi/.version

!/bin/bash

G_DIETPI_VERSION_CORE=6
G_DIETPI_VERSION_SUB=26
G_DIETPI_VERSION_RC=3
G_GITBRANCH='master'
G_GITOWNER='MichaIng'

To reproduce i just update to 6.30.0

Actual behaviour

Rpi hangs after some time. Have not timed it, but within 24 hours

Source

Thidsa

All 53 comments

Many thanks for your report.

What exactly hangs?
Does the system fully crash or can you still login via SSH or at least ping it?
Did you monitor CPU temperature and RAM usage?

free -m

Which services/software are you using?

dietpi-services status

And a few general logs to have a look at to check for kernel or service errors:

dmesg | tail -10
journalctl

Ah, and you could send a bug report so I can have a closer look at those outputs as well:

dietpi-bugreport 1

and tell me the UUID:

echo $G_HW_UUID

MichaIng on 30 Jun 2020

Im sorry but i just noticed that its not been caused by updating dietpi, but by apt-get update/upgrade...
As dietpi-update do it i was tricked :) It has been like this for months so i got tired of it and didnt update dietpi this time but used apt-get update/upgrade and when it lasted longer than 24 hours i thought i was in the clear.
Thats why i wrote this to say that it happen with dietpi update.
But now its stuck again.
Im pretty sure ive pinged it before without result, but now i can
Everything else networkbased is dead as far as i can see.
I dont know? Should we proceed or?
Everything else you asked about i have to do after a restart..

Thidsa on 1 Jul 2020

Probably the OOM killer kills SSH and other network related stuff while ping (ICMP echo) is a kernel feature. Please restart and do the other steps, so we see if something is generally wrong. Monitor CPU temperature and RAM usage to see how it develops and if any of both reaches a critical limit which causes the crash.

MichaIng on 1 Jul 2020

dietpi_michalng.txt

Thidsa on 1 Jul 2020

Just some observation:

you basically have 2 web server running. apache2 as well as lighttpd. Both are trying to LISTEN on port 80. As this is not possible, lighttpd is failing. You would need to decide between both or ensure that one of the web server is running on another port like 81.
config file of pihole-FTL changed, pls run systemctl daemon-reload to reload all service file
you are using DHCP on a system that is running PiHole. Best practice is to use STATIC IP to ensure that PiHole will be reachable on same IP address always
in general it would be interesting to see how system memory behaves over time. Right now it seems to be fine.

Joulinar on 1 Jul 2020

Aside from what Joulinar said, also the Ethernet interface fails to be configured. I guess the most important part (DHCP) is done, so network is up regardless, but there is an ifup script that tries to send a mail which fails:

Jul 01 12:43:45 DietPi sh[354]: msmtp: authentication failed (method PLAIN)
Jul 01 12:43:45 DietPi sh[354]: msmtp: server message: 454 4.7.0 Too many login attempts, please try again later. r13sm2074956lf
p.80 - gsmtp
Jul 01 12:43:45 DietPi sh[354]: msmtp: could not send mail (account default from /etc/msmtprc)
Jul 01 12:43:45 DietPi sh[354]: ERROR: Email could not be sent, please check your logs
Jul 01 12:43:45 DietPi sh[354]: run-parts: /etc/network/if-up.d/berryio_email_ip exited with return code 1
Jul 01 12:43:45 DietPi sh[354]: ifup: failed to bring up eth0
Jul 01 12:43:45 DietPi systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE

Since Apache2 is currently the active webserver (because it is started before Lighttpd), to keep things as they are (regarding Pi-hole web UI and in case you have other websites), you could purge Lighttpd and PHP-FPM:

apt purge lighttpd php7.3-fpm
apt autopurge

Of course you could purge Apache2 instead, but this means a different webserver will serve your sites as before, so I would test first if everything still works fine (or even better):

systemctl stop apache2
systemctl start lighttpd

Check websites, and if everything works fine/better:

apt purge apache2
apt autopurge

MichaIng on 1 Jul 2020

you basically have 2 web server running. apache2 as well as lighttpd. Both are trying to LISTEN on port 80. As this is not possible, lighttpd is failing. You would need to decide between both or ensure that one of the web server is running on another port like 81.
Yes i have noticed, but since everything has been working ok ive let it be

pls run systemctl daemon-reload to reload all service file
done

you are using DHCP on a system that is running PiHole. Best practice is to use STATIC IP to ensure that PiHole will be reachable on same IP address always

Ip is enforced through the router, its always the same.

in general it would be interesting to see how system memory behaves over time. Right now it seems to be fine.

heh, yes but within a few hours i guess it stops again as usual.
Have a system backup that i use from late december, and that runs as long as i want it too, if i dont update

Thidsa on 1 Jul 2020

Jul 01 12:43:45 DietPi sh[354]: msmtp: authentication failed (method PLAIN)
Jul 01 12:43:45 DietPi sh[354]: msmtp: server message: 454 4.7.0 Too many login attempts, please try again later. r13sm2074956lf
p.80 - gsmtp
Jul 01 12:43:45 DietPi sh[354]: msmtp: could not send mail (account default from /etc/msmtprc)
Jul 01 12:43:45 DietPi sh[354]: ERROR: Email could not be sent, please check your logs
Jul 01 12:43:45 DietPi sh[354]: run-parts: /etc/network/if-up.d/berryio_email_ip exited with return code 1
Jul 01 12:43:45 DietPi sh[354]: ifup: failed to bring up eth0
Jul 01 12:43:45 DietPi systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE

This also has been like that along time, cant rember if i configured berryio to send mail.
ill see if i can disable the mail thing

Since Apache2 is currently the active webserver (because it is started before Lighttpd), to keep things as they are (regarding Pi-> hole web UI and in case you have other websites), you could purge Lighttpd and PHP-FPM:
apt purge lighttpd php7.3-fpm
apt autopurge

ok ill remove it

Thidsa on 1 Jul 2020

heh, yes but within a few hours i guess it stops again as usual.

Yeah the unused PHP-FPM service takes a few unnecessary resources but not that much, so while it makes sense to fix the mentioned things, I also don't see a reason why those should cause a crash or resolving them should resolve it.

Keep an eye on htop and/or free -m for memory usage (for the latter "available" memory is what you need to look at) and dmesg | tail -10 if any yellow or red lines appear (kernel warnings/errors).

Ip is enforced through the router, its always the same.

Generally as well for network establish performance it makes sense to assign a static IP in such cases, as long as there is not a changing DNS nameserver, changing gateway or similar that you need to regularly re-sync. However there might be routers which do not handle devices well that use an IP inside the DHCP range but to not request the IP via DHCP, not sure. I do it both ways:

On first boot of a new device, I let it receive an IP via DHCP, which assures that the router as well stores its hostname, allows to reserve the IP etc and do block/reserve the IP.
Then on the device I switch to static IP by copying IP+mask+gateway that was assigned via DHCP before.
So the router has a reserved entry but the new device will bring up network much faster and does not need to run a DHCP client.

MichaIng on 1 Jul 2020

Keep an eye on htop and/or free -m for memory usage (for the latter "available" memory is what you need to look at) and dmesg | tail -10 if any yellow or red lines appear (kernel warnings/errors).

Anyway i can log this? Cant sit here and watch it until it hangs :)

On first boot of a new device, I let it receive an IP via DHCP, which assures that the router as well stores its hostname, allows to reserve the IP etc and do block/reserve the IP.

Then on the device I switch to static IP by copying IP+mask+gateway that was assigned via DHCP before.

So the router has a reserved entry but the new device will bring up network much faster and does not need to run a DHCP client.

Ok fixed

Thidsa on 1 Jul 2020

ok ill have something running here. See what happens
Thx so far

Thidsa on 1 Jul 2020

👍1

nohup cat /proc/kmsg > dmesg.log & disown
cat << _EOF_ > ./monitor.bash
#!/bin/bash
while :
do
free -m
cat /sys/class/thermal/thermal_zone0/temp
sleep 5
done
_EOF_
chmod +x ./monitor.bash
nohup ./monitor.bash > monitor.log & disown

Not beautiful but the quickest idea I have currently 😄.

MichaIng on 1 Jul 2020

if needed I can guide you to a full blown monitoring solution using Telegraf, Grafana and InfluxDB. It might be oversized by you will get everything with nice graphs 🤣

Joulinar on 1 Jul 2020

🚀1

ill see where this brings me first :)

Thidsa on 1 Jul 2020

https://nextcloud.thidsa.net:446/s/NFBtJno7H4zi2JS
Ok there the network went poof. Cant connect to anything. 2 of the running ssh sessions went blank but your script and htop is still running. What now?

Thidsa on 3 Jul 2020

Hmm, since four shown SSH sessions are still up, it is neither a system crash nor a complete network crash. Were the two disconnected (?) SSH sessions from a different client?
Of course you can quick the running htop and check the content of the produced log files, memory usage is already shown to be perfectly fine in htop.

Is it probably Pi-hole and/or router/DHCP server which do not play well together and break connection from local network? From the running SSH session (htop or another one of those), check:

ip r
ip a

Check if there is a route for the local network and that the IP(s) are as expected. Additionally check for recent kernel/sys logs which might be related: journalctl (or since you have rsyslogd running /var/log/syslog /var/log/daemon.log and such).

Generally do you use Pi-hole only for DNS resolving or as DHCP server as well? DHCP disabled in router? Ah actually deriving from above posts the router is the DHCP server, so DHCP functionality in Pi-hole is disabled, right?

MichaIng on 3 Jul 2020

Yes i see that now, but thought it was since i could not connect to anything. No same client
Remember this does not happen if i put the backup from late december on sdcard, so something gets updated through apt-get and cause this.
Running ssh's are unusable, see pictures....cant run any command it seems
Yes router is dchp and pihole is not, also this config is the same on the mentioned sdcard backup i revert to when i get tired of this :) Difficult for me this since im not very good with linux things....
Maybe you get a clue from pictures.....actually its one picture
dietpi_ssh_notfound

Ill see if i cant find the logs when i reboot

Thidsa on 3 Jul 2020

So SSH connection itself works fine. Since the commands are not found, I guess your PATH variable is wrong. Can you please paste from a broken SSH session:

echo $PATH

And fix it for the current session with:

export PATH='/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'

MichaIng on 3 Jul 2020

have not rebooted yet, still the old ssh's from last reboot.
Patch command get executed, but it did no difference
dietpi_path

Thidsa on 3 Jul 2020

Okay then there is something seriously broken. Does the following show something?

command -v ls

Does calling the binary manually work (does the file exist)?

/bin/ls -l

Any difference when logging in with root user?

MichaIng on 3 Jul 2020

could it be that the RootFS disappears or get corrupted?

Joulinar on 3 Jul 2020

dietpi_ssh2
cant login as anything right now, remember...
if i reboot all goes back to normal until next occurence of this..
Dont want to take to much of your time with this either, i can reboot try to gather my configs and do a clean install and see what happens. Or restore my workable backup and use it for as long as i can without using apt-get.

Thidsa on 3 Jul 2020

could it be that the RootFS disappears or get corrupted?

if you ask me i have no clue, dont know what it is :)

Thidsa on 3 Jul 2020

I guess something like lsblk is not working as well isn't it?

Joulinar on 3 Jul 2020

dietpi_ssh3

Thidsa on 3 Jul 2020

what about /bin/lsblk

Joulinar on 3 Jul 2020

dietpi@DietPi:~$ /bin/lsblk
-bash: /bin/lsblk: Input/output error

Thidsa on 3 Jul 2020

Found this one fitting your issue. https://unix.stackexchange.com/questions/542554/got-input-output-error-when-execute-any-commands

Joulinar on 3 Jul 2020

yes, seems about the same. was only hdd focus there. if there was something with my sdcard, why do the backup image work like it should. I dont have the linux knowledge to fiddle with this.....

Thidsa on 3 Jul 2020

Just guessing. Maybe the apt upgrade will write new files to bad or damaged areas on your SD card.

@MichaIng any other ideas?

Joulinar on 3 Jul 2020

yes it could be, i have tried another sdcard too brand new. same happens
This time it too about 30 hours for it to be like this. Time varies, but usually around 20 +- maybe. Cant says for sure

Thidsa on 3 Jul 2020

i dont know what the script Michal gave me did, maybe i should reboot and try to find the log.

Thidsa on 3 Jul 2020

Ah yes totally forgot about the dmesg.log and also last CPU temperature would be intersting. Please reboot and have a look at those:

cat ~/dmesg.log
tail -50 ~/monitor.log

MichaIng on 3 Jul 2020

i dont know if it indicate something but the green light (read write) on the pi was steady on for 5 sec off for one steady for 5 etc
ive seen it before when this had happened.

dietpi@DietPi:~$ sudo cat ~/dmesg.log
cat: /proc/kmsg: Permission denied

Thidsa on 3 Jul 2020

Joulinar on 3 Jul 2020

root@DietPi:~# sudo cat ~/dmesg.log
cat: /root/dmesg.log: No such file or directory

ive looked at the dmesg.log without root and all that was in it was this: cat: /proc/kmsg: Permission denied
was only 35bytes
i musta executed the script wrong....
just did nano file paste
bash file
it was executed and it didnt release cursor in ssh
please tell me how to run it and ill wait one more time until the pi acts up again

Thidsa on 3 Jul 2020

monitor.log was there tho
monitor.log
but no help there :(

Thidsa on 3 Jul 2020

Probably it would need to be executed as user root. I guess you did as user dietpi

Joulinar on 3 Jul 2020

tried sudo bash file now and the dmesg.log stays at 0 and the monitor.log file grows, is that right?

Thidsa on 3 Jul 2020

maybe and yes, i didnt know :)

Thidsa on 3 Jul 2020

Ah makes sense, at least dmesg is not permitted for non-root users by default. Yes cat /proc/kmsg at best prints all output a single time only, but when running it repeatedly I don't get that again. Instead only new kernel messages are printed as fast as they arrive.

MichaIng on 3 Jul 2020

are you interested in looking at it through teamviewer? i know you developers are curious at finding causes :)
just say when, its ready. Atleast tell me how to run the script again

Thidsa on 3 Jul 2020

sudo nohup cat /proc/kmsg > dmesg.log & disown

MichaIng on 3 Jul 2020

as root or?

Thidsa on 3 Jul 2020

When using sudo, root is invoked. Alternative is to login as root and skip sudo.

MichaIng on 3 Jul 2020

yes, i kinda knew but wasnt sure. ok thx :)
and does it have to be exited when the rpi acts up again?

Thidsa on 3 Jul 2020

dietpi@DietPi:~$ sudo nohup cat /proc/kmsg > dmesg.log & disown
[1] 27891
dietpi@DietPi:~$ nohup: ignoring input and redirecting stderr to stdout

running, got this result, later and thx again

Thidsa on 3 Jul 2020

if the ssh the command was started from exited, does that kill the nohup cat command?
its ok, i found the process still running

Thidsa on 4 Jul 2020

The way it is started prevents it from being killed when the SSH session closes. It is completely detached from shell.

MichaIng on 5 Jul 2020

Strange...its been up over 7 days now. Right now im updating to 6.31.2

Thidsa on 11 Jul 2020

Updated and rebooted and its been up for almost 4 new days.....something has changed for the better.
Only thing i did before the 7 days up, was correct the timezone it was 3 hours off. Cant imagine that it had something to do with this getting better.
If it continues like this ill close this.

Thidsa on 21 Jul 2020

👍1

Great to hear 😃.

MichaIng on 21 Jul 2020

Still going strong so ill close this. Thx

Thidsa on 27 Jul 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings