cat /boot/dietpi/.version
G_DIETPI_VERSION_CORE=6
G_DIETPI_VERSION_SUB=26
G_DIETPI_VERSION_RC=3
G_GITBRANCH='master'
G_GITOWNER='MichaIng'
Distro version | 10.4 | Buster
Kernel version | Linux DietPi 4.19.118-v7+ #1311 SMP Mon Apr 27 14:21:24 BST 2020 armv7l GNU/Linux
SBC model | RPI3
Power supply used | Tried several, using 5v 2.5A
SDcard used | Sandisk Ultra 32gb, tried another but same thing happens
To reproduce i just update to 6.30.0
Rpi hangs after some time. Have not timed it, but within 24 hours
Many thanks for your report.
What exactly hangs?
Does the system fully crash or can you still login via SSH or at least ping it?
Did you monitor CPU temperature and RAM usage?
free -m
Which services/software are you using?
dietpi-services status
And a few general logs to have a look at to check for kernel or service errors:
dmesg | tail -10
journalctl
Ah, and you could send a bug report so I can have a closer look at those outputs as well:
dietpi-bugreport 1
and tell me the UUID:
echo $G_HW_UUID
Im sorry but i just noticed that its not been caused by updating dietpi, but by apt-get update/upgrade...
As dietpi-update do it i was tricked :) It has been like this for months so i got tired of it and didnt update dietpi this time but used apt-get update/upgrade and when it lasted longer than 24 hours i thought i was in the clear.
Thats why i wrote this to say that it happen with dietpi update.
But now its stuck again.
Im pretty sure ive pinged it before without result, but now i can
Everything else networkbased is dead as far as i can see.
I dont know? Should we proceed or?
Everything else you asked about i have to do after a restart..
Probably the OOM killer kills SSH and other network related stuff while ping (ICMP echo) is a kernel feature. Please restart and do the other steps, so we see if something is generally wrong. Monitor CPU temperature and RAM usage to see how it develops and if any of both reaches a critical limit which causes the crash.
Just some observation:
apache2 as well as lighttpd. Both are trying to LISTEN on port 80. As this is not possible, lighttpd is failing. You would need to decide between both or ensure that one of the web server is running on another port like 81.systemctl daemon-reload to reload all service fileDHCP on a system that is running PiHole. Best practice is to use STATIC IP to ensure that PiHole will be reachable on same IP address alwaysAside from what Joulinar said, also the Ethernet interface fails to be configured. I guess the most important part (DHCP) is done, so network is up regardless, but there is an ifup script that tries to send a mail which fails:
Jul 01 12:43:45 DietPi sh[354]: msmtp: authentication failed (method PLAIN)
Jul 01 12:43:45 DietPi sh[354]: msmtp: server message: 454 4.7.0 Too many login attempts, please try again later. r13sm2074956lf
p.80 - gsmtp
Jul 01 12:43:45 DietPi sh[354]: msmtp: could not send mail (account default from /etc/msmtprc)
Jul 01 12:43:45 DietPi sh[354]: ERROR: Email could not be sent, please check your logs
Jul 01 12:43:45 DietPi sh[354]: run-parts: /etc/network/if-up.d/berryio_email_ip exited with return code 1
Jul 01 12:43:45 DietPi sh[354]: ifup: failed to bring up eth0
Jul 01 12:43:45 DietPi systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
Since Apache2 is currently the active webserver (because it is started before Lighttpd), to keep things as they are (regarding Pi-hole web UI and in case you have other websites), you could purge Lighttpd and PHP-FPM:
apt purge lighttpd php7.3-fpm
apt autopurge
Of course you could purge Apache2 instead, but this means a different webserver will serve your sites as before, so I would test first if everything still works fine (or even better):
systemctl stop apache2
systemctl start lighttpd
Check websites, and if everything works fine/better:
apt purge apache2
apt autopurge
- you basically have 2 web server running.
apache2as well aslighttpd. Both are trying to LISTEN onport 80. As this is not possible,lighttpdis failing. You would need to decide between both or ensure that one of the web server is running on another port like 81.
Yes i have noticed, but since everything has been working ok ive let it be
pls run
systemctl daemon-reloadto reload all service file
done
- you are using
DHCPon a system that is runningPiHole. Best practice is to useSTATIC IPto ensure thatPiHolewill be reachable on same IP address always
Ip is enforced through the router, its always the same.
- in general it would be interesting to see how system memory behaves over time. Right now it seems to be fine.
heh, yes but within a few hours i guess it stops again as usual.
Have a system backup that i use from late december, and that runs as long as i want it too, if i dont update
Jul 01 12:43:45 DietPi sh[354]: msmtp: authentication failed (method PLAIN)
Jul 01 12:43:45 DietPi sh[354]: msmtp: server message: 454 4.7.0 Too many login attempts, please try again later. r13sm2074956lf
p.80 - gsmtp
Jul 01 12:43:45 DietPi sh[354]: msmtp: could not send mail (account default from /etc/msmtprc)
Jul 01 12:43:45 DietPi sh[354]: ERROR: Email could not be sent, please check your logs
Jul 01 12:43:45 DietPi sh[354]: run-parts: /etc/network/if-up.d/berryio_email_ip exited with return code 1
Jul 01 12:43:45 DietPi sh[354]: ifup: failed to bring up eth0
Jul 01 12:43:45 DietPi systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
This also has been like that along time, cant rember if i configured berryio to send mail.
ill see if i can disable the mail thing
Since Apache2 is currently the active webserver (because it is started before Lighttpd), to keep things as they are (regarding Pi-> hole web UI and in case you have other websites), you could purge Lighttpd and PHP-FPM:
apt purge lighttpd php7.3-fpm
apt autopurge
ok ill remove it
heh, yes but within a few hours i guess it stops again as usual.
Yeah the unused PHP-FPM service takes a few unnecessary resources but not that much, so while it makes sense to fix the mentioned things, I also don't see a reason why those should cause a crash or resolving them should resolve it.
Keep an eye on htop and/or free -m for memory usage (for the latter "available" memory is what you need to look at) and dmesg | tail -10 if any yellow or red lines appear (kernel warnings/errors).
Ip is enforced through the router, its always the same.
Generally as well for network establish performance it makes sense to assign a static IP in such cases, as long as there is not a changing DNS nameserver, changing gateway or similar that you need to regularly re-sync. However there might be routers which do not handle devices well that use an IP inside the DHCP range but to not request the IP via DHCP, not sure. I do it both ways:
Keep an eye on
htopand/orfree -mfor memory usage (for the latter "available" memory is what you need to look at) anddmesg | tail -10if any yellow or red lines appear (kernel warnings/errors).
Anyway i can log this? Cant sit here and watch it until it hangs :)
- On first boot of a new device, I let it receive an IP via DHCP, which assures that the router as well stores its hostname, allows to reserve the IP etc and do block/reserve the IP.
- Then on the device I switch to static IP by copying IP+mask+gateway that was assigned via DHCP before.
- So the router has a reserved entry but the new device will bring up network much faster and does not need to run a DHCP client.
Ok fixed
ok ill have something running here. See what happens
Thx so far
nohup cat /proc/kmsg > dmesg.log & disown
cat << _EOF_ > ./monitor.bash
#!/bin/bash
while :
do
free -m
cat /sys/class/thermal/thermal_zone0/temp
sleep 5
done
_EOF_
chmod +x ./monitor.bash
nohup ./monitor.bash > monitor.log & disown
Not beautiful but the quickest idea I have currently 馃槃.
if needed I can guide you to a full blown monitoring solution using Telegraf, Grafana and InfluxDB. It might be oversized by you will get everything with nice graphs 馃ぃ
ill see where this brings me first :)
https://nextcloud.thidsa.net:446/s/NFBtJno7H4zi2JS
Ok there the network went poof. Cant connect to anything. 2 of the running ssh sessions went blank but your script and htop is still running. What now?
Hmm, since four shown SSH sessions are still up, it is neither a system crash nor a complete network crash. Were the two disconnected (?) SSH sessions from a different client?
Of course you can quick the running htop and check the content of the produced log files, memory usage is already shown to be perfectly fine in htop.
Is it probably Pi-hole and/or router/DHCP server which do not play well together and break connection from local network? From the running SSH session (htop or another one of those), check:
ip r
ip a
Check if there is a route for the local network and that the IP(s) are as expected. Additionally check for recent kernel/sys logs which might be related: journalctl (or since you have rsyslogd running /var/log/syslog /var/log/daemon.log and such).
Generally do you use Pi-hole only for DNS resolving or as DHCP server as well? DHCP disabled in router? Ah actually deriving from above posts the router is the DHCP server, so DHCP functionality in Pi-hole is disabled, right?
Yes i see that now, but thought it was since i could not connect to anything. No same client
Remember this does not happen if i put the backup from late december on sdcard, so something gets updated through apt-get and cause this.
Running ssh's are unusable, see pictures....cant run any command it seems
Yes router is dchp and pihole is not, also this config is the same on the mentioned sdcard backup i revert to when i get tired of this :) Difficult for me this since im not very good with linux things....
Maybe you get a clue from pictures.....actually its one picture
Ill see if i cant find the logs when i reboot
So SSH connection itself works fine. Since the commands are not found, I guess your PATH variable is wrong. Can you please paste from a broken SSH session:
echo $PATH
And fix it for the current session with:
export PATH='/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
have not rebooted yet, still the old ssh's from last reboot.
Patch command get executed, but it did no difference
Okay then there is something seriously broken. Does the following show something?
command -v ls
Does calling the binary manually work (does the file exist)?
/bin/ls -l
Any difference when logging in with root user?
could it be that the RootFS disappears or get corrupted?
cant login as anything right now, remember...
if i reboot all goes back to normal until next occurence of this..
Dont want to take to much of your time with this either, i can reboot try to gather my configs and do a clean install and see what happens. Or restore my workable backup and use it for as long as i can without using apt-get.
could it be that the RootFS disappears or get corrupted?
if you ask me i have no clue, dont know what it is :)
I guess something like lsblk is not working as well isn't it?
what about /bin/lsblk
dietpi@DietPi:~$ /bin/lsblk
-bash: /bin/lsblk: Input/output error
Found this one fitting your issue. https://unix.stackexchange.com/questions/542554/got-input-output-error-when-execute-any-commands
yes, seems about the same. was only hdd focus there. if there was something with my sdcard, why do the backup image work like it should. I dont have the linux knowledge to fiddle with this.....
Just guessing. Maybe the apt upgrade will write new files to bad or damaged areas on your SD card.
@MichaIng any other ideas?
yes it could be, i have tried another sdcard too brand new. same happens
This time it too about 30 hours for it to be like this. Time varies, but usually around 20 +- maybe. Cant says for sure
i dont know what the script Michal gave me did, maybe i should reboot and try to find the log.
Ah yes totally forgot about the dmesg.log and also last CPU temperature would be intersting. Please reboot and have a look at those:
cat ~/dmesg.log
tail -50 ~/monitor.log
i dont know if it indicate something but the green light (read write) on the pi was steady on for 5 sec off for one steady for 5 etc
ive seen it before when this had happened.
dietpi@DietPi:~$ sudo cat ~/dmesg.log
cat: /proc/kmsg: Permission denied
Login as user root
root@DietPi:~# sudo cat ~/dmesg.log
cat: /root/dmesg.log: No such file or directory
ive looked at the dmesg.log without root and all that was in it was this: cat: /proc/kmsg: Permission denied
was only 35bytes
i musta executed the script wrong....
just did nano file paste
bash file
it was executed and it didnt release cursor in ssh
please tell me how to run it and ill wait one more time until the pi acts up again
monitor.log was there tho
monitor.log
but no help there :(
Probably it would need to be executed as user root. I guess you did as user dietpi
tried sudo bash file now and the dmesg.log stays at 0 and the monitor.log file grows, is that right?
maybe and yes, i didnt know :)
Ah makes sense, at least dmesg is not permitted for non-root users by default. Yes cat /proc/kmsg at best prints all output a single time only, but when running it repeatedly I don't get that again. Instead only new kernel messages are printed as fast as they arrive.
are you interested in looking at it through teamviewer? i know you developers are curious at finding causes :)
just say when, its ready. Atleast tell me how to run the script again
sudo nohup cat /proc/kmsg > dmesg.log & disown
as root or?
When using sudo, root is invoked. Alternative is to login as root and skip sudo.
yes, i kinda knew but wasnt sure. ok thx :)
and does it have to be exited when the rpi acts up again?
dietpi@DietPi:~$ sudo nohup cat /proc/kmsg > dmesg.log & disown
[1] 27891
dietpi@DietPi:~$ nohup: ignoring input and redirecting stderr to stdout
running, got this result, later and thx again
if the ssh the command was started from exited, does that kill the nohup cat command?
its ok, i found the process still running
The way it is started prevents it from being killed when the SSH session closes. It is completely detached from shell.
Strange...its been up over 7 days now. Right now im updating to 6.31.2
Updated and rebooted and its been up for almost 4 new days.....something has changed for the better.
Only thing i did before the 7 days up, was correct the timezone it was 3 hours off. Cant imagine that it had something to do with this getting better.
If it continues like this ill close this.
Great to hear 馃槂.
Still going strong so ill close this. Thx