If the mining card is OC-ed to the limit, miner just stop with some "cuda error" it would be nice to have command just to quit miner on error so we can use restart scrypt
A much safer solution would be to have your restart script monitor the GPUs usage and restart the miner when any of the GPUs usage drops below 50% for several seconds. This is easily done on NVIDIA.
@DLS-bau Do you have working example of a script? May be you can share with rest of us, thanks
I think there should be an exit command regardless of error or not.. Press Q for example and it shuts down properly instead of just killing the process..
Killing the process actually screws up my system more than when it crashes..
Yeah command -k or so that will quit miner on any error, so simple batch loop will restart it, I use it for most miners
reb0rn21 can you share a sample bat file for batch loop restart?
:loop
ethminer.exe
goto loop
just disable widows error reporting, thats what I do
see here [Issue 72] There are some solutions with batch, powershell or php available..
My very basic (but effective) solution is to monitor the miner with a bash watchdog script.
You have to redirect ethminer output (stdout & stderr) to a log file and then run this script.
#!/bin/bash
#
# minerwd.sh
# Author: Andrea Lanfranchi
#
# Monitors ethminer output log in search of errors.
# If any is found in last 10 rows then mining rig is restarted
#
# Pre-requistes
# apt-get install inotify-tools
#
while inotifywait -e modify ~/miner.log > /dev/null 2>&1 ; do
# Lookup last 10 rows of log file in search of errors
# Feel free to integrate grep pattern or create more conditions
if tail -n10 ~/miner.log | grep -io "cuda error\|error cuda"; then
# Send mail
echo "Miner requires restart due to error" | mail -s "Miner WatchDog Restart" prospector@localhost
# Restart mining rig
sudo /sbin/shutdown -f -r +2
# Abandon WatchDog
exit
fi
done
Here's something I am using for my nvidia cards.
Feel free to modify it to your needs.
#!/bin/sh
PREP_GPUS="/home/linus/set_overclocking.sh"
MINER_SCRIPT="/home/linus/start_miner.sh"
gpu0_ultilization=`nvidia-smi -i 0 --query-gpu=utilization.gpu --format=csv,noheader,nounits`
if [ $gpu0_ultilization -lt 50 ]
then
echo "[alert] GPU seems to be down, restarting."
$PREP_GPUS
$MINER_SCRIPT
echo "Done restarting miner script, going to sleep now"
else
echo "[info] All normal"
fi
I'm using this with nvidia cards and tmux:
#!/bin/bash
file=/tmp/ethminer-restarts.log
POWER_THRESHOLD=50
PROBE_DELAY=30
STARTUP_DELAY=60
RED='\033[0;31m'
GREEN='\033[0;32m'
NC='\033[0m' # no color
while true
do
sleep $PROBE_DELAY
power_draw=$(nvidia-smi --id=0 --query-gpu=power.draw --format=csv,noheader,nounits)
if (( $(echo "$power_draw < $POWER_THRESHOLD" | bc -l) ))
then
echo -ne " $RED$(date +'%H:%M') ✘$NC " | tee -a $file
tmux respawn-pane -k -t ethminer:0.0
sleep $STARTUP_DELAY
else
echo -ne "$(date +'%M') ${GREEN}✔$NC "
fi
done
This method doesn't work everytime. If GPU fails nvidia-smi is executed in a loop without output. I am currently working on finding a better way to implement watchdog function.
@ddobreff When nvidia-smi stops working, the driver will log a XID error. You can check with:
journalctl _TRANSPORT=kernel | grep NVRM
So far i have not found a reliable why to recover from those failures. I just trigger a reboot on them (https://jjacky.com/journal-triggerd/)
We shouldn't be using this function at all, it may cause other dificulties like I forgot that I stopped the miner and while compiling the system rebooted...A better approach is to use miner as instructor for watchdog.
This method doesn't work everytime. If GPU fails nvidia-smi is executed in a loop without output.
True. I haven't tried it but I think checking exit code from nvidia-smi should allow to catch this. Another thing that should be accounted for is when nvidia-smi hangs (I think I've seen such cases).
After #757 (added --exit parameter to exit whenever an error occurred) you can use a watchdog.
Try ETHminerWatchDogDmW Windows7/8/10 [32/64] & Linux (Any Dist/Any Ver/Any Arch) (#735).
Check and feedback please.
Thank you!
Most helpful comment
I think there should be an exit command regardless of error or not.. Press Q for example and it shuts down properly instead of just killing the process..
Killing the process actually screws up my system more than when it crashes..