How to auto restart CentOS Linux server with
software watchdog (softdog) to reduce server downtime
I'm in charge of dozen of Linux servers these days and therefore am
required to restart many of the servers with a support ticket
(because many of the Data Centers where the servers are co-located
does not have a web interface or IPKVM connected to the server for
that purpose). Therefore the server restart requests in case of
crash sometimes gets processed in few hours or in best case in at
least half an hour.
I'm aware of the existence of Hardware Watchdog devices, which are
capable to detect if a server is hanged and auto-restart it,
however the servers I administrate does not have Hardware support
for Watchdog timer.
Thanksfully there is a free software project called
Watchdog
which is easily configured and mitigates the terrible downtimes
caused every now and then by a server crash and respective delays
by tech support in Data Centers.
I've recently blogged on the topic of
Debian Linux auto-restart in case of kernel panic , however now
i had to conifgure
watchdog on some dozen of CentOS Linux
servers.
It appeared installation & configuration of Watchdog on CentOS
is a piece of cake and comes to simply following few easy steps,
which I'll explain quickly in this post:
1. Install with yum watchdog to CentOS
[root@centos:/etc/init.d ]# yum install watchdog
...
2. Add to configuration a log file to log watchdog activities
and location of the watchdog device
The quickest way to add this two is to use echo to append it in
/etc/watchdog.conf:
[root@centos:/etc/init.d ]# echo 'file = /var/log/messages'
>> /etc/watchdog.conf
echo 'watchdog-device = /dev/watchdog' >>
/etc/watchdog.conf
3. Load the softdog kernel module to initialize the software
watchdog via /dev/watchdog
[root@centos:/etc/init.d ]# /sbin/modprobe
softdog
Initialization of
softdog should be indicated by a line in
dmesg kernel log like the one above:
[root@centos:/etc/init.d ]# dmesg |grep -i watchdog
Software Watchdog Timer: 0.07 initialized. soft_noboot=0
soft_margin=60 sec (nowayout= 0)
4. Include the softdog kernel module to load on CentOS boot
up
This is necessery, because otherwise after reboot the softdog would
not be auto initialized and without it being initialized, the
watchdog daemon service could not function as it does automatically
auto reboots the server if the
/dev/watchdog
disappears.
It's better that the
softdog module is not loaded via
/etc/rc.local but the default CentOS methodology to load
module from
/etc/rc.module is used:
[root@centos:/etc/init.d ]# echo modprobe softdog >>
/etc/rc.modules
[root@centos:/etc/init.d ]# chmod +x
/etc/rc.modules
5. Start the watchdog daemon service
The succesful intialization of softdog in step 4, should have
provided the system with
/dev/watchdog, before proceeding
with starting up the watchdog daemon it's wise to first check if
/dev/watchdog is existent on the system. Here is how:
[root@centos:/etc/init.d ]# ls -al /dev/watchdog crw------- 1
root root 10, 130 Aug 10 14:03 /dev/watchdog
Being sure, that
/dev/watchdog is there, I'll start the
watchdog service.
[root@centos:/etc/init.d ]# service watchdog restart
...
Very important note to make here is that you should never ever
configure watchdog service to run on boot time with
chkconfig. In other words the status from chkconfig for
watchdog boot on all levels should be off like so:
[root@centos:/etc/init.d ]# chkconfig --list |grep -i
watchdog
watchdog 0:off 1:off 2:off 3:off 4:off 5:off
6:off
Enabling the watchdog from the
chkconfig will cause watchdog
to automatically restart the system as it will probably start the
watchdog daemon before the softdog module is initialized. As
watchdog will be unable to read the /dev/watchdog it will though
the system has hanged even though the system might be in a boot
process. Therefore it will end up in an endless loops of reboots
which can only be fixed in a linux single user mode!!! Once again
BEWARE, never ever activate watchdog via chkconfig!
Next step to be absolutely sure that watchdog device is running it
can be checked with normal
ps command:
[root@centos:/etc/init.d ]# ps aux|grep -i watchdog
root@hosting1-fr [~]# ps axu|grep -i watch|grep -v grep root 18692
0.0 0.0 1816 1812 ? SNLs 14:03 0:00 /usr/sbin/watchdog
root 25225 0.0 0.0 0 0 ? ZN 17:25 0:00 [watchdog]
<defunct>
You have probably noticed the
defunct state of watchdog,
consider that as absolutely normal, above output indicates that now
watchdog is properly running on the host and waiting to auto reboot
in case of sudden /dev/watchdog disappearance.
As a last step before, after being sure its initialized properly,
it's necessery to add watchdog to run on boot time via
/etc/rc.local post init script, like so:
[root@centos:/etc/init.d ]# echo 'echo /sbin/service watchdog
start' >> /etc/rc.local
Now enjoy, watchdog is up and running and will automatically
restart the CentOS host
Even though
this small tutorial is dedicated to run watchdog on CentOS, I've
tested the install procedure also on CloudLinux and it works
perfectly fine.
I assume it should be also working on other Redhat based
distributions, as well as on RHEL.
I'll be glad if somebody test it on RHEL or other RPM based
distrubutions and can confirm if this watchdog install tutorial is
applicable.
Feedback is very welcome!