Posts Tagged ‘log messages’

List and fix failed systemd failed services after Linux OS upgrade and how to get full info about systemd service from jorunal log

Friday, February 25th, 2022

systemd-logo-unix-linux-list-failed-systemd-services

I have recently upgraded a number of machines from Debian 10 Buster to Debian 11 Bullseye. The update as always has some issues on some machines, such as problem with package dependencies, changing a number of external package repositories etc. to match che Bullseye deb packages. On some machines the update was less painful on others but the overall line was that most of the machines after the update ended up with one or more failed systemd services. It could be that some of the machines has already had this failed services present and I never checked them from the previous time update from Debian 9 -> Debian 10 or just some mess I've left behind in the hurry when doing software installation in the past. This doesn't matter anyways the fact was that I had to deal to a number of systemctl services which I managed to track by the Failed service mesage on system boot on one of the physical machines and on the OpenXen VTY Console the rest of Virtual Machines after update had some Failed messages. Thus I've spend some good amount of time like an overall of a day or two fixing strange failed services. This is how this small article was born in attempt to help sysadmins or any home Linux desktop users, who has updated his Debian Linux / Ubuntu or any other deb based distribution but due to the chaotic nature of Linux has ended with same strange Failed services and look for a way to find the source of the failures and get rid of the problems. 
Systemd is a very complicated system and in my many sysadmin opinion it makes more problems than it solves, but okay for today's people's megalomania mindset it matches well.

Systemd_components-systemd-journalctl-cgroups-loginctl-nspawn-analyze.svg

 

1. Check the journal for errors, running service irregularities and so on
 

First thing to do to track for errors, right after the update is to take some minutes and closely check,, the journalctl for any strange errors, even on well maintained Unix machines, this journal log would bring you to a problem that is not fatal but still some process or stuff is malfunctioning in the background that you would like to solve:
 

root@pcfreak:~# journalctl -x
Jan 10 10:10:01 pcfreak CRON[17887]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17887]: USER_END pid=17887 uid=0 auid=0 ses=340858 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17888]: CRED_DISP pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17888]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17888]: USER_END pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17884]: CRED_DISP pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17884]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17884]: USER_END pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17886]: CRED_DISP pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/c>
Jan 10 10:10:01 pcfreak CRON[17886]: pam_unix(cron:session): session closed for user www-data
Jan 10 10:10:01 pcfreak audit[17886]: USER_END pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permi>
Jan 10 10:10:08 pcfreak NetworkManager[696]:  [1641802208.0899] device (eth1): carrier: link connected
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:19 pcfreak NetworkManager[696]:
 [1641802219.7920] device (eth1): carrier: link connected
Jan 10 10:10:19 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:20 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:22 pcfreak NetworkManager[696]:
 [1641802222.2772] device (eth1): carrier: link connected
Jan 10 10:10:22 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:23 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:33 pcfreak sshd[18142]: Unable to negotiate with 66.212.17.162 port 19255: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diff>
Jan 10 10:10:41 pcfreak NetworkManager[696]:
 [1641802241.0186] device (eth1): carrier: link connected
Jan 10 10:10:41 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx

If you want to only check latest journal log messages use the -x -e (pager catalog) opts

root@pcfreak;~# journalctl -xe

Feb 25 13:08:29 pcfreak audit[2284920]: USER_LOGIN pid=2284920 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='op=login acct=28696E76616C>
Feb 25 13:08:29 pcfreak sshd[2284920]: Received disconnect from 177.87.57.145 port 40927:11: Bye Bye [preauth]
Feb 25 13:08:29 pcfreak sshd[2284920]: Disconnected from invalid user ubuntuuser 177.87.57.145 port 40927 [preauth]

Next thing to after the update was to get a list of failed service only.


2. List all systemd failed check services which was supposed to be running

root@pcfreak:/root # systemctl list-units | grep -i failed
● certbot.service                                                                                                       loaded failed failed    Certbot
● logrotate.service                                                                                                     loaded failed failed    Rotate log files
● maldet.service                                                                                                        loaded failed failed    LSB: Start/stop maldet in monitor mode
● named.service                                                                                                         loaded failed failed    BIND Domain Name Server


Alternative way is with the –failed option

hipo@jeremiah:~$ systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

 

root@jeremiah:/etc/apt/sources.list.d#  systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon


To get a full list of objects of systemctl you can pass as state:
 

# systemctl –state=help
Full list of possible load states to pass is here
Show service properties


Check whether a service is failed or has other status and check default set systemd variables for it.

root@jeremiah~:# systemctl is-failed vboxweb.service
inactive

# systemctl show haproxy
Type=notify
Restart=always
NotifyAccess=main
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
SuccessExitStatus=143
MainPID=304858
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ReloadResult=success
CleanResult=success

Full output of the above command is dumped in show_systemctl_properties.txt


3. List all running systemd services for a better overview on what's going on on machine
 

To get a list of all properly systemd loaded services you can use –state running.

hipo@jeremiah:~$ systemctl list-units –state running|head -n 10
  UNIT                              LOAD   ACTIVE SUB     DESCRIPTION
  proc-sys-fs-binfmt_misc.automount loaded active running Arbitrary Executable File Formats File System Automount Point
  cups.path                         loaded active running CUPS Scheduler
  init.scope                        loaded active running System and Service Manager
  session-2.scope                   loaded active running Session 2 of user hipo
  accounts-daemon.service           loaded active running Accounts Service
  anydesk.service                   loaded active running AnyDesk
  apache-htcacheclean.service       loaded active running Disk Cache Cleaning Daemon for Apache HTTP Server
  apache2.service                   loaded active running The Apache HTTP Server
  avahi-daemon.service              loaded active running Avahi mDNS/DNS-SD Stack

 

It is useful thing is to list all unit-files configured in systemd and their state, you can do it with:

 


root@pcfreak:~# systemctl list-unit-files
UNIT FILE                                                                 STATE           VENDOR PRESET
proc-sys-fs-binfmt_misc.automount                                         static          –            
-.mount                                                                   generated       –            
backups.mount                                                             generated       –            
dev-hugepages.mount                                                       static          –            
dev-mqueue.mount                                                          static          –            
media-cdrom0.mount                                                        generated       –            
mnt-sda1.mount                                                            generated       –            
proc-fs-nfsd.mount                                                        static          –            
proc-sys-fs-binfmt_misc.mount                                             disabled        disabled     
run-rpc_pipefs.mount                                                      static          –            
sys-fs-fuse-connections.mount                                             static          –            
sys-kernel-config.mount                                                   static          –            
sys-kernel-debug.mount                                                    static          –            
sys-kernel-tracing.mount                                                  static          –            
var-www.mount                                                             generated       –            
acpid.path                                                                masked          enabled      
cups.path                                                                 enabled         enabled      

 

 


root@pcfreak:~# systemctl list-units –type service –all
  UNIT                                   LOAD      ACTIVE   SUB     DESCRIPTION
  accounts-daemon.service                loaded    inactive dead    Accounts Service
  acct.service                           loaded    active   exited  Kernel process accounting
● alsa-restore.service                   not-found inactive dead    alsa-restore.service
● alsa-state.service                     not-found inactive dead    alsa-state.service
  apache2.service                        loaded    active   running The Apache HTTP Server
● apparmor.service                       not-found inactive dead    apparmor.service
  apt-daily-upgrade.service              loaded    inactive dead    Daily apt upgrade and clean activities
 apt-daily.service                      loaded    inactive dead    Daily apt download activities
  atd.service                            loaded    active   running Deferred execution scheduler
  auditd.service                         loaded    active   running Security Auditing Service
  auth-rpcgss-module.service             loaded    inactive dead    Kernel Module supporting RPCSEC_GSS
  avahi-daemon.service                   loaded    active   running Avahi mDNS/DNS-SD Stack
  certbot.service                        loaded    inactive dead    Certbot
  clamav-daemon.service                  loaded    active   running Clam AntiVirus userspace daemon
  clamav-freshclam.service               loaded    active   running ClamAV virus database updater
..

 


linux-systemd-components-diagram-linux-kernel-system-targets-systemd-libraries-daemons

 

4. Finding out more on why a systemd configured service has failed


Usually getting info about failed systemd service is done with systemctl status servicename.service
However, in case of troubles with service unable to start to get more info about why a service has failed with (-l) or (–full) options


root@pcfreak:~# systemctl -l status logrotate.service
● logrotate.service – Rotate log files
     Loaded: loaded (/lib/systemd/system/logrotate.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 00:00:06 EET; 13h ago
TriggeredBy: ● logrotate.timer
       Docs: man:logrotate(8)
             man:logrotate.conf(5)
    Process: 2045320 ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=1/FAILURE)
   Main PID: 2045320 (code=exited, status=1/FAILURE)
        CPU: 2.479s

Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: For now we will assume you meant to write /32
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| ERROR: '0.0.0.0/0.0.0.0' needs to be replaced by the term 'all'.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| SECURITY NOTICE: Overriding config setting. Using 'all' instead.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: (B) '::/0' is a subnetwork of (A) '::/0'
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: because of this '::/0' is ignored to keep splay tree searching predictable
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: You should probably remove '::/0' from the ACL named 'all'
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Failed with result 'exit-code'.
Feb 25 00:00:06 pcfreak systemd[1]: Failed to start Rotate log files.
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Consumed 2.479s CPU time.


systemctl -l however is providing only the last log from message a started / stopped or whatever status service has generated. Sometimes systemctl -l servicename.service is showing incomplete the splitted error message as there is a limitation of line numbers on the console, see below

 

root@pcfreak:~# systemctl status -l certbot.service
● certbot.service – Certbot
     Loaded: loaded (/lib/systemd/system/certbot.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 09:28:33 EET; 4h 0min ago
TriggeredBy: ● certbot.timer
       Docs: file:///usr/share/doc/python-certbot-doc/html/index.html
             https://certbot.eff.org/docs
    Process: 290017 ExecStart=/usr/bin/certbot -q renew (code=exited, status=1/FAILURE)
   Main PID: 290017 (code=exited, status=1/FAILURE)
        CPU: 9.771s

Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using th>
Feb 25 09:28:33 pcfrxen certbot[290017]: All renewals failed. The following certificates could not be renewed:
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/mail.pcfreak.org-0003/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/www.eforia.bg-0005/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/zabbix.pc-freak.net/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]: 3 renew failure(s), 5 parse failure(s)
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Failed with result 'exit-code'.
Feb 25 09:28:33 pcfrxen systemd[1]: Failed to start Certbot.
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Consumed 9.771s CPU time.

 

5. Get a complete log of journal to make sure everything configured on server host runs as it should

Thus to get more complete list of the message and be able to later google and look if has come with a solution on the internet  use:

root@pcfrxen:~#  journalctl –catalog –unit=certbot

— Journal begins at Sat 2022-01-22 21:14:05 EET, ends at Fri 2022-02-25 13:32:01 EET. —
Jan 23 09:58:18 pcfrxen systemd[1]: Starting Certbot…
░░ Subject: A start job for unit certbot.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit certbot.service has begun execution.
░░ 
░░ The job identifier is 5754.
Jan 23 09:58:20 pcfrxen certbot[124996]: Traceback (most recent call last):
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/renewal.py", line 71, in _reconstitute
Jan 23 09:58:20 pcfrxen certbot[124996]:     renewal_candidate = storage.RenewableCert(full_path, config)
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 471, in __init__
Jan 23 09:58:20 pcfrxen certbot[124996]:     self._check_symlinks()
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 537, in _check_symlinks

root@server:~# journalctl –catalog –unit=certbot|grep -i pluginerror|tail -1
Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using the manual plugin non-interactively.')


Or if you want to list and read only the last messages in the journal log regarding a service

root@server:~# journalctl –catalog –pager-end –unit=certbot


If you have disabled a failed service because you don't need it to run at all on the machine with:

root@rhel:~# systemctl stop rngd.service
root@rhel:~# systemctl disable rngd.service

And you want to clear up any failed service information that is kept in the systemctl service log you can do it with:
 

root@rhel:~# systemctl reset-failed

Another useful systemctl option is cat, you can use it to easily list a service it is useful to quickly check what is a service, an actual shortcut to save you from giving a full path to the service e.g. cat /lib/systemd/system/certbot.service

root@server:~# systemctl cat certbot
# /lib/systemd/system/certbot.service
[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/python-certbot-doc/html/index.html
Documentation=https://certbot.eff.org/docs
[Service]
Type=oneshot
ExecStart=/usr/bin/certbot -q renew
PrivateTmp=true


After failed SystemD services are fixed, it is best to reboot the machine and check put some more time to inspect rawly the complete journal log to make sure, no error  was left behind.


Closure
 

As you can see updating a machine from a major to a major version even if you follow the official documentation and you have plenty of experience is always more or a less a pain in the ass, which can eat up much of your time banging your head solving problems with failed daemons issues with /etc/rc.local (which I have faced becase of #/bin/sh -e (which would make /etc/rc.local) to immediately quit if any error from command $? returns different from 0 etc.. The  logical questions comes then;
1. Is it really worthy to update at all regularly, especially if you don't know of a famous major Vulnerability 🙂 ?
2. Or is it worthy to update from OS major release to OS major release at all?  
3. Or should you only try to patch the service that is exposed to an external reachable computer network or the internet only and still the the same OS release until End of Life (LTS = Long Term Support) as called in Debian or  End Of Life  (EOL) Cycle as called in RPM based distros the period until the OS major release your software distro has official security patches is reached.

Anyone could take any approach but for my own managed systems small network at home my practice was always to try to keep up2date everything every 3 or 6 months maximum. This has caused me multiple days of irritation and stress and perhaps many white hairs and spend nerves on shit.


4. Based on the company where I'm employed the better strategy is to patch to the EOL is still offered and keep the rule First Things First (FTF), once the EOL is reached, just make a copy of all servers data and configuration to external Data storage, bring up a new Physical or VM and migrate the services.
Test after the migration all works as expected if all is as it should be change the DNS records or Leading Infrastructure Proxies whatever to point to the new service and that's it! Yes it is true that migration based on a full OS reinstall is more time consuming and requires much more planning, but usually the result is much more expected, plus it is much less stressful for the guy doing the job.

How to configure haproxy logging to separate file on Redhat Enterprise Linux 8.5 Ootpa

Thursday, February 3rd, 2022

haproxy-rsyslog-architecture-logging-picture

Configuring proper logging for haproxy is always a pain in the ass in Linux, because of rsyslogd various config syntax among versions, because of bugs in OS etc. 
Today we have been given 2 Redhat 8.5 Linux servers where we had a task to start configuring haproxies, to have an idea on what is going on of course we had to enable proper haproxy logging in separate log file under separate local, for the test one can use haproxy's 

log /dev/log local6

config, this is a general way to configure logging which I've described earlier in the article How to enable haproxy logging to a separate log /var/log/haproxy.log / prevent duplicate messages to appear in /var/log/messages
However this time I wanted to not use /dev/log as this device is also used by systemd / journald and theoretically could be used by other services and there might be multiple services logging to the same places possibly leading to some issue, thus I wanted to send and process the haproxy messages directly from rsyslog on RHEL 8.5.

Create a custom file that is loaded with the rest of configuration from /etc/rsyslog.conf with a line like:
 

# Include all config files in /etc/rsyslog.d/
include(file="/etc/rsyslog.d/*.conf" mode="optional")


Create 49_haproxy.conf with below content

[root@haproxy: ~]# vim /etc/rsyslog.d/49_haproxy.conf

$ModLoad imudp
$UDPServerAddress 127.0.0.1
$UDPServerRun 514
#2022/02/02: HAProxy logs to local6, save the messages
local6.*                                                /var/log/haproxy.log
if ($programname == 'haproxy') then -/var/log/haproxy.log
& stop

touch /var/log/haproxy.log
chown haproxy:haproxy /var/log/haproxy.log

In /etc/haproxy/haproxy.cfg under global section to print in verbose mode messages (i.e. check, the haproxy is receiving properly sent traffic) do configure something like:

 

global
  log          127.0.0.1 local6 debug


Eventually you might want to remove the debug word out of the config, if you don't want to log too much verbosily once everything is properly tested and configured

[root@haproxy: ~]# curl -v -c -k 10.10.192.135:16010
* Rebuilt URL to: 10.10.192.135:15010/
*   Trying 10.10.192.135…
* TCP_NODELAY set
* Connected to 10.10.192.135 (10.10.192.135) port 15010 (#0)
> GET / HTTP/1.1
> Host: 10.10.192.135:15010
> User-Agent: curl/7.61.1
> Accept: */*

* Empty reply from server
* Connection #0 to host 10.10.192.135 left intact
curl: (52) Empty reply from server

 

In /var/log/haproxy.log you should get some messages like:
 

Feb  3 14:16:44 localhost.localdomain haproxy[25029]: proxy IN_Traffic_Bak has no server available!
Feb  3 14:16:44 localhost.localdomain haproxy[25029]: proxy IN_Traffic_Bak has no server available!
Feb  3 15:59:50 localhost.localdomain haproxy[25029]: [03/Feb/2022:15:59:50.162] 10.44.192.135:1348 -:- IN_Traffic/<NOSRV>:- -1/-1/0 0 SC 1/1/0/0/0 0/0
Feb  3 15:59:50 localhost.localdomain haproxy[25029]: [03/Feb/2022:15:59:50.162] 10.44.192.135:1348 -:- IN_Traffic/<NOSRV>:- -1/-1/0 0 SC 1/1/0/0/0 0/0
Feb  3 15:59:50 localhost.localdomain haproxy[25029]: [03/Feb/2022:15:59:50.162] 10.44.192.135:1348 -:- IN_Traffic/<NOSRV>:- -1/-1/0 0 SC 1/1/0/0/0 0/0

 

Log rsyslog script incoming tagged string message to separate external file to prevent /var/log/message from string flood

Wednesday, December 22nd, 2021

rsyslog_logo-log-external-tag-scripped-messages-to-external-file-linux-howto

If you're using some external bash script to log messages via rsyslogd to some of the multiple rsyslog understood data tubes (called in rsyslog language facility levels) and you want Rsyslog to move message string to external log file, then you had the same task as me few days ago.

For example you have a bash shell script that is writting a message to rsyslog daemon to some of the predefined facility levels be it:
 

kern,user,cron, auth etc. or some local

and your logged script data ends under the wrong file location /var/log/messages , /var/log/secure , var/log/cron etc. However  you need to log everything coming from that service to a separate file based on the localX (fac. level) the usual way to do it is via some config like, as you would usually do it with rsyslog variables as:
 

local1.info                                            /var/log/custom-log.log

# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none;local0.none;local1.none        /var/log/messages


Note the local1.none is instructing the rsyslog not to log anything from local1 facility towards /var/log/message. 
But what if this due to some weirdness in configuration of rsyslog on the server or even due to some weird misconfiguration in

/etc/systemd/journald.conf such as:

[Journal]
Storage=persistent
RateLimitInterval=0s
RateLimitBurst=0
SystemMaxUse=128M
SystemMaxFileSize=32M
MaxRetentionSec=1month
MaxFileSec=1week
ForwardToSyslog=yes
SplitFiles=none

Due to that config and especially the FowardToSyslog=yes, the messages sent via the logger tool to local1 still end up inside /var/log/messages, not nice huh ..

The result out of that is anything being sent with a predefined TAGGED string via the whatever.sh script which uses the logger command  (if you never use it check man logger) to enter message into rsyslog with cmd like:
 

# logger -p local1.info -t TAG_STRING

# logger -p local2.warn test
# tail -2 /var/log/messages
Dec 22 18:58:23 pcfreak rsyslogd: — MARK —
Dec 22 19:07:12 pcfreak hipo: test


was nevertheless logged to /var/log/message.
Of course /var/log/message becomes so overfilled with "junk" shell script data not related to real basic Operating system adminsitration, so this prevented any critical or important messages that usually should come under /var/log/message / /var/log/syslog to be lost among the big quantities of other tagged tata reaching the log.

After many attempts to resolve the issue by modifying /etc/rsyslog.conf as well as the messed /etc/systemd/journald.conf (which by the way was generated with this strange values with an OS install time automation ansible stuff). It took me a while until I found the solution on how to tell rsyslog to log the tagged message strings into an external separate file. From my 20 minutes of research online I have seen multitudes of people in different Linux OS versions to experience the same or similar issues due to whatever, thus this triggered me to write this small article on the solution to rsyslog.

The solution turned to be pretty easy but requires some further digging into rsyslog, Redhat's basic configuration on rsyslog documentation is a very nice reading for starters, in my case I've used one of the Propery-based compare-operations variable contains used to select my tagged message string.
 

1. Add msg contains compare-operations to output log file and discard the messages

[root@centos bin]# vi /etc/rsyslog.conf

# config to log everything logged to rsyslog to a separate file
:msg, contains, "tag_string:/"         /var/log/custom-script-log.log
:msg, contains, "tag_string:/"    ~

Substitute quoted tag_string:/ to whatever your tag is and mind that it is better this config is better to be placed somewhere near the beginning of /etc/rsyslog.conf and touch the file /var/log/custom-script-log.log and give it some decent permissions such as 755, i.e.
 

1.1 Discarding a message


The tilda sign –  

as placed to the end of the msg, contains is the actual one to tell the string to be discarded so it did not end in /var/log/messages.

Alternative rsyslog config to do discard the unwanted message once you have it logged is with the
rawmsg variable, like so:

 

# config to log everything logged to rsyslog to a separate file
:msg, contains, "tag_string:/"         /var/log/custom-script-log.log
:rawmsg, isequal, "tag_string:/" stop

Other way to stop logging immediately after log is written to custom file across some older versions of rsyslog is via the &stop

:msg, contains, "tag_string:/"         /var/log/custom-script-log.log
& stop

I don't know about other versions but Unfortunately the &stop does not work on RHEL 7.9 with installed rpm package rsyslog-8.24.0-57.el7_9.1.x86_64.

1.2 More with property based filters basic exclusion of string 

Property based filters can do much more, you can for example, do regular expression based matches of strings coming to rsyslog and forward to somewhere.

To select syslog messages which do not contain any mention of the words fatal and error with any or no text between them (for example, fatal lib error), type:

:msg, !regex, "fatal .* error"

 

2. Create file where tagged data should be logged and set proper permissions
 

[root@centos bin]# touch /var/log/custom-script-log.log
[root@centos bin]# chmod 755 /var/log/custom-script-log.log


3. Test rsyslogd configuration for errors and reload rsyslog

[root@centos ]# rsyslogd -N1
rsyslogd: version 8.24.0-57.el7_9.1, config validation run (level 1), master config /etc/rsyslog.conf
rsyslogd: End of config validation run. Bye.

[root@centos ]# systemctl restart rsyslog
[root@centos ]#  systemctl status rsyslog 
● rsyslog.service – System Logging Service
   Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
   Active: active (running) since Wed 2021-12-22 13:40:11 CET; 3h 5min ago
     Docs: man:rsyslogd(8)
           http://www.rsyslog.com/doc/
 Main PID: 108600 (rsyslogd)
   CGroup: /system.slice/rsyslog.service
           └─108600 /usr/sbin/rsyslogd -n

 

4. Property-based compare-operations supported by rsyslog table
 

Compare-operation Description
contains Checks whether the provided string matches any part of the text provided by the property. To perform case-insensitive comparisons, use  contains_i .
isequal Compares the provided string against all of the text provided by the property. These two values must be exactly equal to match.
startswith Checks whether the provided string is found exactly at the beginning of the text provided by the property. To perform case-insensitive comparisons, use  startswith_i .
regex Compares the provided POSIX BRE (Basic Regular Expression) against the text provided by the property.
ereregex Compares the provided POSIX ERE (Extended Regular Expression) regular expression against the text provided by the property.
isempty Checks if the property is empty. The value is discarded. This is especially useful when working with normalized data, where some fields may be populated based on normalization result.

 


5. Rsyslog understanding Facility levels

Here is a list of facility levels that can be used.

Note: The mapping between Facility Number and Keyword is not uniform over different operating systems and different syslog implementations, so among separate Linuxes there might be diference in the naming and numbering.

Facility Number Keyword Facility Description
0 kern kernel messages
1 user user-level messages
2 mail mail system
3 daemon system daemons
4 auth security/authorization messages
5 syslog messages generated internally by syslogd
6 lpr line printer subsystem
7 news network news subsystem
8 uucp UUCP subsystem
9   clock daemon
10 authpriv security/authorization messages
11 ftp FTP daemon
12 NTP subsystem
13 log audit
14 log alert
15 cron clock daemon
16 local0 local use 0 (local0)
17 local1 local use 1 (local1)
18 local2 local use 2 (local2)
19 local3 local use 3 (local3)
20 local4 local use 4 (local4)
21 local5 local use 5 (local5)
22 local6 local use 6 (local6)
23 local7 local use 7 (local7)


6. rsyslog Severity levels (sublevels) accepted by facility level

As defined in RFC 5424, there are eight severity levels as of year 2021:

Code Severity Keyword Description General Description
0 Emergency emerg (panic) System is unusable. A "panic" condition usually affecting multiple apps/servers/sites. At this level it would usually notify all tech staff on call.
1 Alert alert Action must be taken immediately. Should be corrected immediately, therefore notify staff who can fix the problem. An example would be the loss of a primary ISP connection.
2 Critical crit Critical conditions. Should be corrected immediately, but indicates failure in a primary system, an example is a loss of a backup ISP connection.
3 Error err (error) Error conditions. Non-urgent failures, these should be relayed to developers or admins; each item must be resolved within a given time.
4 Warning warning (warn) Warning conditions. Warning messages, not an error, but indication that an error will occur if action is not taken, e.g. file system 85% full – each item must be resolved within a given time.
5 Notice notice Normal but significant condition. Events that are unusual but not error conditions – might be summarized in an email to developers or admins to spot potential problems – no immediate action required.
6 Informational info Informational messages. Normal operational messages – may be harvested for reporting, measuring throughput, etc. – no action required.
7 Debug debug Debug-level messages. Info useful to developers for debugging the application, not useful during operations.


7. Sample well tuned configuration using severity and facility levels and immark, imuxsock, impstats
 

Below is sample config using severity and facility levels
 

# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none;local0.none;local1.none        /var/log/messages


Note the local0.none; local1.none tells rsyslog to not log from that facility level to /var/log/messages.

If you need a complete set of rsyslog configuration fine tuned to have a proper logging with increased queues and included configuration for loggint to remote log aggegator service as well as other measures to prevent the system disk from being filled in case if something goes wild with a logging service leading to a repeatedly messages you might always contact me and I can help 🙂
 Other from that sysadmins might benefit from a sample set of configuration prepared with the Automated rsyslog config builder  or use some fine tuned config  for rsyslog-8.24.0-57.el7_9.1.x86_64 on Redhat 7.9 (Maipo)   rsyslog_config_redhat-2021.tar.gz.

To sum it up rsyslog though looks simple and not an important thing to pre

Stop haproxy log requests to /var/log/messages / Disable haproxy double logging

Friday, June 25th, 2021

haproxy-logo

On a CentOS Linux release 7.9.2009 (Core) I've running haproxies on two KVM virtual machines that are configured in a High Avaialability cluster with Corosync and Pacemaker, the machines are inherited from another admin (I did not install the servers hardware) and OS but have been received the system for support.
The old sysadmins seems to not care much about the system so they've left the haprxoy with Double logging one time under separate configured log in /var/log/haproxy/haproxyprod.log and each Haproxy TCP mode flown request has been double logged to /var/log/messages as well. As you can guess this shouldn't be so because we're wasting Hard drive space so to fix that I had to stop haproxy doble logging to /var/log/messages.

The logging is done under a separate local pointer local6 the /etc/haproxy/haproxyprod.cfg goes as follows:
 

[root@haproxy01 ~]# cat /etc/haproxy/haproxyprod.cfg

global
    # log <address> [len ] [max level [min level]]
    log 127.0.0.1 local6 debug

 

The logging is handled by rsyslog via the local6, so obviously to keep out the logging from /var/log/messages
The logging to the separate log file configuration in rsyslog is as follows:

local6.*                                                /var/log/haproxy/haproxyprod.log

It turned to be really easy to prevent haproxy get its requests log to /var/log/messages all I had to change is under /etc/rsyslogd.conf

local6.none config has to be placed for /var/log/messages the full line configuration in /etc/rsyslog.conf that stopped double logging is:

# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none;local5.none;local6.none                /var/log/messages

 

How to check Linux server power supply state is Okay / How to find out a Linux Power Supply is broken

Wednesday, January 6th, 2021

2U-power-supplies-get-status-if-Power-supply-broken-information-linux-ipmitool

If you're a sysadmin and managing remotely Linux servers, every now and then if a machine is hanging without a reason it useful to check the server Power Supply state. I say that because often if the machine is mysteriously hanging and a standard Root Cause Analysis (RCA) on /var/log/messages /var/log/dmesg /var/log/boot etc. did not bring you to any different conclusion. The next step after you send a technician to reboot the machine is to check on Linux OS level whether Power Supply Unit (PSU) hardware on the machine does not have some issues.
As blogged earlier on how to use ipmitool to manage remote ILO remote boards etc. the ipmitool can also be used to check status of Server PSUs.

Below is example output of 2 PSU server whose Power Supplies are functioning normally.
 

[root@linux-server ~]# ipmitool sdr type "Power Supply"

PS Heavy Load    | 2Bh | ok  | 19.1 | State Deasserted
Power Supply 1   | 70h | ok  | 10.1 | Presence detected
Power Supply 2   | 71h | ok  | 10.2 | Presence detected
PS Configuration | 72h | ok  | 19.1 |
PS 1 Therm Fault | 75h | ok  | 10.1 | Transition to OK
PS 2 Therm Fault | 76h | ok  | 10.2 | Transition to OK
PS1 12V OV Fault | 77h | ok  | 10.1 | Transition to OK
PS2 12V OV Fault | 78h | ok  | 10.2 | Transition to OK
PS1 12V UV Fault | 79h | ok  | 10.1 | Transition to OK
PS2 12V UV Fault | 7Ah | ok  | 10.2 | Transition to OK
PS1 12V OC Fault | 7Bh | ok  | 10.1 | Transition to OK
PS2 12V OC Fault | 7Ch | ok  | 10.2 | Transition to OK
PS1 12Vaux Fault | 7Dh | ok  | 10.1 | Transition to OK
PS2 12Vaux Fault | 7Eh | ok  | 10.2 | Transition to OK
Power Unit       | 7Fh | ok  | 19.1 | Fully Redundant

Now if you have a server lets say on an old ProLiant DL360e Gen8 whose Power Supply is damaged, you will get an from ipmitool similar to:

[root@linux-server  systemd]# ipmitool sdr type "Power Supply"
Power Supply 1   | 30h | ok  | 10.1 | 100 Watts, Presence detected
Power Supply 2   | 31h | ok  | 10.2 | 0 Watts, Presence detected, Failure detected, Power Supply AC lost
Power Supplies   | 33h | ok  | 10.3 | Redundancy Lost


If you don't have ipmitool installed due to security or whatever but you have the hardware detection software dmidecode you can use it too to get the Power Supply state

[root@linux-server  systemd]# dmidecode -t chassis
# dmidecode 3.2
Getting SMBIOS data from sysfs.
SMBIOS 2.8 present.

 

Handle 0x0300, DMI type 3, 21 bytes
Chassis Information
        Manufacturer: HP
        Type: Rack Mount Chassis
        Lock: Not Present
        Version: Not Specified
        Serial Number: CZJ38201ZH
        Asset Tag:
        Boot-up State: Critical
        Power Supply State: Critical

        Thermal State: Safe
        Security Status: Unknown
        OEM Information: 0x00000000
        Height: 1 U
        Number Of Power Cords: 2
        Contained Elements: 0

To find only Power Supply info status on a server with dmideode.

# dmidecode –type 39

monitoring-power-supply-hardware-information-linux-ipmitool

Plug between the power supply and the mainboard voltage / coms ATX specification

This can also be used on a normal Linux desktop PCs which usually have only 1U (one power supply) on many of Ubuntus and Linux desktops where lshw (list hardaware information) is installed to get the machine PSUs status with lshw 

 root@ubuntu:~# lshw -c power
  *-battery               
       product: 45N1111
       vendor: SONY
       physical id: 1
       slot: Front
       capacity: 23200mWh
       configuration: voltage=11.1V
        Thermal State: Safe
        Security Status: Unknown
        OEM Information: 0x00000000
        Height: 1 U
        Number Of Power Cords: 2
        Contained Elements: 0


Finally to get an extensive information on the voltages of the Power Supply you can use the good old lm_sensors.

# apt-get install lm-sensors
# sensors-detect 
# service kmod start

# sensors
# watch sensors


As manually monitoring Power Supplies and other various data is dubious, finally you might want to use some centralized monitoring. For one example on that you might want to check my prior Zabbix to Monitor Hardware Hard Drive / Temperature and Disk with lm_sensors / smartd on Linux with Zabbix.

How to enable HaProxy logging to a separate log /var/log/haproxy.log / prevent HAProxy duplicate messages to appear in /var/log/messages

Wednesday, February 19th, 2020

haproxy-logging-basics-how-to-log-to-separate-file-prevent-duplicate-messages-haproxy-haproxy-weblogo-squares
haproxy  logging can be managed in different form the most straight forward way is to directly use /dev/log either you can configure it to use some log management service as syslog or rsyslogd for that.

If you don't use rsyslog yet to install it: 

# apt install -y rsyslog

Then to activate logging via rsyslogd we can should add either to /etc/rsyslogd.conf or create a separte file and include it via /etc/rsyslogd.conf with following content:
 

Enable haproxy logging from rsyslogd


Log haproxy messages to separate log file you can use some of the usual syslog local0 to local7 locally used descriptors inside the conf (be aware that if you try to use some wrong value like local8, local9 as a logging facility you will get with empty haproxy.log, even though the permissions of /var/log/haproxy.log are readable and owned by haproxy user.

When logging to a local Syslog service, writing to a UNIX socket can be faster than targeting the TCP loopback address. Generally, on Linux systems, a UNIX socket listening for Syslog messages is available at /dev/log because this is where the syslog() function of the GNU C library is sending messages by default. To address UNIX socket in haproxy.cfg use:

log /dev/log local2 


If you want to log into separate log each of multiple running haproxy instances with different haproxy*.cfg add to /etc/rsyslog.conf lines like:

local2.* -/var/log/haproxylog2.log
local3.* -/var/log/haproxylog3.log


One important note to make here is since rsyslogd is used for haproxy logging you need to have enabled in rsyslogd imudp and have a UDP port listener on the machine.

E.g. somewhere in rsyslog.conf or via rsyslog include file from /etc/rsyslog.d/*.conf needs to have defined following lines:

$ModLoad imudp
$UDPServerRun 514


I prefer to use external /etc/rsyslog.d/20-haproxy.conf include file that is loaded and enabled rsyslogd via /etc/rsyslog.conf:

# vim /etc/rsyslog.d/20-haproxy.conf

$ModLoad imudp
$UDPServerRun 514​
local2.* -/var/log/haproxy2.log


It is also possible to produce different haproxy log output based on the severiy to differentiate between important and less important messages, to do so you'll need to rsyslog.conf something like:
 

# Creating separate log files based on the severity
local0.* /var/log/haproxy-traffic.log
local0.notice /var/log/haproxy-admin.log

 

Prevent Haproxy duplicate messages to appear in /var/log/messages

If you use local2 and some default rsyslog configuration then you will end up with the messages coming from haproxy towards local2 facility producing doubled simultaneous records to both your pre-defined /var/log/haproxy.log and /var/log/messages on Proxy servers that receive few thousands of simultanous connections per second.
This is a problem since doubling the log will produce too much data and on systems with smaller /var/ partition you will quickly run out of space + this haproxy requests logging to /var/log/messages makes the file quite unreadable for normal system events which are so important to track clearly what is happening on the server daily.

To prevent the haproxy duplicate messages you need to define somewhere in rsyslogd usually /etc/rsyslog.conf local2.none near line of facilities configured to log to file:

*.info;mail.none;authpriv.none;cron.none;local2.none     /var/log/messages

This configuration should work but is more rarely used as most people prefer to have haproxy log being written not directly to /dev/log which is used by other services such as syslogd / rsyslogd.

To use /dev/log to output logs from haproxy configuration in global section use config like:
 

global
        log /dev/log local2 debug
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

The log global directive basically says, use the log line that was set in the global section for whole config till end of file. Putting a log global directive into the defaults section is equivalent to putting it into all of the subsequent proxy sections.

Using global logging rules is the most common HAProxy setup, but you can put them directly into a frontend section instead. It can be useful to have a different logging configuration as a one-off. For example, you might want to point to a different target Syslog server, use a different logging facility, or capture different severity levels depending on the use case of the backend application. 

Insetad of using /dev/log interface that is on many distributions heavily used by systemd to store / manage and distribute logs,  many haproxy server sysadmins nowdays prefer to use rsyslogd as a default logging facility that will manage haproxy logs.
Admins prefer to use some kind of mediator service to manage log writting such as rsyslogd or syslog, the reason behind might vary but perhaps most important reason is  by using rsyslogd it is possible to write logs simultaneously locally on disk and also forward logs  to a remote Logging server  running rsyslogd service.

Logging is defined in /etc/haproxy/haproxy.cfg or the respective configuration through global section but could be also configured to do a separate logging based on each of the defined Frontend Backends or default section. 
A sample exceprt from this section looks something like:

#———————————————————————
# Global settings
#———————————————————————
global
    log         127.0.0.1 local2

    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    # turn on stats unix socket
    stats socket /var/lib/haproxy/stats

#———————————————————————
defaults
    mode                    tcp
    log                     global
    option                  tcplog
    #option                  dontlognull
    #option http-server-close
    #option forwardfor       except 127.0.0.0/8
    option                  redispatch
    retries                 7
    #timeout http-request    10s
    timeout queue           10m
    timeout connect         30s
    timeout client          20m
    timeout server          10m
    #timeout http-keep-alive 10s
    timeout check           30s
    maxconn                 3000

 

 

# HAProxy Monitoring Config
#———————————————————————
listen stats 192.168.0.5:8080                #Haproxy Monitoring run on port 8080
    mode http
    option httplog
    option http-server-close
    stats enable
    stats show-legends
    stats refresh 5s
    stats uri /stats                            #URL for HAProxy monitoring
    stats realm Haproxy\ Statistics
    stats auth hproxyauser:Password___          #User and Password for login to the monitoring dashboard

 

#———————————————————————
# frontend which proxys to the backends
#———————————————————————
frontend ft_DKV_PROD_WLPFO
    mode tcp
    bind 192.168.233.5:30000-31050
    option tcplog
    log-format %ci:%cp\ [%t]\ %ft\ %b/%s\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
    default_backend Default_Bakend_Name


#———————————————————————
# round robin balancing between the various backends
#———————————————————————
backend bk_DKV_PROD_WLPFO
    mode tcp
    # (0) Load Balancing Method.
    balance source
    # (4) Peer Sync: a sticky session is a session maintained by persistence
    stick-table type ip size 1m peers hapeers expire 60m
    stick on src
    # (5) Server List
    # (5.1) Backend
    server Backend_Server1 10.10.10.1 check port 18088
    server Backend_Server2 10.10.10.2 check port 18088 backup


The log directive in above config instructs HAProxy to send logs to the Syslog server listening at 127.0.0.1:514. Messages are sent with facility local2, which is one of the standard, user-defined Syslog facilities. It’s also the facility that our rsyslog configuration is expecting. You can add more than one log statement to send output to multiple Syslog servers.

Once rsyslog and haproxy logging is configured as a minumum you need to restart rsyslog (assuming that haproxy config is already properly loaded):

# systemctl restart rsyslogd.service

To make sure rsyslog reloaded successfully:

systemctl status rsyslogd.service


Restarting HAproxy

If the rsyslogd logging to 127.0.0.1 port 514 was recently added a HAProxy restart should also be run, you can do it with:
 

# /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -D -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)


Or to restart use systemctl script (if haproxy is not used in a cluster with corosync / heartbeat).

# systemctl restart haproxy.service

You can control how much information is logged by adding a Syslog level by

    log         127.0.0.1 local2 info


The accepted values are the standard syslog security level severity:

Value Severity Keyword Deprecated keywords Description Condition
0 Emergency emerg panic System is unusable A panic condition.
1 Alert alert   Action must be taken immediately A condition that should be corrected immediately, such as a corrupted system database.
2 Critical crit   Critical conditions Hard device errors.
3 Error err error Error conditions  
4 Warning warning warn Warning conditions  
5 Notice notice   Normal but significant conditions Conditions that are not error conditions, but that may require special handling.
6 Informational info   Informational messages  
7 Debug debug   Debug-level messages Messages that contain information normally of use only when debugging a program.

 

Logging only errors / timeouts / retries and errors is done with option:

Note that if the rsyslog is configured to listen on different port for some weird reason you should not forget to set the proper listen port, e.g.:
 

  log         127.0.0.1:514 local2 info

option dontlog-normal

in defaults or frontend section.

You most likely want to enable this only during certain times, such as when performing benchmarking tests.

(or log-format-sd for structured-data syslog) directive in your defaults or frontend
 

Haproxy Logging shortly explained


The type of logging you’ll see is determined by the proxy mode that you set within HAProxy. HAProxy can operate either as a Layer 4 (TCP) proxy or as Layer 7 (HTTP) proxy. TCP mode is the default. In this mode, a full-duplex connection is established between clients and servers, and no layer 7 examination will be performed. When in TCP mode, which is set by adding mode tcp, you should also add option tcplog. With this option, the log format defaults to a structure that provides useful information like Layer 4 connection details, timers, byte count and so on.

Below is example of configured logging with some explanations:

Log-format "%ci:%cp [%t] %ft %b/%s %Tw/%Tc/%Tt %B %ts %ac/%fc/%bc/%sc/%rc %sq/%bq"

haproxy-logged-fields-explained
Example of Log-Format configuration as shown above outputted of haproxy config:

Log-format "%ci:%cp [%tr] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r"

haproxy_http_log_format-explained1

To understand meaning of this abbreviations you'll have to closely read  haproxy-log-format.txt. More in depth info is to be found in HTTP Log format documentation


haproxy_logging-explained

Logging HTTP request headers

HTTP request header can be logged via:
 

 http-request capture

frontend website
    bind :80
    http-request capture req.hdr(Host) len 10
    http-request capture req.hdr(User-Agent) len 100
    default_backend webservers


The log will show headers between curly braces and separated by pipe symbols. Here you can see the Host and User-Agent headers for a request:

192.168.150.1:57190 [20/Dec/2018:22:20:00.899] website~ webservers/server1 0/0/1/0/1 200 462 – – —- 1/1/0/0/0 0/0 {mywebsite.com|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/71.0.3578.80 } "GET / HTTP/1.1"

 

Haproxy Stats Monitoring Web interface


Haproxy is having a simplistic stats interface which if enabled produces some general useful information like in above screenshot, through which
you can get a very basic in browser statistics and track potential issues with the proxied traffic for all configured backends / frontends incoming outgoing
network packets configured nodes
 experienced downtimes etc.

haproxy-statistics-report-picture

The basic configuration to make the stats interface accessible would be like pointed in above config for example to enable network listener on address
 

https://192.168.0.5:8080/stats


with hproxyuser / password config would be:

# HAProxy Monitoring Config
#———————————————————————
listen stats 192.168.0.5:8080                #Haproxy Monitoring run on port 8080
    mode http
    option httplog
    option http-server-close
    stats enable
    stats show-legends
    stats refresh 5s
    stats uri /stats                            #URL for HAProxy monitoring
    stats realm Haproxy\ Statistics
    stats auth hproxyauser:Password___          #User and Password for login to the monitoring dashboard

 

 

Sessions states and disconnect errors on new application setup

Both TCP and HTTP logs include a termination state code that tells you the way in which the TCP or HTTP session ended. It’s a two-character code. The first character reports the first event that caused the session to terminate, while the second reports the TCP or HTTP session state when it was closed.

Here are some essential termination codes to track in for in the log:
 

Here are some termination code examples most commonly to see on TCP connection establishment errors:

Two-character code    Meaning
—    Normal termination on both sides.
cD    The client did not send nor acknowledge any data and eventually timeout client expired.
SC    The server explicitly refused the TCP connection.
PC    The proxy refused to establish a connection to the server because the process’ socket limit was reached while attempting to connect.


To get all non-properly exited codes the easiest way is to just grep for anything that is different from a termination code –, like that:

tail -f /var/log/haproxy.log | grep -v ' — '


This should output in real time every TCP connection that is exiting improperly.

There’s a wide variety of reasons a connection may have been closed. Detailed information about all possible termination codes can be found in the HAProxy documentation.
To get better understanding a very useful reading to haproxy Debug errors with  is in haproxy-logging.txt in that small file are collected all the cryptic error messages codes you might find in your logs when you're first time configuring the Haproxy frontend / backend and the backend application behind.

Another useful analyze tool which can be used to analyze Layer 7 HTTP traffic is halog for more on it just google around.

Mail send from command line on Linux and *BSD servers – useful for scripting

Monday, September 10th, 2018

mail-send-email-from-command-line-on-linux-and-freebsd-operating-systems-logo

Historically Email sending has been very different from what most people use it in the Office, there was no heavy Email clients such as Outlook Express no MX Exchange, no e-mail client capabilities for Calendar and Meetings schedule as it is in most of the modern corporate offices that depend on products such as Office 365 (I would call it a connectedHell 365 days a year !).

There was no free webmail and pop3 / imap providers such as Mail.Yahoo.com, Gmail.com, Hotmail.com, Yandex.com, RediffMail, Mail.com the innumerous lists goes and on.
Nope back in the day emails were doing what they were originally supposed to like the post services in real life simply send and receive messages.

For those who remember that charming times, people used to be using BBS-es (which were basicly a shared set-up home system as a server) or some of the few University Internal Email student accounts or by crazy sysadmins who received their notification and warnings logs about daemon (services) messages via local DMZ-ed network email servers and it was common to read the email directly with mail (mailx) text command or custom written scripts … It was not uncommon also that mailx was used heavily to send notification messages on triggered events from logs. Oh life was simple and clear back then, and even though today the email could be used in a similar fashion by hard-core old school sysadmins and Dev Ops / simple shell scriptings tasks or report cron jobs such usage is already in the deep history.

The number of ways one could send email in text format directly from the GNU / Linux / *BSD server to another remote mail MTA node (assuming it had properly configured Relay server be it Exim or Postifix) were plenty.

In this article I will try to rewind back some of the UNIX history by pinpointing a few of the most common ways, one used to send quick emails directly from a remote server connection terminal or lets say a cheap VPS few cents server, through something like (SSH or Telnet) etc.
 

1. Using the mail command client (part of bsd-mailx on Debian).
 

In my previous article Linux: "bash mail command not found" error fix
I ended the article with a short explanation on how this is done but I will repeat myself one more time here for the sake of clearness of this article.

root@linux:~# echo "Your Sample Message Body" | mail -s "Whatever … Message Subject" remote_receiver@remote-server-email-address.com


The mail command will connect to local server TCP PORT 25 on local configured MTA and send via it. If the local MTA is misconfigured or it doesn't have a proper MX / PTR DNS records etc. or not configure as a relay SMTP remote mail will not get delivered. Sent Email should be properly delivered at remote recipient address.

How to send HTML formatted emails using mailx command on Linux console / terminal shell using remote server through SSH ?

Connect to remote SSH server (VPS), dedicated server, home Linux router etc. and run:

 

root@linux:~# mailx -a 'Content-Type: text/html'
      -s "This is advanced mailx indeed!" < email_content.html
      "first_email_to_send_to@gmail.com, mail_recipient_2@yahoo.com"

 


email_content.html should be properly formatted (at best w3c standard compliant) HTML.

Here is an example email_content.html (skeleton file)

 

    To: your_customer@gmail.com
    Subject: This is an HTML message
    From: marketing@your_company.com
    Content-Type: text/html; charset="utf8"

    <html>
    <body>
    <div style="
        background-color:
        #abcdef; width: 300px;
        height: 300px;
        ">
    </div>
Whatever text mixed with valid email HTML tags here.
    </body>
    </html>


Above command sends to two email addresses however if you have a text formatted list of recipients you can easily use that file with a bash shell script for loop and send to multiple addresses red from lets say email_addresses_list.txt .

To further advance the one liner you can also want to provide an email attachment, lets say the file email_archive.rar by using the -A email_archive.rar argument.

 

root@linux:~# mailx -a 'Content-Type: text/html'
      -s "This is advanced mailx indeed!" -A ~/email_archive.rar < email_content.html
      "first_email_to_send_to@gmail.com, mail_recipient_2@yahoo.com"

 

For those familiar with Dan Bernstein's Qmail MTA (which even though a bit obsolete is still a Security and Stability Beast across email servers) – mailx command had to be substituted with a custom qmail one in order to be capable to send via qmail MTA daemon.
 

2. Using sendmail command to send email
 

Do you remember that heavy hard to configure MTA monster sendmail ? It was and until this very day is the default Mail Transport Agent for Slackware Linux.

Here is how we were supposed to send mail with it:

 

[root@sendmail-host ~]# vim email_content_to_be_delivered.txt

 

Content of file should be something like:

Subject: This Email is sent from UNIX Terminal Email

Hi this Email was typed in a file and send via sendmail console email client
(part of the sendmail mail server)

It is really fun to go back in the pre-history of Mail Content creation 🙂

 

[root@sendmail-host ~]# sendmail -v user_name@remote-mail-domain.com  < /tmp/email_content_to_be_delivered.txt

 

-v argument provided, will make the communication between the mail server and your mail transfer agent visible.
 

3. Using ssmtp command to send mail
 

ssmtp MTA and its included shell command was used historically as it was pretty straight forward you just launch it on the command line type on one line all your email and subject and ship it (by pressing the CTRL + D key combination).

To give it a try you can do:

 

root@linux:~# apt-get install ssmtp
Reading package lists… Done
Building dependency tree       
Reading state information… Done
The following additional packages will be installed:
  libgnutls-openssl27
The following packages will be REMOVED:
  exim4-base exim4-config exim4-daemon-heavy
The following NEW packages will be installed:
  libgnutls-openssl27 ssmtp
0 upgraded, 2 newly installed, 3 to remove and 1 not upgraded.
Need to get 239 kB of archives.
After this operation, 3,697 kB disk space will be freed.
Do you want to continue? [Y/n] Y
Get:1 http://ftp.us.debian.org/debian stretch/main amd64 ssmtp amd64 2.64-8+b2 [54.2 kB]
Get:2 http://ftp.us.debian.org/debian stretch/main amd64 libgnutls-openssl27 amd64 3.5.8-5+deb9u3 [184 kB]
Fetched 239 kB in 2s (88.5 kB/s)         
Preconfiguring packages …
dpkg: exim4-daemon-heavy: dependency problems, but removing anyway as you requested:
 mailutils depends on default-mta | mail-transport-agent; however:
  Package default-mta is not installed.
  Package mail-transport-agent is not installed.
  Package exim4-daemon-heavy which provides mail-transport-agent is to be removed.

 

(Reading database … 169307 files and directories currently installed.)
Removing exim4-daemon-heavy (4.89-2+deb9u3) …
dpkg: exim4-config: dependency problems, but removing anyway as you requested:
 exim4-base depends on exim4-config (>= 4.82) | exim4-config-2; however:
  Package exim4-config is to be removed.
  Package exim4-config-2 is not installed.
  Package exim4-config which provides exim4-config-2 is to be removed.
 exim4-base depends on exim4-config (>= 4.82) | exim4-config-2; however:
  Package exim4-config is to be removed.
  Package exim4-config-2 is not installed.
  Package exim4-config which provides exim4-config-2 is to be removed.

Removing exim4-config (4.89-2+deb9u3) …
Selecting previously unselected package ssmtp.
(Reading database … 169247 files and directories currently installed.)
Preparing to unpack …/ssmtp_2.64-8+b2_amd64.deb …
Unpacking ssmtp (2.64-8+b2) …
(Reading database … 169268 files and directories currently installed.)
Removing exim4-base (4.89-2+deb9u3) …
Selecting previously unselected package libgnutls-openssl27:amd64.
(Reading database … 169195 files and directories currently installed.)
Preparing to unpack …/libgnutls-openssl27_3.5.8-5+deb9u3_amd64.deb …
Unpacking libgnutls-openssl27:amd64 (3.5.8-5+deb9u3) …
Processing triggers for libc-bin (2.24-11+deb9u3) …
Setting up libgnutls-openssl27:amd64 (3.5.8-5+deb9u3) …
Setting up ssmtp (2.64-8+b2) …
Processing triggers for man-db (2.7.6.1-2) …
Processing triggers for libc-bin (2.24-11+deb9u3) …

 

As you see from above output local default Debian Linux Exim is removed …

Lets send a simple test email …

 

hipo@linux:~# ssmtp user@remote-mail-server.com
Subject: Simply Test SSMTP Email
This Email was send just as a test using SSMTP obscure client
via SMTP server.
^d

 

What is notable about ssmtp is that even though so obsolete today it supports of STARTTLS (email communication encryption) that is done via its config file

 

/etc/ssmtp/ssmtp.conf

 

4. Send Email from terminal using Mutt client
 

Mutt was and still is one of the swiff army of most used console text email clients along with Alpine and Fetchmail to know more about it read here

Mutt supports reading / sending mail from multiple mailboxes and capable of reading IMAP and POP3 mail fetch protocols and was a serious step forward over mailx. Its syntax pretty much resembles mailx cmds.

 

root@linux:~# mutt -s "Test Email" user@example.com < /dev/null

 

Send email including attachment a 15 megabytes MySQL backup of Squirrel Webmail

 

root@linux:~# mutt  -s "This is last backup small sized database" -a /home/backups/backup_db.sql user@remote-mail-server.com < /dev/null

 


5. Using simple telnet to test and send email (verify existence of email on remote SMTP)
 

As a Mail Server SysAdmin this is one of my best ways to test whether I had a server properly configured and even sometimes for the sake of fun I used it as a hack to send my mail 🙂
telnet is and will always be a great tool for doing SMTP issues troubleshooting.
 

It is very useful to test whether a remote SMTP TCP port 25 is opened or a local / remote server firewall prevents connections to MTA.

Below is an example connect and send example using telnet to my local SMTP on www.pc-freak.net (QMail powered (R) 🙂 )

sending-email-using-telnet-command-howto-screenshot

 

root@pcfreak:~# telnet localhost 25
Trying 127.0.0.1…
Connected to localhost.
Escape character is '^]'.
220 This is Mail Pc-Freak.NET ESMTP
HELO mail.www.pc-freak.net
250 This is Mail Pc-Freak.NET
MAIL FROM:<hipo@www.pc-freak.net>
250 ok
RCPT TO:<roots_bg@yahoo.com>
250 ok
DATA
354 go ahead
Subject: This is a test subject

 

This is just a test mail send through telnet
.
250 ok 1536440787 qp 28058
^]
telnet>

 

Note that the returned messages are native to qmail, a postfix would return a slightly different content, here is another test example to remote SMTP running sendmail or postfix.

 

root@pcfreak:~# telnet mail.servername.com 25
Trying 127.0.0.1…
Connected to localhost.localdomain (127.0.0.1).
Escape character is '^]'.
220 mail.servername.com ESMTP Sendmail 8.13.8/8.13.8; Tue, 22 Oct 2013 05:05:59 -0400
HELO yahoo.com
250 mail.servername.com Hello mail.servername.com [127.0.0.1], pleased to meet you
mail from: systemexec@gmail.com
250 2.1.0 hipo@www.pc-freak.net… Sender ok
rcpt to: hip0d@yandex.ru
250 2.1.5 hip0d@yandex.ru… Recipient ok
data
354 Enter mail, end with "." on a line by itself
Hey
This is test email only

 

Thanks
.
250 2.0.0 r9M95xgc014513 Message accepted for delivery
quit
221 2.0.0 mail.servername.com closing connection
Connection closed by foreign host.


It is handy if you want to know whether remote MTA server has a certain Emailbox existing or not with telnet by simply trying to send to a certian email and checking the Email server returned output (note that the message returned depends on the remote MTA version and many qmails are configured to not give information on the initial SMTP handshake but returns instead a MAILER DAEMON failure error sent back to your sender address. Some MX servrers are still vulnerable to this attack yet, historically dreamhost.com. Below attack screenshot is made at the times before dreamhost.com fixed the brute force email issue.

Terminal-Verify-existing-Email-with-telnet

6. Using simple netcat TCP/IP Swiss Army Knife to test and send email in console

netcat-logo-a-swiff-army-knife-of-the-hacker-and-security-expert-logo
Other tool besides telnet of testing remote / local SMTP is netcat tool (for reading and writting data across TCP and UDP connections).

The way to do it is analogous but since netcat is not present on most Linux OSes by default you need to install it through the package manager first be it apt or yum etc.

# apt-get –yes install netcat


 

First lets create a new file test_email_content.txt using bash's echo cmd.
 

 

# echo 'EHLO hostname
MAIL FROM: hip0d@yandex.ru
RCPT TO:   solutions@www.pc-freak.net
DATA
From: A tester <hip0d@yandex.ru>
To:   <solutions@www.pc-freak.net>
Date: date
Subject: A test message from test hostname

 

Delete me, please
.
QUIT
' >>test_email_content.txt

 

# netcat -C localhost 25 < test_email_content.txt

 

220 This is Mail Pc-Freak.NET ESMTP
250-This is Mail Pc-Freak.NET
250-STARTTLS
250-SIZE 80000000
250-PIPELINING
250 8BITMIME
250 ok
250 ok
354 go ahead
451 See http://pobox.com/~djb/docs/smtplf.html.

Because of its simplicity and the fact it has a bit more capabilities in reading / writing data over network it was no surprise it was among the favorite tools not only of crackers and penetration testers but also a precious debug tool for the avarage sysadmin. netcat's advantage over telnet is you can push-pull over the remote SMTP port (25) a non-interactive input.


7. Using openssl to connect and send email via encrypted channel

 

root@linux:~# openssl s_client -connect smtp.gmail.com:465 -crlf -ign_eof

    ===
               Certificate negotiation output from openssl command goes here
        ===

        220 smtp.gmail.com ESMTP j92sm925556edd.81 – gsmtp
            EHLO localhost
        250-smtp.gmail.com at your service, [78.139.22.28]
        250-SIZE 35882577
        250-8BITMIME
        250-AUTH LOGIN PLAIN XOAUTH2 PLAIN-CLIENTTOKEN OAUTHBEARER XOAUTH
        250-ENHANCEDSTATUSCODES
        250-PIPELINING
        250-CHUNKING
        250 SMTPUTF8
            AUTH PLAIN *passwordhash*
        235 2.7.0 Accepted
            MAIL FROM: <hipo@pcfreak.org>
        250 2.1.0 OK j92sm925556edd.81 – gsmtp
            rcpt to: <systemexec@gmail.com>
        250 2.1.5 OK j92sm925556edd.81 – gsmtp
            DATA
        354  Go ahead j92sm925556edd.81 – gsmtp
            Subject: This is openssl mailing

            Hello nice user
            .
        250 2.0.0 OK 1339757532 m46sm11546481eeh.9
            quit
        221 2.0.0 closing connection m46sm11546481eeh.9
        read:errno=0


8. Using CURL (URL transfer) tool to send SSL / TLS secured crypted channel emails via Gmail / Yahoo servers and MailGun Mail send API service


Using curl webpage downloading advanced tool for managing email send might be  a shocking news to many as it is idea is to just transfer data from a server.
curl is mostly used in conjunction with PHP website scripts for the reason it has a Native PHP implementation and many PHP based websites widely use it for download / upload of user data.
Interestingly besides support for HTTP and FTP it has support for POP3 and SMTP email protocols as well
If you don't have it installed on your server and you want to give it a try, install it first with apt:
 

root@linux:~# apt-get install curl

 


To learn more about curl capabilities make sure you check cURL –manual arg.
 

root@linux:~# curl –manual

 

a) Sending Emails via Gmail and other Mail Public services

Curl is capable to send emails from terminal using Gmail and Yahoo Mail services, if you want to give that a try.

gmail-settings-google-allow-less-secure-apps-sign-in-to-google-screenshot

Go to myaccount.google.com URL and login from the web interface choose Sign in And Security choose Allow less Secure Apps to be -> ON and turn on access for less secure apps in Gmail. Though I have not tested it myself so far with Yahoo! Mail, I suppose it should have a similar security settings somewhere.

Here is how to use curl to send email via Gmail.

Gmail-password-Allow-less-secure-apps-ON-screenshot-howto-to-be-able-to-send-email-with-text-commands-with-encryption-and-outlook

 

 

root@linux:~# curl –url 'smtps://smtp.gmail.com:465' –ssl-reqd \
  –mail-from 'your_email@gmail.com' –mail-rcpt 'remote_recipient@mail.com' \
  –upload-file mail.txt –user 'your_email@gmail.com:your_accout_password'


b) Sending Emails using Mailgun.com (Transactional Email Service API for developers)

To use Mailgun to script sending automated emails go to Mailgun.com and create account and generate new API key.

Then use curl in a similar way like below example:

 

curl -sv –user 'api:key-7e55d003b…f79accd31a' \
    https://api.mailgun.net/v3/sandbox21a78f824…3eb160ebc79.mailgun.org/messages \
    -F from='Excited User <developer@yourcompany.com>' \
    -F to=sandbox21a78f824…3eb160ebc79.mailgun.org \
    -F to=user_acc@gmail.com \
    -F subject='Hello' \
    -F text='Testing Mailgun service!' \
   –form-string html='<h1>EDMdesigner Blog</h1><br /><cite>This tutorial helps me understand email sending from Linux console</cite>' \
    -F attachment=@logo_picture.jpg

 

The -F option that is heavy present in above command lets curl (Emulate a form filled in button in which user has pressed the submit button).
For more info of the options check out man curl.
 

 

9. Using swaks command to send emails from

 

root@linux:~# apt-cache show swaks|grep "Description" -B 10
Package: swaks
Version: 20170101.0-1
Installed-Size: 221
Maintainer: Andreas Metzler <ametzler@debian.org>
Architecture: all
Depends: perl
Recommends: libnet-dns-perl, libnet-ssleay-perl
Suggests: perl-doc, libauthen-sasl-perl, libauthen-ntlm-perl
Description-en: SMTP command-line test tool
 swaks (Swiss Army Knife SMTP) is a command-line tool written in Perl
 for testing SMTP setups; it supports STARTTLS and SMTP AUTH (PLAIN,
 LOGIN, CRAM-MD5, SPA, and DIGEST-MD5). swaks allows one to stop the
 SMTP dialog at any stage, e.g to check RCPT TO: without actually
 sending a mail.
 .
 If you are spending too much time iterating "telnet foo.example 25"
 swaks is for you.
Description-md5: f44c6c864f0f0cb3896aa932ce2bdaa8

 

 

 

root@linux:~# apt-get instal –yes swaks

root@linux:~# swaks –to mailbox@example.com -s smtp.gmail.com:587
      -tls -au <user-account> -ap <account-password>

 


The -tls argument (in order to use gmail encrypted TLS channel on port 587)

If you want to hide the password not to provide the password from command line so (in order not to log it to user history) add the -a options.

10. Using qmail-inject on Qmail mail servers to send simple emails

Create new file with content like:
 

root@qmail:~# vim email_file_content.text
To: user@mail-example.com
Subject: Test


This is a test message.
 

root@qmail:~# cat email_file_content.text | /var/qmail/bin/qmail-inject


qmail-inject is part of ordinary qmail installation so it is very simple it even doesn't return error codes it just ships what ever given as content to remote MTA.
If the linux host where you invoke it has a properly configured qmail installation the email will get immediately delivered. The advantage of qmail-inject over the other ones is it is really lightweight and will deliver the simple message more quickly than the the prior heavy tools but again it is more a Mail Delivery Agent (MDA) for quick debugging, if MTA is not working, than for daily email writting.

It is very useful to simply test whether email send works properly without sending any email content by (I used qmail-inject to test local email delivery works like so).
 

root@linux:~# echo 'To: mailbox_acc@mail-server.com' | /var/qmail/bin/qmail-inject

 

11. Debugging why Email send with text tool is not being send properly to remote recipient

If you use some of the above described methods and email is not delivered to remote recipient email addresses check /var/log/mail.log (for a general email log and postfix MTAs – the log is present on many of the Linux distributions) and /var/log/messages or /var/log/qmal (on Qmail installations) /var/log/exim4 (on servers running Exim as MTA).

https://www.pc-freak.net/images/linux-email-log-debug-var-log-mail-output

 Closure

The ways to send email via Linux terminal are properly innumerous as there are plenty of scripted tools in various programming languages, I am sure in this article,  also missing a lot of pre-bundled installable distro packages. If you know other interesting ways / tools to send via terminal I would like to hear it.

Hope you enjoyed, happy mailing !

Linux Bond network interfaces to merge multiple interfaces ISPs traffic – Combine many interfaces NIC into one on Debian / Ubuntu / CentOS / Fedora / RHEL Linux

Tuesday, December 16th, 2014

how-to-create-bond-linux-agregated-network-interfaces-for-increased-network-thoroughput-debian-ubuntu-centos-fedora-rhel
Bonding Network Traffic
 (link aggregation) or NIC teaming is used to increase connection thoroughput and as a way to provide redundancy for a services / applications in case of some of the network connection (eth interfaces) fail. Networking Bonding is mostly used in large computer network providers (ISPs), infrastructures, university labs or big  computer network accessible infrastructures or even by enthusiatst to run home-server assuring its >= ~99% connectivity to the internet by bonding few Internet Providers links into single Bonded Network interface. One of most common use of Link Aggreegation nowadays is of course in Cloud environments.  

 Boding Network Traffic is a must know and (daily use) skill for the sys-admin of both Small Company Office network environment up to the large Professional Distributed Computing networks, as novice GNU /  Linux sys-admins would probably have never heard it and sooner or later they will have to, I've created this article as a quick and dirty guide on configuring Linux bonding across most common used Linux distributions.

It is assumed that the server where you need network boding to be configured has at least 2 or more PCI Gigabyte NICs with hardware driver for Linux supporting Jumbo Frames and some relatively fresh up2date Debian Linux >=6.0.*, Ubuntu 10+ distro, CentOS 6.4, RHEL 5.1, SuSE etc.
 

1. Bond Network ethernet interfaces on Debian / Ubutnu and Deb based distributions

To make network bonding possible on Debian and derivatives you need to install support for it through ifenslave package (command).

apt-cache show ifenslave-2.6|grep -i descript -A 8
Description: Attach and detach slave interfaces to a bonding device
 This is a tool to attach and detach slave network interfaces to a bonding
 device. A bonding device will act like a normal Ethernet network device to
 the kernel, but will send out the packets via the slave devices using a simple
 round-robin scheduler. This allows for simple load-balancing, identical to
 "channel bonding" or "trunking" techniques used in switches.
 .
 The kernel must have support for bonding devices for ifenslave to be useful.
 This package supports 2.6.x kernels and the most recent 2.4.x kernels.

 

apt-get –yes install ifenslave-2.6

 

Bonding interface works by creating a "Virtual" network interface on a Linux kernel level, it sends and receives packages via special
slave devices using simple round-robin scheduler. This makes possible a very simple network load balancing also known as "channel bonding" and "trunking"
supported by all Intelligent network switches

Below is a text diagram showing tiny Linux office network router configured to bond ISPs interfaces for increased thoroughput:

 

Internet
 |                  204.58.3.10 (eth0)
ISP Router/Firewall 10.10.10.254 (eth1)
   
                              | -----+------ Server 1 (Debian FTP file server w/ eth0 & eth1) 10.10.10.1
      +------------------+ --- |
      | Gigabit Ethernet       |------+------ Server 2 (MySQL) 10.10.10.2
      | with Jumbo Frame       |
      +------------------+     |------+------ Server 3 (Apache Webserver) 10.10.10.3
                               |
                               |------+-----  Server 4 (Squid Proxy / Qmail SMTP / DHCP) 10.10.10.4
                               |
                               |------+-----  Server 5 (Nginx CDN static content Webserver) 10.10.10.5
                               |
                               |------+-----  WINDOWS Desktop PCs / Printers & Scanners, Other network devices 

 

Next to configure just installed ifenslave Bonding  
 

vim /etc/modprobe.d/bonding.conf

alias bond0 bonding
  options bonding mode=0 arp_interval=100 arp_ip_target=10.10.10.254, 10.10.10.2, 10.10.10.3, 10.10.10.4, 10.10.10.5


Where:

  1. mode=0 : Set the bonding policies to balance-rr (round robin). This is default mode, provides load balancing and fault tolerance.
  2. arp_interval=100 : Set the ARP link monitoring frequency to 100 milliseconds. Without option you will get various warning when start bond0 via /etc/network/interfaces
  3. arp_ip_target=10.10.10.254, 10.10.10.2, … : Use the 10.10.10.254 (router ip) and 10.10.10.2-5 IP addresses to use as ARP monitoring peers when arp_interval is > 0. This is used determine the health of the link to the targets. Multiple IP addresses must be separated by a comma. At least one IP address must be given (usually I set it to router IP) for ARP monitoring to function. The maximum number of targets that can be specified is 16.

Next to make bonding work its necessery to load the bonding kernel module:

modprobe -v bonding mode=0 arp_interval=100 arp_ip_target=10.10.10.254, 10.10.10.2, 10.10.10.3, 10.10.10.4, 10.10.10.5

 

Loading the bonding module should spit some good output in /var/log/messages (check it out with tail -f /var/log/messages)

Now to make bonding active it is necessery to reload networking (this is extremely risky if you don't have some way of Console Web Java / VPN Access such as IPKVM / ILO / IDRAC), so reloading the network be absolutely sure to either do it through a cronjob which will automatically do the network restart with new settings and revert back to old configuration whether network is inaccessible or assure physical access to the server console if the server is at your disposal.

Whatever the case make sure you backup:

 cp /etc/network/interfaces /etc/network/interfaces.bak

vim /etc/network/interfaces

############ WARNING ####################
# You do not need an "iface eth0" nor an "iface eth1" stanza.
# Setup IP address / netmask / gateway as per your requirements.
#######################################
auto lo
iface lo inet loopback
 
# The primary network interface
auto bond0
iface bond0 inet static
    address 10.10.10.1
    netmask 255.255.255.0
    network 192.168.1.0
    gateway 10.10.10.254
    slaves eth0 eth1
    # jumbo frame support
    mtu 9000
    # Load balancing and fault tolerance
    bond-mode balance-rr
    bond-miimon 100
    bond-downdelay 200
    bond-updelay 200
    dns-nameservers 10.10.10.254
    dns-search nixcraft.net.in

 


As you can see from config there are some bond specific configuration variables that can be tuned, they can have positive / negative impact in some cases on network thoroughput. As you can see bonding interfaces has slaves (this are all other ethXX) interfaces. Bonded traffic will be available via one single interface, such configuration is great for webhosting providers with multiple hosted sites as usually hosting thousand websites on the same server or one single big news site requires a lot of bandwidth and of course requires a redundancy of data (guarantee it is up if possible 7/24h.

Here is what of configs stand for

 
  • mtu 9000 : Set MTU size to 9000. This is related to Jumbo Frames.
  • bond-mode balance-rr : Set bounding mode profiles to "Load balancing and fault tolerance". See below for more information.
  • bond-miimon 100 : Set the MII link monitoring frequency to 100 milliseconds. This determines how often the link state of each slave is inspected for link failures.
  • bond-downdelay 200 : Set the time, t0 200 milliseconds, to wait before disabling a slave after a link failure has been detected. This option is only valid for the bond-miimon.
  • bond-updelay 200 : Set the time, to 200 milliseconds, to wait before enabling a slave after a link recovery has been detected. This option is only valid for the bond-miimon.
  • dns-nameservers 192.168.1.254 : Use 192.168.1.254 as dns server.
  • dns-search nixcraft.net.in : Use nixcraft.net.in as default host-name lookup (optional).

To get the best network thorougput you might want to play with different bounding policies. To learn more and get the list of all bounding policies check out Linux ethernet Bounding driver howto

To make the new bounding active restart network:
 

/etc/init.d/networking stop
sleep 5;
/etc/init.d/networking start


2. Fedora / CentOS RHEL Linux network Bond 

Configuring eth0, eth1, eth2 into single bond0 NIC network virtual device is with few easy steps:

a) Create following bond0 configuration file:
 

vim /etc/sysconfig/network-scripts/ifcfg-bond0

 

DEVICE=bond0
IPADDR=10.10.10.20
NETWORK=10.10.10.0
NETMASK=255.255.255.0
GATEWAY=10.10.10.1
USERCTL=no
BOOTPROTO=none
ONBOOT=yes


b) Modify ifcfg-eth0 and ifcfg-eth0 files /etc/sysconfig/network-scripts/

– Edit ifcfg-eth0

vim /etc/sysconfig/network-scripts/ifcfg-eth0

DEVICE=eth0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none

– Edit ifcfg-eth1

vim /etc/sysconfig/network-scripts/ifcfg-eth1

DEVICE=eth0
USERCTL=no
ONBOOT=yes
MASTER=bond0
SLAVE=yes
BOOTPROTO=none


c) Load bond driver through modprobe.conf

vim /etc/modprobe.conf

alias bond0 bonding
options bond0 mode=balance-alb miimon=100


Manually load the bonding kernel driver to make it affective without server reboot:
 

modprobe bonding

d) Restart networking to load just configured bonding 
 

service network restart


3. Testing Bond Success / Fail status

Periodically if you have to administrate a bonded interface Linux server it is useful to check Bonds Link Status:

cat /proc/net/bonding/bond0
 

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1e:0b:d6:6c:8f

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:1e:0b:d6:6c:8c

To check out which interfaces are bonded you can either use (on older Linux kernels)
 

/sbin/ifconfig -a


If ifconfig is not returning IP addresses / interfaces of teamed up eths, to check NICs / IPs:

/bin/ip a show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet 127.0.0.2/8 brd 127.255.255.255 scope host secondary lo
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 00:1e:0b:d6:6c:8c brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP qlen 1000
    link/ether 00:1e:0b:d6:6c:8c brd ff:ff:ff:ff:ff:ff
7: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 00:1e:0b:d6:6c:8c brd ff:ff:ff:ff:ff:ff
    inet 10.239.15.173/27 brd 10.239.15.191 scope global bond0
    inet 10.239.15.181/27 brd 10.239.15.191 scope global secondary bond0:7156web
    inet6 fe80::21e:bff:fed6:6c8c/64 scope link
       valid_lft forever preferred_lft forever


In case of Bonding interface failure you will get output like:

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)
Bonding Mode: load balancing (round-robin)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:xx:yy:zz:tt:31
Slave Interface: eth1
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:xx:yy:zz:tt:30

Failure to start / stop bonding is also logged in /var/log/messages so its a good idea to check there too once launched:
 

tail -f /var/log/messages
Dec  15 07:18:15 nas01 kernel: [ 6271.468218] e1000e: eth1 NIC Link is Down
Dec 15 07:18:15 nas01 kernel: [ 6271.548027] bonding: bond0: link status down for interface eth1, disabling it in 200 ms.
Dec  15 07:18:15 nas01 kernel: [ 6271.748018] bonding: bond0: link status definitely down for interface eth1, disabling it

On bond failure you will get smthing like:

Dec  15 04:19:15 micah01 kernel: [ 6271.468218] e1000e: eth1 NIC Link is Down
Dec  15 04:19:15 micah01 kernel: [ 6271.548027] bonding: bond0: link status down for interface eth1, disabling it in 200 ms.
Dec  15 04:19:15 micah01 kernel: [ 6271.748018] bonding: bond0: link status definitely down for interface eth1, disabling it


4. Adding removing interfaces to the bond interactively
 

You can set the mode through sysfs virtual filesystem with:

echo active-backup > /sys/class/net/bond0/bonding/mode

If you want to try adding an ethernet interface to the bond, type:

echo +ethN > /sys/class/net/bond0/bonding/slaves

To remove an interface type:

echo -ethN > /sys/class/net/bond0/bonding/slaves


In case if you're wondering how many bonding devices you can have, well the "sky is the limit" you can have, it is only limited by the number of NIC cards Linux kernel / distro support and ofcourse how many physical NIC slots are on your server.

To monitor (in real time) adding  / removal of new ifaces to the bond use:
 

watch -n 1 ‘cat /proc/net.bonding/bond0′

 

Resolving “nf_conntrack: table full, dropping packet.” flood message in dmesg Linux kernel log

Wednesday, March 28th, 2012

nf_conntrack_table_full_dropping_packet
On many busy servers, you might encounter in /var/log/syslog or dmesg kernel log messages like

nf_conntrack: table full, dropping packet

to appear repeatingly:

[1737157.057528] nf_conntrack: table full, dropping packet.
[1737157.160357] nf_conntrack: table full, dropping packet.
[1737157.260534] nf_conntrack: table full, dropping packet.
[1737157.361837] nf_conntrack: table full, dropping packet.
[1737157.462305] nf_conntrack: table full, dropping packet.
[1737157.564270] nf_conntrack: table full, dropping packet.
[1737157.666836] nf_conntrack: table full, dropping packet.
[1737157.767348] nf_conntrack: table full, dropping packet.
[1737157.868338] nf_conntrack: table full, dropping packet.
[1737157.969828] nf_conntrack: table full, dropping packet.
[1737157.969928] nf_conntrack: table full, dropping packet
[1737157.989828] nf_conntrack: table full, dropping packet
[1737162.214084] __ratelimit: 83 callbacks suppressed

There are two type of servers, I've encountered this message on:

1. Xen OpenVZ / VPS (Virtual Private Servers)
2. ISPs – Internet Providers with heavy traffic NAT network routers
 

I. What is the meaning of nf_conntrack: table full dropping packet error message

In short, this message is received because the nf_conntrack kernel maximum number assigned value gets reached.
The common reason for that is a heavy traffic passing by the server or very often a DoS or DDoS (Distributed Denial of Service) attack. Sometimes encountering the err is a result of a bad server planning (incorrect data about expected traffic load by a company/companeis) or simply a sys admin error…

– Checking the current maximum nf_conntrack value assigned on host:

linux:~# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
65536

– Alternative way to check the current kernel values for nf_conntrack is through:

linux:~# /sbin/sysctl -a|grep -i nf_conntrack_max
error: permission denied on key 'net.ipv4.route.flush'
net.netfilter.nf_conntrack_max = 65536
error: permission denied on key 'net.ipv6.route.flush'
net.nf_conntrack_max = 65536

– Check the current sysctl nf_conntrack active connections

To check present connection tracking opened on a system:

:

linux:~# /sbin/sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 12742

The shown connections are assigned dynamicly on each new succesful TCP / IP NAT-ted connection. Btw, on a systems that work normally without the dmesg log being flooded with the message, the output of lsmod is:

linux:~# /sbin/lsmod | egrep 'ip_tables|conntrack'
ip_tables 9899 1 iptable_filter
x_tables 14175 1 ip_tables

On servers which are encountering nf_conntrack: table full, dropping packet error, you can see, when issuing lsmod, extra modules related to nf_conntrack are shown as loaded:

linux:~# /sbin/lsmod | egrep 'ip_tables|conntrack'
nf_conntrack_ipv4 10346 3 iptable_nat,nf_nat
nf_conntrack 60975 4 ipt_MASQUERADE,iptable_nat,nf_nat,nf_conntrack_ipv4
nf_defrag_ipv4 1073 1 nf_conntrack_ipv4
ip_tables 9899 2 iptable_nat,iptable_filter
x_tables 14175 3 ipt_MASQUERADE,iptable_nat,ip_tables

 

II. Remove completely nf_conntrack support if it is not really necessery

It is a good practice to limit or try to omit completely use of any iptables NAT rules to prevent yourself from ending with flooding your kernel log with the messages and respectively stop your system from dropping connections.

Another option is to completely remove any modules related to nf_conntrack, iptables_nat and nf_nat.
To remove nf_conntrack support from the Linux kernel, if for instance the system is not used for Network Address Translation use:

/sbin/rmmod iptable_nat
/sbin/rmmod ipt_MASQUERADE
/sbin/rmmod rmmod nf_nat
/sbin/rmmod rmmod nf_conntrack_ipv4
/sbin/rmmod nf_conntrack
/sbin/rmmod nf_defrag_ipv4

Once the modules are removed, be sure to not use iptables -t nat .. rules. Even attempt to list, if there are any NAT related rules with iptables -t nat -L -n will force the kernel to load the nf_conntrack modules again.

Btw nf_conntrack: table full, dropping packet. message is observable across all GNU / Linux distributions, so this is not some kind of local distribution bug or Linux kernel (distro) customization.
 

III. Fixing the nf_conntrack … dropping packets error

– One temporary, fix if you need to keep your iptables NAT rules is:

linux:~# sysctl -w net.netfilter.nf_conntrack_max=131072

I say temporary, because raising the nf_conntrack_max doesn't guarantee, things will get smoothly from now on.
However on many not so heavily traffic loaded servers just raising the net.netfilter.nf_conntrack_max=131072 to a high enough value will be enough to resolve the hassle.

– Increasing the size of nf_conntrack hash-table

The Hash table hashsize value, which stores lists of conntrack-entries should be increased propertionally, whenever net.netfilter.nf_conntrack_max is raised.

linux:~# echo 32768 > /sys/module/nf_conntrack/parameters/hashsize
The rule to calculate the right value to set is:
hashsize = nf_conntrack_max / 4

– To permanently store the made changes ;a) put into /etc/sysctl.conf:

linux:~# echo 'net.netfilter.nf_conntrack_count = 131072' >> /etc/sysctl.conf
linux:~# /sbin/sysct -p

b) put in /etc/rc.local (before the exit 0 line):

echo 32768 > /sys/module/nf_conntrack/parameters/hashsize

Note: Be careful with this variable, according to my experience raising it to too high value (especially on XEN patched kernels) could freeze the system.
Also raising the value to a too high number can freeze a regular Linux server running on old hardware.

– For the diagnosis of nf_conntrack stuff there is ;

/proc/sys/net/netfilter kernel memory stored directory. There you can find some values dynamically stored which gives info concerning nf_conntrack operations in "real time":

linux:~# cd /proc/sys/net/netfilter
linux:/proc/sys/net/netfilter# ls -al nf_log/

total 0
dr-xr-xr-x 0 root root 0 Mar 23 23:02 ./
dr-xr-xr-x 0 root root 0 Mar 23 23:02 ../
-rw-r--r-- 1 root root 0 Mar 23 23:02 0
-rw-r--r-- 1 root root 0 Mar 23 23:02 1
-rw-r--r-- 1 root root 0 Mar 23 23:02 10
-rw-r--r-- 1 root root 0 Mar 23 23:02 11
-rw-r--r-- 1 root root 0 Mar 23 23:02 12
-rw-r--r-- 1 root root 0 Mar 23 23:02 2
-rw-r--r-- 1 root root 0 Mar 23 23:02 3
-rw-r--r-- 1 root root 0 Mar 23 23:02 4
-rw-r--r-- 1 root root 0 Mar 23 23:02 5
-rw-r--r-- 1 root root 0 Mar 23 23:02 6
-rw-r--r-- 1 root root 0 Mar 23 23:02 7
-rw-r--r-- 1 root root 0 Mar 23 23:02 8
-rw-r--r-- 1 root root 0 Mar 23 23:02 9

 

IV. Decreasing other nf_conntrack NAT time-out values to prevent server against DoS attacks

Generally, the default value for nf_conntrack_* time-outs are (unnecessery) large.
Therefore, for large flows of traffic even if you increase nf_conntrack_max, still shorty you can get a nf_conntrack overflow table resulting in dropping server connections. To make this not happen, check and decrease the other nf_conntrack timeout connection tracking values:

linux:~# sysctl -a | grep conntrack | grep timeout
net.netfilter.nf_conntrack_generic_timeout = 600
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 180
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_events_retry_timeout = 15
net.ipv4.netfilter.ip_conntrack_generic_timeout = 600
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent2 = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300
net.ipv4.netfilter.ip_conntrack_udp_timeout = 30
net.ipv4.netfilter.ip_conntrack_udp_timeout_stream = 180
net.ipv4.netfilter.ip_conntrack_icmp_timeout = 30

All the timeouts are in seconds. net.netfilter.nf_conntrack_generic_timeout as you see is quite high – 600 secs = (10 minutes).
This kind of value means any NAT-ted connection not responding can stay hanging for 10 minutes!

The value net.netfilter.nf_conntrack_tcp_timeout_established = 432000 is quite high too (5 days!)
If this values, are not lowered the server will be an easy target for anyone who would like to flood it with excessive connections, once this happens the server will quick reach even the raised up value for net.nf_conntrack_max and the initial connection dropping will re-occur again …

With all said, to prevent the server from malicious users, situated behind the NAT plaguing you with Denial of Service attacks:

Lower net.ipv4.netfilter.ip_conntrack_generic_timeout to 60 – 120 seconds and net.ipv4.netfilter.ip_conntrack_tcp_timeout_established to stmh. like 54000

linux:~# sysctl -w net.ipv4.netfilter.ip_conntrack_generic_timeout = 120
linux:~# sysctl -w net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 54000

This timeout should work fine on the router without creating interruptions for regular NAT users. After changing the values and monitoring for at least few days make the changes permanent by adding them to /etc/sysctl.conf

linux:~# echo 'net.ipv4.netfilter.ip_conntrack_generic_timeout = 120' >> /etc/sysctl.conf
linux:~# echo 'net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 54000' >> /etc/sysctl.conf

How to auto restart CentOS Linux server with software watchdog (softdog) to reduce server downtime

Wednesday, August 10th, 2011

How to auto restart centos with software watchdog daemon to mitigate server downtimes, watchdog linux artistic logo

I’m in charge of dozen of Linux servers these days and therefore am required to restart many of the servers with a support ticket (because many of the Data Centers where the servers are co-located does not have a web interface or IPKVM connected to the server for that purpose). Therefore the server restart requests in case of crash sometimes gets processed in few hours or in best case in at least half an hour.

I’m aware of the existence of Hardware Watchdog devices, which are capable to detect if a server is hanged and auto-restart it, however the servers I administrate does not have Hardware support for Watchdog timer.

Thanksfully there is a free software project called Watchdog which is easily configured and mitigates the terrible downtimes caused every now and then by a server crash and respective delays by tech support in Data Centers.

I’ve recently blogged on the topic of Debian Linux auto-restart in case of kernel panic , however now i had to conifgure watchdog on some dozen of CentOS Linux servers.

It appeared installation & configuration of Watchdog on CentOS is a piece of cake and comes to simply following few easy steps, which I’ll explain quickly in this post:

1. Install with yum watchdog to CentOS

[root@centos:/etc/init.d ]# yum install watchdog
...

2. Add to configuration a log file to log watchdog activities and location of the watchdog device

The quickest way to add this two is to use echo to append it in /etc/watchdog.conf:

[root@centos:/etc/init.d ]# echo 'file = /var/log/messages' >> /etc/watchdog.conf
echo 'watchdog-device = /dev/watchdog' >> /etc/watchdog.conf

3. Load the softdog kernel module to initialize the software watchdog via /dev/watchdog

[root@centos:/etc/init.d ]# /sbin/modprobe softdog

Initialization of softdog should be indicated by a line in dmesg kernel log like the one above:

[root@centos:/etc/init.d ]# dmesg |grep -i watchdog
Software Watchdog Timer: 0.07 initialized. soft_noboot=0 soft_margin=60 sec (nowayout= 0)

4. Include the softdog kernel module to load on CentOS boot up

This is necessery, because otherwise after reboot the softdog would not be auto initialized and without it being initialized, the watchdog daemon service could not function as it does automatically auto reboots the server if the /dev/watchdog disappears.

It’s better that the softdog module is not loaded via /etc/rc.local but the default CentOS methodology to load module from /etc/rc.module is used:

[root@centos:/etc/init.d ]# echo modprobe softdog >> /etc/rc.modules
[root@centos:/etc/init.d ]# chmod +x /etc/rc.modules

5. Start the watchdog daemon service

The succesful intialization of softdog in step 4, should have provided the system with /dev/watchdog, before proceeding with starting up the watchdog daemon it’s wise to first check if /dev/watchdog is existent on the system. Here is how:

[root@centos:/etc/init.d ]# ls -al /dev/watchdogcrw------- 1 root root 10, 130 Aug 10 14:03 /dev/watchdog

Being sure, that /dev/watchdog is there, I’ll start the watchdog service.

[root@centos:/etc/init.d ]# service watchdog restart
...

Very important note to make here is that you should never ever configure watchdog service to run on boot time with chkconfig. In other words the status from chkconfig for watchdog boot on all levels should be off like so:

[root@centos:/etc/init.d ]# chkconfig --list |grep -i watchdog
watchdog 0:off 1:off 2:off 3:off 4:off 5:off 6:off

Enabling the watchdog from the chkconfig will cause watchdog to automatically restart the system as it will probably start the watchdog daemon before the softdog module is initialized. As watchdog will be unable to read the /dev/watchdog it will though the system has hanged even though the system might be in a boot process. Therefore it will end up in an endless loops of reboots which can only be fixed in a linux single user mode!!! Once again BEWARE, never ever activate watchdog via chkconfig!

Next step to be absolutely sure that watchdog device is running it can be checked with normal ps command:

[root@centos:/etc/init.d ]# ps aux|grep -i watchdog
root@hosting1-fr [~]# ps axu|grep -i watch|grep -v greproot 18692 0.0 0.0 1816 1812 ? SNLs 14:03 0:00 /usr/sbin/watchdog
root 25225 0.0 0.0 0 0 ? ZN 17:25 0:00 [watchdog] <defunct>

You have probably noticed the defunct state of watchdog, consider that as absolutely normal, above output indicates that now watchdog is properly running on the host and waiting to auto reboot in case of sudden /dev/watchdog disappearance.

As a last step before, after being sure its initialized properly, it’s necessery to add watchdog to run on boot time via /etc/rc.local post init script, like so:

[root@centos:/etc/init.d ]# echo 'echo /sbin/service watchdog start' >> /etc/rc.local

Now enjoy, watchdog is up and running and will automatically restart the CentOS host 😉