Posts Tagged ‘server’

Linux extending life time for a damaged hard drive server tricks on a live server. Force fcsk on next reboot.Read-only file system error solutions

Friday, February 17th, 2023

linux-extending-life-time-for-a-damaged-hard-drive-server-tricks-can-not-read-superblock-linux-force-fsck-on-next-reboot

In our daily work as system administrators we have some very old Legacy systems running Clustered High Availability proxies using CRM (Cluster Resource Manager) and some legacy systems still using Heartbeat to manage the cluster instead of the newer and modern Corosync variant.

The HA cluster is only 2 nodes Linux machine and running the obscure already long time unsupported version of Redhat 5.11 (Ootpa) who was officially became stable distant year 1998 (yeath the years were good) and whose EOL (End of Life) has been reached long time ago and the OS is no longer supported, however for about 14 years the machines has been running perfectly fine until one of the Cluster nodes managed by ocf::heartbeat:IPAddr2 , that is  /etc/ha.d/resource.d/IPAddr2 shell script. Yeah for the newbies Heartbeat Application Cluster in Linux does work like that it uses a number of extendable pair of shell scripts written for different kind of Network / Web / Mail / SQL or whatever services HA management.

The first node configured however, started failing due to some errors like:
 

EXT3-fs error (device dm-1): ext3_journal_start_sb: Detected aborted journal
sd 0:2:0:0: rejecting I/O to offline device
Aborting journal on device sda1.
sd 0:2:0:0: rejecting I/O to offline device
printk: 159 messages suppressed.
Buffer I/O error on device sda1, logical block 526
lost page write due to I/O error on sda1
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
megaraid_sas: FW was restarted successfully, initiating next stage…
megaraid_sas: HBA recovery state machine, state 2 starting…
megasas: Waiting for FW to come to ready state
megasas: FW in FAULT state!!
FW state [-268435456] hasn't changed in 180 secs
megaraid_sas: out: controller is not in ready state
megasas: waiting_for_outstanding: after issue OCR. 
megasas: waiting_for_outstanding: before issue OCR. FW state = f0000000
megaraid_sas: pending commands remain even after reset handling. megasas[0]: Dumping Frame Phys Address of all pending cmds in FW
megasas[0]: Total OS Pending cmds : 0 megasas[0]: 64 bit SGLs were sent to FW
megasas[0]: Pending OS cmds in FW :

The result out of that was a frequently the filesystem of the machine got re-mounted as Read Only and of course that is
quite bad if you have a running processess of haproxy that should be able to be living their and take up some Web traffic
for high availability and you run all the traffic only on the 2nd pair of machine.

This of course was a clear sign for a failing disks or some hit bad blocks regions or as the messages indicates, some
problem with system hardware or Raid SAS Array.

The physical raid on the system, just like rest of the hardware is very old stuff as well.

[root@haproxy_lb_node1 ~]# lspci |grep -i RAI
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05)

The produced errors not only made the machine to auto-mount its root / filesystem in Read-Only mode but besides has most
likely made the machine to automatically reboot every few days or few times every day in a raw.

The second Load Balancer node2 did operated perfectly, and we thought that we might just keep the broken machine in that half running
and inconsistent state for few weeks until we have built the new machines with Pre-Installed new haproxy cluster with modern
RedHat Linux 8.6 distribution, but since we have to follow SLAs (Service Line Agreements) with Customers and the end services behind the
High Availability (HA) Haproxy cluster were at danger … 

We as sysadmins had the task to make our best to try to stabilize the unstable node with disk errors for the system to servive
and be able to normally serve traffic (if node2 that is in a separate Data center fails due to a hardware or electricity issues etc.)
.

Here is few steps we took, that has hopefully improved the situation.

1. Make backups of most important files of high importance

Always before doing anything with a broken system, prepare backup of the most important files, if that is a cluster that should be a backup of the cluster configurations (if you don't have already ones) backup of /etc/hosts / backup of any important services configs /etc/haproxy/haproxy.cfg /etc/postfix/postfix.cfg (like it was my case), preferrably backup of whole /etc/  any important files from /root/ or /home/users* directories backup of at leasts latest logs from /var/log etc.
 

2. Clear up all unnecessery services scripts from the server

Any additional Softwares / Services and integrity checking tools (daemons) / scripts and cron jobs, were immediately stopped and wheter unused removed.

E.g. we had moved through /etc/cron* to check what's there,

# ls -ld /etc/cron.*
drwx—— 2 root root 4096 Feb  7 18:13 /etc/cron.d
drwxr-xr-x 2 root root 4096 Feb  7 17:59 /etc/cron.daily
-rw-r–r– 1 root root    0 Jul 20  2010 /etc/cron.deny
drwxr-xr-x 2 root root 4096 Jan  9  2013 /etc/cron.hourly
drwxr-xr-x 2 root root 4096 Jan  9  2013 /etc/cron.monthly
drwxr-xr-x 2 root root 4096 Aug 26  2015 /etc/cron.weekly

 

And like well professional butchers removed everything unnecessery that could trigger any extra unnecessery disk read / writes to HDD.

E.g. just create

# mkdir -p /root/etc_old/{/etc/cron.d,\
/etc/cron.daily,/etc/cron.hourly,/etc/cron.monthly\
,/etc/cron.weekly}

 

And moved all unnecessery cron job scripts like:

1. nmon (old school network / memory / hard disk console tool for monitoring and tuning server parameters)
2. clamscan / freshclam crons
3. mlocate (the script that is taking care for periodic run of updatedb command to keep the locate command to easily search
for files inside the DB to put less read operations on disk in case if you need to find file (e.g. prevent yourself to everytime
run cmd like: find / . -iname '*whatever_you_look_for*'
4. cups cron jobs
5. logwatch cron
6. rkhunter stuff
7. logrotate (yes we stopped even logrotation trigger job as we found the server was crashing sometimes at the same time when
the lograte job to rotate logs inside /var/log/* was running perhaps leading to a hit of the I/O read error (bad blocks).


Also inspected the Administrator user root cron job for any unwated scripts and stopped two report bash scripts that were part of the PCI tightened Security procedures.
Therein found script responsible to periodically report the list of installed packages and if they have not changed, as well a script to periodically report via email the list of
/etc/{passwd,/etc/shadow} created users, used to historically keep an eye on the list of users and easily see if someone
has created new users on the machine. Those were enabled via /var/spool/cron/root cron jobs, in other cases, on other machines if it happens for you
it is a good idea to check out all the existing user cron jobs and stop anything that might be putting Read / Write extra heat pressure on machine attached the Hard drives.

# ls -al /var/spool/cron/
total 20
drwx——  2 root root 4096 Nov 13  2015 .
drwxr-xr-x 12 root root 4096 May 11  2011 ..
-rw——-  1 root root  133 Nov 13  2015 root


3. Clear up old log files and any files unnecessery

Under /var/log and /home /var/tmp /var/spool/tmp immediately try to clear up the old log files.
From my past experience this has many times made the FS file inodes that are storing on a unbroken part (good blocks) of the hard drive and
ready to be reused by newly written rsyslog / syslogd services spitted files.

!!! Note that during the removal of some files you might hit a files stored on a bad blocks that might lead to a unexpected system reboot.

But that's okay, don't worry most likely after a hard reset by a technician in the Datacenter the machine will boot again and you can enjoy
removing remaining still files to send them to the heaven for old files.

 

4. Trigger an automatic system file system check with fsck on next boot

The standard way to force a Linux to aumatically recheck its Root filesystem is to simply create the /forcefsck to root partition or any other secondary disk partition you would like to check.

# touch /forcefsck

# reboot


However at some occasions you might be unable to do it because, the / (root fs) has been remounted in ReadOnly mode, yackes …

Luckily old Linux distibutions like this RHEL 5.1, has a way to force a filesystem check after reboot fsck and identify any
unknown bad-blocks and hopefully succceed in isolating them, so you don't hit into the same auto-reboots if the hard drive or Software / Hardware RAID
is not in terrible state
, you can use an option built in in /sbin/shutdown command the '-F'

   -F     Force fsck on reboot.


Hence to make the machine reboot and trigger immediately fsck:

# shutdown -rF now


Just In case you wonder why to reboot before check the Filesystem. Well simply because you need to have them unmounted before you check.

In that specific case this produced so far a good result and the machine booted just fine and we crossed the fingers and prayed that the machine would work flawlessly in the coming few weeks, before we finalize the configuration of the substitute machines, where this old infrastructure will be migrated to a new built cluster with new Haproxy and Corosync / Pacemaker Cluster on a brand new RHEL.

NB! On newer machines this won't work however as shutdown command has been stripped off this option because no SystemV (SystemInit) or Upstart and not on SystemD newer services architecture.
 

5. Hints on checking the hard drives with fsck

If you happen to be able to have physical access to the remote Hardare machine via a TTY[1-9] Console, that's even better and is the standard way to do it but with this specific case we had no easy way to get access to the Physical server console.

It is even better to go there and via either via connected Monitor (Display) or KVM Switch (Those who hear KVM switch first time this is a great device in server rooms to connect multiple monitors to same Monitor Display), it is better to use a some of the multitude of options to choose from for USB Distro Linux recovery OS versions or a CDROM / DVD on older machines like this with the Redhat's recovery mode rolled on.
After mounting the partition simply check each of the disks
e.g. :

# fsck -y /dev/sdb
# fsck -y /dev/sdc

Or if you want to not waste time and look for each hard drive but directly check all the ones that are attached and known by Linux distro via /etc/fstab definition run:

# fsck -AR

If necessery and you have a mixture of filesystems for example EXT3 , EXT4 , REISERFS you can tell it to omit some filesystem, for example ext3, like that:

# fsck -AR -t noext3 -y


To skip fsck on mounted partitions with fsck:

# fsck -M /dev/sdb


One remark to make here on fsck is usually fsck to complete its job on various filesystem it uses other external component binaries usually stored in /sbin/fsck*

ls -al /sbin/fsck*
-rwxr-xr-x 1 root root  55576 20 яну 2022 /sbin/fsck*
-rwxr-xr-x 1 root root  43272 20 яну 2022 /sbin/fsck.cramfs*
lrwxrwxrwx 1 root root      9  4 юли 2020 /sbin/fsck.exfat -> exfatfsck*
lrwxrwxrwx 1 root root      6  7 юни 2021 /sbin/fsck.ext2 -> e2fsck*
lrwxrwxrwx 1 root root      6  7 юни 2021 /sbin/fsck.ext3 -> e2fsck*
lrwxrwxrwx 1 root root      6  7 юни 2021 /sbin/fsck.ext4 -> e2fsck*
-rwxr-xr-x 1 root root  84208  8 фев 2021 /sbin/fsck.fat*
-rwxr-xr-x 2 root root 393040 30 ное 2009 /sbin/fsck.jfs*
-rwxr-xr-x 1 root root 125184 20 яну 2022 /sbin/fsck.minix*
lrwxrwxrwx 1 root root      8  8 фев 2021 /sbin/fsck.msdos -> fsck.fat*
-rwxr-xr-x 1 root root    333 16 дек 2021 /sbin/fsck.nfs*
lrwxrwxrwx 1 root root      8  8 фев 2021 /sbin/fsck.vfat -> fsck.fat*


6. Using tune2fs to  adjust tunable filesystem parameters on ext2/ext3/ext4 filesystems (few examples)

a) To check whether really the filesystem was checked on boot time or check a random filesystem on the server for its last check up date with fsck:

#  tune2fs -l /dev/sda1 | grep checked
Last checked:             Wed Apr 17 11:04:44 2019

On some distributions like old Debian and Ubuntu, it is even possible to enable fsck to log its operations during check on reboot via changing the verbosity from NO to YES:

# sed -i "s/#VERBOSE=no/VERBOSE=yes/" /etc/default/rcS


If you're having the issues on old Debian Linuxes  and not on RHEL  it is possible to;

b) Enable all fsck repairs automatic on boot

by running via:
 

# sed -i "s/FSCKFIX=no/FSCKFIX=yes/" /etc/default/rcS


c) Forcing fcsk check on for server attached Hard Drive Partitions with tune2fs

# tune2fs -c 1 /dev/sdXY

Note that:
tune2fs can force a fsck on each reboot for EXT4, EXT3 and EXT2 filesystems only.

tune2fs can trigger a forced fsck on every reboot using the -c (max-mount-counts) option.
This option sets the number of mounts after which the filesystem will be checked, so setting it to 1 will run fsck each time the computer boots.
Setting it to -1 or 0 resets this (the number of times the filesystem is mounted will be disregarded by e2fsck and the kernel).


 For example you could:

d) Set fsck to run a filesystem check every 30 boots, by using -c 30 
 

# tune2fs -c 30 /dev/sdXY


e) Checking whether a Hard Drive has been really checked on the boot

 

#  tune2fs -l /dev/sda1 | grep checked
Last checked:             Wed Apr 17 11:04:44 2019


e) Check when was the last time the file system /dev/sdX was checked:
 

# tune2fs -l /dev/sdX | grep Last\ c
Last checked:             Thu Jan 12 20:28:34 2017


f) Check how many times our /dev/sdX filesystem was mounted

# tune2fs -l /dev/sdX | grep Mount
Mount count:              157

g) Check how many mounts are allowed to pass before filesystem check is forced
 

# tune2fs -l /dev/sdX | grep Max
Maximum mount count:      -1


7. Repairing disk / partitions via GRUB fsck.mode and fsck.repair kernel module options

It is also possible to force a fsck.repair on boot via GRUB, but that usually is not an option someone would like as the machine might fail too boot if it hards to repair hardly, however in difficult situations with failing disks temporary enabling it is good idea.

This can be done by including for grub initial config

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash fsck.mode=force fsck.repair=yes"

fsck.mode=force – will force a fsck each time a system boot and keeping that value enabled for a long time inside GRUB is stupid for servers as

sometimes booting could be severely prolonged because of the checks especially with servers with many or slow old hard drives.

fsck.repair=yes – will make the fsck try to repair if it finds bad blocks when checking (be absolutely sure you know, what you're doing if passing this options)

The options can be also set via editing the GRUB boot screen, if you have physical access to the server and don't want to reload the grub loader and possibly make the machine unbootable on next boot.
 

8. Few more details on how /etc/fstab disk fsck check parameters values for Systemd Linux machines works

The "proper" way on systemd (if we can talk about proper way on Linux) to runs fsck for each filesystem that has a fsck is to pass number greater than 0 set in
/etc/fstab (last column in /etc/fstab), so make sure you edit your /etc/fstab if that's not the case.

The root partition should be set to 1 (first to be checked), while other partitions you want to be checked should be set to 2.

Example /etc/fstab:
 

# /etc/fstab: static file system information.

/dev/sda1  /      ext4  errors=remount-ro  0  1
/dev/sda5  /home  ext4  defaults           0  2

The values you can put here as a second number meaning is as follows:
0 – disabled, that is do not check filesystem
1 – partition with this PASS value has a higher priority and is checked first. This value is usually set to the root / partition
2 – partitions with this PASS value will be checked last

a) Check the produced log out of fsck

Unfortunately on the older versions of Linux distros with SystemV fsck log output might be not generated except on the physical console so if you have a kind of duplicator device physical tty on the display port of the server, you might capture some bad block reports or fixed errors messages, but if you don't you might just cross the fingers and hope that anything found FS irregularities was recovered.

On systemd Linux machines the fsck log should be produced either in /run/initramfs/fsck.log or some other location depending on the Linux distro and you should be able to see something from fsck inside /var/log/* logs:

# grep -rli fsck /var/log/*


Close it up

Having a system with failing disk is a really one of the worst sysadmin nightmares to get. The good news is that most of the cases we're prepared with some working backup or some work around stuff like the few steps explained to mitigate the amount of Read / Writes to hard disks on the failing machine HDDs. If the failing disk is a primary Linux filesystem all becomes even worse as every next reboot, you have no guarantee, whether the kernel / initrd or some of the other system components required to run the Core Linux system won't break up the normal boot. Thus one side changes on the hard drives is a risky business on ther other side, if you're in a situation where you have a mirror system or the failing system is just a Linux server installed without a Cluster pair, then this is not a big deal as you can guarantee at least one of the nodes still up, unning and serving. Still doing too much of operations with HDD is always a danger so the steps described, though in most cases leading to improvement on how the system behaves, the system should be considered totally unreliable and closely monitored not only by some monitoring stuff like Zabbix / Prometheus whatever but regularly check the systems state via normal SSH logins. It is important if you have some important datas or logs on the system that are not synchronized to a system node to copy them before doing any of the described operations. After all minimal is backuped, proceed to clear up everything that might be cleared up and still the machine to continue providing most of its functionalities, trigger fsck automatic HDD check on next reboot, reboot, check what is going on and monitor the machine from there on.

Hopefully the few described steps, has helped some sysadmin. There is plenty of things which I've described that might go wrong, even following the described steps, might not help if the machines Storage Drives / SAS / SSD has too much of a damage. But as said in most cases following this few steps would improve the machine state.

Wish you the best of luck!

 

How to disable haproxy log for certain frontend / backend or stop haproxy logging completely

Wednesday, September 14th, 2022

haproxy-disable-logging-for-single-frontend-or-backend-or-stop-message-logging-completely-globally

In my previous article I've shortly explained on how it is possible to configure multiple haproxy instances to log in separate log files as well as how to configure a specific frontend to log inside a separate file. Sometimes it is simply unnecessery to keep any kind of log file for haproxy to spare disk space or even for anonymity of traffic. Hence in this tiny article will explain how to disable globally logging for haproxy and how logging for a certain frontend or backend could be stopped.

1. Disable globally logging of haproxy service
 

Disabling globally logging for haproxy in case if you don't need the log is being achieved by redirecting the log variable to /dev/null handler and to also mute the reoccurring alert, notice and info messages, that are produced in case of some extra ordinary events during start / stop of haproxy or during mising backends etc. you can send those messages to local0 and loca1 handlers which will be discarded later by rsyslogd configuration, for example thsi can be achieved with a configuration like:
 

global     log /dev/log    local0 info alert     log /dev/log    local1 notice alert  defaults log global mode http option httplog option dontlognull

 

<level>    is optional and can be specified to filter outgoing messages. By
           default, all messages are sent. If a level is specified, only
           messages with a severity at least as important as this level
           will be sent. An optional minimum level can be specified. If it
           is set, logs emitted with a more severe level than this one will
           be capped to this level. This is used to avoid sending "emerg"
           messages on all terminals on some default syslog configurations.
           Eight levels are known :
             emerg  alert  crit   err    warning notice info  debug

         

By using the log level you can also tell haproxy to omit from logging errors from log if for some reasons haproxy receives a lot of errors and this is flooding your logs, like this:

    backend Backend_Interface
  http-request set-log-level err
  no log


But sometimes you might need to disable it for a single frontend only and comes the question.


2. How to disable logging for a single frontend interface?

I thought that might be more complex but it was pretty easy with the option dontlog-normal haproxy.cfg variable:

Here is sample configuration with frontend and backend on how to instrucruct the haproxy frontend to disable all logging for the frontend
 

frontend ft_Frontend_Interface
#        log  127.0.0.1 local4 debug
        bind 10.44.192.142:12345
       
option dontlog-normal
        mode tcp
        option tcplog

              timeout client 350000
        log-format [%t]\ %ci:%cp\ %fi:%fp\ %b/%s:%sp\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
        default_backend bk_WLP_echo_port_service

backend bk_Backend_Interface
                        timeout server 350000
                        timeout connect 35000
        server serverhost1 10.10.192.12:12345 weight 1 check port 12345
        server serverhost2 10.10.192.13:12345 weight 3 check port 12345

 


As you can see from those config, we have also enabled as a check port 12345 which is the application port service if something goes wrong with the application and 12345 is not anymore responding the respective server will get excluded automatically by haproxy and only one of machines will serve, the weight tells it which server will have the preference to serve the traffic the weight ratio will be 1 request will end up on one machine and 3 requests on the other machine.


3. How to disable single configured backend to not log anything but still have a log for the frontend
 

Omit the use of option dontlog normal from frontend inside the backend just set  no log:

backend bk_Backend_Interface
                       
 no log
                        timeout server 350000
                        timeout connect 35000
        server serverhost1 10.10.192.12:12345 weight 1 check port 12345
        server serverhost2 10.10.192.13:12345 weight 3 check port 12345

That's all reload haproxy service on the machine and backend will no longer log to your default configured log file via the respective local0 – local6 handler.

How to configure haproxy logging to separate file on Redhat Enterprise Linux 8.5 Ootpa

Thursday, February 3rd, 2022

haproxy-rsyslog-architecture-logging-picture

Configuring proper logging for haproxy is always a pain in the ass in Linux, because of rsyslogd various config syntax among versions, because of bugs in OS etc. 
Today we have been given 2 Redhat 8.5 Linux servers where we had a task to start configuring haproxies, to have an idea on what is going on of course we had to enable proper haproxy logging in separate log file under separate local, for the test one can use haproxy's 

log /dev/log local6

config, this is a general way to configure logging which I've described earlier in the article How to enable haproxy logging to a separate log /var/log/haproxy.log / prevent duplicate messages to appear in /var/log/messages
However this time I wanted to not use /dev/log as this device is also used by systemd / journald and theoretically could be used by other services and there might be multiple services logging to the same places possibly leading to some issue, thus I wanted to send and process the haproxy messages directly from rsyslog on RHEL 8.5.

Create a custom file that is loaded with the rest of configuration from /etc/rsyslog.conf with a line like:
 

# Include all config files in /etc/rsyslog.d/
include(file="/etc/rsyslog.d/*.conf" mode="optional")


Create 49_haproxy.conf with below content

[root@haproxy: ~]# vim /etc/rsyslog.d/49_haproxy.conf

$ModLoad imudp
$UDPServerAddress 127.0.0.1
$UDPServerRun 514
#2022/02/02: HAProxy logs to local6, save the messages
local6.*                                                /var/log/haproxy.log
if ($programname == 'haproxy') then -/var/log/haproxy.log
& stop

touch /var/log/haproxy.log
chown haproxy:haproxy /var/log/haproxy.log

In /etc/haproxy/haproxy.cfg under global section to print in verbose mode messages (i.e. check, the haproxy is receiving properly sent traffic) do configure something like:

 

global
  log          127.0.0.1 local6 debug


Eventually you might want to remove the debug word out of the config, if you don't want to log too much verbosily once everything is properly tested and configured

[root@haproxy: ~]# curl -v -c -k 10.10.192.135:16010
* Rebuilt URL to: 10.10.192.135:15010/
*   Trying 10.10.192.135…
* TCP_NODELAY set
* Connected to 10.10.192.135 (10.10.192.135) port 15010 (#0)
> GET / HTTP/1.1
> Host: 10.10.192.135:15010
> User-Agent: curl/7.61.1
> Accept: */*

* Empty reply from server
* Connection #0 to host 10.10.192.135 left intact
curl: (52) Empty reply from server

 

In /var/log/haproxy.log you should get some messages like:
 

Feb  3 14:16:44 localhost.localdomain haproxy[25029]: proxy IN_Traffic_Bak has no server available!
Feb  3 14:16:44 localhost.localdomain haproxy[25029]: proxy IN_Traffic_Bak has no server available!
Feb  3 15:59:50 localhost.localdomain haproxy[25029]: [03/Feb/2022:15:59:50.162] 10.44.192.135:1348 -:- IN_Traffic/<NOSRV>:- -1/-1/0 0 SC 1/1/0/0/0 0/0
Feb  3 15:59:50 localhost.localdomain haproxy[25029]: [03/Feb/2022:15:59:50.162] 10.44.192.135:1348 -:- IN_Traffic/<NOSRV>:- -1/-1/0 0 SC 1/1/0/0/0 0/0
Feb  3 15:59:50 localhost.localdomain haproxy[25029]: [03/Feb/2022:15:59:50.162] 10.44.192.135:1348 -:- IN_Traffic/<NOSRV>:- -1/-1/0 0 SC 1/1/0/0/0 0/0

 

Fix Out of inodes on Postfix Linux Mail Cluster. How to clean up filesystem running out of Inodes, Filesystem inodes on partition is 100% full

Wednesday, August 25th, 2021

Inode_Entry_inode-table-content

Recently we have faced a strange issue with with one of our Clustered Postfix Mail servers (the cluster is with 2 nodes that each has configured Postfix daemon mail servers (running on an OpenVZ virtualized environment).
A heartbeat that checks liveability of clusters and switches nodes in case of one of the two gets broken due to some reason), pretty much a standard SMTP cluster.

So far so good but since the cluster is a kind of abondoned and is pretty much legacy nowadays and used just for some Monitoring emails from different scripts and systems on servers, it was not really checked thoroughfully for years and logically out of sudden the alarming email content sent via the cluster stopped working.

The normal sysadmin job here  was to analyze what is going on with the cluster and fix it ASAP. After some very basic analyzing we catched the problem is caused by a  "inodes full" (100% of available inodes were occupied) problem, e.g. file system run out of inodes on both machines perhaps due to a pengine heartbeat process  bug  leading to producing a high number of .bz2 pengine recovery archive files stored in /var/lib/pengine>

Below are the few steps taken to analyze and fix the problem.
 

1. Finding out about the the system run out of inodes problem


After logging on to system and not finding something immediately is wrong with inodes, all I can see from crm_mon is cluster was broken.
A plenty of emails were left inside the postfix mail queue visible with a standard command

[root@smtp1: ~ ]# postqueue -p

It took me a while to find ot the problem is with inodes because a simple df -h  was showing systems have enough space but still cluster quorum was not complete.
A bit of further investigation led me to a  simple df -i reporting the number of inodes on the local filesystems on both our SMTP1 and SMTP2 got all occupied.

[root@smtp1: ~ ]# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/simfs            500000   500000  0   100% /
none                   65536      61   65475    1% /dev

As you can see the number of inodes on the Virual Machine are unfortunately depleted

Next step was to check directories occupying most inodes, as this is the place from where files could be temporary moved to a remote server filesystem or moved to another partition with space on a server locally attached drives.
Below command gives an ordered list with directories locally under the mail root filesystem / and its respective occupied number files / inodes,
the more files under a directory the more inodes are being occupied by the files on the filesystem.

 

run-out-if-inodes-what-is-inode-find-out-which-filesystem-or-directory-eating-up-all-your-system-inodes-linux_inode_diagram.gif
1.1 Getting which directory consumes most of the inodes on the systems

 

[root@smtp1: ~ ]# { find / -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n; } 2>/dev/null
….
…..

…….
    586 /usr/lib64/python2.4
    664 /usr/lib64
    671 /usr/share/man/man8
    860 /usr/bin
   1006 /usr/share/man/man1
   1124 /usr/share/man/man3p
   1246 /var/lib/Pegasus/prev_repository_2009-03-10-1236698426.308128000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2009-05-18-1242636104.524113000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2009-11-06-1257494054.380244000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2010-08-04-1280907760.750543000.rpmsave/root#cimv2/classes
   1381 /var/lib/Pegasus/prev_repository_2010-11-15-1289811714.398469000.rpmsave/root#cimv2/classes
   1381 /var/lib/Pegasus/prev_repository_2012-03-19-1332151633.572875000.rpmsave/root#cimv2/classes
   1398 /var/lib/Pegasus/repository/root#cimv2/classes
   1696 /usr/share/man/man3
   400816 /var/lib/pengine

Note, the above command orders the files from bottom to top order and obviosuly the bottleneck directory that is over-eating Filesystem inodes with an exceeding amount of files is
/var/lib/pengine
 

2. Backup old multitude of files just in case of something goes wrong with the cluster after some files are wiped out


The next logical step of course is to check what is going on inside /var/lib/pengine just to find a very ,very large amount of pe-input-*NUMBER*.bz2 files were suddenly produced.

 

[root@smtp1: ~ ]# ls -1 pe-input*.bz2 | wc -l
 400816


The files are produced by the pengine process which is one of the processes that is controlling the heartbeat cluster state, presumably it is done by running process:

[root@smtp1: ~ ]# ps -ef|grep -i pengine
24        5649  5521  0 Aug10 ?        00:00:26 /usr/lib64/heartbeat/pengine


Hence in order to fix the issue, to prevent some inconsistencies in the cluster due to the file deletion,  copied the whole directory to another mounted parition (you can mount it remotely with sshfs for example) or use a local one if you have one:

[root@smtp1: ~ ]# cp -rpf /var/lib/pengine /mnt/attached_storage


and proceeded to clean up some old multitde of files that are older than 2 years of times (720 days):


3. Clean  up /var/lib/pengine files that are older than two years with short loop and find command

 


First I made a list with all the files to be removed in external text file and quickly reviewed it by lessing it like so

[root@smtp1: ~ ]#  cd /var/lib/pengine
[root@smtp1: ~ ]# find . -type f -mtime +720|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last -fprint /home/myuser/pengine_older_than_720days.txt
[root@smtp1: ~ ]# less /home/myuser/pengine_older_than_720days.txt


Once reviewing commands I've used below command to delete the files you can run below command do delete all older than 2 years that are different from pe-error.last / pe-input.last / pre-warn.last which might be needed for proper cluster operation.

[root@smtp1: ~ ]#  for i in $(find . -type f -mtime +720 -exec echo '{}' \;|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last); do echo $i; done


Another approach to the situation is to simply review all the files inside /var/lib/pengine and delete files based on year of creation, for example to delete all files in /var/lib/pengine from 2010, you can run something like:
 

[root@smtp1: ~ ]# for i in $(ls -al|grep -i ' 2010 ' | awk '{ print $9 }' |grep -v 'pe-warn.last'); do rm -f $i; done


4. Monitor real time inodes freeing

While doing the clerance of old unnecessery pengine heartbeat archives you can open another ssh console to the server and view how the inodes gets freed up with a command like:

 

# check if inodes is not being rapidly decreased

[root@csmtp1: ~ ]# watch 'df -i'


5. Restart basic Linux services producing pid files and logs etc. to make then workable (some services might not be notified the inodes on the Hard drive are freed up)

Because the hard drive on the system was full some services started to misbehaving and /var/log logging was impacted so I had to also restart them in our case this is the heartbeat itself
that  checks clusters nodes availability as well as the logging daemon service rsyslog

 

# restart rsyslog and heartbeat services
[root@csmtp1: ~ ]# /etc/init.d/heartbeat restart
[root@csmtp1: ~ ]# /etc/init.d/rsyslog restart

The systems had been a data integrity legacy service samhain so I had to restart this service as well to reforce the /var/log/samhain log file to again continusly start writting data to HDD.

# Restart samhain service init script 
[root@csmtp1: ~ ]# /etc/init.d/samhain restart


6. Check up enough inodes are freed up with df

[root@smtp1 log]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/simfs 500000 410531 19469 91% /
none 65536 61 65475 1% /dev


I had to repeat the same process on the second Postfix cluster node smtp2, and after all the steps like below check the status of smtp2 node and the postfix queue, following same procedure made the second smtp2 cluster member as expected 🙂

 

7. Check the cluster node quorum is complete, e.g. postfix cluster is operating normally

 

# Test if email cluster is ok with pacemaker resource cluster manager – lt-crm_mon
 

[root@csmtp1: ~ ]# crm_mon -1
============
Last updated: Tue Aug 10 18:10:48 2021
Stack: Heartbeat
Current DC: smtp2.fqdn.com (bfb3d029-89a8-41f6-a9f0-52d377cacd83) – partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============

Online: [ smtp2.fqdn.com smtp1.fqdn.com ]

failover-ip (ocf::heartbeat:IPaddr2): Started csmtp1.ikossvan.de
Clone Set: postfix_clone
Started: [ smtp2.fqdn.com smtp1fqdn.com ]
Clone Set: pingd_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]
Clone Set: mailto_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]

 

8.  Force resend a few hundred thousands of emails left in the email queue


After some inodes gets freed up due to the file deletion, i've reforced a couple of times the queued mail servers to be immediately resent to remote mail destinations with cmd:

 

# force emails in queue to be resend with postfix

[root@smtp1: ~ ]# sendmail -q


– It was useful to watch in real time how the queued emails are quickly decreased (queued mails are successfully sent to destination addresses) with:

 

# Monitor  the decereasing size of the email queue
[root@smtp1: ~ ]# watch 'postqueue -p|grep -i '@'|wc -l'

How to configure bond0 bonding and network bridging for KVM Virtual machines on Redhat / CentOS / Fedora Linux

Tuesday, February 16th, 2021

configure-bond0-bonding-channel-with-bridges-on-hypervisor-host-for-guest-KVM-virtual-machines-howto-sample-Hypervisor-Virtual-machines-pic
 1. Intro to Redhat RPM based distro /etc/sysconfig/network-scripts/* config vars shortly explained

On RPM based Linux distributions configuring network has a very specific structure. As a sysadmin just recently I had a task to configure Networking on 2 Machines to be used as Hypervisors so the servers could communicate normally to other Networks via some different intelligent switches that are connected to each of the interfaces of the server. The idea is the 2 redhat 8.3 machines to be used as  Hypervisor (HV) and each of the 2 HVs to each be hosting 2 Virtual guest Machines with preinstalled another set of Redhat 8.3 Ootpa. I've recently blogged on how to automate a bit installing the KVM Virtual machines with using predefined kickstart.cfg file.

The next step after install was setting up the network. Redhat has a very specific network configuration well known under /etc/sysconfig/network-scripts/ifcfg-eno*# or if you have configured the Redhats to fix the changing LAN card naming ens, eno, em1 to legacy eth0, eth1, eth2 on CentOS Linux – e.g. to be named as /etc/sysconfig/network-scripts/{ifcfg-eth0,1,2,3}.

The first step to configure the network from that point is to come up with some network infrastrcture that will be ready on the HV nodes server-node1 server-node2 for the Virtual Machines to be used by server-vm1, server-vm2.

Thus for the sake of myself and some others I decide to give here the most important recognized variables that can be placed inside each of the ifcfg-eth0,ifcfg-eth1,ifcfg-eth2 …

A standard ifcfg-eth0 confing would look something this:
 

[root@redhat1 :~ ]# cat /etc/sysconfig/network-scripts/ifcfg-eth0
TYPE=Ethernet
BOOTPROTO=none
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV4_FAILURE_FATAL=no
NAME=eth0
UUID=…
ONBOOT=yes
HWADDR=0e:a4:1a:b6:fc:86
IPADDR0=10.31.24.10
PREFIX0=23
GATEWAY0=10.31.24.1
DNS1=192.168.50.3
DNS2=10.215.105.3
DOMAIN=example.com
IPV6_PEERDNS=yes
IPV6_PEERROUTES=yes


Lets say few words to each of the variables to make it more clear to people who never configured Newtork on redhat without the help of some of the console ncurses graphical like tools such as nmtui or want to completely stop the Network-Manager to manage the network and thus cannot take the advantage of using nmcli (a command-line tool for controlling NetworkManager).

Here is a short description of each of above configuration parameters:

TYPE=device_type: The type of network interface device
BOOTPROTO=protocol: Where protocol is one of the following:

  • none: No boot-time protocol is used.
  • bootp: Use BOOTP (bootstrap protocol).
  • dhcp: Use DHCP (Dynamic Host Configuration Protocol).
  • static: if configuring static IP

EFROUTE|IPV6_DEFROUTE=answer

  • yes: This interface is set as the default route for IPv4|IPv6 traffic.
  • no: This interface is not set as the default route.

Usually most people still don't use IPV6 so better to disable that

IPV6INIT=answer: Where answer is one of the following:

  • yes: Enable IPv6 on this interface. If IPV6INIT=yes, the following parameters could also be set in this file:

IPV6ADDR=IPv6 address

IPV6_DEFAULTGW=The default route through the specified gateway

  • no: Disable IPv6 on this interface.

IPV4_FAILURE_FATAL|IPV6_FAILURE_FATAL=answer: Where answer is one of the following:

  • yes: This interface is disabled if IPv4 or IPv6 configuration fails.
  • no: This interface is not disabled if configuration fails.

ONBOOT=answer: Where answer is one of the following:

  • yes: This interface is activated at boot time.
  • no: This interface is not activated at boot time.

HWADDR=MAC-address: The hardware address of the Ethernet device
IPADDRN=address: The IPv4 address assigned to the interface
PREFIXN=N: Length of the IPv4 netmask value
GATEWAYN=address: The IPv4 gateway address assigned to the interface. Because an interface can be associated with several combinations of IP address, network mask prefix length, and gateway address, these are numbered starting from 0.
DNSN=address: The address of the Domain Name Servers (DNS)
DOMAIN=DNS_search_domain: The DNS search domain (this is the search Domain-name.com you usually find in /etc/resolv.conf)

Other interesting file that affects how routing is handled on a Redhat Linux is

/etc/sysconfig/network

[root@redhat1 :~ ]# cat /etc/sysconfig/network
# Created by anaconda
GATEWAY=10.215.105.

Having this gateway defined does add a default gateway

This file specifies global network settings. For example, you can specify the default gateway, if you want to apply some network settings such as routings, Alias IPs etc, that will be valid for all configured and active configuration red by systemctl start network scripts or the (the network-manager if such is used), just place it in that file.

Other files of intesresting to control how resolving is being handled on the server worthy to check are 

/etc/nsswitch.conf

and

/etc/hosts

If you want to set a preference of /etc/hosts being red before /etc/resolv.conf and DNS resolving for example you need to have inside it, below is default behavior of it.
 

root@redhat1 :~ ]#   grep -i hosts /etc/nsswitch.conf
#     hosts: files dns
#     hosts: files dns  # from user file
# Valid databases are: aliases, ethers, group, gshadow, hosts,
hosts:      files dns myhostname

As you can see the default order is to read first files (meaning /etc/hosts) and then the dns (/etc/resolv.conf)
hosts: files dns

Now with this short intro description on basic values accepted by Redhat's /etc/sysconfig/network-scripts/ifcfg* prepared configurations.


I will give a practical example of configuring a bond0 interface with 2 members which were prepared based on Redhat's Official documentation found in above URLs:

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/configuring_and_managing_networking/configuring-network-bonding_configuring-and-managing-networking
 

# Bonding on RHEL 7 documentation
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-network_bonding_using_the_command_line_interface

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-verifying_network_configuration_bonding_for_redundancy

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/s2-networkscripts-interfaces_network-bridge

# Network Bridge with Bond documentation
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sec-Configuring_a_VLAN_over_a_Bond

https://docs.fedoraproject.org/en-US/Fedora/24/html/Networking_Guide/sec-Network_Bridge_with_Bond.html


2. Configuring a single bond connection on eth0 / eth2 and setting 3 bridge interfaces bond -> br0, br1 -> eth1, br2 -> eth2

The task on my machines was to set up from 4 lan cards one bonded interface as active-backup type of bond with bonded lines on eth0, eth2 and 3 other 2 eth1, eth2 which will be used for private communication network that is connected via a special dedicated Switches and Separate VLAN 50, 51 over a tagged dedicated gigabit ports.

As said the 2 Servers had each 4 Broadcom Network CARD interfaces each 2 of which are paired (into a single card) and 2 of which are a solid Broadcom NetXtreme Dual Port 10GbE SFP+ and Dell Broadcom 5720 Dual Port 1Gigabit Network​.

2-ports-broadcom-netxtreme-dual-port-10GBe-spf-plus

On each of server-node1 and server-node2 we had 4 Ethernet Adapters properly detected on the Redhat

root@redhat1 :~ ]# lspci |grep -i net
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
19:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01)
19:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57412 NetXtreme-E 10Gb RDMA Ethernet Controller (rev 01)


I've already configured as prerogative net.ifnames=0 to /etc/grub2/boot.cfg and Network-Manager service disabled on the host (hence to not use Network Manager you'll see in below configuration NM_CONTROLLED="no" is telling the Redhat servers is not to be trying NetworkManager for more on that check my previous article Disable NetworkManager automatic Ethernet Interface Management on Redhat Linux , CentOS 6 / 7 / 8.

3. Types of Network Bonding

mode=0 (balance-rr)

This mode is based on Round-robin policy and it is the default mode. This mode offers fault tolerance and load balancing features. It transmits the packets in Round robin fashion that is from the first available slave through the last.

mode-1 (active-backup)

This mode is based on Active-backup policy. Only one slave is active in this band, and another one will act only when the other fails. The MAC address of this bond is available only on the network adapter part to avoid confusing the switch. This mode also provides fault tolerance.

mode=2 (balance-xor)

This mode sets an XOR (exclusive or) mode that is the source MAC address is XOR’d with destination MAC address for providing load balancing and fault tolerance. Each destination MAC address the same slave is selected.

mode=3 (broadcast)

This method is based on broadcast policy that is it transmitted everything on all slave interfaces. It provides fault tolerance. This can be used only for specific purposes.

mode=4 (802.3ad)

This mode is known as a Dynamic Link Aggregation mode that has it created aggregation groups having same speed. It requires a switch that supports IEEE 802.3ad dynamic link. The slave selection for outgoing traffic is done based on a transmit hashing method. This may be changed from the XOR method via the xmit_hash_policy option.

mode=5 (balance-tlb)

This mode is called Adaptive transmit load balancing. The outgoing traffic is distributed based on the current load on each slave and the incoming traffic is received by the current slave. If the incoming traffic fails, the failed receiving slave is replaced by the MAC address of another slave. This mode does not require any special switch support.

mode=6 (balance-alb)

This mode is called adaptive load balancing. This mode does not require any special switch support.

Lets create the necessery configuration for the bond and bridges

[root@redhat1 :~ ]# cat ifcfg-bond0
DEVICE=bond0
NAME=bond0
TYPE=Bond
BONDING_MASTER=yes
#IPADDR=10.50.21.16
#PREFIX=26
#GATEWAY=10.50.0.1
#DNS1=172.20.88.2
ONBOOT=yes
BOOTPROTO=none
BONDING_OPTS="mode=1 miimon=100 primary=eth0"
NM_CONTROLLED="no"
BRIDGE=br0


[root@redhat1 :~ ]# cat ifcfg-bond0.10
DEVICE=bond0.10
BOOTPROTO=none
ONPARENT=yes
#IPADDR=10.50.21.17
#NETMASK=255.255.255.0
VLAN=yes

[root@redhat1 :~ ]# cat ifcfg-br0
STP=yes
BRIDGING_OPTS=priority=32768
TYPE=Bridge
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_ADDR_GEN_MODE=stable-privacy
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=br0
UUID=4451286d-e40c-4d8c-915f-7fc12a16d595
DEVICE=br0
ONBOOT=yes
IPADDR=10.50.50.16
PREFIX=26
GATEWAY=10.50.0.1
DNS1=172.20.0.2
NM_CONTROLLED=no

[root@redhat1 :~ ]# cat ifcfg-br1
STP=yes
BRIDGING_OPTS=priority=32768
TYPE=Bridge
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
DEFROUTE=no
IPV4_FAILURE_FATAL=no
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_ADDR_GEN_MODE=stable-privacy
IPV6INIT=no
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=br1
UUID=40360c3c-47f5-44ac-bbeb-77f203390d29
DEVICE=br1
ONBOOT=yes
##IPADDR=10.50.51.241
PREFIX=28
##GATEWAY=10.50.0.1
##DNS1=172.20.0.2
NM_CONTROLLED=no

[root@redhat1 :~ ]# cat ifcfg-br2
STP=yes
BRIDGING_OPTS=priority=32768
TYPE=Bridge
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
DEFROUTE=no
IPV4_FAILURE_FATAL=no
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_ADDR_GEN_MODE=stable-privacy
IPV6INIT=no
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=br2
UUID=fbd5c257-2f66-4f2b-9372-881b783276e0
DEVICE=br2
ONBOOT=yes
##IPADDR=10.50.51.243
PREFIX=28
##GATEWAY=10.50.0.1
##DNS1=172.20.10.1
NM_CONTROLLED=no
NM_CONTROLLED=no
BRIDGE=br0

[root@redhat1 :~ ]# cat ifcfg-eth0
TYPE=Ethernet
NAME=bond0-slaveeth0
BOOTPROTO=none
#UUID=61065574-2a9d-4f16-b16e-00f495e2ee2b
DEVICE=eth0
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no

[root@redhat1 :~ ]# cat ifcfg-eth1
TYPE=Ethernet
NAME=eth1
UUID=b4c359ae-7a13-436b-a904-beafb4edee94
DEVICE=eth1
ONBOOT=yes
BRIDGE=br1
NM_CONTROLLED=no

[root@redhat1 :~ ]#  cat ifcfg-eth2
TYPE=Ethernet
NAME=bond0-slaveeth2
BOOTPROTO=none
#UUID=821d711d-47b9-490a-afe7-190811578ef7
DEVICE=eth2
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no

[root@redhat1 :~ ]#  cat ifcfg-eth3
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
#BOOTPROTO=dhcp
BOOTPROTO=none
DEFROUTE=no
IPV4_FAILURE_FATAL=no
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_ADDR_GEN_MODE=stable-privacy
IPV6INIT=no
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
BRIDGE=br2
NAME=eth3
UUID=61065574-2a9d-4f16-b16e-00f495e2ee2b
DEVICE=eth3
ONBOOT=yes
NM_CONTROLLED=no

[root@redhat2 :~ ]# cat ifcfg-bond0
DEVICE=bond0
NAME=bond0
TYPE=Bond
BONDING_MASTER=yes
#IPADDR=10.50.21.16
#PREFIX=26
#GATEWAY=10.50.21.1
#DNS1=172.20.88.2
ONBOOT=yes
BOOTPROTO=none
BONDING_OPTS="mode=1 miimon=100 primary=eth0"
NM_CONTROLLED="no"
BRIDGE=br0

# cat ifcfg-bond0.10
DEVICE=bond0.10
BOOTPROTO=none
ONPARENT=yes
#IPADDR=10.50.21.17
#NETMASK=255.255.255.0
VLAN=yes
NM_CONTROLLED=no
BRIDGE=br0

[root@redhat2 :~ ]# cat ifcfg-br0
STP=yes
BRIDGING_OPTS=priority=32768
TYPE=Bridge
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
#IPV6INIT=yes
#IPV6_AUTOCONF=yes
#IPV6_DEFROUTE=yes
#IPV6_FAILURE_FATAL=no
#IPV6_ADDR_GEN_MODE=stable-privacy
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=br0
#UUID=f87e55a8-0fb4-4197-8ccc-0d8a671f30d0
UUID=4451286d-e40c-4d8c-915f-7fc12a16d595
DEVICE=br0
ONBOOT=yes
IPADDR=10.50.21.17
PREFIX=26
GATEWAY=10.50.21.1
DNS1=172.20.88.2
NM_CONTROLLED=no

[root@redhat2 :~ ]#  cat ifcfg-br1
STP=yes
BRIDGING_OPTS=priority=32768
TYPE=Bridge
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
DEFROUTE=no
IPV4_FAILURE_FATAL=no
#IPV6INIT=no
#IPV6_AUTOCONF=no
#IPV6_DEFROUTE=no
#IPV6_FAILURE_FATAL=no
#IPV6_ADDR_GEN_MODE=stable-privacy
IPV6INIT=no
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=br1
UUID=40360c3c-47f5-44ac-bbeb-77f203390d29
DEVICE=br1
ONBOOT=yes
##IPADDR=10.50.21.242
PREFIX=28
##GATEWAY=10.50.21.1
##DNS1=172.20.88.2
NM_CONTROLLED=no

[root@redhat2 :~ ]# cat ifcfg-br2
STP=yes
BRIDGING_OPTS=priority=32768
TYPE=Bridge
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=none
DEFROUTE=no
IPV4_FAILURE_FATAL=no
#IPV6INIT=no
#IPV6_AUTOCONF=no
#IPV6_DEFROUTE=no
#IPV6_FAILURE_FATAL=no
#IPV6_ADDR_GEN_MODE=stable-privacy
IPV6INIT=no
IPV6_AUTOCONF=no
IPV6_DEFROUTE=no
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=br2
UUID=fbd5c257-2f66-4f2b-9372-881b783276e0
DEVICE=br2
ONBOOT=yes
##IPADDR=10.50.21.244
PREFIX=28
##GATEWAY=10.50.21.1
##DNS1=172.20.88.2
NM_CONTROLLED=no

[root@redhat2 :~ ]# cat ifcfg-eth0
TYPE=Ethernet
NAME=bond0-slaveeth0
BOOTPROTO=none
#UUID=ee950c07-7eb2-463b-be6e-f97e7ad9d476
DEVICE=eth0
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no

[root@redhat2 :~ ]# cat ifcfg-eth1
TYPE=Ethernet
NAME=eth1
UUID=ffec8039-58f0-494a-b335-7a423207c7e6
DEVICE=eth1
ONBOOT=yes
BRIDGE=br1
NM_CONTROLLED=no

[root@redhat2 :~ ]# cat ifcfg-eth2
TYPE=Ethernet
NAME=bond0-slaveeth2
BOOTPROTO=none
#UUID=2c097475-4bef-47c3-b241-f5e7f02b3395
DEVICE=eth2
ONBOOT=yes
MASTER=bond0
SLAVE=yes
NM_CONTROLLED=no


Notice that the bond0 configuration does not have an IP assigned this is done on purpose as we're using the interface channel bonding together with attached bridge for the VM. Usual bonding on a normal physical hardware hosts where no virtualization use is planned is perhaps a better choice. If you however try to set up an IP address in that specific configuration shown here and you try to reboot the machine, you will end up with inacessible machine over the network like I did and you will need to resolve configuration via some kind of ILO / IDRAC interface.

4. Generating UUID for ethernet devices bridges and bonds

One thing to note is the command uuidgen you might need that to generate UID identificators to fit in the new network config files.

Example:
 

[root@redhat2 :~ ]#uuidgen br2
e7995e15-7f23-4ea2-80d6-411add78d703
[root@redhat2 :~ ]# uuidgen br1
05e0c339-5998-414b-b720-7adf91a90103
[root@redhat2 :~ ]# uuidgen br0
e6d7ff74-4c15-4d93-a150-ff01b7ced5fb


5. How to make KVM Virtual Machines see configured Network bridges (modify VM XML)

To make the Virtual machines installed see the bridges I had to

[root@redhat1 :~ ]#virsh edit VM_name1
[root@redhat1 :~ ]#virsh edit VM_name2

[root@redhat2 :~ ]#virsh edit VM_name1
[root@redhat2 :~ ]#virsh edit VM_name2

Find the interface network configuration and change it to something like:

    <interface type='bridge'>
      <mac address='22:53:00:56:5d:ac'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='22:53:00:2a:5f:01'/>
      <source bridge='br1'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x07' slot='0x00' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='22:34:00:4a:1b:6c'/>
      <source bridge='br2'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </interface>


6. Testing the bond  is up and works fine

# ip addr show bond0
The result is the following:

 

4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 52:54:00:cb:25:82 brd ff:ff:ff:ff:ff:ff


The bond should be visible in the normal network interfaces with ip address show or /sbin/ifconfig

 

# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 00:0c:29:ab:2a:fa
Slave queue ID: 0

 

According to the output eth0 is the active slave.

The active slaves device files (eth0 in this case) is found in virtual file system /sys/

# find /sys -name *eth0
/sys/devices/pci0000:00/0000:00:15.0/0000:03:00.0/net/eth0
/sys/devices/virtual/net/bond0/lower_eth0
/sys/class/net/eth0


You can remove a bond member say eth0 by 

 

 cd to the pci* directory
Example: /sys/devices/pci000:00/000:00:15.0

 

# echo 1 > remove


At this point the eth0 device directory structure that was previously located under /sys/devices/pci000:00/000:00:15.0 is no longer there.  It was removed and the device no longer exists as seen by the OS.

You can verify this is the case with a simple ifconfig which will no longer list the eth0 device.
You can also repeat the cat /proc/net/bonding/bond0 command from Step 1 to see that eth0 is no longer listed as active or available.
You can also see the change in the messages file.  It might look something like this:

2021-02-12T14:13:23.363414-06:00 redhat1  device eth0: device has been deleted
2021-02-12T14:13:23.368745-06:00 redhat1 kernel: [81594.846099] bonding: bond0: releasing active interface eth0
2021-02-12T14:13:23.368763-06:00 redhat1 kernel: [81594.846105] bonding: bond0: Warning: the permanent HWaddr of eth0 – 00:0c:29:ab:2a:f0 – is still in use by bond0. Set the HWaddr of eth0 to a different address to avoid conflicts.
2021-02-12T14:13:23.368765-06:00 redhat1 kernel: [81594.846132] bonding: bond0: making interface eth1 the new active one.

 

Another way to test the bonding is correctly switching between LAN cards on case of ethernet hardware failure is to bring down one of the 2 or more bonded interfaces, lets say you want to switch from active-backup from eth1 to eth2, do:
 

# ip link set dev eth0 down


That concludes the test for fail over on active slave failure.

7. Bringing bond updown (rescan) bond with no need for server reboot

You know bonding is a tedious stuff that sometimes breaks up badly so only way to fix the broken bond seems to be a init 6 (reboot) cmd but no actually that is not so.

You can also get the deleted device back with a simple pci rescan command:

# echo 1 > /sys/bus/pci/rescan


The eth0 interface should now be back
You can see that it is back with an ifconfig command, and you can verify that the bond sees it with this command:

# cat /proc/net/bonding/bond0


That concludes the test of the bond code seeing the device when it comes back again.

The same steps can be repeated only this time using the eth1 device and file structure to fail the active slave in the bond back over to eth0.

8. Testing the bond with ifenslave command (ifenslave command examples)

Below is a set of useful information to test the bonding works as expected with ifenslave command  comes from "iputils-20071127" package

– To show information of all the inerfaces

                  # ifenslave -a
                  # ifenslave –all-interfaces 

 

– To change the active slave

                  # ifenslave -c bond0 eth1
                  # ifenslave –change-active bond0 eth1 

 

– To remove the slave interface from the bonding device

                  # ifenslave -d eth1
                  # ifenslave –detach bond0 eth1 

 

– To show master interface info

                  # ifenslave bond0 

 

– To set the bond device down and automatically release all the slaves

                  # ifenslave bond1 down 

– To get the usage info

                  # ifenslave -u
                  # ifenslave –usage 

– To set to verbose mode

                  # ifenslave -v
                  # ifenslave –verbose 

9. Testing the bridge works fine

Historically over the years all kind of bridges are being handled with the brctl part of bridge-utils .deb / .rpm installable package.

The classical way to check a bridge is working is to do

# brctl show
# brctl show br0; brctl show br1; brctl show br2

# brctl showmacs br0
 

etc.

Unfortunately with redhat 8 this command is no longer available so to get information about configured bridges you need to use instead:

 

# bridge link show
3:eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master bridge0 state forwarding priority 32 cost 100
4:eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master bridge0 state listening priority 32 cost 100


10. Troubleshooting network connectivity issues on bond bridges and LAN cards

Testing the bond connection and bridges can route proper traffic sometimes is a real hassle so here comes at help the good old tcpdump

If you end up with issues with some of the ethernet interfaces between HV1 and HV2 to be unable to talk to each other and you have some suspiciousness that some colleague from the network team has messed up a copper (UTP) cable or there is a connectivity fiber optics issues. To check the VLAN tagged traffic headers on the switch you can listen to each and every bond0 and br0, br1, br2 eth0, eth1, eth2, eth3 configured on the server like so:

# tcpdump -i bond0 -nn -e vlan


Some further investigation on where does a normal ICMP traffic flows once everything is setup is a normal thing to do, hence just try to route a normal ping via the different server interfaces:

# ping -I bond0 DSTADDR

# ping -i eth0 DSTADDR

# ping -i eth1 DSTADDR

# ping -i eth2 DSTADDR


After conducting the ping do the normal for network testing big ICMP packages (64k) ping to make sure there are no packet losses etc., e.g:

# ping -I eth3 -s 64536  DSTADDR


If for 10 – 20 seconds the ping does not return package losses then you should be good.

How to restart Microsoft IIS with command via Windows command line

Friday, August 19th, 2011

I'm tuning a Windows 2003 for better performance and securing it against DoS of service attacks. After applying all the changes I needed to restart the WebServer for the new configurations to take effect.
As I'm not a GUI kind of guy I found it handy there is a fast command to restart the Microsoft Internet Information Server. The command to restart IIS is:

c:> iisreset

Check server Internet connectivity Speedtest from Linux terminal CLI

Friday, August 7th, 2020

check-server-console-speedtest

If you are a system administrator of a dedicated server and you have no access to Xserver Graphical GNOME / KDE etc. environment and you wonder how you can track the bandwidth connectivity speed of remote system to the internet and you happen to have a modern Linux distribution, here is few ways to do a speedtest.
 

1. Use speedtest-cli command line tool to test connectivity

 


speedtest-cli is a tiny tool written in python, to use it hence you need to have python installed on the server.
It is available both for Redhat Linux distros and Debians / Ubuntus etc. in the list of standard installable packages.

a) Install speedtest-cli on Fedora / CentOS / RHEL
 

On CentOS / RHEL / Scientific Linux lower than ver 8:

 

 

$ sudo yum install python

On CentOS 8 / RHEL 8 user type the following command to install Python 3 or 2:

 

 

$sudo yum install python3
$ sudo yum install python2

 

 

 


On Fedora Linux version 22+

 

 

$ sudo dnf install python
$ sudo dnf install pytho3

 


Once python is at place download speedtest.py or in case if link is not reachable download mirrored version of speedtest.py on www.pc-freak.net here
 

 

 

$ wget -O speedtest-cli https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py
$ chmod +x speedtest-cli

 


Then it is time to run script speedtest-screenshot-linux-terminal-console-cli-cmd
To test enabled Bandwidth on the server

 

 

$ python speedtest-cli


b) Install speedtest-cli on Debian

On Latest Debian 10 Buster speedtest is available out of the box in regular .deb repositories, so fetch it with apt
 

 

# apt install –yes speedtest-cli

 


You can give now speedtest-cli a try with –bytes arguments to get speed values in bytes instead of bits or if you want to generate an image with test results in picture just like it will appear if you use speedtest.net inside a gui browser, use the –share option

speedtest-screenshot-linux-terminal-console-cli-cmd-options

 

 

 

2. Getting connectivity results of all defined speedtest test City Locations


Speedtest has a list of servers through which a Upload and Download speed is tested, to run speedtest-cli to test with each and every server and get a better picture on what kind of connectivity to expect from your server towards the closest region capital cities, fetch speedtest-servers.php list and use a small shell loop below is how:

 

 

 

 

 

root@pcfreak:~#  wget http://www.speedtest.net/speedtest-servers.php
–2020-08-07 16:31:34–  http://www.speedtest.net/speedtest-servers.php
Преобразувам www.speedtest.net (www.speedtest.net)… 151.101.2.219, 151.101.66.219, 151.101.130.219, …
Connecting to www.speedtest.net (www.speedtest.net)|151.101.2.219|:80… успешно свързване.
HTTP изпратено искане, чакам отговор… 301 Moved Permanently
Адрес: https://www.speedtest.net/speedtest-servers.php [следва]
–2020-08-07 16:31:34–  https://www.speedtest.net/speedtest-servers.php
Connecting to www.speedtest.net (www.speedtest.net)|151.101.2.219|:443… успешно свързване.
HTTP изпратено искане, чакам отговор… 307 Temporary Redirect
Адрес: https://c.speedtest.net/speedtest-servers-static.php [следва]
–2020-08-07 16:31:35–  https://c.speedtest.net/speedtest-servers-static.php
Преобразувам c.speedtest.net (c.speedtest.net)… 151.101.242.219
Connecting to c.speedtest.net (c.speedtest.net)|151.101.242.219|:443… успешно свързване.
HTTP изпратено искане, чакам отговор… 200 OK
Дължина: 211695 (207K) [text/xml]
Saving to: ‘speedtest-servers.php’
speedtest-servers.php                  100%[==========================================================================>] 206,73K  –.-KB/s    in 0,1s
2020-08-07 16:31:35 (1,75 MB/s) – ‘speedtest-servers.php’ saved [211695/211695]

Once file is there with below loop we extract all file defined servers id="" 's 
 

root@pcfreak:~# for i in $(cat speedtest-servers.php | egrep -Eo 'id="[0-9]{4}"' |sed -e 's#id="##' -e 's#"##g'); do speedtest-cli  –server $i; done
Retrieving speedtest.net configuration…
Testing from Vivacom (83.228.93.76)…
Retrieving speedtest.net server list…
Retrieving information for the selected server…
Hosted by Telecoms Ltd. (Varna) [38.88 km]: 25.947 ms
Testing download speed……………………………………………………………………..
Download: 57.71 Mbit/s
Testing upload speed…………………………………………………………………………………………
Upload: 93.85 Mbit/s
Retrieving speedtest.net configuration…
Testing from Vivacom (83.228.93.76)…
Retrieving speedtest.net server list…
Retrieving information for the selected server…
Hosted by GMB Computers (Constanta) [94.03 km]: 80.247 ms
Testing download speed……………………………………………………………………..
Download: 35.86 Mbit/s
Testing upload speed…………………………………………………………………………………………
Upload: 80.15 Mbit/s
Retrieving speedtest.net configuration…
Testing from Vivacom (83.228.93.76)…

…..

 


etc.

For better readability you might want to add the ouput to a file or even put it to run periodically on a cron if you have some suspcion that your server Internet dedicated lines dies out to some general locations sometimes.
 

3. Testing UPlink speed with Download some big file from source location


In the past a classical way to test the bandwidth connectivity of your Internet Service Provider was to fetch some big file, Linux guys should remember it was almost a standard to roll a download of Linux kernel source .tar file with some test browser as elinks / lynx / w3c.
speedtest-screenshot-kernel-org-shot1 speedtest-screenshot-kernel-org-shot2
or if those are not at hand test connectivity on remote free shell servers whatever file downloader as wget or curl was used.
Analogical method is still possible, for example to use wget to get an idea about bandwidtch connectivity, let it roll below 500 mb from speedtest.wdc01.softlayer.com to /dev/null few times:

 

$ wget –output-document=/dev/null http://speedtest.wdc01.softlayer.com/downloads/test500.zip

$ wget –output-document=/dev/null http://speedtest.wdc01.softlayer.com/downloads/test500.zip

$ wget –output-document=/dev/null http://speedtest.wdc01.softlayer.com/downloads/test500.zip

 

# wget -O /dev/null –progress=dot:mega http://cachefly.cachefly.net/10mb.test ; date
–2020-08-07 13:56:49–  http://cachefly.cachefly.net/10mb.test
Resolving cachefly.cachefly.net (cachefly.cachefly.net)… 205.234.175.175
Connecting to cachefly.cachefly.net (cachefly.cachefly.net)|205.234.175.175|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 10485760 (10M) [application/octet-stream]
Saving to: ‘/dev/null’

     0K …….. …….. …….. …….. …….. …….. 30%  142M 0s
  3072K …….. …….. …….. …….. …….. …….. 60%  179M 0s
  6144K …….. …….. …….. …….. …….. …….. 90%  204M 0s
  9216K …….. ……..                                    100%  197M=0.06s

2020-08-07 13:56:50 (173 MB/s) – ‘/dev/null’ saved [10485760/10485760]

Fri 07 Aug 2020 01:56:50 PM UTC


To be sure you have a real picture on remote machine Internet speed it is always a good idea to run download of random big files on a certain locations that are well known to have a very stable Internet bandwidth to the Internet backbone routers.

4. Using Simple shell script to test Internet speed


Fetch and use speedtest.sh

 


wget https://raw.github.com/blackdotsh/curl-speedtest/master/speedtest.sh && chmod u+x speedtest.sh && bash speedtest.sh

 

 

5. Using iperf to test connectivity between two servers 

 

iperf is another good tool worthy to mention that can be used to test the speed between client and server.

To use iperf install it with apt and do on the server machine to which bandwidth will be tested:

 

# iperf -s 

 

On the client machine do:

 

# iperf -c 192.168.1.1 

 

where 192.168.1.1 is the IP of the server where iperf was spawned to listen.

6. Using Netflix fast to determine Internet connection speed on host


Fast

fast is a service provided by Netflix. Its web interface is located at Fast.com and it has a command-line interface available through npm (npm is a package manager for nodejs) so if you don't have it you will have to install it first with:

# apt install –yes npm

 

Note that if you run on Debian this will install you some 249 new nodejs packages which you might not want to have on the system, so this is useful only for machines that has already use of nodejs.

 

$ fast

 

     82 Mbps ↓


The command returns your Internet download speed. To get your upload speed, use the -u flag:

 

$ fast -u

 

   ⠧ 80 Mbps ↓ / 8.2 Mbps ↑

 

7. Use speedometer / iftop to measure incoming and outgoing traffic on interface


If you're measuring connectivity on a live production server system, then you might consider that the measurement output might not be exactly correct especially if you're measuring the Uplink / Downlink on a Heavy loaded webserver / Mail Server / Samba or DNS server.
If this is the case a very useful tools to consider to extract the already taken traffic used on your Incoming and Outgoing ( TX / RX ) Network interfaces
are speedometer and iftop, they're present and installable depending on the OS via yum / apt or the respective package manager.

 


To install on Debian server:

 

 

 

# apt install –yes iftop speedometer

 


The most basic use to check the live received traffic in a nice Ncurses like text graphic is with: 

 

 

 

 

# speedometer -r 


speedometer-check-received-transmitted-network-traffic-on-linux1

To generate real time ASCII art graph on RX / TX traffic do:

 

 

# speedometer -r eth0 -t eth0


speedometer-check-received-transmitted-network-traffic-on-linux

 

 

 

 

# iftop -P -i eth0

 

 


iftop-show-statistics-on-connections-screenshot-pcfreak

 

 

 

 

 

Linux Send Monitoring Alert Emails without Mail Server via relay SMTP with ssmtp / msmtp

Friday, July 10th, 2020

ssmtp-linux-server-sending-email-without-a-local-mail-server-mta-relay-howto

If you have to setup a new Linux server where you need to do a certain local running daemons monitoring with a custom scripts on the local machine Nagios / Zabbix / Graphana etc. that should notify about local running custom programs or services in case of a certain criteria is matched or you simply want your local existing UNIX accounts to be able to send outbound Emails to the Internet.

Then usually you need to install a fully functional SMTP Email server that was Sendmail or QMAIL in old times in early 21st century andusually postfix or Exim in recent days and configure it to use as as a Relay mail server some Kind of SMTP.

The common Relay smtp setting would be such as Google's smtp.gmail.com, Yahoo!'s  smtp.mail.yahoo.com relay host, mail.com or External configured MTA Physical server with proper PTR / MX records or a SMTP hosted on a virtual machine living in Amazon's AWS or m$ Azure that is capable to delivere EMails to the Internet.

Configuring the local installed Mail Transport Agent (MTA) as a relay server is a relatively easy task to do but of course why should you have a fully stacked MTA service with a number of unnecessery services such as Email Queue, Local created mailboxes, Firewall rules, DNS records, SMTP Auth, DKIM keys etc. and even the ability to acccept any emails back in case if you just want to simply careless send and forget with a confirmation that remote email was send successfully?

This is often the case for some machines and especially with the inclusion of technologies such as Kubernettes / Clustered environments / VirtualMachines small proggies such as ssmtp / msmtp that could send mail without a Fully functional mail server installed on localhost ( 127.0.0.1 ) is true jams.

ssmtp program is Simple Send-only sendMail emulator  has been around in Debian GNU / Linux, Ubuntu, CentOS and mostly all Linuxes for quite some a time but recently the Debian package has been orphaned so to install it on a deb based server host you need to use instead msmtp.
 

1. Install ssmtp on CentOS / Fedora / RHEL Linux

In RPM distributions you can't install until epel-release repository is enabled.

[root@centos:~]# yum –enablerepo=extras install epel-release

[root@centos:~]# yum install ssmtp


2. Install ssmp / msmtp Debian / Ubuntu Linux

If you run older version of Debian based distribution the package to install is ssmtp, e.g.:

root@debian:~# apt-get install –yes ssmtp


On Newer Debians as of Debian 10.0 Buster onwards install instead

root@debian:~# apt install –yes msmtp-mta

can save you a lot of effort to keep an eye on a separately MTA hanging around and running as a local service eating up resources that could be spared.
 

3. Configure Relay host for ssmtp


A simple configuration to make ssmtp use gmail.com SMTP servers as a relay host below:

linux:~# cat << EOF > /etc/ssmtp/ssmtp.conf
# /etc/ssmtp/ssmtp.conf
# The user that gets all the mails (UID < 1000, usually the admin)
root=user@host.name
# The full hostname.  Must be correctly formed, fully qualified domain name or GMail will reject connection.
hostname=host.name
# The mail server (where the mail is sent to), both port 465 or 587 should be acceptable
# See also https://support.google.com/mail/answer/78799
mailhub=smtp.gmail.com:587
#mailhub=smtp.host.name:465

# The address where the mail appears to come from for user authentication.
rewriteDomain=gmail.com
# Email 'From header's can override the default domain?

FromLineOverride=YES

# Username/Password
AuthUser=username@gmail.com
AuthPass=password
AuthMethod=LOGIN
# Use SSL/TLS before starting negotiation
UseTLS=YES
UseTLS=Yes
UseSTARTTLS=Yes
logfile        ~/.msmtp.log

EOF

This configuration is very basic and it is useful only if you don't want to get delivered mails back as this functionality is also supported even though rarely used by most.

One downside of ssmtp is mail password will be plain text, so make sure you set proper permissions to /etc/ssmtp/ssmtp.conf
 

– If your Gmail account is secured with two-factor authentication, you need to generate a unique App Password to use in ssmtp.conf. You can do so on your App Passwords page. Use Gmail username (not the App Name) in the AuthUser line and use the generated 16-character password in the AuthPass line, spaces in the password can be omitted.

– If you do not use two-factor authentication, you need to allow access to unsecure apps.
 

4. Configuring different msmtp for separate user profiles


SSMTP is capable of respecting multiple relays for different local UNIX users assuming each of whom has a separate home under /home/your-username

To set a certain user lets say georgi to relay smtp sent emails with mail or mailx command create ~/.msmtprc

 

linux:~# vim ~/.msmtprc


Append configuration like:

# Set default values for all following accounts.
defaults
port 587
tls on
tls_trust_file /etc/ssl/certs/ca-certificates.crt
account gmail
host smtp.gmail.com
from <user>@gmail.com
auth on
user <user>
passwordeval gpg –no-tty -q -d ~/.msmtp-gmail.gpg
# Set a default account

account default : gmail


To add it for any different user modify the respective fields and set the different Mail hostname etc.
 

5. Using mail address aliases


msmtp also supports mail aliases, to make them work you will need to have file /etc/msmptrc with
 

aliases               /etc/aliases


Standard aliasses them should work 

linux:~# cat /etc/aliases
# Example aliases file
     
# Send root to Joe and Jane
root: georgi_georgiev@example.com, georgi@example.com
   
# Send everything else to admin
default: admin@domain.example

 

6. Get updated when your Debian servers have new packages to update 

msmpt can be used for multiple stuff one example use would be to use it together with cron to get daily updates if there are new debian issued security or errata update pending packages, to do so you can use the apticron shell script.

To use it on debian install the apticron pack:
 

root@debian:~# apt-get install –yes apticron

apticron has the capability to:

 * send daily emails about pending upgrades in your system;
 * give you the choice of receiving only those upgrades not previously notified;
 * automatically integrate to apt-listchanges in order to give you by email the
   new changes of the pending upgrade packages;
 * handle and warn you about packages put on hold via aptitude/dselect,
   avoiding unexpected package upgrades (see #137771);
 * give you all these stuff in a simple default installation;

 

To configure it you have to place a config copy the one from /usr/lib/apticron/apticron.conf to /etc/apticron/apticron.conf

The only important value to modify in the config is the email address to which an apt-listchanges info for new installable debs from the apt-get dist-upgrade command. Output from them will be be send to the configured EMAIL field  in apticron.conf.
 

EMAIL="<your-user@email-addr-domain.com>"


The timing at which the offered new pending package update reminder will be sent is controlled by /etc/cron.d/apticron
 

debian:~# cat /etc/cron.d/apticron
# cron entry for apticron

48 * * * * root if test -x /usr/sbin/apticron; then /usr/sbin/apticron –cron; else true; fi

apticron will use the local previous ssmtp / msmpt program to deliver to configured mailbox.
To manually trigger apticron run:
 

root@debian:~# if test -x /usr/sbin/apticron; then /usr/sbin/apticron –cron; else true; fi


7. Test whether local mail send works to the Internet

To test mail sent we can use either mail / mailx or sendmail command or some more advanced mailer as alpine or mutt.

Below is few examples.

linux:~$ echo -e "Subject: this is the subject\n\nthis is the body" | mail user@your-recipient-domain.com

To test attachments to mail also works run:

linux:~$ mail -s "Subject" recipient-email@domain.com < mail-content-to-attach.txt

or

Prepare the mail you want to send and send it with sendmail

linux:~$ vim test-mail.txt
To:username@example.com
From:youraccount@gmail.com
Subject: Test Email
This is a test mail.

linux:~$ sendmail -t < test-mail.txt

Sending encoded atacchments with uuencode is also possible but you will need sharutils Deb / RPM package installed.

To attach lets say 2 simple text files uuencoded:

linux:~$ uuencode file.txt myfile.txt | sendmail user@example.com

echo "

To: username@domain.com From: username@gmail.com Subject: A test Hello there." > test.mail

linux:~$ cat test.mail | msmtp -a default <username>@domain.com


That's all folks, hope you learned something, if you know of some better stuff like ssmtp please shar e it.

Automatic network restart and reboot Linux server script if ping timeout to gateway is not responding as a way to reduce connectivity downtimes

Monday, December 10th, 2018

automatic-server-network-restart-and-reboot-script-if-connection-to-server-gateway-inavailable-tux-penguing-ascii-art-bin-bash

Inability of server to come back online server automaticallyafter electricity / network outage

These days my home server  is experiencing a lot of issues due to Electricity Power Outages, a construction dig operations to fix / change waterpipe tubes near my home are in action and perhaps the power cables got ruptered by the digger machine.
The effect of all this was that my server networking accessability was affected and as I didn't have network I couldn't access it remotely anymore at a certain point the electricity was restored (and the UPS charge could keep the server up), however the server accessibility did not due restore until I asked a relative to restart it or under a more complicated cases where Tech aquanted guy has to help – Alexander (Alex) a close friend from school years check his old site here – alex.www.pc-freak.net helps a lot.to restart the machine physically either run a quick restoration commands on root TTY terminal or generally do check whether default router is reachable.

This kind of Pc-Freak.net downtime issues over the last month become too frequent (the machine was down about 5 times for 2 to 5 hours and this was too much (and weirdly enough it was not accessible from the internet even after electricity network was restored and the only solution to that was a physical server restart (from the Power Button).

To decrease the number of cases in which known relatives or friends has to  physically go to the server and restart it, each time after network or electricity outage I wrote a small script to check accessibility towards Default defined Network Gateway for my server with few ICMP packages sent with good old PING command
and trigger a network restart and system reboot
(in case if the network restart does fail) in a row.

1. Create reboot-if-nwork-is-downsh script under /usr/sbin or other dir

Here is the script itself:

 

#!/bin/sh
# Script checks with ping 5 ICMP pings 10 times to DEF GW and if so
# triggers networking restart /etc/inid.d/networking restart
# Then does another 5 x 10 PINGS and if ping command returns errors,
# Reboots machine
# This script is useful if you run home router with Linux and you have
# electricity outages and machine doesn't go up if not rebooted in that case

GATEWAY_HOST='192.168.0.1';

run_ping () {
for i in $(seq 1 10); do
    ping -c 5 $GATEWAY_HOST
done

}

reboot_f () {
if [ $? -eq 0 ]; then
        echo "$(date "+%Y-%m-%d %H:%M:%S") Ping to $GATEWAY_HOST OK" >> /var/log/reboot.log
    else
    /etc/init.d/networking restart
        echo "$(date "+%Y-%m-%d %H:%M:%S") Restarted Network Interfaces:" >> /tmp/rebooted.txt
    for i in $(seq 1 10); do ping -c 5 $GATEWAY_HOST; done
    if [ $? -eq 0 ] && [ $(cat /tmp/rebooted.txt) -lt ‘5’ ]; then
         echo "$(date "+%Y-%m-%d %H:%M:%S") Ping to $GATEWAY_HOST FAILED !!! REBOOTING." >> /var/log/reboot.log
        /sbin/reboot

    # increment 5 times until stop
    [[ -f /tmp/rebooted.txt ]] || echo 0 > /tmp/rebooted.txt
    n=$(< /tmp/rebooted.txt)
        echo $(( n + 1 )) > /tmp/rebooted.txt
    fi
    # if 5 times rebooted sleep 30 mins and reset counter
    if [ $(cat /tmprebooted.txt) -eq ‘5’ ]; then
    sleep 1800
        cat /dev/null > /tmp/rebooted.txt
    fi
fi

}
run_ping;
reboot_f;

You can download a copy of reboot-if-nwork-is-down.sh script here.

As you see in script successful runs  as well as its failures are logged on server in /var/log/reboot.log with respective timestamp.
Also a counter to 5 is kept in /tmp/rebooted.txt, incremented on each and every script run (rebooting) if, the 5 times increment is matched

a sleep is executed for 30 minutes and the counter is being restarted.
The counter check to 5 guarantees the server will not get restarted if access to Gateway is not continuing for a long time to prevent the system is not being restarted like crazy all time.
 

2. Create a cron job to run reboot-if-nwork-is-down.sh every 15 minutes or so 

I've set the script to re-run in a scheduled (root user) cron job every 15 minutes with following  job:

To add the script to the existing cron rules without rewriting my old cron jobs and without tempering to use cronta -u root -e (e.g. do the cron job add in a non-interactive mode with a single bash script one liner had to run following command:

 

{ crontab -l; echo "*/15 * * * * /usr/sbin/reboot-if-nwork-is-down.sh 2>&1 >/dev/null; } | crontab –


I know restarting a server to restore accessibility is a stupid practice but for home-use or small client servers with unguaranteed networks with a cheap Uninterruptable Power Supply (UPS) devices it is useful.

Summary

Time will show how efficient such a  "self-healing script practice is.
Even though I'm pretty sure that even in a Corporate businesses and large Public / Private Hybrid Clouds where access to remote mounted NFS / XFS / ZFS filesystems are failing a modifications of the script could save you a lot of nerves and troubles and unhappy customers / managers screaming at you on the phone 🙂


I'll be interested to hear from others who have a better  ideas to restore ( resurrect ) access to inessible Linux server after an outage.?