Archive for the ‘System Administration’ Category

Zabbix: Monitor Linux rsyslog configured central log server is rechable with check_log_server_status.sh userparameter script

Wednesday, June 8th, 2022

zabbix-monitor-central-log-server-is-reachable-from-host-with-a-userparamater-script-zabbix-logo

On modern Linux OS servers on Redhat / CentOS / Fedora and Debian based distros log server service is usually running on the system  such as rsyslog (rsyslogd) to make sure the logging from services is properly logged in separate logs under /var/log.

A very common practice on critical server machines in terms of data security, where logs produced by rsyslog daermon needs to be copied over network via TCP or UDP protocol immediately is to copy over the /var/log produced logs to another configured central logging server. Then later every piece of bit generated by rsyslogd could be  overseen by a third party auditor person and useful for any investigation in case of logs integrity is required or at worse case if there is a suspicion that system in question is hacked by a malicious hax0r and logs have been "cleaned" up from any traces leading to the intruder (things usually done locally by hackers) or by any automated script exploit tools since yesr.

This doubled logging of system events to external log server  ipmentioned is very common practice by companies to protect their log data and quite useful for logs to be recovered easily later on from the central logging server machine that could be also setup for example to use rsyslogd to receive logs from other Linux machines in circumstances where some log disappears just like that (things i've seen happen) for any strange reason or gets destroyed by the admins mistake locally on machine / or by any other mean such as filesystem gets damaged. a very common practice by companies to protect their log data.  

Monitor remote logging server is reachable with userparameter script

Assuming that you already have setup a logging from the server hostname A towards the Central logging server log storepool and everything works as expected the next logical step is to have at least some basic way to monitor remote logging server configured is still reachable all the time and respectively rsyslog /var/log/*.* logs gets properly produced on remote side for example with something like a simple TCP remote server port check and reported in case of troubles in zabbix.

To solve that simple task for company where I'm employed, I've developed below check_log_server_status.sh:
 

#!/bin/bash
# @@ for TCP @ for UDP
# check_log_server_status.sh Script to check if configured TCP / UDP logging server in /etc/rsyslog.conf is rechable
# report to zabbix
DELIMITER='@@';
GREP_PORT='5145';
CONNECT_TIMEOUT=5;

PORT=$(grep -Ei "*.* $DELIMITER.*:$GREP_PORT" /etc/rsyslog.conf|awk -F : '{ print $2 }'|sort -rn |uniq);

#for i in $(grep -Ei "*.* $DELIMITER.*:$GREP_PORT" /etc/rsyslog.conf |grep -v '\#'|awk -F"$DELIMITER" '{ print $2 }' | awk -F ':' '{ print $1 }'|sort -rn); do
HOST=$(grep -Ei "*.* $DELIMITER.*:$GREP_PORT" /etc/rsyslog.conf |grep -v '\#'|awk -F"$DELIMITER" '{ print $2 }' | awk -F ':' '{ print $1 }'|sort -rn)

# echo $PORT

if [[ ! -z $PORT ]] && [[ ! -z $HOST ]]; then
SSH_RETURN=$(/bin/ssh $HOST -p $PORT -o ConnectTimeout=$CONNECT_TIMEOUT 2>&1);
else
echo "PROBLEM Port $GREP_PORT not defined in /etc/rsyslog.conf";
fi

##echo SSH_RETURN $SSH_RETURN;
#exit 1;
if [[ $(echo $SSH_RETURN |grep -i ‘Connection timed out during banner exchange’ | wc -l) -eq ‘1’ ]]; then
echo "rsyslogd $HOST:$PORT OK";
fi

if [[ $(echo $SSH_RETURN |grep -i ‘Connection refused’ | wc -l) -eq ‘1’ ]]; then
echo "rsyslogd $HOST:$PORT PROBLEM";
fi

#sleep 2;
#done


You can download a copy of the script check_log_server_status.sh here

Depending on the port the remote rsyslogd central logging server is using configure it in the script with respective port through the DELIMITER='@@', GREP_PORT='5145', CONNECT_TIMEOUT=5 values.

The delimiter is setup as usually in /etc/rsyslog.conf this the remote logging server for TCP IP is configured with @@ prefix to indicated TCP mode should be used.

Below is example from /etc/rsyslog.conf of how the rsyslogd server is configured:

[root@Server-hostA /root]# grep -i @@ /etc/rsyslogd.conf
# central remote Log server IP / port
*.* @@10.10.10.1:5145

To use the script on a machine, where you have a properly configured zabbix-agentd service host connected and reporting data to a zabbix-server monitoring server.

1. Set up the script under /usr/local/bin/check_log_server_status.sh

[root@Server-hostA /root ]# vim /usr/local/bin/check_log_server_status.sh

[root@Server-hostA /root ]# chmod +x /usr/local/bin/check_log_server_status.sh

2. Prepare userparameter_check_log_server.conf with log_server.check Item key

[root@Server-hostA zabbix_agentd.d]# cat userparameter_check_log_server.conf 
UserParameter=log_server.check, /usr/local/bin/check_log_server_status.sh

3. Set in Zabbix some Item such as on below screenshot

 

check-log-server-status-screenshot-linux-item-zabbix.png4. Create a Zabbix trigger 

check-log-server-status-trigger-logserver-is-unreachable-zabbix


The redded hided field in Expression field should be substituted with your actual hostname on which the monitor script will run.

How to RPM update Hypervisors and Virtual Machines running Haproxy High Availability cluster on KVM, Virtuozzo without a downtime on RHEL / CentOS Linux

Friday, May 20th, 2022

virtuozzo-kvm-virtual-machines-and-hypervisor-update-manual-haproxy-logo


Here is the scenario, lets say you have on your daily task list two Hypervisor (HV) hosts running CentOS or RHEL Linux with KVM or Virutozzo technology and inside the HV hosts you have configured at least 2 pairs of virtual machines one residing on HV Host 1 and one residing on HV Host 2 and you need to constantly keep the hosts to the latest distribution major release security patchset.

The Virtual Machines has been running another set of Redhat Linux or CentOS configured to work in a High Availability Cluster running Haproxy / Apache / Postfix or any other kind of HA solution on top of corosync / keepalived or whatever application cluster scripts Free or Open Source technology that supports a switch between clustered Application nodes.

The logical question comes how to keep up the CentOS / RHEL Machines uptodate without interfering with the operations of the Applications running on the cluster?

Assuming that the 2 or more machines are configured to run in Active / Passive App member mode, e.g. one machine is Active at any time and the other is always Passive, a switch is possible between the Active and Passive node.

HAProxy--Load-Balancer-cluster-2-nodes-your-Servers

In this article I'll give a simple step by step tested example on how you I succeeded to update (for security reasons) up to the latest available Distribution major release patchset on one by one first the Clustered App on Virtual Machines 1 and VM2 on Linux Hypervisor Host 1. Then the App cluster VM 1 / VM 2 on Hypervisor Host 2.
And finally update the Hypervisor1 (after moving the Active resources from it to Hypervisor2) and updating the Hypervisor2 after moving the App running resources back on HV1.
I know the procedure is a bit monotonic but it tries to go through everything step by step to try to mitigate any possible problems. In case of failure of some rpm dependencies during yum / dnf tool updates you can always revert to backups so in anyways don't forget to have a fully functional backup of each of the HV hosts and the VMs somewhere on a separate machine before proceeding further, any possible failures due to following my aritcle literally is your responsibility 🙂

 

0. Check situation before the update on HVs / get VM IDs etc.

Check the virsion of each of the machines to be updated both Hypervisor and Hosted VMs, on each machine run:
 

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)


The machine setup I'll be dealing with is as follows:
 

hypervisor-host1 -> hypervisor-host1.fqdn.com 
•    virt-mach-centos1
•    virt-machine-zabbix-proxy-centos (zabbix proxy)

hypervisor-host2 -> hypervisor-host2.fqdn.com
•    virt-mach-centos2
•    virt-machine-zabbix2-proxy-centos (zabbix proxy)

To check what is yours check out with virsh cmd –if on KVM or with prlctl if using Virutozzo, you should get something like:

[root@hypervisor-host2 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host1 running
 2 virt-mach-centos2 running

 # virsh list –all

[root@hypervisor-host1 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host2 running
 3 virt-mach-centos1 running

[root@hypervisor-host1 ~]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{dc37c201-08c9-589d-aa20-9386d63ce3f3}  running      –               VM virt-mach-centos1
{76e8a5f8-caa8-5442-830e-aa4bfe8d42d9}  running      –               VM vm-host2
[root@hypervisor-host1 ~]#

If you have stopped VMs with Virtuozzo to list the stopped ones as well.
 

# prlctl list -a

[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# vzlist -a
                                CTID      NPROC STATUS    IP_ADDR         HOSTNAME
[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{92075803-a4ce-5ec0-a3d8-9ee83d85fc76}  running      –               VM virt-mach-centos2
{74a7bbe8-9245-5385-ac0d-d10299100789}  running      –               VM vm-host1

# prlctl list -a


If due to Virtuozzo version above command does not return you can manually check in the VM located folder, VM ID etc.
 

[root@hypervisor-host2 vmprivate]# ls
74a7bbe8-9245-4385-ac0d-d10299100789  92075803-a4ce-4ec0-a3d8-9ee83d85fc76
[root@hypervisor-host2 vmprivate]# pwd
/vz/vmprivate
[root@hypervisor-host2 vmprivate]#


[root@hypervisor-host1 ~]# ls -al /vz/vmprivate/
total 20
drwxr-x—. 5 root root 4096 Feb 14  2019 .
drwxr-xr-x. 7 root root 4096 Feb 13  2019 ..
drwxr-x–x. 4 root root 4096 Feb 18  2019 1c863dfc-1deb-493c-820f-3005a0457627
drwxr-x–x. 4 root root 4096 Feb 14  2019 76e8a5f8-caa8-4442-830e-aa4bfe8d42d9
drwxr-x–x. 4 root root 4096 Feb 14  2019 dc37c201-08c9-489d-aa20-9386d63ce3f3
[root@hypervisor-host1 ~]#


Before doing anything with the VMs, also don't forget to check the Hypervisor hosts has enough space, otherwise you'll get in big troubles !
 

[root@hypervisor-host2 vmprivate]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/centos_hypervisor-host2-root   20G  1.8G   17G  10% /
devtmpfs                          20G     0   20G   0% /dev
tmpfs                             20G     0   20G   0% /dev/shm
tmpfs                             20G  2.0G   18G  11% /run
tmpfs                             20G     0   20G   0% /sys/fs/cgroup
/dev/sda1                        992M  159M  766M  18% /boot
/dev/mapper/centos_hypervisor-host2-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos_hypervisor-host2-var   9.8G  355M  8.9G   4% /var
/dev/mapper/centos_hypervisor-host2-vz    755G   25G  692G   4% /vz

 

[root@hypervisor-host1 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   50G  1.8G   45G   4% /
devtmpfs                  20G     0   20G   0% /dev
tmpfs                     20G     0   20G   0% /dev/shm
tmpfs                     20G  2.1G   18G  11% /run
tmpfs                     20G     0   20G   0% /sys/fs/cgroup
/dev/sda2                992M  153M  772M  17% /boot
/dev/mapper/centos-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos-var   9.8G  406M  8.9G   5% /var
/dev/mapper/centos-vz    689G   12G  643G   2% /vz

Another thing to do before proceeding with update is to check and tune if needed the amount of CentOS repositories used, before doing anything with yum.
 

[root@hypervisor-host2 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 110 root root 12288 Oct  7 11:13 ..
-rw-r–r–.   1 root root  4382 Mar 14  2019 CentOS7.repo
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo
[root@hypervisor-host2 yum.repos.d]#

 

[root@hypervisor-host1 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 112 root root 12288 Oct  7 11:09 ..
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   300 Mar 14  2019 obsoleted_tmpls.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo


1. Dump VM definition XMs (to have it in case if it gets wiped during update)

There is always a possibility that something will fail during the update and you might be unable to restore back to the old version of the Virtual Machine due to some config misconfigurations or whatever thus a very good idea, before proceeding to modify the working VMs is to use KVM's virsh and dump the exact set of XML configuration that makes the VM roll properly.

To do so:
Check a little bit up in the article how we have listed the IDs that are part of the directory containing the VM.
 

[root@hypervisor-host1 ]# virsh dumpxml (Id of VM virt-mach-centos1 ) > /root/virt-mach-centos1_config_bak.xml
[root@hypervisor-host2 ]# virsh dumpxml (Id of VM virt-mach-centos2) > /root/virt-mach-centos2_config_bak.xml

 


2. Set on standby virt-mach-centos1 (virt-mach-centos1)

As I'm upgrading two machines that are configured to run an haproxy corosync cluster, before proceeding to update the active host, we have to switch off
the proxied traffic from node1 to node2, – e.g. standby the active node, so the cluster can move up the traffic to other available node.
 

[root@virt-mach-centos1 ~]# pcs cluster standby virt-mach-centos1


3. Stop VM virt-mach-centos1 & backup on Hypervisor host (hypervisor-host1) for VM1

Another prevention step to make sure you don't get into damaged VM or broken haproxy cluster after the upgrade is to of course backup 

 

[root@hypervisor-host1 ]# prlctl backup virt-mach-centos1

or
 

[root@hypervisor-host1 ]# prlctl stop virt-mach-centos1
[root@hypervisor-host1 ]# cp -rpf /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3 /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3-bak
[root@hypervisor-host1 ]# tar -czvf virt-mach-centos1_vm_virt-mach-centos1.tar.gz /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3

[root@hypervisor-host1 ]# prlctl start virt-mach-centos1


4. Remove package version locks on all hosts

If you're using package locking to prevent some other colleague to not accidently upgrade the machine (if multiple sysadmins are managing the host), you might use the RPM package locking meachanism, if that is used check RPM packs that are locked and release the locking.

+ List actual list of locked packages

[root@hypervisor-host1 ]# yum versionlock list  

…..
0:libtalloc-2.1.16-1.el7.*
0:libedit-3.0-12.20121213cvs.el7.*
0:p11-kit-trust-0.23.5-3.el7.*
1:quota-nls-4.01-19.el7.*
0:perl-Exporter-5.68-3.el7.*
0:sudo-1.8.23-9.el7.*
0:libxslt-1.1.28-5.el7.*
versionlock list done
                          

+ Clear the locking            

# yum versionlock clear                               


+ List actual list / == clear all entries
 

[root@virt-mach-centos2 ]# yum versionlock list; yum versionlock clear
[root@virt-mach-centos1 ]# yum versionlock list; yum versionlock clear
[root@hypervisor-host1 ~]# yum versionlock list; yum versionlock clear
[root@hypervisor-host2 ~]# yum versionlock list; yum versionlock clear


5. Do yum update virt-mach-centos1


For some clarity if something goes wrong, it is really a good idea to make a dump of the basic packages installed before the RPM package update is initiated,
The exact versoin of RHEL or CentOS as well as the list of locked packages, if locking is used.

Enter virt-mach-centos1 (ssh virt-mach-centos1) and run following cmds:
 

# cat /etc/redhat-release  > /root/logs/redhat-release-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


+ Only if needed!!
 

# yum versionlock clear
# yum versionlock list


Clear any previous RPM packages – careful with that as you might want to keep the old RPMs, if unsure comment out below line
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

Proceed with the update and monitor closely the output of commands and log out everything inside files using a small script that you should place under /root/status the script is given at the end of the aritcle.:
 

yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
yum check-update | wc -l
yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

6. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


7. Stop VM virt-mach-centos2 & backup  on Hypervisor host (hypervisor-host2)

Same backup step as prior 

# prlctl backup virt-mach-centos2


or
 

# prlctl stop virt-mach-centos2
# cp -rpf /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76 /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76-bak
## tar -czvf virt-mach-centos2_vm_virt-mach-centos2.tar.gz /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76

# prctl start virt-mach-centos2


8. Do yum update on virt-mach-centos2

Log system state, before the update
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list

 

Clean old install update / packages if required
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Initiate the update

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l 
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


9. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now

 

10. Stop VM vm-host2 & backup
 

# prlctl backup vm-host2


or

# prlctl stop vm-host2

Or copy the actual directory containig the Virtozzo VM (use the correct ID)
 

# cp -rpf /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9 /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9-bak
## tar -czvf vm-host2.tar.gz /vz/vmprivate/76e8a5f8-caa8-4442-830e-aa5bfe8d42d9

# prctl start vm-host2


11. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Clear only if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Do the rpm upgrade

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


12. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now


13. Do yum update hypervisor-host2

 

 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock   if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Update rpms
 

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


14. Stop VM vm-host1 & backup


Some as ealier
 

# prlctl backup vm-host1

or
 

# prlctl stop vm-host1

# cp -rpf /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789 /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789-bak
# tar -czvf vm-host1.tar.gz /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789

# prctl start vm-host1


15. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


16. Check if everything is running fine after upgrade

+ Reboot VM

# shutdown -r now


17. Do yum update hypervisor-host1

Same procedure for HV host 1 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock
 

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


18. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host1 all VMs run as expected 


19. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host2 all VMs run as expected afterwards


20. Check once more VMs and haproxy or any other contained services in VMs run as expected

Login to hosts and check processes and logs for errors etc.
 

21. Haproxy Unstandby virt-mach-centos1

Assuming that the virt-mach-centos1 and virt-mach-centos2 are running a Haproxy / corosync cluster you can try to standby node1 and check the result
hopefully all should be fine and traffic should come to host node2.

[root@virt-mach-centos1 ~]# pcs cluster unstandby virt-mach-centos1


Monitor logs and make sure HAproxy works fine on virt-mach-centos1


22. If necessery to redefine VMs (in case they disappear from virsh) or virtuosso is not working

[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos1_config_bak.xml
[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos2_config_bak.xml


23. Set versionlock to RPMs to prevent accident updates and check OS version release

[root@virt-mach-centos2 ]# yum versionlock \*
[root@virt-mach-centos1 ]# yum versionlock \*
[root@hypervisor-host1 ~]# yum versionlock \*
[root@hypervisor-host2 ~]# yum versionlock \*

[root@hypervisor-host2 ~]# cat /etc/redhat-release 
CentOS Linux release 7.8.2003 (Core)

Other useful hints

[root@hypervisor-host1 ~]# virsh console dc37c201-08c9-489d-aa20-9386d63ce3f3
Connected to domain virt-mach-centos1
..

! Compare packages count before the upgrade on each of the supposable identical VMs and HVs – if there is difference in package count review what kind of packages are different and try to make the machines to look as identical as possible  !

Packages to update on hypervisor-host1 Count: XXX
Packages to update on hypervisor-host2 Count: XXX
Packages to update virt-mach-centos1 Count: – 254
Packages to update virt-mach-centos2 Count: – 249

The /root/status script

+++

#!/bin/sh
echo  '=======================================================   '
echo  '= Systemctl list-unit-files –type=service | grep enabled '
echo  '=======================================================   '
systemctl list-unit-files –type=service | grep enabled

echo  '=======================================================   '
echo  '= systemctl | grep ".service" | grep "running"            '
echo  '=======================================================   '
systemctl | grep ".service" | grep "running"

echo  '=======================================================   '
echo  '= chkconfig –list                                        '
echo  '=======================================================   '
chkconfig –list

echo  '=======================================================   '
echo  '= netstat -tulpn                                          '
echo  '=======================================================   '
netstat -tulpn

echo  '=======================================================   '
echo  '= netstat -r                                              '
echo  '=======================================================   '
netstat -r


+++

That's all folks, once going through the article, after some 2 hours of efforts or so you should have an up2date machines.
Any problems faced or feedback is mostly welcome as this might help others who have the same setup.

Thanks for reading me 🙂

How to remove GNOME environment and Xorg server on CentOS 7 / 8 / 9 Linux

Wednesday, April 13th, 2022

centos-linux-remove-gnome-gui-remove-howto-logo

If you have installed recent version of CentOS, you have noticed by default the Installator did setup Xserver and GNOME as Graphical Environment as well the surrounding GUI Administration tools. That's really not needed on "headless" monitorless Linux servers as this wastes up for nothing a very tiny amount of the machine CPU and RAM and Disk resource on keeping services up and running. Even worse a Graphical Environment on a Production server poses a security breach as their are much more services running on the OS that could be potentially hacked.

Removal of GUI across CentOS is similar but slightly differs. Hence in this article, I'll show how it can be removed on CentOS Linux 7 / 8 and 9. Removal of Graphics is usual operation for sysadmins thus there is plenty of info on the net,how this is done on CentOS 7 and COS 8 but unfortunately as of time of writting this article, couldn't find anything on the net on how to Remove GUI environment on CentOS 9.

The reason for this article is mostly for documentation purposes for myself

First list the available meta-package groups installed on the OS:

1. List machine installed package groups

 

yum-groupinstall-gnome

[root@centos ~]# yum grouplist
Last metadata expiration check: 3:55:48 ago on Mon 11 Apr 2022 03:26:06 AM EDT.
Available Environment Groups:
   Server
   Minimal Install
   Workstation
   KDE Plasma Workspaces
   Custom Operating System
   Virtualization Host
Installed Environment Groups:
   Server with GUI
Installed Groups:
   Container Management
   Headless Management
Available Groups:
   Legacy UNIX Compatibility
   Console Internet Tools
   Development Tools
   .NET Development
   Graphical Administration Tools
   Network Servers
   RPM Development Tools
   Scientific Support
   Security Tools
   Smart Card Support
   System Tools
   Fedora Packager


On CentOS 8 and CentOS 9 to list the installed package groups, you can use also:

[root@centos ~]# dnf grouplist

Installed Environment Groups:
   Server with GUI

2. Remove GNOME and Xorg GUIs on CentOS 7

[root@centos ~]# yum groupremove "Server with GUI" –skip-broken

[root@centos ~]# yum groupremove "GNOME Desktop" -y

3. Remove GNOME and X on CentOS 8

[root@centos ~]# dnf groupremove 'X Window System' 'GNOME' -y

4. Remove Graphical Environment on CentOS 9

 

centos9-linux-groupremove-command-screenshot

[root@centos ~]# yum groupremove GNOME 'Graphical Administration Tools' -y

 

Removing Groups:
 GNOME

Transaction Summary
====================================================
Remove  123 Packages

Freed space: 416 M
Is this ok [y/N]: y
Is this ok [y/N]: y
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.

 


  xorg-x11-drv-libinput-1.0.1-3.el9.x86_64
  xorg-x11-server-Xorg-1.20.11-10.el9.x86_64
  xorg-x11-server-Xwayland-21.1.3-2.el9.x86_64
  xorg-x11-server-common-1.20.11-10.el9.x86_64
  xorg-x11-server-utils-7.7-44.el9.x86_64
  xorg-x11-xauth-1:1.1-10.el9.x86_64
  xorg-x11-xinit-1.4.0-11.el9.x86_64

Complete!


Graphical Administration Tools – is a group of tools that 

Or alternatively you can do

[root@centos ~]# yum remove gnome* xorg* -y


5. Change the Graphical boot to text multiuser

[root@centos ~]# systemctl set-default multi-user.target


6. Install GNOME / X GUI on the CentOS 7 / 8 / 9

Sometimes GNOME Desktop environment and Xorg are missing on previously delpoyed installs but you need it back for some reason.For example it was earlier removed a year ago on the server as it was not needed, but the machine use type changes and now you need to have installed an Oracle Server / Oracle Client which usually depends on having at least a minimal working version of X environment ont the Linux.


To install back the GNOME and X back on the machine:

[root@centos ~]# yum groupistall "Server with GUI" –skip-broken

[root@centos9 network-scripts]# yum groupinstall "Server with GUI" –skip-broken
Last metadata expiration check: 0:09:26 ago on Mon 11 Apr 2022 07:43:11 AM EDT.
No match for group package "insights-client"
No match for group package "redhat-release"
No match for group package "redhat-release-eula"
Dependencies resolved.
===================================================
 Package                                       Arch       Version              Repository     Size
===================================================
Installing group/module packages:
 NetworkManager-wifi                           x86_64     1:1.37.2-1.el9       baseos         75 k
 cheese                                        x86_64     2:3.38.0-6.el9       appstream      96 k
 chrome-gnome-shell                            x86_64     10.1-14.el9          appstream      33 k
 eog                                           x86_64     40.3-2.el9           appstream     3.6 M
 evince                                        x86_64     40.4-4.el9           appstream     2.8 M
 evince-nautilus                               x86_64     40.4-4.el9           appstream      20 k
 gdm                                           x86_64     1:40.1-13.el9        appstream     894 k
 gnome-bluetooth                               x86_64     1:3.34.5-3.el9       appstream      44 k
 gnome-calculator                              x86_64     40.1-2.el9           appstream     1.4 M
 gnome-characters                              x86_64     40.0-3.el9           appstream     236 k
 gnome-classic-session                         noarch     40.6-1.el9           appstream      36 k
 gnome-color-manager                           x86_64     3.36.0-7.el9         appstream     1.1 M
 gnome-control-center                          x86_64     40.0-22.el9          appstream     5.7 M
 gnome-disk-utility                            x86_64     40.2-2.el9           appstream     1.1 M
 gnome-font-viewer                             x86_64     40.0-3.el9           appstream     233 k
 gnome-initial-setup                           x86_64     40.1-2.el9           appstream     1.1 M
 gnome-logs                                    x86_64     3.36.0-6.el9         appstream     416 k

Installing dependencies:
 cheese-libs                                   x86_64     2:3.38.0-6.el9       appstream     941 k
 clutter                                       x86_64     1.26.4-7.el9         appstream     1.1 M
 clutter-gst3                                  x86_64     3.0.27-7.el9         appstream      85 k
 clutter-gtk                                   x86_64     1.8.4-13.el9         appstream      47 k
 cogl                                          x86_64     1.22.8-5.el9         appstream     505 k
 colord-gtk                                    x86_64     0.2.0-7.el9          appstream      33 k
 dbus-daemon                                   x86_64     1:1.12.20-5.el9      appstream     202 k
 dbus-tools                                    x86_64     1:1.12.20-5.el9      baseos         52 k
 evince-previewer                              x86_64     40.4-4.el9           appstream      29 k

Installing weak dependencies:
 gnome-tour                                    x86_64     40.1-1.el9           appstream     722 k
 nm-connection-editor                          x86_64     1.26.0-1.el9         appstream     838 k
 p11-kit-server                                x86_64     0.24.1-2.el9         appstream     199 k
 pinentry-gnome3                               x86_64     1.1.1-8.el9          appstream      41 k
Installing Environment Groups:
 Server with GUI
Installing Groups:
 base-x
 Container Management
 core
 fonts
 GNOME
 guest-desktop-agents
 Hardware Monitoring Utilities
 hardware-support
 Headless Management
 Internet Browser
 multimedia
 networkmanager-submodules
 print-client
 Server product core
 standard

Transaction Summary
=======================================================
Install  114 Packages

Total download size: 96 M
Installed size: 429 M
Is this ok [y/N]: y

or yum groupinstall GNOME

[root@centos9 ~]# yum grouplist
Last metadata expiration check: 3:55:48 ago on Mon 11 Apr 2022 03:26:06 AM EDT.
Available Environment Groups:

Installed Environment Groups:
   Server with GUI

Next you should change the OS default run level to 5 to make CentOS automatically start the Xserver and gdm.

To see the list of available default Login targets do:
 


[root@centos ~]# find / -name "runlevel*.target"
/usr/lib/systemd/system/runlevel0.target
/usr/lib/systemd/system/runlevel1.target
/usr/lib/systemd/system/runlevel2.target
/usr/lib/systemd/system/runlevel3.target
/usr/lib/systemd/system/runlevel4.target
/usr/lib/systemd/system/runlevel5.target
/usr/lib/systemd/system/runlevel6.target

The meaning of each runlevel is as follows:

Run Level Target Units Description
0 runlevel0.target, poweroff.target Shut down and power off
1 runlevel1.target, rescue.target Set up a rescue shell
2,3,4 runlevel[234].target, multi- user.target Set up a nongraphical multi-user shell
5 runlevel5.target, graphical.target Set up a graphical multi-user shell
6 runlevel6.target, reboot.target Shut down and reboot the system


If this does not work you can try:

yum-groupinstall-gnome

[root@centos ~]#  yum -y groups install "GNOME Desktop"


7. To check the OS configured boot target
 

[root@centos ~]# systemctl get-default
multi-user.target


multi-user.target is a mode of operation that is text mode only with multiple logins supported on tty and remotely.

To change it to graphical

[root@centos ~]# systemctl set-default graphical.target


or simply link it yourself
 

[root@centos ~]# ln -sf /lib/systemd/system/runlevel5.target /etc/systemd/system/default.target

[root@centos ~]# reboot


If the X was not used so far ever, you will get a few graphial screens to accept the License Information and Finish the configuration,i .e.

1. Accept the license by clicking on the “LICENSE INFORMATION“.

2. Tick mark the “accept the license agreement” and click on “Done“.

3. Click on “FINISH CONFIGURATION” to complete the setup.
And voila GDM (Graphical Login) Greater should shine up.
 

You could also go the manual route by adding an .xinitrc file in your home directory (instead of making the graphical login screen the default, as done above with the sudo systemctl set-default graphical.target command). To do this, issue the command:

[root@centos ~]# echo "exec gnome-session" >> ~/.xinitrc

 

How to disable Windows pagefile.sys and hiberfil.sys to temporary or permamently save disk space if space is critically low

Monday, March 28th, 2022

howto-pagefile-hiberfil.sys-remove-reduce-increase-increase-size-windows-logo

Sometimes you have to work with Windows 7 / 8 / 10 PCs  etc. that has a very small partition C:\
drive or othertimes due to whatever the disk got filled up with time and has only few megabytes left
and this totally broke up the windows performance as Windows OS becomes terribly sluggish and even
simple things as opening Internet Browser (Chrome / Firefox / Opera ) or Windows Explorer stones the PC performance.

You might of course try to use something like Spacesniffer tool (a great tool to find lost data space on PC s short description on it is found in my previous article how to
delete temporary Internet Files and Folders to to speed up and free disk space
 ) or use CCleaner to clean up a bit the pc.
Sometimes this is not enough though or it is not possible to do at all the main
partition disk C:\ is anyhow too much low (only 30-50MB are available on HDD) or the Physical or Virtual Machine containing the OS is filled with important data
and you couldn't risk to remove anything including Internet Temporary files, browsing cookies … whatever.

Lets say you are the fate chosen guy as sysadmin to face this uneasy situation and have no easy
way to add disk space from another present free space partition or could not add a new SATA hard drive
SSD drive, what should you do?
 

The solution wipe off pagefile.sys and hiberfil.sys

Usually every Windows installation has a pagefile.sys and hiberfil.sys.

  • pagefile.sys – is the default file that is used as a swap file, immediately once the machine runs out of memory. For Unix / Linux users better understanding pagefile.sys is the equivallent of Linux's swap partition. Of course as the pagefile is in a file and not in separate partition the swapping in Windows is perhaps generally worse than in Linux.
  • hiberfil.sys – is used to store data from the machine on machine Hibernation (for those who use the feature)


Pagefile.sys which depending on the configured RAM memory on the OS could takes up up to 5 – 8 GB, there hanging around doing nothing but just occupying space. Thus a temporary workaround that could free you some space even though it will degrade performance and on servers and production machines this is not a good solution on just user machines, where you temporary need to free space any other important task you can free up space
by seriously reducing the preconfigured default size of pagefile.sys (which usually is 1.5 times the active memory on the OS – hence if you have 4GB you would have a 6Gigabytes of pagefile.sys).

Other possibility especially on laptop and movable devices running Win OS is to disable hiberfil.sys, read below how this is done.


The temporary solution here is to simply free space by either reducing the pagefile.sys or completely disabling it


1. Disable pagefile.sys on Windows XP, Windows 7 / 8 / 10 / 11


The GUI interface to disable pagefile across all NT based Windows OS-es is quite similar, the only difference is newer versions of Windows has slightly more options.


1.1 Disable pagefile on Windows XP


Quickest way is to find pagefile.sys settings from GUI menus

1. Computer (My Computer) – right click mouse
2. Properties (System Preperties will appear)
3. Advanced (tab) 
4. Settings
5. Advanced (tab)
6. Change button

windows-xp-pagefile-disable-screenshot

1.2 Disable pagefile on Windows 7

 

advanced-system-settings-control-panel-system-and-security-screenshot

windows-system-properties-screenshot-properties-advanced-change-Virtual-memory-pagefile-screenshot

system-properties-performance-options
 

Once applied you'll be required to reboot the PC

How-to-turn-off-Virtual-Memory-Paging_File-in-Windows-7-restart

 

1.3 Disable Increase / Decrease pagefile.sys on Windows 10 / Win 11
 

open-system-properties-advanced-win10

win10-performance-options-menu-screenshot

configure-virtual-memory-win-10-screenshot


1.4 Make Windows clear pagefile.sys on shutdown

On home PCs it might be useful thing to clear up ( nullify) pagefile.sys on shutdown, that could save you some disk space on every reboot, until file continuously grows to its configured Maximum.

Run

regedit

Modify registry key at location

 

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management

windows-clean-up-pagefile-sys-file-on-shutdown-or-reboot-registry-editor-value-screenshot

You can apply the value also via a registry file you can get the Enable Clearpagefile at shutdown here .reg.
 

2. Manipulating pagefile.sys size and file delete from command line with wmic tool 

For scripting purposes you might want to use the wmic pagefile which can do increase / decrase or delete the file without GUI, that is very helpful if you have to admin a Windows Domain (Active Directory)
 

[hipo.WINDOWS-PC] ➤ wmic pagefile /?

PAGEFILE – Virtual memory file swapping management.

HINT: BNF for Alias usage.
(<alias> [WMIObject] | [] | [] ) [].

USAGE:

PAGEFILE ASSOC []
PAGEFILE CREATE <assign list>
PAGEFILE DELETE
PAGEFILE GET [] []
PAGEFILE LIST [] []

 

[hipo.WINDOWS-PC] ➤ wmic pagefile
AllocatedBaseSize  Caption          CurrentUsage  Description      InstallDate                Name             PeakUsage  Status  TempPageFile
4709               C:\pagefile.sys  499           C:\pagefile.sys  20200912061902.938000+180  C:\pagefile.sys  525                FALSE

 

[hipo.WINDOWS-PC] ➤ wmic pagefile list /format:list

AllocatedBaseSize=4709
CurrentUsage=499
Description=C:\pagefile.sys
InstallDate=20200912061902.938000+180
Name=C:\pagefile.sys
PeakUsage=525
Status=
TempPageFile=FALSE

wmic-pagefile-command-line-tool-for-windows-default-output-screenshot

 

  • To change the Initial Size or Maximum Size of Pagefile use:
     

➤ wmic pagefileset where name="C:\\pagefile.sys" set InitialSize=2048,MaximumSize=2048

  • To move the pagefile / change location of pagefile to less occupied disk drive partition (i.e. D:\ drive)

     

     

    Sometimes you might have multiple drives on the PC and some of them might be having multitudes of gigabytes while main drive C:\ could be fully occupied due to initial install bad drive organization, in that case a good work arount to save you space so you can work normally with the server is just to temporary or permanently move pagefile to another drive.

wmic pagefileset where name="D:\\pagefile.sys" set InitialSize=2048,MaximumSize=2048


!! CONSIDER !!! 

That if you have the option to move the pagefile.sys for best performance it is advicable to place the file inside another physical disk, preferrably a Solid State Drive one, SATA disks are too slow and reduced Input / Output disk operations will lead to degraded performance, if there is lack of memory (i.e. pagefile.sys is actively open read and wrote in).
 

  • To delete pagefile.sys 
     

➤ wmic pagefileset where name="C:\\pagefile.sys" delete

 

If for some reason you prefer to not use wmic but simple del command you can delete pagefile.sys also by:

Removing file default "Hidden" and "system" file attributes – set for security reasons as the file is a system file usually not touched by user. This will save you from "permission denied" errors:
 

➤ attrib -s -h %systemdrive%\pagefile.sys


Delete the file:
 

➤ del /a /q %systemdrive%\pagefile.sys


3. Disable hibernation on Windows 7 / 8 and Win 10 / 11

Disabling hibernation file hiberfil.sys can also free up some space, especially if the hibernation has been actively used before and the file is written with data. Of course, that is more common on notebooks.
Windows hibernation has significantly improved over time though i didn't have very pleasant experience in the past and I prefer to disable it just in case.
 

3.1 Disable Windows 7 / 8 / 10 / 11 hibernation from GUI 

Disable it through:

Control Panel -> All Control Panel Items -> Power Options -> Edit Plan Settings -> Change advanced power settings


 like shown in below screenshot:

Windqows-power-options-Advanced-settings-Allow-Hybrid-sleep-option-menu-screenshot

 

3.2 Disable Windows 7 / 8 / 10 / 11 hibernation from command line

Disable hibernation Is done in the same way through the powercfg.exe command, to disable it
if you're cut of disk space and you want to save space from it:

run as Administrator in Command Line Windows (cmd.exe)
 

powercfg.exe /hibernate off

If you later need to switch on hibernation
 

powercfg.exe /hibernate on


disable-hiberfile-windows-screenshot

3.3 Disable Windows hibernation on legacy Windows XP

On XP to disable hibernation open

1. Power Options Properties
2. Select Hibernate
3. Select Enable Hibernation to clear the checkbox and disable Hibernation mode. 
4. Select OK to apply the change.

Close the Power Options Properties box. 

enable-disable-hibernate-windows-xp-menu-screenshot

To sum it up

We have learned some basics on Windows swapping and hibernation and i've tried to give some insight on how thiese files if misconfigured could lead to degraded Win OS performance. In any case using SSD as of 2022 to store both files is a best practice for machines that has plenty of memory always try to completely disable / remove the files. It was shown how  to manage pagefile.sys and hiberfil.sys across Windows Operating Systems different versions both from GUI and via command line as well as how you can configure pagefile.sys to be cleared up on pc reboot.
 

Create Linux High Availability Load Balancer Cluster with Keepalived and Haproxy on Linux

Tuesday, March 15th, 2022

keepalived-logo-linux

Configuring a Linux HA (High Availibiltiy) for an Application with Haproxy is already used across many Websites on the Internet and serious corporations that has a crucial infrastructure has long time
adopted and used keepalived to provide High Availability Application level Clustering.
Usually companies choose to use HA Clusters with Haproxy with Pacemaker and Corosync cluster tools.
However one common used alternative solution if you don't have the oportunity to bring up a High availability cluster with Pacemaker / Corosync / pcs (Pacemaker Configuration System) due to fact machines you need to configure the cluster on are not Physical but VMWare Virtual Machines which couldn't not have configured a separate Admin Lans and Heartbeat Lan as we usually do on a Pacemaker Cluster due to the fact the 5 Ethernet LAN Card Interfaces of the VMWare Hypervisor hosts are configured as a BOND (e.g. all the incoming traffic to the VMWare vSphere  HV is received on one Virtual Bond interface).

I assume you have 2 separate vSphere Hypervisor Physical Machines in separate Racks and separate switches hosting the two VMs.
For the article, I'll call the two brand new brought Virtual Machines with some installation automation software such as Terraform or Ansible – vm-server1 and vm-server2 which would have configured some recent version of Linux.

In that scenario to have a High Avaiability for the VMs on Application level and assure at least one of the two is available at a time if one gets broken due toe malfunction of the HV, a Network connectivity issue, or because the VM OS has crashed.
Then one relatively easily solution is to use keepalived and configurea single High Availability Virtual IP (VIP) Address, i.e. 10.10.10.1, which would float among two VMs using keepalived so at a time at least one of the two VMs would be reachable on the Network.

haproxy_keepalived-vip-ip-diagram-linux

Having a VIP IP is quite a common solution in corporate world, as it makes it pretty easy to add F5 Load Balancer in front of the keepalived cluster setup to have a 3 Level of security isolation, which usually consists of:

1. Physical (access to the hardware or Virtualization hosts)
2. System Access (The mechanism to access the system login credetials users / passes, proxies, entry servers leading to DMZ-ed network)
3. Application Level (access to different programs behind L2 and data based on the specific identity of the individual user,
special Secondary UserID,  Factor authentication, biometrics etc.)

 

1. Install keepalived and haproxy on machines

Depending on the type of Linux OS:

On both machines
 

[root@server1:~]# yum install -y keepalived haproxy

If you have to install keepalived / haproxy on Debian / Ubuntu and other Deb based Linux distros

[root@server1:~]# apt install keepalived haproxy –yes

2. Configure haproxy (haproxy.cfg) on both server1 and server2

 

Create some /etc/haproxy/haproxy.cfg configuration

 

[root@server1:~]vim /etc/haproxy/haproxy.cfg

#———————————————————————
# Global settings
#———————————————————————
global
    log          127.0.0.1 local6 debug
    chroot       /var/lib/haproxy
    pidfile      /run/haproxy.pid
    stats socket /var/lib/haproxy/haproxy.sock mode 0600 level admin 
    maxconn      4000
    user         haproxy
    group        haproxy
    daemon
    #debug
    #quiet

#———————————————————————
# common defaults that all the 'listen' and 'backend' sections will
# use if not designated in their block
#———————————————————————
defaults
    mode        tcp
    log         global
#    option      dontlognull
#    option      httpclose
#    option      httplog
#    option      forwardfor
    option      redispatch
    option      log-health-checks
    timeout connect 10000 # default 10 second time out if a backend is not found
    timeout client 300000
    timeout server 300000
    maxconn     60000
    retries     3

#———————————————————————
# round robin balancing between the various backends
#———————————————————————

listen FRONTEND_APPNAME1
        bind 10.10.10.1:15000
        mode tcp
        option tcplog
#        #log global
        log-format [%t]\ %ci:%cp\ %bi:%bp\ %b/%s:%sp\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
        balance roundrobin
        timeout client 350000
        timeout server 350000
        timeout connect 35000
        server app-server1 10.10.10.55:30000 weight 1 check port 68888
        server app-server2 10.10.10.55:30000 weight 2 check port 68888

listen FRONTEND_APPNAME2
        bind 10.10.10.1:15000
        mode tcp
        option tcplog
        #log global
        log-format [%t]\ %ci:%cp\ %bi:%bp\ %b/%s:%sp\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
        balance roundrobin
        timeout client 350000
        timeout server 350000
        timeout connect 35000
        server app-server1 10.10.10.55:30000 weight 5
        server app-server2 10.10.10.55:30000 weight 5 

 

You can get a copy of above haproxy.cfg configuration here.
Once configured roll it on.

[root@server1:~]#  systemctl start haproxy
 
[root@server1:~]# ps -ef|grep -i hapro
root      285047       1  0 Mar07 ?        00:00:00 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
haproxy   285050  285047  0 Mar07 ?        00:00:26 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid

Bring up the haproxy also on server2 machine, by placing same configuration and starting up the proxy.
 

[root@server1:~]vim /etc/haproxy/haproxy.cfg


 

3. Configure keepalived on both servers

We'll be configuring 2 nodes with keepalived even though if necessery this can be easily extended and you can add more nodes.
First we make a copy of the original or existing server configuration keepalived.conf (just in case we need it later on or if you already had something other configured manually by someone – that could be so on inherited servers by other sysadmin)
 

[root@server1:~]# mv /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.orig
[root@server2:~]# mv /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.orig

a. Configure keepalived to serve as a MASTER Node

 

[root@server1:~]# vim /etc/keepalived/keepalived.conf

Master Node
global_defs {
  router_id server1-fqdn # The hostname of this host.
  
  enable_script_security
  # Synchro of the state of the connections between the LBs on the eth0 interface
   lvs_sync_daemon eth0
 
notification_email {
        linuxadmin@notify-domain.com     # Email address for notifications 
    }
 notification_email_from keepalived@server1-fqdn        # The from address for the notifications
    smtp_server 127.0.0.1                       # SMTP server address
    smtp_connect_timeout 15
}

vrrp_script haproxy {
  script "killall -0 haproxy"
  interval 2
  weight 2
  user root
}

vrrp_instance LB_VIP_QA {
  virtual_router_id 50
  advert_int 1
  priority 51

  state MASTER
  interface eth0
  smtp_alert          # Enable Notifications Via Email
  
  authentication {
              auth_type PASS
              auth_pass testp141

    }
### Commented because running on VM on VMWare
##    unicast_src_ip 10.44.192.134 # Private IP address of master
##    unicast_peer {
##        10.44.192.135           # Private IP address of the backup haproxy
##   }

#        }
# master node with higher priority preferred node for Virtual IP if both keepalived up
###  priority 51
###  state MASTER
###  interface eth0
  virtual_ipaddress {
     10.10.10.1 dev eth0 # The virtual IP address that will be shared between MASTER and BACKUP
  }
  track_script {
      haproxy
  }
}

 

 To dowload a copy of the Master keepalived.conf configuration click here

Below are few interesting configuration variables, worthy to mention few words on, most of them are obvious by their names but for more clarity I'll also give a list here with short description of each:

 

  • vrrp_instance – defines an individual instance of the VRRP protocol running on an interface.
  • state – defines the initial state that the instance should start in (i.e. MASTER / SLAVE )state –
  • interface – defines the interface that VRRP runs on.
  • virtual_router_id – should be unique value per Keepalived Node (otherwise slave master won't function properly)
  • priority – the advertised priority, the higher the priority the more important the respective configured keepalived node is.
  • advert_int – specifies the frequency that advertisements are sent at (1 second, in this case).
  • authentication – specifies the information necessary for servers participating in VRRP to authenticate with each other. In this case, a simple password is defined.
    only the first eight (8) characters will be used as described in  to note is Important thing
    man keepalived.conf – keepalived.conf variables documentation !!! Nota Bene !!! – Password set on each node should match for nodes to be able to authenticate !
  • virtual_ipaddress – defines the IP addresses (there can be multiple) that VRRP is responsible for.
  • notification_email – the notification email to which Alerts will be send in case if keepalived on 1 node is stopped (e.g. the MASTER node switches from host 1 to 2)
  • notification_email_from – email address sender from where email will originte
    ! NB ! In order for notification_email to be working you need to have configured MTA or Mail Relay (set to local MTA) to another SMTP – e.g. have configured something like Postfix, Qmail or Postfix

b. Configure keepalived to serve as a SLAVE Node

[root@server1:~]vim /etc/keepalived/keepalived.conf
 

#Slave keepalived
global_defs {
  router_id server2-fqdn # The hostname of this host!

  enable_script_security
  # Synchro of the state of the connections between the LBs on the eth0 interface
  lvs_sync_daemon eth0
 
notification_email {
        linuxadmin@notify-host.com     # Email address for notifications
    }
 notification_email_from keepalived@server2-fqdn        # The from address for the notifications
    smtp_server 127.0.0.1                       # SMTP server address
    smtp_connect_timeout 15
}

vrrp_script haproxy {
  script "killall -0 haproxy"
  interval 2
  weight 2
  user root
}

vrrp_instance LB_VIP_QA {
  virtual_router_id 50
  advert_int 1
  priority 50

  state BACKUP
  interface eth0
  smtp_alert          # Enable Notifications Via Email

authentication {
              auth_type PASS
              auth_pass testp141
}
### Commented because running on VM on VMWare    
##    unicast_src_ip 10.10.192.135 # Private IP address of master
##    unicast_peer {
##        10.10.192.134         # Private IP address of the backup haproxy
##   }

###  priority 50
###  state BACKUP
###  interface eth0
  virtual_ipaddress {
     10.10.10.1 dev eth0 # The virtual IP address that will be shared betwee MASTER and BACKUP.
  }
  track_script {
    haproxy
  }
}

 

Download the keepalived.conf slave config here

 

c. Set required sysctl parameters for haproxy to work as expected
 

[root@server1:~]vim /etc/sysctl.conf
#Haproxy config
# haproxy
net.core.somaxconn=65535
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.ip_nonlocal_bind = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3

4. Test Keepalived keepalived.conf configuration syntax is OK

 

[root@server1:~]keepalived –config-test
(/etc/keepalived/keepalived.conf: Line 7) Unknown keyword 'lvs_sync_daemon_interface'
(/etc/keepalived/keepalived.conf: Line 21) Unable to set default user for vrrp script haproxy – removing
(/etc/keepalived/keepalived.conf: Line 31) (LB_VIP_QA) Specifying lvs_sync_daemon_interface against a vrrp is deprecated.
(/etc/keepalived/keepalived.conf: Line 31)              Please use global lvs_sync_daemon
(/etc/keepalived/keepalived.conf: Line 35) Truncating auth_pass to 8 characters
(/etc/keepalived/keepalived.conf: Line 50) (LB_VIP_QA) track script haproxy not found, ignoring…

I've experienced this error because first time I've configured keepalived, I did not mention the user with which the vrrp script haproxy should run,
in prior versions of keepalived, leaving the field empty did automatically assumed you have the user with which the vrrp script runs to be set to root
as of RHELs keepalived-2.1.5-6.el8.x86_64, i've been using however this is no longer so and thus in prior configuration as you can see I've
set the user in respective section to root.
The error Unknown keyword 'lvs_sync_daemon_interface'
is also easily fixable by just substituting the lvs_sync_daemon_interface and lvs_sync_daemon and reloading
keepalived etc.

Once keepalived is started and you can see the process on both machines running in process list.

[root@server1:~]ps -ef |grep -i keepalived
root     1190884       1  0 18:50 ?        00:00:00 /usr/sbin/keepalived -D
root     1190885 1190884  0 18:50 ?        00:00:00 /usr/sbin/keepalived -D

Next step is to check the keepalived statuses as well as /var/log/keepalived.log

If everything is configured as expected on both keepalived on first node you should see one is master and one is slave either in the status or the log

[root@server1:~]#systemctl restart keepalived

 

[root@server1:~]systemctl status keepalived|grep -i state
Mar 14 18:59:02 server1-fqdn Keepalived_vrrp[1192003]: (LB_VIP_QA) Entering MASTER STATE

[root@server1:~]systemctl status keepalived

● keepalived.service – LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2022-03-14 18:15:51 CET; 32min ago
  Process: 1187587 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1187589 (code=exited, status=0/SUCCESS)

Mar 14 18:15:04 server1lb-fqdn Keepalived_vrrp[1187590]: Sending gratuitous ARP on eth0 for 10.44.192.142
Mar 14 18:15:50 server1lb-fqdn systemd[1]: Stopping LVS and VRRP High Availability Monitor…
Mar 14 18:15:50 server1lb-fqdn Keepalived[1187589]: Stopping
Mar 14 18:15:50 server1lb-fqdn Keepalived_vrrp[1187590]: (LB_VIP_QA) sent 0 priority
Mar 14 18:15:50 server1lb-fqdn Keepalived_vrrp[1187590]: (LB_VIP_QA) removing VIPs.
Mar 14 18:15:51 server1lb-fqdn Keepalived_vrrp[1187590]: Stopped – used 0.002007 user time, 0.016303 system time
Mar 14 18:15:51 server1lb-fqdn Keepalived[1187589]: CPU usage (self/children) user: 0.000000/0.038715 system: 0.001061/0.166434
Mar 14 18:15:51 server1lb-fqdn Keepalived[1187589]: Stopped Keepalived v2.1.5 (07/13,2020)
Mar 14 18:15:51 server1lb-fqdn systemd[1]: keepalived.service: Succeeded.
Mar 14 18:15:51 server1lb-fqdn systemd[1]: Stopped LVS and VRRP High Availability Monitor

[root@server2:~]systemctl status keepalived|grep -i state
Mar 14 18:59:02 server2-fqdn Keepalived_vrrp[297368]: (LB_VIP_QA) Entering BACKUP STATE

[root@server1:~]# grep -i state /var/log/keepalived.log
Mar 14 18:59:02 server1lb-fqdn Keepalived_vrrp[297368]: (LB_VIP_QA) Entering MASTER STATE
 

a. Fix Keepalived SECURITY VIOLATION – scripts are being executed but script_security not enabled.
 

When configurating keepalived for a first time we have faced the following strange error inside keepalived status inside keepalived.log 
 

Feb 23 14:28:41 server1 Keepalived_vrrp[945478]: SECURITY VIOLATION – scripts are being executed but script_security not enabled.

 

To fix keepalived SECURITY VIOLATION error:

Add to /etc/keepalived/keepalived.conf on the keepalived node hosts
inside 

global_defs {}

After chunk
 

enable_script_security

include

# Synchro of the state of the connections between the LBs on the eth0 interface
  lvs_sync_daemon_interface eth0

 

5. Prepare rsyslog configuration and Inlcude additional keepalived options
to force keepalived log into /var/log/keepalived.log

To force keepalived log into /var/log/keepalived.log on RHEL 8 / CentOS and other Redhat Package Manager (RPM) Linux distributions

[root@server1:~]# vim /etc/rsyslog.d/48_keepalived.conf

#2022/02/02: HAProxy logs to local6, save the messages
local7.*                                                /var/log/keepalived.log
if ($programname == 'Keepalived') then -/var/log/keepalived.log
if ($programname == 'Keepalived_vrrp') then -/var/log/keepalived.log
& stop

[root@server:~]# touch /var/log/keepalived.log

Reload rsyslog to load new config
 

[root@server:~]# systemctl restart rsyslog
[root@server:~]# systemctl status rsyslog

 

rsyslog.service – System Logging Service
   Loaded: loaded (/usr/lib/systemd/system/rsyslog.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/rsyslog.service.d
           └─rsyslog-service.conf
   Active: active (running) since Mon 2022-03-07 13:34:38 CET; 1 weeks 0 days ago
     Docs: man:rsyslogd(8)

           https://www.rsyslog.com/doc/
 Main PID: 269574 (rsyslogd)
    Tasks: 6 (limit: 100914)
   Memory: 5.1M
   CGroup: /system.slice/rsyslog.service
           └─269574 /usr/sbin/rsyslogd -n

Mar 15 08:15:16 server1lb-fqdn rsyslogd[269574]: — MARK —
Mar 15 08:35:16 server1lb-fqdn rsyslogd[269574]: — MARK —
Mar 15 08:55:16 server1lb-fqdn rsyslogd[269574]: — MARK —

 

If once keepalived is loaded but you still have no log written inside /var/log/keepalived.log

[root@server1:~]# vim /etc/sysconfig/keepalived
 KEEPALIVED_OPTIONS="-D -S 7"

[root@server2:~]# vim /etc/sysconfig/keepalived
 KEEPALIVED_OPTIONS="-D -S 7"

[root@server1:~]# systemctl restart keepalived.service
[root@server1:~]#  systemctl status keepalived

● keepalived.service – LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2022-02-24 12:12:20 CET; 2 weeks 4 days ago
 Main PID: 1030501 (keepalived)
    Tasks: 2 (limit: 100914)
   Memory: 1.8M
   CGroup: /system.slice/keepalived.service
           ├─1030501 /usr/sbin/keepalived -D
           └─1030502 /usr/sbin/keepalived -D

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

[root@server2:~]# systemctl restart keepalived.service
[root@server2:~]# systemctl status keepalived

6. Monitoring VRRP traffic of the two keepaliveds with tcpdump
 

Once both keepalived are up and running a good thing is to check the VRRP protocol traffic keeps fluently on both machines.
Keepalived VRRP keeps communicating over the TCP / IP Port 112 thus you can simply snoop TCP tracffic on its protocol.
 

[root@server1:~]# tcpdump proto 112

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:08:07.356187 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:08.356297 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:09.356408 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:10.356511 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:11.356655 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20

[root@server2:~]# tcpdump proto 112

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
​listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:08:07.356187 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:08.356297 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:09.356408 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:10.356511 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20
11:08:11.356655 IP server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20

As you can see the VRRP traffic on the network is originating only from server1lb-fqdn, this is so because host server1lb-fqdn is the keepalived configured master node.

It is possible to spoof the password configured to authenticate between two nodes, thus if you're bringing up keepalived service cluster make sure your security is tight at best the machines should be in a special local LAN DMZ, do not configure DMZ on the internet !!! 🙂 Or if you eventually decide to configure keepalived in between remote hosts, make sure you somehow use encrypted VPN or SSH tunnels to tunnel the VRRP traffic.

[root@server1:~]tcpdump proto 112 -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:36:25.530772 IP (tos 0xc0, ttl 255, id 59838, offset 0, flags [none], proto VRRP (112), length 40)
    server1lb-fqdn > vrrp.mcast.net: vrrp server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20, addrs: VIPIP_QA auth "testp431"
11:36:26.530874 IP (tos 0xc0, ttl 255, id 59839, offset 0, flags [none], proto VRRP (112), length 40)
    server1lb-fqdn > vrrp.mcast.net: vrrp server1lb-fqdn > vrrp.mcast.net: VRRPv2, Advertisement, vrid 50, prio 53, authtype simple, intvl 1s, length 20, addrs: VIPIP_QA auth "testp431"

Lets also check what floating IP is configured on the machines:

[root@server1:~]# ip -brief address show
lo               UNKNOWN        127.0.0.1/8 
eth0             UP             10.10.10.5/26 10.10.10.1/32 

The 10.10.10.5 IP is the main IP set on LAN interface eth0, 10.10.10.1 is the floating IP which as you can see is currently set by keepalived to listen on first node.

[root@server2:~]# ip -brief address show |grep -i 10.10.10.1

An empty output is returned as floating IP is currently configured on server1

To double assure ourselves the IP is assigned on correct machine, lets ping it and check the IP assigned MAC  currently belongs to which machine.
 

[root@server2:~]# ping 10.10.10.1
PING 10.10.10.1 (10.10.10.1) 56(84) bytes of data.
64 bytes from 10.10.10.1: icmp_seq=1 ttl=64 time=0.526 ms
^C
— 10.10.10.1 ping statistics —
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.526/0.526/0.526/0.000 ms

[root@server2:~]# arp -an |grep -i 10.44.192.142
? (10.10.10.1) at 00:48:54:91:83:7d [ether] on eth0
[root@server2:~]# ip a s|grep -i 00:48:54:91:83:7d
[root@server2:~]# 

As you can see from below output MAC is not found in configured IPs on server2.
 

[root@server1-fqdn:~]# /sbin/ip a s|grep -i 00:48:54:91:83:7d -B1 -A1
 eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:48:54:91:83:7d brd ff:ff:ff:ff:ff:ff
inet 10.10.10.1/26 brd 10.10.1.191 scope global noprefixroute eth0

Pretty much expected MAC is on keepalived node server1.

 

7. Testing keepalived on server1 and server2 maachines VIP floating IP really works
 

To test the overall configuration just created, you should stop keeaplived on the Master node and in meantime keep an eye on Slave node (server2), whether it can figure out the Master node is gone and switch its
state BACKUP to save MASTER. By changing the secondary (Slave) keepalived to master the floating IP: 10.10.10.1 will be brought up by the scripts on server2.

Lets assume that something went wrong with server1 VM host, for example the machine crashed due to service overload, DDoS or simply a kernel bug or whatever reason.
To simulate that we simply have to stop keepalived, then the broadcasted information on VRRP TCP/IP proto port 112 will be no longer available and keepalived on node server2, once
unable to communicate to server1 should chnage itself to state MASTER.

[root@server1:~]# systemctl stop keepalived
[root@server1:~]# systemctl status keepalived

● keepalived.service – LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Tue 2022-03-15 12:11:33 CET; 3s ago
  Process: 1192001 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 1192002 (code=exited, status=0/SUCCESS)

Mar 14 18:59:07 server1lb-fqdn Keepalived_vrrp[1192003]: Sending gratuitous ARP on eth0 for 10.10.10.1
Mar 15 12:11:32 server1lb-fqdn systemd[1]: Stopping LVS and VRRP High Availability Monitor…
Mar 15 12:11:32 server1lb-fqdn Keepalived[1192002]: Stopping
Mar 15 12:11:32 server1lb-fqdn Keepalived_vrrp[1192003]: (LB_VIP_QA) sent 0 priority
Mar 15 12:11:32 server1lb-fqdn Keepalived_vrrp[1192003]: (LB_VIP_QA) removing VIPs.
Mar 15 12:11:33 server1lb-fqdn Keepalived_vrrp[1192003]: Stopped – used 2.145252 user time, 15.513454 system time
Mar 15 12:11:33 server1lb-fqdn Keepalived[1192002]: CPU usage (self/children) user: 0.000000/44.555362 system: 0.001151/170.118126
Mar 15 12:11:33 server1lb-fqdn Keepalived[1192002]: Stopped Keepalived v2.1.5 (07/13,2020)
Mar 15 12:11:33 server1lb-fqdn systemd[1]: keepalived.service: Succeeded.
Mar 15 12:11:33 server1lb-fqdn systemd[1]: Stopped LVS and VRRP High Availability Monitor.

 

On keepalived off, you will get also a notification Email on the Receipt Email configured from keepalived.conf from the working keepalived node with a simple message like:

=> VRRP Instance is no longer owning VRRP VIPs <=

Once keepalived is back up you will get another notification like:

=> VRRP Instance is now owning VRRP VIPs <=

[root@server2:~]# systemctl status keepalived
● keepalived.service – LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2022-03-14 18:13:52 CET; 17h ago
  Process: 297366 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 297367 (keepalived)
    Tasks: 2 (limit: 100914)
   Memory: 2.1M
   CGroup: /system.slice/keepalived.service
           ├─297367 /usr/sbin/keepalived -D -S 7
           └─297368 /usr/sbin/keepalived -D -S 7

Mar 15 12:11:33 server2lb-fqdn Keepalived_vrrp[297368]: Sending gratuitous ARP on eth0 for 10.10.10.1
Mar 15 12:11:33 server2lb-fqdn Keepalived_vrrp[297368]: Sending gratuitous ARP on eth0 for 10.10.10.1
Mar 15 12:11:33 server2lb-fqdn Keepalived_vrrp[297368]: Remote SMTP server [127.0.0.1]:25 connected.
Mar 15 12:11:33 server2lb-fqdn Keepalived_vrrp[297368]: SMTP alert successfully sent.
Mar 15 12:11:38 server2lb-fqdn Keepalived_vrrp[297368]: (LB_VIP_QA) Sending/queueing gratuitous ARPs on eth0 for 10.10.10.1
Mar 15 12:11:38 server2lb-fqdn Keepalived_vrrp[297368]: Sending gratuitous ARP on eth0 for 10.10.10.1
Mar 15 12:11:38 server2lb-fqdn Keepalived_vrrp[297368]: Sending gratuitous ARP on eth0 for 10.10.10.1
Mar 15 12:11:38 server2lb-fqdn Keepalived_vrrp[297368]: Sending gratuitous ARP on eth0 for 10.10.10.1
Mar 15 12:11:38 server2lb-fqdn Keepalived_vrrp[297368]: Sending gratuitous ARP on eth0 for 10.10.10.1
Mar 15 12:11:38 server2lb-fqdn Keepalived_vrrp[297368]: Sending gratuitous ARP on eth0 for 10.10.10.1

[root@server2:~]#  ip addr show|grep -i 10.10.10.1
    inet 10.10.10.1/32 scope global eth0
    

As you see the VIP is now set on server2, just like expected – that's OK, everything works as expected. If the IP did not move double check the keepalived.conf on both nodes for errors or misconfigurations.

To recover the initial order of things so server1 is MASTER and server2 SLAVE host, we just have to switch on the keepalived on server1 machine.

[root@server1:~]# systemctl start keepalived

The automatic change of server1 to MASTER node and respective move of the VIP IP is done because of the higher priority (of importance we previously configured on server1 in keepalived.conf).
 

What we learned?
 

So what we learned in  this article?
We have seen how to easily install and configure a High Availability Load balancer with Keepalived with single floating VIP IP address with 1 MASTER and 1 SLAVE host and a Haproxy example config with few frontends / App backends. We have seen how the config can be tested for potential errors and how we can monitor whether the VRRP2 network traffic flows between nodes and how to potentially debug it further if necessery.
Further on rawly explained some of the keepalived configurations but as keepalived can do pretty much more,for anyone seriously willing to deal with keepalived on a daily basis or just fine tune some already existing ones, you better read closely its manual page "man keepalived.conf" as well as the official Redhat Linux documentation page on setting up a Linux cluster with Keepalived (Be prepare for a small nightmare as the documentation of it seems to be a bit chaotic, and even I would say partly missing or opening questions on what does the developers did meant – not strange considering the havoc that is pretty much as everywhere these days.)

Finally once keepalived hosts are prepared, it was shown how to test the keepalived application cluster and Floating IP does move between nodes in case if one of the 2 keepalived nodes is inaccessible.

The same logic can be repeated multiple times and if necessery you can set multiple VIPs to expand the HA reachable IPs solution.

high-availability-with-two-vips-example-diagram

The presented idea is with haproxy forward Proxy server to proxy requests towards Application backend (servince machines), however if you need to set another set of server on the flow to  process HTML / XHTML / PHP / Perl / Python  programming code, with some common Webserver setup ( Nginx / Apache / Tomcat / JBOSS) and enable SSL Secure certificate with lets say Letsencrypt, this can be relatively easily done. If you want to implement letsencrypt and a webserver check this redundant SSL Load Balancing with haproxy & keepalived article.

That's all folks, hope you enjoyed.
If you need to configure keepalived Cluster or a consultancy write your query here 🙂

How to Install ssh client / server on Windows 10, Windows Server 2019 and Windows Server 2022 using PowerShell commands

Wednesday, March 2nd, 2022

How-to-install-OpenSSH-Client-and-Server-on-Windows-10-Windows-Server-2022-Windows-2019-via-command-line-Powershell

Historically to have a running ssh client on Windows it was required to install CygWin or MobaXterm as told in my previous articles Some Standard software programs to install on Windows to make your Desktop feel  more Linux / Unix Desktop and Must have software on Freshly installed Windows OS.
Interesting things have been developed on the Windows scene since then and as of year 2022 on Windows 10 (build 1809 and later) and on Windows 2019, Windows Server 2022, the task to have a running ssh client to use from cmd.exe (command line) became trivial and does not need to have a CygWin Collection of GNU and Open Source tools installed but this is easily done via Windows embedded Apps & Features GUI tool:

To install it from there on 3 easy steps:

 

  1. Via  Settings, select Apps > Apps & Features, then select Optional Features.
  2. Find OpenSSH Client, then click Install
  3. Find OpenSSH Server, then click Install


For Windows domain administrators of a small IT company that requires its employees for some automated script to run stuff for example to tunnel encrypted traffic from Workers PC towards a server port for example to secure the 110 POP Email clients to communicate with the remote Office server in encrypted form or lets say because ssh client is required to be on multiple domain belonging PCs used as Windows Desktops by a bunch of developers in the company it also possible to use PowerShell script to install the client on the multiple Windows machines.

Install OpenSSH using PowerShell
 

To install OpenSSH using PowerShell, run PowerShell as an Administrator. To make sure that OpenSSH is available, run the following cmdlet in PowerShell

Get-WindowsCapability -Online | Where-Object Name -like 'OpenSSH*'


This should return the following output if neither are already installed:

 

Name  : OpenSSH.Client~~~~0.0.1.0
State : NotPresent

Name  : OpenSSH.Server~~~~0.0.1.0
State : NotPresent


Then, install the server or client components as needed:

Copy in PS cmd window

# Install the OpenSSH Client
Add-WindowsCapability -Online -Name OpenSSH.Client~~~~0.0.1.0

# Install the OpenSSH Server
Add-WindowsCapability -Online -Name OpenSSH.Server~~~~0.0.1.0


Both of these should return the following output:
 

Path          :
Online        : True
RestartNeeded : False


If you want to also allow remote access via OpenSSH sshd daemon, this is also easily possible without installing especially an openssh-server Windows variant !

Start and configure OpenSSH Server

To start and configure OpenSSH Server for initial use, open PowerShell as an administrator, then run the following commands to start the sshd service:

# Start the sshd service
Start-Service sshd

# OPTIONAL but recommended:
Set-Service -Name sshd -StartupType 'Automatic'

# Confirm the Firewall rule is configured. It should be created automatically by setup. Run the following to verify
if (!(Get-NetFirewallRule -Name "OpenSSH-Server-In-TCP" -ErrorAction SilentlyContinue | Select-Object Name, Enabled)) {
    Write-Output "Firewall Rule 'OpenSSH-Server-In-TCP' does not exist, creating it…"
    New-NetFirewallRule -Name 'OpenSSH-Server-In-TCP' -DisplayName 'OpenSSH Server (sshd)' -Enabled True -Direction Inbound -Protocol TCP -Action Allow -LocalPort 22
} else {
    Write-Output "Firewall rule 'OpenSSH-Server-In-TCP' has been created and exists."
}


Connect to OpenSSH Server
 

Once installed, you can connect to OpenSSH Server from a Windows 10 or Windows Server 2019 device with the OpenSSH client installed using PowerShell or Command Line tool as Administrator and use the ssh client like you would use it on any *NIX host.

C:\Users\User> ssh username@servername


The authenticity of host 'servername (10.10.10.1)' can't be established.
ECDSA key fingerprint is SHA256:(<a large string>).
Are you sure you want to continue connecting (yes/no)?
Selecting yes adds that server to the list of known SSH hosts on your Windows client.

You are prompted for the password at this point. As a security precaution, your password will not be displayed as you type.

Once connected, you will see the Windows command shell prompt:

Domain\username@SERVERNAME C:\Users\username>

 

List and fix failed systemd failed services after Linux OS upgrade and how to get full info about systemd service from jorunal log

Friday, February 25th, 2022

systemd-logo-unix-linux-list-failed-systemd-services

I have recently upgraded a number of machines from Debian 10 Buster to Debian 11 Bullseye. The update as always has some issues on some machines, such as problem with package dependencies, changing a number of external package repositories etc. to match che Bullseye deb packages. On some machines the update was less painful on others but the overall line was that most of the machines after the update ended up with one or more failed systemd services. It could be that some of the machines has already had this failed services present and I never checked them from the previous time update from Debian 9 -> Debian 10 or just some mess I've left behind in the hurry when doing software installation in the past. This doesn't matter anyways the fact was that I had to deal to a number of systemctl services which I managed to track by the Failed service mesage on system boot on one of the physical machines and on the OpenXen VTY Console the rest of Virtual Machines after update had some Failed messages. Thus I've spend some good amount of time like an overall of a day or two fixing strange failed services. This is how this small article was born in attempt to help sysadmins or any home Linux desktop users, who has updated his Debian Linux / Ubuntu or any other deb based distribution but due to the chaotic nature of Linux has ended with same strange Failed services and look for a way to find the source of the failures and get rid of the problems. 
Systemd is a very complicated system and in my many sysadmin opinion it makes more problems than it solves, but okay for today's people's megalomania mindset it matches well.

Systemd_components-systemd-journalctl-cgroups-loginctl-nspawn-analyze.svg

 

1. Check the journal for errors, running service irregularities and so on
 

First thing to do to track for errors, right after the update is to take some minutes and closely check,, the journalctl for any strange errors, even on well maintained Unix machines, this journal log would bring you to a problem that is not fatal but still some process or stuff is malfunctioning in the background that you would like to solve:
 

root@pcfreak:~# journalctl -x
Jan 10 10:10:01 pcfreak CRON[17887]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17887]: USER_END pid=17887 uid=0 auid=0 ses=340858 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17888]: CRED_DISP pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17888]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17888]: USER_END pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17884]: CRED_DISP pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17884]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17884]: USER_END pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17886]: CRED_DISP pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/c>
Jan 10 10:10:01 pcfreak CRON[17886]: pam_unix(cron:session): session closed for user www-data
Jan 10 10:10:01 pcfreak audit[17886]: USER_END pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permi>
Jan 10 10:10:08 pcfreak NetworkManager[696]:  [1641802208.0899] device (eth1): carrier: link connected
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:19 pcfreak NetworkManager[696]:
 [1641802219.7920] device (eth1): carrier: link connected
Jan 10 10:10:19 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:20 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:22 pcfreak NetworkManager[696]:
 [1641802222.2772] device (eth1): carrier: link connected
Jan 10 10:10:22 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:23 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:33 pcfreak sshd[18142]: Unable to negotiate with 66.212.17.162 port 19255: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diff>
Jan 10 10:10:41 pcfreak NetworkManager[696]:
 [1641802241.0186] device (eth1): carrier: link connected
Jan 10 10:10:41 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx

If you want to only check latest journal log messages use the -x -e (pager catalog) opts

root@pcfreak;~# journalctl -xe

Feb 25 13:08:29 pcfreak audit[2284920]: USER_LOGIN pid=2284920 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='op=login acct=28696E76616C>
Feb 25 13:08:29 pcfreak sshd[2284920]: Received disconnect from 177.87.57.145 port 40927:11: Bye Bye [preauth]
Feb 25 13:08:29 pcfreak sshd[2284920]: Disconnected from invalid user ubuntuuser 177.87.57.145 port 40927 [preauth]

Next thing to after the update was to get a list of failed service only.


2. List all systemd failed check services which was supposed to be running

root@pcfreak:/root # systemctl list-units | grep -i failed
● certbot.service                                                                                                       loaded failed failed    Certbot
● logrotate.service                                                                                                     loaded failed failed    Rotate log files
● maldet.service                                                                                                        loaded failed failed    LSB: Start/stop maldet in monitor mode
● named.service                                                                                                         loaded failed failed    BIND Domain Name Server


Alternative way is with the –failed option

hipo@jeremiah:~$ systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

 

root@jeremiah:/etc/apt/sources.list.d#  systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon


To get a full list of objects of systemctl you can pass as state:
 

# systemctl –state=help
Full list of possible load states to pass is here
Show service properties


Check whether a service is failed or has other status and check default set systemd variables for it.

root@jeremiah~:# systemctl is-failed vboxweb.service
inactive

# systemctl show haproxy
Type=notify
Restart=always
NotifyAccess=main
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
SuccessExitStatus=143
MainPID=304858
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ReloadResult=success
CleanResult=success

Full output of the above command is dumped in show_systemctl_properties.txt


3. List all running systemd services for a better overview on what's going on on machine
 

To get a list of all properly systemd loaded services you can use –state running.

hipo@jeremiah:~$ systemctl list-units –state running|head -n 10
  UNIT                              LOAD   ACTIVE SUB     DESCRIPTION
  proc-sys-fs-binfmt_misc.automount loaded active running Arbitrary Executable File Formats File System Automount Point
  cups.path                         loaded active running CUPS Scheduler
  init.scope                        loaded active running System and Service Manager
  session-2.scope                   loaded active running Session 2 of user hipo
  accounts-daemon.service           loaded active running Accounts Service
  anydesk.service                   loaded active running AnyDesk
  apache-htcacheclean.service       loaded active running Disk Cache Cleaning Daemon for Apache HTTP Server
  apache2.service                   loaded active running The Apache HTTP Server
  avahi-daemon.service              loaded active running Avahi mDNS/DNS-SD Stack

 

It is useful thing is to list all unit-files configured in systemd and their state, you can do it with:

 


root@pcfreak:~# systemctl list-unit-files
UNIT FILE                                                                 STATE           VENDOR PRESET
proc-sys-fs-binfmt_misc.automount                                         static          –            
-.mount                                                                   generated       –            
backups.mount                                                             generated       –            
dev-hugepages.mount                                                       static          –            
dev-mqueue.mount                                                          static          –            
media-cdrom0.mount                                                        generated       –            
mnt-sda1.mount                                                            generated       –            
proc-fs-nfsd.mount                                                        static          –            
proc-sys-fs-binfmt_misc.mount                                             disabled        disabled     
run-rpc_pipefs.mount                                                      static          –            
sys-fs-fuse-connections.mount                                             static          –            
sys-kernel-config.mount                                                   static          –            
sys-kernel-debug.mount                                                    static          –            
sys-kernel-tracing.mount                                                  static          –            
var-www.mount                                                             generated       –            
acpid.path                                                                masked          enabled      
cups.path                                                                 enabled         enabled      

 

 


root@pcfreak:~# systemctl list-units –type service –all
  UNIT                                   LOAD      ACTIVE   SUB     DESCRIPTION
  accounts-daemon.service                loaded    inactive dead    Accounts Service
  acct.service                           loaded    active   exited  Kernel process accounting
● alsa-restore.service                   not-found inactive dead    alsa-restore.service
● alsa-state.service                     not-found inactive dead    alsa-state.service
  apache2.service                        loaded    active   running The Apache HTTP Server
● apparmor.service                       not-found inactive dead    apparmor.service
  apt-daily-upgrade.service              loaded    inactive dead    Daily apt upgrade and clean activities
 apt-daily.service                      loaded    inactive dead    Daily apt download activities
  atd.service                            loaded    active   running Deferred execution scheduler
  auditd.service                         loaded    active   running Security Auditing Service
  auth-rpcgss-module.service             loaded    inactive dead    Kernel Module supporting RPCSEC_GSS
  avahi-daemon.service                   loaded    active   running Avahi mDNS/DNS-SD Stack
  certbot.service                        loaded    inactive dead    Certbot
  clamav-daemon.service                  loaded    active   running Clam AntiVirus userspace daemon
  clamav-freshclam.service               loaded    active   running ClamAV virus database updater
..

 


linux-systemd-components-diagram-linux-kernel-system-targets-systemd-libraries-daemons

 

4. Finding out more on why a systemd configured service has failed


Usually getting info about failed systemd service is done with systemctl status servicename.service
However, in case of troubles with service unable to start to get more info about why a service has failed with (-l) or (–full) options


root@pcfreak:~# systemctl -l status logrotate.service
● logrotate.service – Rotate log files
     Loaded: loaded (/lib/systemd/system/logrotate.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 00:00:06 EET; 13h ago
TriggeredBy: ● logrotate.timer
       Docs: man:logrotate(8)
             man:logrotate.conf(5)
    Process: 2045320 ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=1/FAILURE)
   Main PID: 2045320 (code=exited, status=1/FAILURE)
        CPU: 2.479s

Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: For now we will assume you meant to write /32
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| ERROR: '0.0.0.0/0.0.0.0' needs to be replaced by the term 'all'.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| SECURITY NOTICE: Overriding config setting. Using 'all' instead.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: (B) '::/0' is a subnetwork of (A) '::/0'
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: because of this '::/0' is ignored to keep splay tree searching predictable
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: You should probably remove '::/0' from the ACL named 'all'
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Failed with result 'exit-code'.
Feb 25 00:00:06 pcfreak systemd[1]: Failed to start Rotate log files.
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Consumed 2.479s CPU time.


systemctl -l however is providing only the last log from message a started / stopped or whatever status service has generated. Sometimes systemctl -l servicename.service is showing incomplete the splitted error message as there is a limitation of line numbers on the console, see below

 

root@pcfreak:~# systemctl status -l certbot.service
● certbot.service – Certbot
     Loaded: loaded (/lib/systemd/system/certbot.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 09:28:33 EET; 4h 0min ago
TriggeredBy: ● certbot.timer
       Docs: file:///usr/share/doc/python-certbot-doc/html/index.html
             https://certbot.eff.org/docs
    Process: 290017 ExecStart=/usr/bin/certbot -q renew (code=exited, status=1/FAILURE)
   Main PID: 290017 (code=exited, status=1/FAILURE)
        CPU: 9.771s

Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using th>
Feb 25 09:28:33 pcfrxen certbot[290017]: All renewals failed. The following certificates could not be renewed:
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/mail.pcfreak.org-0003/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/www.eforia.bg-0005/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/zabbix.pc-freak.net/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]: 3 renew failure(s), 5 parse failure(s)
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Failed with result 'exit-code'.
Feb 25 09:28:33 pcfrxen systemd[1]: Failed to start Certbot.
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Consumed 9.771s CPU time.

 

5. Get a complete log of journal to make sure everything configured on server host runs as it should

Thus to get more complete list of the message and be able to later google and look if has come with a solution on the internet  use:

root@pcfrxen:~#  journalctl –catalog –unit=certbot

— Journal begins at Sat 2022-01-22 21:14:05 EET, ends at Fri 2022-02-25 13:32:01 EET. —
Jan 23 09:58:18 pcfrxen systemd[1]: Starting Certbot…
░░ Subject: A start job for unit certbot.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit certbot.service has begun execution.
░░ 
░░ The job identifier is 5754.
Jan 23 09:58:20 pcfrxen certbot[124996]: Traceback (most recent call last):
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/renewal.py", line 71, in _reconstitute
Jan 23 09:58:20 pcfrxen certbot[124996]:     renewal_candidate = storage.RenewableCert(full_path, config)
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 471, in __init__
Jan 23 09:58:20 pcfrxen certbot[124996]:     self._check_symlinks()
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 537, in _check_symlinks

root@server:~# journalctl –catalog –unit=certbot|grep -i pluginerror|tail -1
Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using the manual plugin non-interactively.')


Or if you want to list and read only the last messages in the journal log regarding a service

root@server:~# journalctl –catalog –pager-end –unit=certbot


If you have disabled a failed service because you don't need it to run at all on the machine with:

root@rhel:~# systemctl stop rngd.service
root@rhel:~# systemctl disable rngd.service

And you want to clear up any failed service information that is kept in the systemctl service log you can do it with:
 

root@rhel:~# systemctl reset-failed

Another useful systemctl option is cat, you can use it to easily list a service it is useful to quickly check what is a service, an actual shortcut to save you from giving a full path to the service e.g. cat /lib/systemd/system/certbot.service

root@server:~# systemctl cat certbot
# /lib/systemd/system/certbot.service
[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/python-certbot-doc/html/index.html
Documentation=https://certbot.eff.org/docs
[Service]
Type=oneshot
ExecStart=/usr/bin/certbot -q renew
PrivateTmp=true


After failed SystemD services are fixed, it is best to reboot the machine and check put some more time to inspect rawly the complete journal log to make sure, no error  was left behind.


Closure
 

As you can see updating a machine from a major to a major version even if you follow the official documentation and you have plenty of experience is always more or a less a pain in the ass, which can eat up much of your time banging your head solving problems with failed daemons issues with /etc/rc.local (which I have faced becase of #/bin/sh -e (which would make /etc/rc.local) to immediately quit if any error from command $? returns different from 0 etc.. The  logical questions comes then;
1. Is it really worthy to update at all regularly, especially if you don't know of a famous major Vulnerability 🙂 ?
2. Or is it worthy to update from OS major release to OS major release at all?  
3. Or should you only try to patch the service that is exposed to an external reachable computer network or the internet only and still the the same OS release until End of Life (LTS = Long Term Support) as called in Debian or  End Of Life  (EOL) Cycle as called in RPM based distros the period until the OS major release your software distro has official security patches is reached.

Anyone could take any approach but for my own managed systems small network at home my practice was always to try to keep up2date everything every 3 or 6 months maximum. This has caused me multiple days of irritation and stress and perhaps many white hairs and spend nerves on shit.


4. Based on the company where I'm employed the better strategy is to patch to the EOL is still offered and keep the rule First Things First (FTF), once the EOL is reached, just make a copy of all servers data and configuration to external Data storage, bring up a new Physical or VM and migrate the services.
Test after the migration all works as expected if all is as it should be change the DNS records or Leading Infrastructure Proxies whatever to point to the new service and that's it! Yes it is true that migration based on a full OS reinstall is more time consuming and requires much more planning, but usually the result is much more expected, plus it is much less stressful for the guy doing the job.

How to filter dhcp traffic between two networks running separate DHCP servers to prevent IP assignment issues and MAC duplicate addresses

Tuesday, February 8th, 2022

how-to-filter-dhcp-traffic-2-networks-running-2-separate-dhcpd-servers-to-prevent-ip-assignment-conflicts-linux
Tracking the Problem of MAC duplicates on Linux routers
 

If you have two networks that see each other and they're not separated in VLANs but see each other sharing a common netmask lets say 255.255.254.0 or 255.255.252.0, it might happend that there are 2 dhcp servers for example (isc-dhcp-server running on 192.168.1.1 and dhcpd running on 192.168.0.1 can broadcast their services to both LANs 192.168.1.0.1/24 (netmask 255.255.255.0) and Local Net LAN 192.168.1.1/24. The result out of this is that some devices might pick up their IP address via DHCP from the wrong dhcp server.

Normally if you have a fully controlled little or middle class home or office network (10 – 15 electronic devices nodes) connecting to the LAN in a mixed moth some are connected via one of the Networks via connected Wifi to 192.168.1.0/22 others are LANned and using static IP adddresses and traffic is routed among two ISPs and each network can see the other network, there is always a possibility of things to go wrong. This is what happened to me so this is how this post was born.

The best practice from my experience so far is to define each and every computer / phone / laptop host joining the network and hence later easily monitor what is going on the network with something like iptraf-ng / nethogs  / iperf – described in prior  how to check internet spepeed from console and in check server internet connectivity speed with speedtest-cliiftop / nload or for more complex stuff wireshark or even a simple tcpdump. No matter the tools network monitoring is only part on solving network issues. A very must have thing in a controlled network infrastructure is defining every machine part of it to easily monitor later with the monitoring tools. Defining each and every host on the Hybrid computer networks makes administering the network much easier task and  tracking irregularities on time is much more likely. 

Since I have such a hybrid network here hosting a couple of XEN virtual machines with Linux, Windows 7 and Windows 10, together with Mac OS X laptops as well as MacBook Air notebooks, I have followed this route and tried to define each and every host based on its MAC address to pick it up from the correct DHCP1 server  192.168.1.1 (that is distributing IPs for Internet Provider 1 (ISP 1), that is mostly few computers attached UTP LAN cables via LiteWave LS105G Gigabit Switch as well from DHCP2 – used only to assigns IPs to servers and a a single Wi-Fi Access point configured to route incoming clients via 192.168.0.1 Linux NAT gateway server.

To filter out the unwanted IPs from the DHCPD not to propagate I've so far used a little trick to  Deny DHCP MAC Address for unwanted clients and not send IP offer for them.

To give you more understanding,  I have to clear it up I don't want to have automatic IP assignments from DHCP2 / LAN2 to DHCP1 / LAN1 because (i don't want machines on DHCP1 to end up with IP like 192.168.0.50 or DHCP2 (to have 192.168.1.80), as such a wrong IP delegation could potentially lead to MAC duplicates IP conflicts. MAC Duplicate IP wrong assignments for those older or who have been part of administrating large ISP network infrastructures  makes the network communication unstable for no apparent reason and nodes partially unreachable at times or full time …

However it seems in the 21-st century which is the century of strangeness / computer madness in the 2022, technology advanced so much that it has massively started to break up some good old well known sysadmin standards well documented in the RFCs I know of my youth, such as that every electronic equipment manufactured Vendor should have a Vendor Assigned Hardware MAC Address binded to it that will never change (after all that was the idea of MAC addresses wasn't it !). 
Many mobile devices nowadays however, in the developers attempts to make more sophisticated software and Increase Anonimity on the Net and Security, use a technique called  MAC Address randomization (mostly used by hackers / script kiddies of the early days of computers) for their Wi-Fi Net Adapter OS / driver controlled interfaces for the sake of increased security (the so called Private WiFi Addresses). If a sysadmin 10-15 years ago has seen that he might probably resign his profession and turn to farming or agriculture plant growing, but in the age of digitalization and "cloud computing", this break up of common developed network standards starts to become the 'new normal' standard.

I did not suspected there might be a MAC address oddities, since I spare very little time on administering the the network. This was so till recently when I accidently checked the arp table with:

Hypervisor:~# arp -an
192.168.1.99     5c:89:b5:f2:e8:d8      (Unknown)
192.168.1.99    00:15:3e:d3:8f:76       (Unknown)

..


and consequently did a network MAC Address ARP Scan with arp-scan (if you never used this little nifty hacker tool I warmly recommend it !!!)
If you don't have it installed it is available in debian based linuces from default repos to install

Hypervisor:~# apt-get install –yes arp-scan


It is also available on CentOS / Fedora / Redhat and other RPM distros via:

Hypervisor:~# yum install -y arp-scan

 

 

Hypervisor:~# arp-scan –interface=eth1 192.168.1.0/24

192.168.1.19    00:16:3e:0f:48:05       Xensource, Inc.
192.168.1.22    00:16:3e:04:11:1c       Xensource, Inc.
192.168.1.31    00:15:3e:bb:45:45       Xensource, Inc.
192.168.1.38    00:15:3e:59:96:8e       Xensource, Inc.
192.168.1.34    00:15:3e:d3:8f:77       Xensource, Inc.
192.168.1.60    8c:89:b5:f2:e8:d8       Micro-Star INT'L CO., LTD
192.168.1.99     5c:89:b5:f2:e8:d8      (Unknown)
192.168.1.99    00:15:3e:d3:8f:76       (Unknown)

192.168.x.91     02:a0:xx:xx:d6:64        (Unknown)
192.168.x.91     02:a0:xx:xx:d6:64        (Unknown)  (DUP: 2)

N.B. !. I found it helpful to check all available interfaces on my Linux NAT router host.

As you see the scan revealed, a whole bunch of MAC address mess duplicated MAC hanging around, destroying my network topology every now and then 
So far so good, the MAC duplicates and strangely hanging around MAC addresses issue, was solved relatively easily with enabling below set of systctl kernel variables.
 

1. Fixing Linux ARP common well known Problems through disabling arp_announce / arp_ignore / send_redirects kernel variables disablement

 

Linux answers ARP requests on wrong and unassociated interfaces per default. This leads to the following two problems:

ARP requests for the loopback alias address are answered on the HW interfaces (even if NOARP on lo0:1 is set). Since loopback aliases are required for DSR (Direct Server Return) setups this problem is very common (but easy to fix fortunately).

If the machine is connected twice to the same switch (e.g. with eth0 and eth1) eth2 may answer ARP requests for the address on eth1 and vice versa in a race condition manner (confusing almost everything).

This can be prevented by specific arp kernel settings. Take a look here for additional information about the nature of the problem (and other solutions): ARP flux.

To fix that generally (and reboot safe) we  include the following lines into

 

Hypervisor:~# cp -rpf /etc/sysctl.conf /etc/sysctl.conf_bak_07-feb-2022
Hypervisor:~# cat >> /etc/sysctl.conf

# LVS tuning
net.ipv4.conf.lo.arp_ignore=1
net.ipv4.conf.lo.arp_announce=2
net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.all.arp_announce=2

net.ipv4.conf.all.send_redirects=0
net.ipv4.conf.eth0.send_redirects=0
net.ipv4.conf.eth1.send_redirects=0
net.ipv4.conf.default.send_redirects=0

Press CTRL + D simultaneusly to Write out up-pasted vars.


To read more on Load Balancer using direct routing and on LVS and the arp problem here


2. Digging further the IP conflict / dulicate MAC Problems

Even after this arp tunings (because I do have my Hypervisor 2 LAN interfaces connected to 1 switch) did not resolved the issues and still my Wireless Connected devices via network 192.168.1.1/24 (ISP2) were randomly assigned the wrong range IPs 192.168.0.XXX/24 as well as the wrong gateway 192.168.0.1 (ISP1).
After thinking thoroughfully for hours and checking the network status with various tools and thanks to the fact that my wife has a MacBook Air that was always complaining that the IP it tried to assign from the DHCP was already taken, i"ve realized, something is wrong with DHCP assignment.
Since she owns a IPhone 10 with iOS and this two devices are from the same vendor e.g. Apple Inc. And Apple's products have been having strange DHCP assignment issues from my experience for quite some time, I've thought initially problems are caused by software on Apple's devices.
I turned to be partially right after expecting the logs of DHCP server on the Linux host (ISP1) finding that the phone of my wife takes IP in 192.168.0.XXX, insetad of IP from 192.168.1.1 (which has is a combined Nokia Router with 2.4Ghz and 5Ghz Wi-Fi and LAN router provided by ISP2 in that case Vivacom). That was really puzzling since for me it was completely logical thta the iDevices must check for DHCP address directly on the Network of the router to whom, they're connecting. Guess my suprise when I realized that instead of that the iDevices does listen to the network on a wide network range scan for any DHCPs reachable baesd on the advertised (i assume via broadcast) address traffic and try to connect and take the IP to the IP of the DHCP which responds faster !!!! Of course the Vivacom Chineese produced Nokia router responded DHCP requests and advertised much slower, than my Linux NAT gateway on ISP1 and because of that the Iphone and iOS and even freshest versions of Android devices do take the IP from the DHCP that responds faster, even if that router is not on a C class network (that's invasive isn't it??). What was even more puzzling was the automatic MAC Randomization of Wifi devices trying to connect to my ISP1 configured DHCPD and this of course trespassed any static MAC addresses filtering, I already had established there.

Anyways there was also a good think out of tthat intermixed exercise 🙂 While playing around with the Gigabit network router of vivacom I found a cozy feature SCHEDULEDING TURNING OFF and ON the WIFI ACCESS POINT  – a very useful feature to adopt, to stop wasting extra energy and lower a bit of radiation is to set a swtich off WIFI AP from 12:30 – 06:30 which are the common sleeping hours or something like that.
 

3. What is MAC Randomization and where and how it is configured across different main operating systems as of year 2022?

Depending on the operating system of your device, MAC randomization will be available either by default on most modern mobile OSes or with possibility to have it switched on:

  • Android Q: Enabled by default 
  • Android P: Available as a developer option, disabled by default
  • iOS 14: Available as a user option, disabled by default
  • Windows 10: Available as an option in two ways – random for all networks or random for a specific network

Lately I don't have much time to play around with mobile devices, and I do not my own a luxury mobile phone so, the fact this ne Androids have this MAC randomization was unknown to me just until I ended a small mess, based on my poor configured networks due to my tight time constrains nowadays.

Finding out about the new security feature of MAC Randomization, on all Android based phones (my mother's Nokia smartphone and my dad's phone, disabled the feature ASAP:


4. Disable MAC Wi-Fi Ethernet device Randomization on Android

MAC Randomization creates a random MAC address when joining a Wi-Fi network for the first time or after “forgetting” and rejoining a Wi-Fi network. It Generates a new random MAC address after 24 hours of last connection.

Disabling MAC Randomization on your devices. It is done on a per SSID basis so you can turn off the randomization, but allow it to function for hotspots outside of your home.

  1. Open the Settings app
  2. Select Network and Internet
  3. Select WiFi
  4. Connect to your home wireless network
  5. Tap the gear icon next to the current WiFi connection
  6. Select Advanced
  7. Select Privacy
  8. Select "Use device MAC"
     

5. Disabling MAC Randomization on MAC iOS, iPhone, iPad, iPod

To Disable MAC Randomization on iOS Devices:

Open the Settings on your iPhone, iPad, or iPod, then tap Wi-Fi or WLAN

 

  1. Tap the information button next to your network
  2. Turn off Private Address
  3. Re-join the network


Of course next I've collected their phone Wi-Fi adapters and made sure the included dhcp MAC deny rules in /etc/dhcp/dhcpd.conf are at place.

The effect of the MAC Randomization for my Network was terrible constant and strange issues with my routings and networks, which I always thought are caused by the openxen hypervisor Virtualization VM bugs etc.

That continued for some months now, and the weird thing was the issues always started when I tried to update my Operating system to the latest packetset, do a reboot to load up the new piece of software / libraries etc. and plus it happened very occasionally and their was no obvious reason for it.

 

6. How to completely filter dhcp traffic between two network router hosts
IP 192.168.0.1 / 192.168.1.1 to stop 2 or more configured DHCP servers
on separate networks see each other

To prevent IP mess at DHCP2 server side (which btw is ISC DHCP server, taking care for IP assignment only for the Servers on the network running on Debian 11 Linux), further on I had to filter out any DHCP UDP traffic with iptables completely.
To prevent incorrect route assignments assuming that you have 2 networks and 2 routers that are configurred to do Network Address Translation (NAT)-ing Router 1: 192.168.0.1, Router 2: 192.168.1.1.

You have to filter out UDP Protocol data on Port 67 and 68 from the respective source and destination addresses.

In firewall rules configuration files on your Linux you need to have some rules as:

# filter outgoing dhcp traffic from 192.168.1.1 to 192.168.0.1
-A INPUT -p udp -m udp –dport 67:68 -s 192.168.1.1 -d 192.168.0.1 -j DROP
-A OUTPUT -p udp -m udp –dport 67:68 -s 192.168.1.1 -d 192.168.0.1 -j DROP
-A FORWARD -p udp -m udp –dport 67:68 -s 192.168.1.1 -d 192.168.0.1 -j DROP

-A INPUT -p udp -m udp –dport 67:68 -s 192.168.0.1 -d 192.168.1.1 -j DROP
-A OUTPUT -p udp -m udp –dport 67:68 -s 192.168.0.1 -d 192.168.1.1 -j DROP
-A FORWARD -p udp -m udp –dport 67:68 -s 192.168.0.1 -d 192.168.1.1 -j DROP

-A INPUT -p udp -m udp –sport 67:68 -s 192.168.1.1 -d 192.168.0.1 -j DROP
-A OUTPUT -p udp -m udp –sport 67:68 -s 192.168.1.1 -d 192.168.0.1 -j DROP
-A FORWARD -p udp -m udp –sport 67:68 -s 192.168.1.1 -d 192.168.0.1 -j DROP


You can download also filter_dhcp_traffic.sh with above rules from here


Applying this rules, any traffic of DHCP between 2 routers is prohibited and devices from Net: 192.168.1.1-255 will no longer wrongly get assinged IP addresses from Network range: 192.168.0.1-255 as it happened to me.


7. Filter out DHCP traffic based on MAC completely on Linux with arptables

If even after disabling MAC randomization on all devices on the network, and you know physically all the connecting devices on the Network, if you still see some weird MAC addresses, originating from a wrongly configured ISP traffic router host or whatever, then it is time to just filter them out with arptables.

## drop traffic prevent mac duplicates due to vivacom and bergon placed in same network – 255.255.255.252
dchp1-server:~# arptables -A INPUT –source-mac 70:e2:83:12:44:11 -j DROP


To list arptables configured on Linux host

dchp1-server:~# arptables –list -n


If you want to be paranoid sysadmin you can implement a MAC address protection with arptables by only allowing a single set of MAC Addr / IPs and dropping the rest.

dchp1-server:~# arptables -A INPUT –source-mac 70:e2:84:13:45:11 -j ACCEPT
dchp1-server:~# arptables -A INPUT  –source-mac 70:e2:84:13:45:12 -j ACCEPT


dchp1-server:~# arptables -L –line-numbers
Chain INPUT (policy ACCEPT)
1 -j DROP –src-mac 70:e2:84:13:45:11
2 -j DROP –src-mac 70:e2:84:13:45:12

Once MACs you like are accepted you can set the INPUT chain policy to DROP as so:

dchp1-server:~# arptables -P INPUT DROP


If you later need to temporary, clean up the rules inside arptables on any filtered hosts flush all rules inside INPUT chain, like that
 

dchp1-server:~#  arptables -t INPUT -F

How to disable appArmor automatically installed and loaded after Linux Debian 10 to 11 Upgrade. Disable Apparmour on Deb based Linux

Friday, January 28th, 2022

check-apparmor-status-linux-howto-disable-Apparmor_on-debian-ubuntu-mint-and-other-deb-based-linux-distributions

I've upgraded recently all my machines from Debian Buster Linux 10 to Debian 11 Bullseye (if you wonder what Bullseye is) this is one of the heroes of Disneys Toy Stories which are used for a naming of General Debian Distributions.
After the upgrade most of the things worked expected, expect from some stuff like MariaDB (MySQL) and other weirdly behaving services. After some time of investigation being unable to find out what was causing the random issues observed on the machines. I finally got the strange daemon improper functioning and crashing was caused by AppArmor.

AppArmor ("Application Armor") is a Linux kernel security module that allows the system administrator to restrict programs' capabilities with per-program profiles. Profiles can allow capabilities like network access, raw socket access, and the permission to read, write, or execute files on matching paths. AppArmor supplements the traditional Unix discretionary access control (DAC) model by providing mandatory access control (MAC). It has been partially included in the mainline Linux kernel since version 2.6.36 and its development has been supported by Canonical since 2009.

The general idea of apparmor is wonderful as it could really strengthen system security, however it should be setup on install time and not setup on update time. For one more time I got convinced myself that upgrading from version to version to keep up to date with security is a hard task and often the results are too much unexpected and a better way to upgrade from General version to version any modern Linux / Unix distribution (and their forked mobile equivalents Android etc.) is to just make a copy of the most important configuration, setup the services on a freshly new installed machine be it virtual or a physical Server and rebuild the whole system from scratch, test and then run the system in production, substituting the old server general version with the new machine. 

The rest is leading to so much odd issues like this time with AppArmors causing distractions on the servers hosted applications.

But enough rent if you're unlucky and unwise enough to try to Upgrade Debian / Ubuntu 20, 21 / Mint 18, 19 etc. or whatever Deb distro from older general release to a newer One. Perhaps the best first thing to do onwards is stop and remove AppArmor (those who are hardcore enthusiasts could try to enable the failing services due to apparmor), by disabling the respective apparmor hardening profile but i did not have time to waste on stupid stuff and experiment so I preferred to completely stop it. 

To identify the upgrade oddities has to deal with apparmors service enabled security protections you should be able to find respective records inside /var/log/messages as well as in /var/log/audit/audit.log

 

# dmesg

[   64.210463] audit: type=1400 audit(1548120161.662:21): apparmor="DENIED" operation="sendmsg" info="Failed name lookup – disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=2527 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=113 ouid=0
[  144.364055] audit: type=1400 audit(1548120241.595:22): apparmor="DENIED" operation="sendmsg" info="Failed name lookup – disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=2527 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=113 ouid=0
[  144.465883] audit: type=1400 audit(1548120241.699:23): apparmor="DENIED" operation="sendmsg" info="Failed name lookup – disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=2527 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=113 ouid=0
[  144.566363] audit: type=1400 audit(1548120241.799:24): apparmor="DENIED" operation="sendmsg" info="Failed name lookup – disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=2527 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=113 ouid=0
[  144.666722] audit: type=1400 audit(1548120241.899:25): apparmor="DENIED" operation="sendmsg" info="Failed name lookup – disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=2527 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=113 ouid=0
[  144.767069] audit: type=1400 audit(1548120241.999:26): apparmor="DENIED" operation="sendmsg" info="Failed name lookup – disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=2527 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=113 ouid=0
[  144.867432] audit: type=1400 audit(1548120242.099:27): apparmor="DENIED" operation="sendmsg" info="Failed name lookup – disconnected path" error=-13 profile="/usr/sbin/mysqld" name="run/systemd/notify" pid=2527 comm="mysqld" requested_mask="w" denied_mask="w" fsuid=113 ouid=0


1. How to check if AppArmor is running on the system

If you have a system with enabled apparmor you should get some output like:

root@haproxy2:~# apparmor_status 
apparmor module is loaded.
5 profiles are loaded.
5 profiles are in enforce mode.
   /usr/sbin/ntpd
   lsb_release
   nvidia_modprobe
   nvidia_modprobe//kmod
   tcpdump
0 profiles are in complain mode.
1 processes have profiles defined.
1 processes are in enforce mode.
   /usr/sbin/ntpd (387) 
0 processes are in complain mode.
0 processes are unconfined but have a profile defined.


Also if you check the service you will find out that Debian's Major Release upgrade from 10 Buster to 11 BullsEye with.

apt update -y && apt upgrade -y && apt dist-update -y

automatically installed apparmor and started the service, e.g.:

# systemctl status apparmor
● apparmor.service – Load AppArmor profiles
     Loaded: loaded (/lib/systemd/system/apparmor.service; enabled; vendor pres>
     Active: active (exited) since Sat 2022-01-22 23:04:58 EET; 5 days ago
       Docs: man:apparmor(7)
             https://gitlab.com/apparmor/apparmor/wikis/home/
    Process: 205 ExecStart=/lib/apparmor/apparmor.systemd reload (code=exited, >
   Main PID: 205 (code=exited, status=0/SUCCESS)
        CPU: 43ms

яну 22 23:04:58 haproxy2 apparmor.systemd[205]: Restarting AppArmor
яну 22 23:04:58 haproxy2 apparmor.systemd[205]: Reloading AppArmor profiles
яну 22 23:04:58 haproxy2 systemd[1]: Starting Load AppArmor profiles…
яну 22 23:04:58 haproxy2 systemd[1]: Finished Load AppArmor profiles.

 

# dpkg -l |grep -i apparmor
ii  apparmor                          2.13.6-10                      amd64        user-space parser utility for AppArmor
ii  libapparmor1:amd64                2.13.6-10                      amd64        changehat AppArmor library
ii  libapparmor-perl:amd64               2.13.6-10


In case AppArmor is disabled, you will get something like:

root@pcfrxenweb:~# aa-status 
apparmor module is loaded.
0 profiles are loaded.
0 profiles are in enforce mode.
0 profiles are in complain mode.
0 processes have profiles defined.
0 processes are in enforce mode.
0 processes are in complain mode.
0 processes are unconfined but have a profile defined.


2. How to disable AppArmor for particular running services processes

In my case after the upgrade of a system running a MySQL Server suddenly out of nothing after reboot the Database couldn't load up properly and if I try to restart it with the usual

root@pcfrxen: /# systemctl restart mariadb

I started getting errors like:

DBI connect failed : Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)

To get an idea of what kind of profile definitions, could be enabled disabled on apparmor enabled system do:
 

root@pcfrxen:/var/log# ls -1 /etc/apparmor.d/
abstractions/
force-complain/
local/
lxc/
lxc-containers
samba/
system_tor
tunables/
usr.bin.freshclam
usr.bin.lxc-start
usr.bin.man
usr.bin.tcpdump
usr.lib.telepathy
usr.sbin.clamd
usr.sbin.cups-browsed
usr.sbin.cupsd
usr.sbin.ejabberdctl
usr.sbin.mariadbd
usr.sbin.mysqld
usr.sbin.named
usr.sbin.ntpd
usr.sbin.privoxy
usr.sbin.squid

Lets say you want to disable any protection AppArmor profile for MySQL you can do it with:

root@pcfrxen:/ #  ln -s /etc/apparmor.d/usr.sbin.mysqld /etc/apparmor.d/disable/
root@pcfrxen:/ # apparmor_parser -R /etc/apparmor.d/usr.sbin.mysqld 


To make the system know you have disabled a profile you should restart apparmor service:
 

root@pcfrxen:/ # systemctl restart apparmor.service


3. Disable completely AppArmor to save your time weird system behavior and hang bangs

In my opinion the best thing to do anyways, especially if you don't run Containerized applications, that runs only one single application / service at at time is to completely disable apparmor, otherwise you would have to manually check each of the running applications before the upgrade and make sure that apparmor did not bring havoc to some of it.
Hence my way was to simple get rid of apparmor by disable and remove the related package completely out of the system to do so:

root@pcfrxen:/ # systemctl stop apparmor
root@pcfrxen:/ # systemctl disable apparmor
root@pcfrxen:/ # apt-get remove -y apparmor

Once  disabled to make the system completely load out anything loaded related to apparmor loaded into system memory, you should do machine reboot.

root@pcfrxen:/ # shutdown -r now

Hopefully if you run into same issue after removal of apparmor most of the things should be working fine after the upgrade. Anyways I had to go through each and every app everywhere and make sure it is working as expected. The major release upgrade has also automatically enabled me some of the already disable services, thus if you have upgraded like me I would advice you do a close check on every enabled / running service everywhere:

root@pcfrxen:/# systemctl list-unit-files|grep -i enabled

Beware of AppArmor  !!! 🙂