Posts Tagged ‘check’

How to RPM update Hypervisors and Virtual Machines running Haproxy High Availability cluster on KVM, Virtuozzo without a downtime on RHEL / CentOS Linux

Friday, May 20th, 2022

virtuozzo-kvm-virtual-machines-and-hypervisor-update-manual-haproxy-logo


Here is the scenario, lets say you have on your daily task list two Hypervisor (HV) hosts running CentOS or RHEL Linux with KVM or Virutozzo technology and inside the HV hosts you have configured at least 2 pairs of virtual machines one residing on HV Host 1 and one residing on HV Host 2 and you need to constantly keep the hosts to the latest distribution major release security patchset.

The Virtual Machines has been running another set of Redhat Linux or CentOS configured to work in a High Availability Cluster running Haproxy / Apache / Postfix or any other kind of HA solution on top of corosync / keepalived or whatever application cluster scripts Free or Open Source technology that supports a switch between clustered Application nodes.

The logical question comes how to keep up the CentOS / RHEL Machines uptodate without interfering with the operations of the Applications running on the cluster?

Assuming that the 2 or more machines are configured to run in Active / Passive App member mode, e.g. one machine is Active at any time and the other is always Passive, a switch is possible between the Active and Passive node.

HAProxy--Load-Balancer-cluster-2-nodes-your-Servers

In this article I'll give a simple step by step tested example on how you I succeeded to update (for security reasons) up to the latest available Distribution major release patchset on one by one first the Clustered App on Virtual Machines 1 and VM2 on Linux Hypervisor Host 1. Then the App cluster VM 1 / VM 2 on Hypervisor Host 2.
And finally update the Hypervisor1 (after moving the Active resources from it to Hypervisor2) and updating the Hypervisor2 after moving the App running resources back on HV1.
I know the procedure is a bit monotonic but it tries to go through everything step by step to try to mitigate any possible problems. In case of failure of some rpm dependencies during yum / dnf tool updates you can always revert to backups so in anyways don't forget to have a fully functional backup of each of the HV hosts and the VMs somewhere on a separate machine before proceeding further, any possible failures due to following my aritcle literally is your responsibility 🙂

 

0. Check situation before the update on HVs / get VM IDs etc.

Check the virsion of each of the machines to be updated both Hypervisor and Hosted VMs, on each machine run:
 

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)


The machine setup I'll be dealing with is as follows:
 

hypervisor-host1 -> hypervisor-host1.fqdn.com 
•    virt-mach-centos1
•    virt-machine-zabbix-proxy-centos (zabbix proxy)

hypervisor-host2 -> hypervisor-host2.fqdn.com
•    virt-mach-centos2
•    virt-machine-zabbix2-proxy-centos (zabbix proxy)

To check what is yours check out with virsh cmd –if on KVM or with prlctl if using Virutozzo, you should get something like:

[root@hypervisor-host2 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host1 running
 2 virt-mach-centos2 running

 # virsh list –all

[root@hypervisor-host1 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host2 running
 3 virt-mach-centos1 running

[root@hypervisor-host1 ~]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{dc37c201-08c9-589d-aa20-9386d63ce3f3}  running      –               VM virt-mach-centos1
{76e8a5f8-caa8-5442-830e-aa4bfe8d42d9}  running      –               VM vm-host2
[root@hypervisor-host1 ~]#

If you have stopped VMs with Virtuozzo to list the stopped ones as well.
 

# prlctl list -a

[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# vzlist -a
                                CTID      NPROC STATUS    IP_ADDR         HOSTNAME
[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{92075803-a4ce-5ec0-a3d8-9ee83d85fc76}  running      –               VM virt-mach-centos2
{74a7bbe8-9245-5385-ac0d-d10299100789}  running      –               VM vm-host1

# prlctl list -a


If due to Virtuozzo version above command does not return you can manually check in the VM located folder, VM ID etc.
 

[root@hypervisor-host2 vmprivate]# ls
74a7bbe8-9245-4385-ac0d-d10299100789  92075803-a4ce-4ec0-a3d8-9ee83d85fc76
[root@hypervisor-host2 vmprivate]# pwd
/vz/vmprivate
[root@hypervisor-host2 vmprivate]#


[root@hypervisor-host1 ~]# ls -al /vz/vmprivate/
total 20
drwxr-x—. 5 root root 4096 Feb 14  2019 .
drwxr-xr-x. 7 root root 4096 Feb 13  2019 ..
drwxr-x–x. 4 root root 4096 Feb 18  2019 1c863dfc-1deb-493c-820f-3005a0457627
drwxr-x–x. 4 root root 4096 Feb 14  2019 76e8a5f8-caa8-4442-830e-aa4bfe8d42d9
drwxr-x–x. 4 root root 4096 Feb 14  2019 dc37c201-08c9-489d-aa20-9386d63ce3f3
[root@hypervisor-host1 ~]#


Before doing anything with the VMs, also don't forget to check the Hypervisor hosts has enough space, otherwise you'll get in big troubles !
 

[root@hypervisor-host2 vmprivate]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/centos_hypervisor-host2-root   20G  1.8G   17G  10% /
devtmpfs                          20G     0   20G   0% /dev
tmpfs                             20G     0   20G   0% /dev/shm
tmpfs                             20G  2.0G   18G  11% /run
tmpfs                             20G     0   20G   0% /sys/fs/cgroup
/dev/sda1                        992M  159M  766M  18% /boot
/dev/mapper/centos_hypervisor-host2-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos_hypervisor-host2-var   9.8G  355M  8.9G   4% /var
/dev/mapper/centos_hypervisor-host2-vz    755G   25G  692G   4% /vz

 

[root@hypervisor-host1 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   50G  1.8G   45G   4% /
devtmpfs                  20G     0   20G   0% /dev
tmpfs                     20G     0   20G   0% /dev/shm
tmpfs                     20G  2.1G   18G  11% /run
tmpfs                     20G     0   20G   0% /sys/fs/cgroup
/dev/sda2                992M  153M  772M  17% /boot
/dev/mapper/centos-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos-var   9.8G  406M  8.9G   5% /var
/dev/mapper/centos-vz    689G   12G  643G   2% /vz

Another thing to do before proceeding with update is to check and tune if needed the amount of CentOS repositories used, before doing anything with yum.
 

[root@hypervisor-host2 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 110 root root 12288 Oct  7 11:13 ..
-rw-r–r–.   1 root root  4382 Mar 14  2019 CentOS7.repo
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo
[root@hypervisor-host2 yum.repos.d]#

 

[root@hypervisor-host1 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 112 root root 12288 Oct  7 11:09 ..
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   300 Mar 14  2019 obsoleted_tmpls.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo


1. Dump VM definition XMs (to have it in case if it gets wiped during update)

There is always a possibility that something will fail during the update and you might be unable to restore back to the old version of the Virtual Machine due to some config misconfigurations or whatever thus a very good idea, before proceeding to modify the working VMs is to use KVM's virsh and dump the exact set of XML configuration that makes the VM roll properly.

To do so:
Check a little bit up in the article how we have listed the IDs that are part of the directory containing the VM.
 

[root@hypervisor-host1 ]# virsh dumpxml (Id of VM virt-mach-centos1 ) > /root/virt-mach-centos1_config_bak.xml
[root@hypervisor-host2 ]# virsh dumpxml (Id of VM virt-mach-centos2) > /root/virt-mach-centos2_config_bak.xml

 


2. Set on standby virt-mach-centos1 (virt-mach-centos1)

As I'm upgrading two machines that are configured to run an haproxy corosync cluster, before proceeding to update the active host, we have to switch off
the proxied traffic from node1 to node2, – e.g. standby the active node, so the cluster can move up the traffic to other available node.
 

[root@virt-mach-centos1 ~]# pcs cluster standby virt-mach-centos1


3. Stop VM virt-mach-centos1 & backup on Hypervisor host (hypervisor-host1) for VM1

Another prevention step to make sure you don't get into damaged VM or broken haproxy cluster after the upgrade is to of course backup 

 

[root@hypervisor-host1 ]# prlctl backup virt-mach-centos1

or
 

[root@hypervisor-host1 ]# prlctl stop virt-mach-centos1
[root@hypervisor-host1 ]# cp -rpf /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3 /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3-bak
[root@hypervisor-host1 ]# tar -czvf virt-mach-centos1_vm_virt-mach-centos1.tar.gz /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3

[root@hypervisor-host1 ]# prlctl start virt-mach-centos1


4. Remove package version locks on all hosts

If you're using package locking to prevent some other colleague to not accidently upgrade the machine (if multiple sysadmins are managing the host), you might use the RPM package locking meachanism, if that is used check RPM packs that are locked and release the locking.

+ List actual list of locked packages

[root@hypervisor-host1 ]# yum versionlock list  

…..
0:libtalloc-2.1.16-1.el7.*
0:libedit-3.0-12.20121213cvs.el7.*
0:p11-kit-trust-0.23.5-3.el7.*
1:quota-nls-4.01-19.el7.*
0:perl-Exporter-5.68-3.el7.*
0:sudo-1.8.23-9.el7.*
0:libxslt-1.1.28-5.el7.*
versionlock list done
                          

+ Clear the locking            

# yum versionlock clear                               


+ List actual list / == clear all entries
 

[root@virt-mach-centos2 ]# yum versionlock list; yum versionlock clear
[root@virt-mach-centos1 ]# yum versionlock list; yum versionlock clear
[root@hypervisor-host1 ~]# yum versionlock list; yum versionlock clear
[root@hypervisor-host2 ~]# yum versionlock list; yum versionlock clear


5. Do yum update virt-mach-centos1


For some clarity if something goes wrong, it is really a good idea to make a dump of the basic packages installed before the RPM package update is initiated,
The exact versoin of RHEL or CentOS as well as the list of locked packages, if locking is used.

Enter virt-mach-centos1 (ssh virt-mach-centos1) and run following cmds:
 

# cat /etc/redhat-release  > /root/logs/redhat-release-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


+ Only if needed!!
 

# yum versionlock clear
# yum versionlock list


Clear any previous RPM packages – careful with that as you might want to keep the old RPMs, if unsure comment out below line
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

Proceed with the update and monitor closely the output of commands and log out everything inside files using a small script that you should place under /root/status the script is given at the end of the aritcle.:
 

yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
yum check-update | wc -l
yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

6. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


7. Stop VM virt-mach-centos2 & backup  on Hypervisor host (hypervisor-host2)

Same backup step as prior 

# prlctl backup virt-mach-centos2


or
 

# prlctl stop virt-mach-centos2
# cp -rpf /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76 /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76-bak
## tar -czvf virt-mach-centos2_vm_virt-mach-centos2.tar.gz /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76

# prctl start virt-mach-centos2


8. Do yum update on virt-mach-centos2

Log system state, before the update
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list

 

Clean old install update / packages if required
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Initiate the update

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l 
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


9. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now

 

10. Stop VM vm-host2 & backup
 

# prlctl backup vm-host2


or

# prlctl stop vm-host2

Or copy the actual directory containig the Virtozzo VM (use the correct ID)
 

# cp -rpf /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9 /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9-bak
## tar -czvf vm-host2.tar.gz /vz/vmprivate/76e8a5f8-caa8-4442-830e-aa5bfe8d42d9

# prctl start vm-host2


11. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Clear only if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Do the rpm upgrade

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


12. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now


13. Do yum update hypervisor-host2

 

 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock   if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Update rpms
 

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


14. Stop VM vm-host1 & backup


Some as ealier
 

# prlctl backup vm-host1

or
 

# prlctl stop vm-host1

# cp -rpf /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789 /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789-bak
# tar -czvf vm-host1.tar.gz /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789

# prctl start vm-host1


15. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


16. Check if everything is running fine after upgrade

+ Reboot VM

# shutdown -r now


17. Do yum update hypervisor-host1

Same procedure for HV host 1 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock
 

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


18. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host1 all VMs run as expected 


19. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host2 all VMs run as expected afterwards


20. Check once more VMs and haproxy or any other contained services in VMs run as expected

Login to hosts and check processes and logs for errors etc.
 

21. Haproxy Unstandby virt-mach-centos1

Assuming that the virt-mach-centos1 and virt-mach-centos2 are running a Haproxy / corosync cluster you can try to standby node1 and check the result
hopefully all should be fine and traffic should come to host node2.

[root@virt-mach-centos1 ~]# pcs cluster unstandby virt-mach-centos1


Monitor logs and make sure HAproxy works fine on virt-mach-centos1


22. If necessery to redefine VMs (in case they disappear from virsh) or virtuosso is not working

[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos1_config_bak.xml
[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos2_config_bak.xml


23. Set versionlock to RPMs to prevent accident updates and check OS version release

[root@virt-mach-centos2 ]# yum versionlock \*
[root@virt-mach-centos1 ]# yum versionlock \*
[root@hypervisor-host1 ~]# yum versionlock \*
[root@hypervisor-host2 ~]# yum versionlock \*

[root@hypervisor-host2 ~]# cat /etc/redhat-release 
CentOS Linux release 7.8.2003 (Core)

Other useful hints

[root@hypervisor-host1 ~]# virsh console dc37c201-08c9-489d-aa20-9386d63ce3f3
Connected to domain virt-mach-centos1
..

! Compare packages count before the upgrade on each of the supposable identical VMs and HVs – if there is difference in package count review what kind of packages are different and try to make the machines to look as identical as possible  !

Packages to update on hypervisor-host1 Count: XXX
Packages to update on hypervisor-host2 Count: XXX
Packages to update virt-mach-centos1 Count: – 254
Packages to update virt-mach-centos2 Count: – 249

The /root/status script

+++

#!/bin/sh
echo  '=======================================================   '
echo  '= Systemctl list-unit-files –type=service | grep enabled '
echo  '=======================================================   '
systemctl list-unit-files –type=service | grep enabled

echo  '=======================================================   '
echo  '= systemctl | grep ".service" | grep "running"            '
echo  '=======================================================   '
systemctl | grep ".service" | grep "running"

echo  '=======================================================   '
echo  '= chkconfig –list                                        '
echo  '=======================================================   '
chkconfig –list

echo  '=======================================================   '
echo  '= netstat -tulpn                                          '
echo  '=======================================================   '
netstat -tulpn

echo  '=======================================================   '
echo  '= netstat -r                                              '
echo  '=======================================================   '
netstat -r


+++

That's all folks, once going through the article, after some 2 hours of efforts or so you should have an up2date machines.
Any problems faced or feedback is mostly welcome as this might help others who have the same setup.

Thanks for reading me 🙂

Fix weird double logging in haproxy.log file due to haproxy.cfg misconfiguration

Tuesday, March 8th, 2022

haproxy-logging-front-network-back-network-diagram

While we were building a new machine that will serve as a Haproxy server proxy frontend to tunnel some traffic to a number of backends, 
came across weird oddity. The call requests sent to the haproxy and redirected to backend servers, were being written in the log twice.

Since we have two backend Application servers hat are serving the request, my first guess was this is caused by the fact haproxy
tries to connect to both nodes on each sent request, or that the double logging is caused by the rsyslogd doing something strange on each
received query. The rsyslog configuration configured to send via local6 facility to rsyslog is like that:

$ModLoad imudp
$UDPServerAddress 127.0.0.1
$UDPServerRun 514
#2022/02/02: HAProxy logs to local6, save the messages
local6.*                                                /var/log/haproxy.log


The haproxy basic global and defaults and frontend section config is like that:
 

global
    log          127.0.0.1 local6 debug
    chroot       /var/lib/haproxy
    pidfile      /run/haproxy.pid
    stats socket /var/lib/haproxy/haproxy.sock mode 0600 level admin
    maxconn      4000
    user         haproxy
    group        haproxy
    daemon

 

defaults
    mode        tcp
    log         global
#    option      dontlognull
#    option      httpclose
#    option      httplog
#    option      forwardfor
    option      redispatch
    option      log-health-checks
    timeout connect 10000 # default 10 second time out if a backend is not found
    timeout client 300000
    timeout server 300000
    maxconn     60000
    retries     3
 

listen FRONTEND1
        bind 10.71.80.5:63750
        mode tcp
        option tcplog
        log global
        log-format [%t]\ %ci:%cp\ %bi:%bp\ %b/%s:%sp\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
        balance roundrobin
        timeout client 350000
        timeout server 350000
        timeout connect 35000
        server backend-host1 10.80.50.1:13750 weight 1 check port 15000
        server backend-host2 10.80.50.2:13750 weight 2 check port 15000

After quick research online on why this odd double logging of request ends in /var/log/haproxy.log it turns out this is caused by the 

log global  defined double under the defaults section as well as in the frontend itself, hence to resolve simply had to comment out the log global in Frontend, so it looks like so:

listen FRONTEND1
        bind 10.71.80.5:63750
        mode tcp
        option tcplog
  #      log global
        log-format [%t]\ %ci:%cp\ %bi:%bp\ %b/%s:%sp\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
        balance roundrobin
        timeout client 350000
        timeout server 350000
        timeout connect 35000
        server backend-host1 10.80.50.1:13750 weight 1 check port 15000
        server backend-host2 10.80.50.2:13750 weight 2 check port 15000

 


Next just reloaded haproxy and one request started leaving only one trail inside haproxy.log as expected 🙂
 

List and fix failed systemd failed services after Linux OS upgrade and how to get full info about systemd service from jorunal log

Friday, February 25th, 2022

systemd-logo-unix-linux-list-failed-systemd-services

I have recently upgraded a number of machines from Debian 10 Buster to Debian 11 Bullseye. The update as always has some issues on some machines, such as problem with package dependencies, changing a number of external package repositories etc. to match che Bullseye deb packages. On some machines the update was less painful on others but the overall line was that most of the machines after the update ended up with one or more failed systemd services. It could be that some of the machines has already had this failed services present and I never checked them from the previous time update from Debian 9 -> Debian 10 or just some mess I've left behind in the hurry when doing software installation in the past. This doesn't matter anyways the fact was that I had to deal to a number of systemctl services which I managed to track by the Failed service mesage on system boot on one of the physical machines and on the OpenXen VTY Console the rest of Virtual Machines after update had some Failed messages. Thus I've spend some good amount of time like an overall of a day or two fixing strange failed services. This is how this small article was born in attempt to help sysadmins or any home Linux desktop users, who has updated his Debian Linux / Ubuntu or any other deb based distribution but due to the chaotic nature of Linux has ended with same strange Failed services and look for a way to find the source of the failures and get rid of the problems. 
Systemd is a very complicated system and in my many sysadmin opinion it makes more problems than it solves, but okay for today's people's megalomania mindset it matches well.

Systemd_components-systemd-journalctl-cgroups-loginctl-nspawn-analyze.svg

 

1. Check the journal for errors, running service irregularities and so on
 

First thing to do to track for errors, right after the update is to take some minutes and closely check,, the journalctl for any strange errors, even on well maintained Unix machines, this journal log would bring you to a problem that is not fatal but still some process or stuff is malfunctioning in the background that you would like to solve:
 

root@pcfreak:~# journalctl -x
Jan 10 10:10:01 pcfreak CRON[17887]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17887]: USER_END pid=17887 uid=0 auid=0 ses=340858 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17888]: CRED_DISP pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17888]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17888]: USER_END pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17884]: CRED_DISP pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17884]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17884]: USER_END pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17886]: CRED_DISP pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/c>
Jan 10 10:10:01 pcfreak CRON[17886]: pam_unix(cron:session): session closed for user www-data
Jan 10 10:10:01 pcfreak audit[17886]: USER_END pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permi>
Jan 10 10:10:08 pcfreak NetworkManager[696]:  [1641802208.0899] device (eth1): carrier: link connected
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:19 pcfreak NetworkManager[696]:
 [1641802219.7920] device (eth1): carrier: link connected
Jan 10 10:10:19 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:20 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:22 pcfreak NetworkManager[696]:
 [1641802222.2772] device (eth1): carrier: link connected
Jan 10 10:10:22 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:23 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:33 pcfreak sshd[18142]: Unable to negotiate with 66.212.17.162 port 19255: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diff>
Jan 10 10:10:41 pcfreak NetworkManager[696]:
 [1641802241.0186] device (eth1): carrier: link connected
Jan 10 10:10:41 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx

If you want to only check latest journal log messages use the -x -e (pager catalog) opts

root@pcfreak;~# journalctl -xe

Feb 25 13:08:29 pcfreak audit[2284920]: USER_LOGIN pid=2284920 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='op=login acct=28696E76616C>
Feb 25 13:08:29 pcfreak sshd[2284920]: Received disconnect from 177.87.57.145 port 40927:11: Bye Bye [preauth]
Feb 25 13:08:29 pcfreak sshd[2284920]: Disconnected from invalid user ubuntuuser 177.87.57.145 port 40927 [preauth]

Next thing to after the update was to get a list of failed service only.


2. List all systemd failed check services which was supposed to be running

root@pcfreak:/root # systemctl list-units | grep -i failed
● certbot.service                                                                                                       loaded failed failed    Certbot
● logrotate.service                                                                                                     loaded failed failed    Rotate log files
● maldet.service                                                                                                        loaded failed failed    LSB: Start/stop maldet in monitor mode
● named.service                                                                                                         loaded failed failed    BIND Domain Name Server


Alternative way is with the –failed option

hipo@jeremiah:~$ systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

 

root@jeremiah:/etc/apt/sources.list.d#  systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon


To get a full list of objects of systemctl you can pass as state:
 

# systemctl –state=help
Full list of possible load states to pass is here
Show service properties


Check whether a service is failed or has other status and check default set systemd variables for it.

root@jeremiah~:# systemctl is-failed vboxweb.service
inactive

# systemctl show haproxy
Type=notify
Restart=always
NotifyAccess=main
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
SuccessExitStatus=143
MainPID=304858
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ReloadResult=success
CleanResult=success

Full output of the above command is dumped in show_systemctl_properties.txt


3. List all running systemd services for a better overview on what's going on on machine
 

To get a list of all properly systemd loaded services you can use –state running.

hipo@jeremiah:~$ systemctl list-units –state running|head -n 10
  UNIT                              LOAD   ACTIVE SUB     DESCRIPTION
  proc-sys-fs-binfmt_misc.automount loaded active running Arbitrary Executable File Formats File System Automount Point
  cups.path                         loaded active running CUPS Scheduler
  init.scope                        loaded active running System and Service Manager
  session-2.scope                   loaded active running Session 2 of user hipo
  accounts-daemon.service           loaded active running Accounts Service
  anydesk.service                   loaded active running AnyDesk
  apache-htcacheclean.service       loaded active running Disk Cache Cleaning Daemon for Apache HTTP Server
  apache2.service                   loaded active running The Apache HTTP Server
  avahi-daemon.service              loaded active running Avahi mDNS/DNS-SD Stack

 

It is useful thing is to list all unit-files configured in systemd and their state, you can do it with:

 


root@pcfreak:~# systemctl list-unit-files
UNIT FILE                                                                 STATE           VENDOR PRESET
proc-sys-fs-binfmt_misc.automount                                         static          –            
-.mount                                                                   generated       –            
backups.mount                                                             generated       –            
dev-hugepages.mount                                                       static          –            
dev-mqueue.mount                                                          static          –            
media-cdrom0.mount                                                        generated       –            
mnt-sda1.mount                                                            generated       –            
proc-fs-nfsd.mount                                                        static          –            
proc-sys-fs-binfmt_misc.mount                                             disabled        disabled     
run-rpc_pipefs.mount                                                      static          –            
sys-fs-fuse-connections.mount                                             static          –            
sys-kernel-config.mount                                                   static          –            
sys-kernel-debug.mount                                                    static          –            
sys-kernel-tracing.mount                                                  static          –            
var-www.mount                                                             generated       –            
acpid.path                                                                masked          enabled      
cups.path                                                                 enabled         enabled      

 

 


root@pcfreak:~# systemctl list-units –type service –all
  UNIT                                   LOAD      ACTIVE   SUB     DESCRIPTION
  accounts-daemon.service                loaded    inactive dead    Accounts Service
  acct.service                           loaded    active   exited  Kernel process accounting
● alsa-restore.service                   not-found inactive dead    alsa-restore.service
● alsa-state.service                     not-found inactive dead    alsa-state.service
  apache2.service                        loaded    active   running The Apache HTTP Server
● apparmor.service                       not-found inactive dead    apparmor.service
  apt-daily-upgrade.service              loaded    inactive dead    Daily apt upgrade and clean activities
 apt-daily.service                      loaded    inactive dead    Daily apt download activities
  atd.service                            loaded    active   running Deferred execution scheduler
  auditd.service                         loaded    active   running Security Auditing Service
  auth-rpcgss-module.service             loaded    inactive dead    Kernel Module supporting RPCSEC_GSS
  avahi-daemon.service                   loaded    active   running Avahi mDNS/DNS-SD Stack
  certbot.service                        loaded    inactive dead    Certbot
  clamav-daemon.service                  loaded    active   running Clam AntiVirus userspace daemon
  clamav-freshclam.service               loaded    active   running ClamAV virus database updater
..

 


linux-systemd-components-diagram-linux-kernel-system-targets-systemd-libraries-daemons

 

4. Finding out more on why a systemd configured service has failed


Usually getting info about failed systemd service is done with systemctl status servicename.service
However, in case of troubles with service unable to start to get more info about why a service has failed with (-l) or (–full) options


root@pcfreak:~# systemctl -l status logrotate.service
● logrotate.service – Rotate log files
     Loaded: loaded (/lib/systemd/system/logrotate.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 00:00:06 EET; 13h ago
TriggeredBy: ● logrotate.timer
       Docs: man:logrotate(8)
             man:logrotate.conf(5)
    Process: 2045320 ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=1/FAILURE)
   Main PID: 2045320 (code=exited, status=1/FAILURE)
        CPU: 2.479s

Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: For now we will assume you meant to write /32
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| ERROR: '0.0.0.0/0.0.0.0' needs to be replaced by the term 'all'.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| SECURITY NOTICE: Overriding config setting. Using 'all' instead.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: (B) '::/0' is a subnetwork of (A) '::/0'
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: because of this '::/0' is ignored to keep splay tree searching predictable
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: You should probably remove '::/0' from the ACL named 'all'
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Failed with result 'exit-code'.
Feb 25 00:00:06 pcfreak systemd[1]: Failed to start Rotate log files.
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Consumed 2.479s CPU time.


systemctl -l however is providing only the last log from message a started / stopped or whatever status service has generated. Sometimes systemctl -l servicename.service is showing incomplete the splitted error message as there is a limitation of line numbers on the console, see below

 

root@pcfreak:~# systemctl status -l certbot.service
● certbot.service – Certbot
     Loaded: loaded (/lib/systemd/system/certbot.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 09:28:33 EET; 4h 0min ago
TriggeredBy: ● certbot.timer
       Docs: file:///usr/share/doc/python-certbot-doc/html/index.html
             https://certbot.eff.org/docs
    Process: 290017 ExecStart=/usr/bin/certbot -q renew (code=exited, status=1/FAILURE)
   Main PID: 290017 (code=exited, status=1/FAILURE)
        CPU: 9.771s

Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using th>
Feb 25 09:28:33 pcfrxen certbot[290017]: All renewals failed. The following certificates could not be renewed:
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/mail.pcfreak.org-0003/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/www.eforia.bg-0005/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/zabbix.pc-freak.net/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]: 3 renew failure(s), 5 parse failure(s)
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Failed with result 'exit-code'.
Feb 25 09:28:33 pcfrxen systemd[1]: Failed to start Certbot.
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Consumed 9.771s CPU time.

 

5. Get a complete log of journal to make sure everything configured on server host runs as it should

Thus to get more complete list of the message and be able to later google and look if has come with a solution on the internet  use:

root@pcfrxen:~#  journalctl –catalog –unit=certbot

— Journal begins at Sat 2022-01-22 21:14:05 EET, ends at Fri 2022-02-25 13:32:01 EET. —
Jan 23 09:58:18 pcfrxen systemd[1]: Starting Certbot…
░░ Subject: A start job for unit certbot.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit certbot.service has begun execution.
░░ 
░░ The job identifier is 5754.
Jan 23 09:58:20 pcfrxen certbot[124996]: Traceback (most recent call last):
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/renewal.py", line 71, in _reconstitute
Jan 23 09:58:20 pcfrxen certbot[124996]:     renewal_candidate = storage.RenewableCert(full_path, config)
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 471, in __init__
Jan 23 09:58:20 pcfrxen certbot[124996]:     self._check_symlinks()
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 537, in _check_symlinks

root@server:~# journalctl –catalog –unit=certbot|grep -i pluginerror|tail -1
Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using the manual plugin non-interactively.')


Or if you want to list and read only the last messages in the journal log regarding a service

root@server:~# journalctl –catalog –pager-end –unit=certbot


If you have disabled a failed service because you don't need it to run at all on the machine with:

root@rhel:~# systemctl stop rngd.service
root@rhel:~# systemctl disable rngd.service

And you want to clear up any failed service information that is kept in the systemctl service log you can do it with:
 

root@rhel:~# systemctl reset-failed

Another useful systemctl option is cat, you can use it to easily list a service it is useful to quickly check what is a service, an actual shortcut to save you from giving a full path to the service e.g. cat /lib/systemd/system/certbot.service

root@server:~# systemctl cat certbot
# /lib/systemd/system/certbot.service
[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/python-certbot-doc/html/index.html
Documentation=https://certbot.eff.org/docs
[Service]
Type=oneshot
ExecStart=/usr/bin/certbot -q renew
PrivateTmp=true


After failed SystemD services are fixed, it is best to reboot the machine and check put some more time to inspect rawly the complete journal log to make sure, no error  was left behind.


Closure
 

As you can see updating a machine from a major to a major version even if you follow the official documentation and you have plenty of experience is always more or a less a pain in the ass, which can eat up much of your time banging your head solving problems with failed daemons issues with /etc/rc.local (which I have faced becase of #/bin/sh -e (which would make /etc/rc.local) to immediately quit if any error from command $? returns different from 0 etc.. The  logical questions comes then;
1. Is it really worthy to update at all regularly, especially if you don't know of a famous major Vulnerability 🙂 ?
2. Or is it worthy to update from OS major release to OS major release at all?  
3. Or should you only try to patch the service that is exposed to an external reachable computer network or the internet only and still the the same OS release until End of Life (LTS = Long Term Support) as called in Debian or  End Of Life  (EOL) Cycle as called in RPM based distros the period until the OS major release your software distro has official security patches is reached.

Anyone could take any approach but for my own managed systems small network at home my practice was always to try to keep up2date everything every 3 or 6 months maximum. This has caused me multiple days of irritation and stress and perhaps many white hairs and spend nerves on shit.


4. Based on the company where I'm employed the better strategy is to patch to the EOL is still offered and keep the rule First Things First (FTF), once the EOL is reached, just make a copy of all servers data and configuration to external Data storage, bring up a new Physical or VM and migrate the services.
Test after the migration all works as expected if all is as it should be change the DNS records or Leading Infrastructure Proxies whatever to point to the new service and that's it! Yes it is true that migration based on a full OS reinstall is more time consuming and requires much more planning, but usually the result is much more expected, plus it is much less stressful for the guy doing the job.

Zabbix rkhunter monitoring check if rootkits trojans and viruses or suspicious OS activities are detected

Wednesday, December 8th, 2021

monitor-rkhunter-with-zabbix-zabbix-rkhunter-check-logo

If you're using rkhunter to monitor for malicious activities, a binary changes, rootkits, viruses, malware, suspicious stuff and other famous security breach possible or actual issues, perhaps you have configured your machines to report to some Email.
But what if you want to have a scheduled rkhunter running on the machine and you don't want to count too much on email alerting (especially because email alerting) makes possible for emails to be tracked by sysadmin pretty late?

We have been in those situation and in this case me and my dear colleague Georgi Stoyanov developed a small rkhunter Zabbix userparameter check to track and Alert if any traces of "Warning"''s are mateched in the traditional rkhunter log file /var/log/rkhunter/rkhunter.log

To set it up and use it is pretty use you will need to have a recent version of zabbix-agent installed on the machine and connected to a Zabbix server, in my case this is:

[root@centos ~]# rpm -qa |grep -i zabbix-agent
zabbix-agent-4.0.7-1.el7.x86_64

 placed inside /etc/zabbix/zabbix_agentd.d/userparameter_rkhunter_warning_check.conf

[root@centos /etc/zabbix/zabbix_agentd.d ]# cat userparameter_rkhunter_warning_check.conf
# userparameter script to check if any Warning is inside /var/log/rkhunter/rkhunter.log and if found to trigger Zabbix alert
UserParameter=rkhunter.warning, (TODAY=$(date |awk '{ print $1" "$2" "$3 }'); if [ $(cat /var/log/rkhunter/rkhunter.log | awk “/$TODAY/,EOF” | /bin/grep -i ‘\[ Warning \]’ | /usr/bin/wc -l) != ‘0’ ]; then echo 1; else echo 0; fi)
UserParameter=rkhunter.suspected,(/bin/grep -i 'Suspect files: ' /var/log/rkhunter/rkhunter.log|tail -n 1| awk '{ print $4 }')
UserParameter=rkhunter.rootkits,(/bin/grep -i 'Possible rootkits: ' /var/log/rkhunter/rkhunter.log|tail -n 1| awk '{ print $4 }')


2. Prepare Rkhunter Template, Triggers and Items


In Zabbix Server that you access from web control interface, you will have to prepare a new template called lets say Rkhunter with the necessery Triggers and Items


2.1 Create Rkhunter Items
 

On Zabbix Server side, uou will have to configure 3 Items for the 3 configured userparameter above script keys, like so:

rkhunter-items-zabbix-screenshot.png

  • rkhunter.suspected Item configuration


rkhunter-suspected-files

  • rkhunter.warning Zabbix Item config

rkhunter-warning-found-check-zabbix

  • rkhunter.rootkits Zabbix Item config

     

     

    rkhunter-zabbix-possible-rootkits-item1

    2.2 Create Triggers
     

You need to have an overall of 3 triggers like in below shot:

zabbix-rkhunter-all-triggers-screenshot
 

  • rkhunter.rootkits Trigger config

rkhunter-rootkits-trigger-zabbix1

  • rkhunter.suspected Trigger cfg

rkhunter-suspected-trigger-zabbix1

  • rkhunter warning Trigger cfg

rkhunter-warning-trigger1

3. Reload zabbix-agent and test the keys


It is necessery to reload zabbix agent for the new userparameter to start to be sent to remote zabbix server (through a proxy if you have one configured).

[root@centos ~]# systemctl restart zabbix-agent


To make the zabbix-agent send the keys to the server you can use zabbix_sender to have the test tool you will have to have installed (zabbix-sender) on the server.

To trigger a manualTest if you happen to have some problems with the key which shouldn''t be the case you can sent a value to the respectve key with below command:

[root@centos ~ ]# zabbix_sender -vv -c "/etc/zabbix/zabbix_agentd.conf" -k "khunter.warning" -o "1"


Check on Zabbix Server the sent value is received, for any oddities as usual check what is inside  /var/log/zabbix/zabbix_agentd.log for any errors or warnings.

How to test RAM Memory for errors in Linux / UNIX OS servers. Find broken memory RAM banks

Friday, December 3rd, 2021

test-ram-memory-for-errors-linux-unix-find-broken-memory-logo

 

1. Testing the memory with motherboard integrated tools
 

Memory testing has been integral part of Computers for the last 50 years. In the dawn of computers those older perhaps remember memory testing was part of the computer initialization boot. And this memory testing was delaying the boot with some seconds and the user could see the memory numbers being counted up to the amount of memory. With the increased memory modern computers started to have and the annoyance to wait for a memory check program to check the computer hardware memory on modern computers this check has been mitigated or completely removed on some hardware.
Thus under some circumstances sysadmins or advanced computer users might need to check the memory, especially if there is some suspicion for memory damages or if for example a home PC starts crashing with Blue screens of Death on Windows without reason or simply the PC or some old arcane Linux / UNIX servers gets restarted every now and then for now apparent reason. When such circumstances occur it is an idea to start debugging the hardware issue with a simple memory check.

There are multiple ways to test installed memory banks on a server laptop or local home PC both integrated and using external programs.
On servers that is usually easily done from ILO or IPMI or IDRAC access (usually web) interface of the vendor, on laptops and home usage from BIOS or UEFI (Unified Extensible Firmware Interface) acces interface on system boot that is possible as well.

memtest-hp
HP BIOS Setup

An old but gold TIP, more younger people might not know is the

 

Prolonged SHIFT key press which once held with the user instructs the machine to initiate a memory test before the computer starts reading what is written in the boot loader.

So before anything else from below article it might be a good idea to just try HOLD SHIFT for 15-20 seconds after a complete Shut and ON from the POWER button.

If this test does not triggered or it is triggered and you end up with some corrupted memory but you're not sure which exact Memory bank is really crashing and want to know more on what memory Bank and segments are breaking up you might want to do a more thorough testing. In below article I'll try to explain shortly how this can be done.


2. Test the memory using a boot USB Flash Drive / DVD / CD 
 

Say hello to memtest86+. It is a Linux GRUB boot loader bootable utility that tests physical memory by writing various patterns to it and reading them back. Since memtest86+ runs directly off the hardware it does not require any operating system support for execution. Perhaps it is important to mention that memtest86 (is PassMark memtest86)and memtest86+ (An Advanced Memory diagnostic tool) are different tools, the first is freeware and second one is FOSS software.

To use it all you'll need is some version of Linux. If you don't already have some burned in somewhere at your closet, you might want to burn one.
For Linux / Mac users this is as downloading a Linux distribution ISO file and burning it with

# dd if=/path/to/iso of=/dev/sdbX bs=80M status=progress


Windows users can burn a Live USB with whatever Linux distro or download and burn the latest versionof memtest86+ from https://www.memtest.org/  on Windows Desktop with some proggie like lets say UnetBootIn.
 

2.1. Run memtest86+ on Ubuntu

Many Linux distributions such as Ubuntu 20.0 comes together with memtest86+, which can be easily invoked from GRUB / GRUB2 Kernel boot loader.
Ubuntu has a separate menu pointer for a Memtest.

ubuntu-grub-2-04-boot-loader-memtest86-menu-screenshot

Other distributions RPM based distributions such as CentOS, Fedora Linux, Redhat things differ.

2.2. memtest86+ on Fedora


Fedora used to have the memtest86+ menu at the GRUB boot selection prompt, but for some reason removed it and in newest Fedora releases as of time such as Fedora 35 memtest86+ is preinstalled and available but not visible, to start on  already and to start a memtest memory test tool:

  •   Boot a Fedora installation or Rescue CD / USB. At the prompt, type "memtest86".

boot: memtest86

2.3 memtest86+ on RHEL Linux

The memtest86+tool is available as an RPM package from Red Hat Network (RHN) as well as a boot option from the Red Hat Enterprise Linux rescue disk.
And nowadays Red Hat Enterprise Linux ships by default with the tool.

Prior redhat (now legacy) releases such as on RHEL 5.0 it has to be installed and configure it with below 3 commands.

[root@rhel ~]# yum install memtest86+
[root@rhel ~]# memtest-setup
[root@rhel ~]# grub2-mkconfig -o /boot/grub2/grub.cfg


    Again as with CentOS to boot memtest86+ from the rescue disk, you will need to boot your system from CD 1 of the Red Hat Enterprise Linux installation media, and type the following at the boot prompt (before the Linux kernel is started):

boot: memtest86

memtestx86-8gigabytes-of-memory-boot-screenshot
memtest86+ testing 5 memory slots

As you see all on above screenshot the Memory banks are listed as Slots. There are a number of Tests to be completed until
it can be said for sure memory does not have any faulty cells. 
The

Pass: 0
Errors: 0 

Indicates no errors, so in the end if memtest86 does not find anything this values should stay at zero.
memtest86+ is also usable to detecting issues with temperature of CPU. Just recently I've tested a PC thinking that some memory has defects but it turned out the issue on the Computer was at the CPU's temperature which was topping up at 80 – 82 Celsius.

If you're unfortunate and happen to get some corrupted memory segments you will get some red fields with the memory addresses found to have corrupted on Read / Write test operations:

memtest86-returning-memory-address-errors-screenshot


2.4. Install and use memtest and memtest86+ on Debian / Mint Linux

You can install either memtest86+ or just for the fun put both of them and play around with both of them as they have a .deb package provided out of debian non-free /etc/apt/sources.list repositories.


root@jeremiah:/home/hipo# apt-cache show memtest86 memtest86+
Package: memtest86
Version: 4.3.7-3
Installed-Size: 302
Maintainer: Yann Dirson <dirson@debian.org>
Architecture: amd64
Depends: debconf (>= 0.5) | debconf-2.0
Recommends: memtest86+
Suggests: hwtools, memtester, kernel-patch-badram, grub2 (>= 1.96+20090523-1) | grub (>= 0.95+cvs20040624), mtools
Description-en: thorough real-mode memory tester
 Memtest86 scans your RAM for errors.
 .
 This tester runs independently of any OS – it is run at computer
 boot-up, so that it can test *all* of your memory.  You may want to
 look at `memtester', which allows testing your memory within Linux,
 but this one won't be able to test your whole RAM.
 .
 It can output a list of bad RAM regions usable by the BadRAM kernel
 patch, so that you can still use you old RAM with one or two bad bits.
 .
 This is the last DFSG-compliant version of this software, upstream
 has opted for a proprietary development model starting with 5.0.  You
 may want to consider using memtest86+, which has been forked from an
 earlier version of memtest86, and provides a different set of
 features.  It is available in the memtest86+ package.
 .
 A convenience script is also provided to make a grub-legacy-based
 floppy or image.

Description-md5: 0ad381a54d59a7d7f012972f613d7759
Homepage: http://www.memtest86.com/
Section: misc
Priority: optional
Filename: pool/main/m/memtest86/memtest86_4.3.7-3_amd64.deb
Size: 45470
MD5sum: 8dd2a4c52910498d711fbf6b5753bca9
SHA256: 09178eca21f8fd562806ccaa759d0261a2d3bb23190aaebc8cd99071d431aeb6

Package: memtest86+
Version: 5.01-3
Installed-Size: 2391
Maintainer: Yann Dirson <dirson@debian.org>
Architecture: amd64
Depends: debconf (>= 0.5) | debconf-2.0
Suggests: hwtools, memtester, kernel-patch-badram, memtest86, grub-pc | grub-legacy, mtools
Description-en: thorough real-mode memory tester
 Memtest86+ scans your RAM for errors.
 .
 This tester runs independently of any OS – it is run at computer
 boot-up, so that it can test *all* of your memory.  You may want to
 look at `memtester', which allows to test your memory within Linux,
 but this one won't be able to test your whole RAM.
 .
 It can output a list of bad RAM regions usable by the BadRAM kernel
 patch, so that you can still use your old RAM with one or two bad bits.
 .
 Memtest86+ is based on memtest86 3.0, and adds support for recent
 hardware, as well as a number of general-purpose improvements,
 including many patches to memtest86 available from various sources.
 .
 Both memtest86 and memtest86+ are being worked on in parallel.
Description-md5: aa685f84801773ef97fdaba8eb26436a
Homepage: http://www.memtest.org/

Tag: admin::benchmarking, admin::boot, hardware::storage:floppy,
 interface::text-mode, role::program, scope::utility, use::checking
Section: misc
Priority: optional
Filename: pool/main/m/memtest86+/memtest86+_5.01-3_amd64.deb
Size: 75142
MD5sum: 4f06523532ddfca0222ba6c55a80c433
SHA256: ad42816e0b17e882713cc6f699b988e73e580e38876cebe975891f5904828005
 

 

root@jeremiah:/home/hipo# apt-get install –yes memtest86+

root@jeremiah:/home/hipo# apt-get install –yes memtest86

Reading package lists… Done
Building dependency tree       
Reading state information… Done
Suggested packages:
  hwtools kernel-patch-badram grub2 | grub
The following NEW packages will be installed:
  memtest86
0 upgraded, 1 newly installed, 0 to remove and 21 not upgraded.
Need to get 45.5 kB of archives.
After this operation, 309 kB of additional disk space will be used.
Get:1 http://ftp.de.debian.org/debian buster/main amd64 memtest86 amd64 4.3.7-3 [45.5 kB]
Fetched 45.5 kB in 0s (181 kB/s)     
Preconfiguring packages …
Selecting previously unselected package memtest86.
(Reading database … 519985 files and directories currently installed.)
Preparing to unpack …/memtest86_4.3.7-3_amd64.deb …
Unpacking memtest86 (4.3.7-3) …
Setting up memtest86 (4.3.7-3) …
Generating grub configuration file …
Found background image: saint-John-of-Rila-grub.jpg
Found linux image: /boot/vmlinuz-4.19.0-18-amd64
Found initrd image: /boot/initrd.img-4.19.0-18-amd64
Found linux image: /boot/vmlinuz-4.19.0-17-amd64
Found initrd image: /boot/initrd.img-4.19.0-17-amd64
Found linux image: /boot/vmlinuz-4.19.0-8-amd64
Found initrd image: /boot/initrd.img-4.19.0-8-amd64
Found linux image: /boot/vmlinuz-4.19.0-6-amd64
Found initrd image: /boot/initrd.img-4.19.0-6-amd64
Found linux image: /boot/vmlinuz-4.19.0-5-amd64
Found initrd image: /boot/initrd.img-4.19.0-5-amd64
Found linux image: /boot/vmlinuz-4.9.0-8-amd64
Found initrd image: /boot/initrd.img-4.9.0-8-amd64
Found memtest86 image: /boot/memtest86.bin
Found memtest86+ image: /boot/memtest86+.bin
Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
File descriptor 3 (pipe:[66049]) leaked on lvs invocation. Parent PID 22581: /bin/sh
done
Processing triggers for man-db (2.8.5-2) …

 

After this both memory testers memtest86+ and memtest86 will appear next to the option of booting a different version kernels and the Advanced recovery kernels, that you usually get in the GRUB boot prompt.

2.5. Use memtest embedded tool on any Linux by adding a kernel variable

Edit-Grub-Parameters-add-memtest-4-to-kernel-boot

2.4.1. Reboot your computer

# reboot

2.4.2. At the GRUB boot screen (with UEFI, press Esc).

2.4.3 For 4 passes add temporarily the memtest=4 kernel parameter.
 

memtest=        [KNL,X86,ARM,PPC,RISCV] Enable memtest
                Format: <integer>
                default : 0 <disable>
                Specifies the number of memtest passes to be
                performed. Each pass selects another test
                pattern from a given set of patterns. Memtest
                fills the memory with this pattern, validates
                memory contents and reserves bad memory
                regions that are detected.


3. Install and use memtester Linux tool
 

At some condition, memory is the one of the suspcious part, or you just want have a quick test. memtester  is an effective userspace tester for stress-testing the memory subsystem.  It is very effective at finding intermittent and non-deterministic faults.

The advantage of memtester "live system check tool is", you can check your system for errors while it's still running. No need for a restart, just run that application, the downside is that some segments of memory cannot be thoroughfully tested as you already have much preloaded data in it to have the Operating Sytstem running, thus always when possible try to stick to rule to test the memory using memtest86+  from OS Boot Loader, after a clean Machine restart in order to clean up whole memory heap.

Anyhow for a general memory test on a Critical Legacy Server  (if you lets say don't have access to Remote Console Board, or don't trust the ILO / IPMI Hardware reported integrity statistics), running memtester from already booted is still a good idea.


3.1. Install memtester on any Linux distribution from source

wget http://pyropus.ca/software/memtester/old-versions/memtester-4.2.2.tar.gz
# tar zxvf memtester-4.2.2.tar.gz
# cd memtester-4.2.2
# make && make install

3.2 Install on RPM based distros

 

On Fedora memtester is available from repositories however on many other RPM based distros it is not so you have to install it from source.

[root@fedora ]# yum install -y memtester

 

3.3. Install memtester on Deb based Linux distributions from source
 

To install it on Debian / Ubuntu / Mint etc. , open a terminal and type:
 

root@linux:/ #  apt install –yes memtester

The general run syntax is:

memtester [-p PHYSADDR] [ITERATIONS]


You can hence use it like so:

hipo@linux:/ $ sudo memtester 1024 5

This should allocate 1024MB of memory, and repeat the test 5 times. The more repeats you run the better, but as a memtester run places a great overall load on the system you either don't increment the runs too much or at least run it with  lowered process importance e.g. by nicing the PID:

hipo@linux:/ $ nice -n 15 sudo memtester 1024 5

 

  • If you have more RAM like 4GB or 8GB, it is upto you how much memory you want to allocate for testing.
  • As your operating system, current running process might take some amount of RAM, Please check available Free RAM and assign that too memtester.
  • If you are using a 32 Bit System, you cant test more than 4 GB even though you have more RAM( 32 bit systems doesnt support more than 3.5 GB RAM as you all know).
  • If your system is very busy and you still assigned higher than available amount of RAM, then the test might get your system into a deadlock, leads to system to halt, be aware of this.
  • Run the memtester as root user, so that memtester process can malloc the memory, once its gets hold on that memory it will try to apply lock. if specified memory is not available, it will try to reduce required RAM automatically and try to lock it with mlock.
  • if you run it as a regular user, it cant auto reduce the required amount of RAM, so it cant lock it, so it tries to get hold on that specified memory and starts exhausting all system resources.


If you have 8 Gigas of RAM plugged into the PC motherboard you have to multiple 1024*8 this is easily done with bc (An arbitrary precision calculator language) tool:

root@linux:/ # bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
8*1024
8192


 for example you should run:

root@linux:/ # memtester 8192 5

memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 8192MB (2083520512 bytes)
got  8192MB (2083520512 bytes), trying mlock …Loop 1/1:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok

Done.

 

4. Shell Script to test server memory for corruptions
 

If for some reason the machine you want to run a memory test doesn't have connection to the external network such as the internet and therefore you cannot configure a package repository server and install memtester, the other approach is to use a simple memory test script such as memtestlinux.sh
 

#!/bin/bash
# Downloaded from https://www.srv24x7.com/memtest-linux/
echo "ByteOnSite Memory Test"
cpus=`cat /proc/cpuinfo | grep processor | wc -l`
if [ $cpus -lt 6 ]; then
threads=2
else
threads=$(($cpus / 2))
fi
echo "Detected $cpus CPUs, using $threads threads.."
memory=`free | grep 'Mem:' | awk {'print $2'}`
memoryper=$(($memory / $threads))
echo "Detected ${memory}K of RAM ($memoryper per thread).."
freespace=`df -B1024 . | tail -n1 | awk {'print $4'}`
if [ $freespace -le $memory ]; then
echo You do not have enough free space on the current partition. Minimum: $memory bytes
exit 1
fi
echo "Clearing RAM Cache.."
sync; echo 3 > /proc/sys/vm/drop_cachesfile
echo > dump.memtest.img
echo "Writing to dump file (dump.memtest.img).."
for i in `seq 1 $threads`;
do
# 1044 is used in place of 1024 to ensure full RAM usage (2% over allocation)
dd if=/dev/urandom bs=$memoryper count=1044 >> dump.memtest.img 2>/dev/null &
pids[$i]=$!
echo $i
done
for pid in "${pids[@]}"
do
wait $pid
done

echo "Reading and analyzing dump file…"
echo "Pass 1.."
md51=`md5sum dump.memtest.img | awk {'print $1'}`
echo "Pass 2.."
md52=`md5sum dump.memtest.img | awk {'print $1'}`
echo "Pass 3.."
md53=`md5sum dump.memtest.img | awk {'print $1'}`
if [ “$md51” != “$md52” ]; then
fail=1
elif [ “$md51” != “$md53” ]; then
fail=1
elif [ “$md52” != “$md53” ]; then
fail=1
else
fail=0
fi
if [ $fail -eq 0 ]; then
echo "Memory test PASSED."
else
echo "Memory test FAILED. Bad memory detected."
fi
rm -f dump.memtest.img
exit $fail

Nota Bene !: Again consider the restults might not always be 100% trustable if possible restart the server and test with memtest86+

Consider also its important to make sure prior to script run,  you''ll have enough disk space to produce the dump.memtest.img file – file is created as a test bed for the memory tests and if not scaled properly you might end up with a full ( / ) root directory!

 

4.1 Other memory test script with dd and md5sum checksum

I found this solution on the well known sysadmin site nixCraft cyberciti.biz, I think it makes sense and quicker.

First find out memory site using free command.
 

# free
             total       used       free     shared    buffers     cached
Mem:      32867436   32574160     293276          0      16652   31194340
-/+ buffers/cache:    1363168   31504268
Swap:            0          0          0


It shows that this server has 32GB memory,
 

# dd if=/dev/urandom bs=32867436 count=1050 of=/home/memtest


free reports by k and use 1050 is to make sure file memtest is bigger than physical memory.  To get better performance, use proper bs size, for example 2048 or 4096, depends on your local disk i/o,  the rule is to make bs * count > 32 GB.
run

# md5sum /home/memtest; md5sum /home/memtest; md5sum /home/memtest


If you see md5sum mismatch in different run, you have faulty memory guaranteed.
The theory is simple, the file /home/memtest will cache data in memory by filling up all available memory during read operation. Using md5sum command you are reading same data from memory.


5. Other ways to test memory / do a machine stress test

Other good tools you might want to check for memory testing is mprime – ftp://mersenne.org/gimps/ 
(https://www.mersenne.org/ftp_root/gimps/)

  •  (mprime can also be used to stress test your CPU)

Alternatively, use the package stress-ng to run all kind of stress tests (including memory test) on your machine.
Perhaps there are other interesting tools for a diagnosis of memory if you know other ones I miss, let me know in the comment section.

How to fresh Upgrade mistakenly installed 32-bit Windows 10 Professional to 64-bit Windows / A failure to Disk Clone old SSD 120GB to 512GB HDD due to failed Solid State Drive

Wednesday, November 17th, 2021

upgrade-windows-10-32-bit-to-64-bit-howto-picture

I've been Setting up a new PC with Windows OS that is a bit old a 11 years old Lenovo ThinkCentre model M90P with 8 GB of Memory, Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz   3.19 GHz, Intel Q57 Express Chipset. The machine came to me with Windows 7 preinstalled and the intial goal was to migrate Windows as it is with its data from the old 120GB SSD to new 512 SSD and then to keep the machine at least a bit more up to date to upgrade the old Windows 7 to Windows 10.

This as usual seemed like a very trivial task for a System Administrator, and even if you haven't touched much of Windows as me it makes it look a piece of cake, however as always with computers, once you think you'll be done in 2 hours usually it takes 20+ . Some call it Murphy's law "If something could go wrong then it will go wrong". But putting this situation that I thought all well that's easy lets do it is a kind of a proud Thought for man and the to save us from this Passion of Proudness which according to Church fathers is the worst passion one can have and humiliate us a bit.

God allows some unforseen stuff to happen   🙂 The case with this machine whose original idea I had is to OK I Simply Duplicate the Old Hard Drive to the New one and Place the new one on the ThinkCentre is not a big deal turned to a small adventure 🙂

For this machine hardware I have to say, the old English saying "Old but Gold" is pretty true, especially after I've attached the Samsung 512GB NVME SSD Drive, which my dear friend and brother in Christ "Uncle Emilian" had received as a gift from another friend called Angel. To put even more rant, here name Emilian stems from the Greek Emilianos which translated to English means Adversary.. But anyways The old Intel SSD 120 GB drive which besides being already completely Full of Data,  turned to have Memory DATA Chips (that perhaps burn out / wasted),  so parts of the Drive were Unreadable.
I've realized the fauly SSD fact after, 
trying to first clone the drives with my Hardware Disk Clone device Orico Dual Bay 2.5 6629US3-C device and then using a simple bit to bit copy with dd command.

orico-6629us3-c2-bay-usb3-type-b2.5-type3-5.inch-sata


At first for some weird reason the Cloning of 120GB SSD HDD towards -> 512 GB newer one was unsuccessful – one of the 2 lamp indicators on Source and Destination Drives was continuiously blinking orange as it seemed data could not be read, even though I tried few times and wait for about 1 hour of time for the cloning to complete, so I first suspected that might be an issue with my  last year bought Disk Clone hardware device. So I've attached the 2 Hard Drives towards my Debian GNU / Linux 10 as USB attached drives using the "Toaster" device  and tried a classical copy   from terminal with Disk Druid e.g.


# dd if=/dev/sdb2 of=/dev/sdbc2 bs=180M status=progress conv=noerror, sync

 
dd: error reading '/dev/sdb2': Input/output error
1074889+17746 records in
1092635+0 records out
559429120 bytes (559 MB, 534 MiB) copied, 502933 s, 1.1 kB/s
dd: writing to '/dev/dc2': Input/output error
1074889+17747 records in
1092635+0 records out
559429120 bytes (559 MB, 534 MiB) copied, 502933 s, 1.1 kB/s

Finally I did a manual copy of files from /dev/sdb2 /dev/sdc2 with rsync and part of the files managed to be succesfully copied, about 55Gigabytes out of 110 managed to copy.  Luckily the data on the broken Intel 320 Series 120GB was not top secret stuff so wasting some bits wasn't the end of the world 🙂

Next, I've removed the broken 120Gb SSD which perhaps was about at least 9+ years old and attached to the Lenovo ThinkCentre, the new drive and as my dear friend wanted to have Windows again (his computer has Microsoft "Certificate of Authenticity"), e.g. that OEM Registration Serial Key for Windows 7.

Lenovo-ThinkCentre-M90p-certificate-of-authenticity

I've jumped in and used some old Flash USB Stick Drive to place again Windows 7 (in order to use the same active license) and from there on, I've used another old Windows 10 Installation Bootable stick of mine to upgrade the Windows 7 to Windows 10 (by using this Win 7 to Win 10 upgrade trick it is possible to still continue use your old Windows 7 License Key on Windows 10). So far so good, now I've had Windows 10 Professional Edition installed on the machine, but faced another issue the Memory of the Machine which is 8GB did not get fully detected the machine had detected only 3.22 GB of Memory, for some weird reason.

only-2-80-gb-usable-windows-10-problem-32-bit-cpu-cause-screenshot

After few minutes of investigation online, I've realized, I've installed by mistake a 32 Bit version of Windows 10 Pro…So the next step was of course to upgrade to 64 bit to work around the unrecognized 5.2GB memory… To make sure my Windows 10 Installation is up-to-date I've downloaded the latest one from the Media Creation Installation Tool from Microsoft's website used the tool to burn the Downloaded Image to an Empty USB Stick (mine is 16GB but minimum required would be 4Gb) and proceeded to reboot the Lenovo Desktop machine and boot from the Windows 10 Install Flash Drive. From there on I've had to select I need to install a 64 Bit version of Windows and Skip the Licensing Key fill in Prompt Twice (act as I have no license) as Windows already could recognize the older OEM installed 32 bit install Windows key and automatically fetches the key from there.

Before proceeding to install the 64 Bit Windows, of course double check  that the Machine you have at hand has already the License Key recognized by Microsoft  is 64 Bit capable:

To check 32 bit version of Windows before attempted upgrade is Properly Licensed :

Settings > Update & security > Activation

check-if-windows-is-already-activated-settings-update-and-security-Activation-menus

 

To check whether Hardware is 64 Capable:

Settings -> System -> About

 

is-hardware-processor-64-bit-capable-windows-screenshot

32 bit Windows on x64based processor (Machine supports 64 bit OS)

 

windows10-OS-Installation-media-install-tool

Media Creation Tool Windows 10 MS Installer tool (make sure you select 64-bit (x86) instead of the default

From the Installer, I've installed Windows just like I install a brand new fersh Win OS and after asking the few trivial Installation Program questions landed to the new working OS and proceeded to install the usual software which are a must have on a freshly installed Windows for some of them check my previous article Essential Must have software to install on Fresh  new Windows installation host.

Fix Out of inodes on Postfix Linux Mail Cluster. How to clean up filesystem running out of Inodes, Filesystem inodes on partition is 100% full

Wednesday, August 25th, 2021

Inode_Entry_inode-table-content

Recently we have faced a strange issue with with one of our Clustered Postfix Mail servers (the cluster is with 2 nodes that each has configured Postfix daemon mail servers (running on an OpenVZ virtualized environment).
A heartbeat that checks liveability of clusters and switches nodes in case of one of the two gets broken due to some reason), pretty much a standard SMTP cluster.

So far so good but since the cluster is a kind of abondoned and is pretty much legacy nowadays and used just for some Monitoring emails from different scripts and systems on servers, it was not really checked thoroughfully for years and logically out of sudden the alarming email content sent via the cluster stopped working.

The normal sysadmin job here  was to analyze what is going on with the cluster and fix it ASAP. After some very basic analyzing we catched the problem is caused by a  "inodes full" (100% of available inodes were occupied) problem, e.g. file system run out of inodes on both machines perhaps due to a pengine heartbeat process  bug  leading to producing a high number of .bz2 pengine recovery archive files stored in /var/lib/pengine>

Below are the few steps taken to analyze and fix the problem.
 

1. Finding out about the the system run out of inodes problem


After logging on to system and not finding something immediately is wrong with inodes, all I can see from crm_mon is cluster was broken.
A plenty of emails were left inside the postfix mail queue visible with a standard command

[root@smtp1: ~ ]# postqueue -p

It took me a while to find ot the problem is with inodes because a simple df -h  was showing systems have enough space but still cluster quorum was not complete.
A bit of further investigation led me to a  simple df -i reporting the number of inodes on the local filesystems on both our SMTP1 and SMTP2 got all occupied.

[root@smtp1: ~ ]# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/simfs            500000   500000  0   100% /
none                   65536      61   65475    1% /dev

As you can see the number of inodes on the Virual Machine are unfortunately depleted

Next step was to check directories occupying most inodes, as this is the place from where files could be temporary moved to a remote server filesystem or moved to another partition with space on a server locally attached drives.
Below command gives an ordered list with directories locally under the mail root filesystem / and its respective occupied number files / inodes,
the more files under a directory the more inodes are being occupied by the files on the filesystem.

 

run-out-if-inodes-what-is-inode-find-out-which-filesystem-or-directory-eating-up-all-your-system-inodes-linux_inode_diagram.gif
1.1 Getting which directory consumes most of the inodes on the systems

 

[root@smtp1: ~ ]# { find / -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n; } 2>/dev/null
….
…..

…….
    586 /usr/lib64/python2.4
    664 /usr/lib64
    671 /usr/share/man/man8
    860 /usr/bin
   1006 /usr/share/man/man1
   1124 /usr/share/man/man3p
   1246 /var/lib/Pegasus/prev_repository_2009-03-10-1236698426.308128000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2009-05-18-1242636104.524113000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2009-11-06-1257494054.380244000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2010-08-04-1280907760.750543000.rpmsave/root#cimv2/classes
   1381 /var/lib/Pegasus/prev_repository_2010-11-15-1289811714.398469000.rpmsave/root#cimv2/classes
   1381 /var/lib/Pegasus/prev_repository_2012-03-19-1332151633.572875000.rpmsave/root#cimv2/classes
   1398 /var/lib/Pegasus/repository/root#cimv2/classes
   1696 /usr/share/man/man3
   400816 /var/lib/pengine

Note, the above command orders the files from bottom to top order and obviosuly the bottleneck directory that is over-eating Filesystem inodes with an exceeding amount of files is
/var/lib/pengine
 

2. Backup old multitude of files just in case of something goes wrong with the cluster after some files are wiped out


The next logical step of course is to check what is going on inside /var/lib/pengine just to find a very ,very large amount of pe-input-*NUMBER*.bz2 files were suddenly produced.

 

[root@smtp1: ~ ]# ls -1 pe-input*.bz2 | wc -l
 400816


The files are produced by the pengine process which is one of the processes that is controlling the heartbeat cluster state, presumably it is done by running process:

[root@smtp1: ~ ]# ps -ef|grep -i pengine
24        5649  5521  0 Aug10 ?        00:00:26 /usr/lib64/heartbeat/pengine


Hence in order to fix the issue, to prevent some inconsistencies in the cluster due to the file deletion,  copied the whole directory to another mounted parition (you can mount it remotely with sshfs for example) or use a local one if you have one:

[root@smtp1: ~ ]# cp -rpf /var/lib/pengine /mnt/attached_storage


and proceeded to clean up some old multitde of files that are older than 2 years of times (720 days):


3. Clean  up /var/lib/pengine files that are older than two years with short loop and find command

 


First I made a list with all the files to be removed in external text file and quickly reviewed it by lessing it like so

[root@smtp1: ~ ]#  cd /var/lib/pengine
[root@smtp1: ~ ]# find . -type f -mtime +720|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last -fprint /home/myuser/pengine_older_than_720days.txt
[root@smtp1: ~ ]# less /home/myuser/pengine_older_than_720days.txt


Once reviewing commands I've used below command to delete the files you can run below command do delete all older than 2 years that are different from pe-error.last / pe-input.last / pre-warn.last which might be needed for proper cluster operation.

[root@smtp1: ~ ]#  for i in $(find . -type f -mtime +720 -exec echo '{}' \;|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last); do echo $i; done


Another approach to the situation is to simply review all the files inside /var/lib/pengine and delete files based on year of creation, for example to delete all files in /var/lib/pengine from 2010, you can run something like:
 

[root@smtp1: ~ ]# for i in $(ls -al|grep -i ' 2010 ' | awk '{ print $9 }' |grep -v 'pe-warn.last'); do rm -f $i; done


4. Monitor real time inodes freeing

While doing the clerance of old unnecessery pengine heartbeat archives you can open another ssh console to the server and view how the inodes gets freed up with a command like:

 

# check if inodes is not being rapidly decreased

[root@csmtp1: ~ ]# watch 'df -i'


5. Restart basic Linux services producing pid files and logs etc. to make then workable (some services might not be notified the inodes on the Hard drive are freed up)

Because the hard drive on the system was full some services started to misbehaving and /var/log logging was impacted so I had to also restart them in our case this is the heartbeat itself
that  checks clusters nodes availability as well as the logging daemon service rsyslog

 

# restart rsyslog and heartbeat services
[root@csmtp1: ~ ]# /etc/init.d/heartbeat restart
[root@csmtp1: ~ ]# /etc/init.d/rsyslog restart

The systems had been a data integrity legacy service samhain so I had to restart this service as well to reforce the /var/log/samhain log file to again continusly start writting data to HDD.

# Restart samhain service init script 
[root@csmtp1: ~ ]# /etc/init.d/samhain restart


6. Check up enough inodes are freed up with df

[root@smtp1 log]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/simfs 500000 410531 19469 91% /
none 65536 61 65475 1% /dev


I had to repeat the same process on the second Postfix cluster node smtp2, and after all the steps like below check the status of smtp2 node and the postfix queue, following same procedure made the second smtp2 cluster member as expected 🙂

 

7. Check the cluster node quorum is complete, e.g. postfix cluster is operating normally

 

# Test if email cluster is ok with pacemaker resource cluster manager – lt-crm_mon
 

[root@csmtp1: ~ ]# crm_mon -1
============
Last updated: Tue Aug 10 18:10:48 2021
Stack: Heartbeat
Current DC: smtp2.fqdn.com (bfb3d029-89a8-41f6-a9f0-52d377cacd83) – partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============

Online: [ smtp2.fqdn.com smtp1.fqdn.com ]

failover-ip (ocf::heartbeat:IPaddr2): Started csmtp1.ikossvan.de
Clone Set: postfix_clone
Started: [ smtp2.fqdn.com smtp1fqdn.com ]
Clone Set: pingd_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]
Clone Set: mailto_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]

 

8.  Force resend a few hundred thousands of emails left in the email queue


After some inodes gets freed up due to the file deletion, i've reforced a couple of times the queued mail servers to be immediately resent to remote mail destinations with cmd:

 

# force emails in queue to be resend with postfix

[root@smtp1: ~ ]# sendmail -q


– It was useful to watch in real time how the queued emails are quickly decreased (queued mails are successfully sent to destination addresses) with:

 

# Monitor  the decereasing size of the email queue
[root@smtp1: ~ ]# watch 'postqueue -p|grep -i '@'|wc -l'

OpenVZ enable or disable auto start on Linux Hypervisor host boot for Virtual Machine containers

Wednesday, July 7th, 2021

howto-add-virtual-machine-to-auto-start-with-vz-openvz-linux-containers-4-logo-slogan-vertical-big

To make OpenVZ / Virtuozzo Hypervisor servers and you are not sure whether your configured container virtual machines are configured to automatically boot on Linux Physical OS host boot in case of restart after patch update set or after unexpected shutdown due to Kernel / OS bug a hang or due to some electricity Power outage.

To check what is your current configuration for Virtual Environment on CentOS Linux you need to check inside /etc/sysconfig/vz-scripts/VEID.conf
You need to check the value for inside the file

ONBOOT="" 

To get the exact ID of "VEID.conf of the current openvz guest VM containers exec:

[root@openvz vz-scripts]# vzlist -a
      CTID      NPROC STATUS    IP_ADDR         HOSTNAME
       300         23 running   10.10.10.1     VirtualMachine1
       301         25 running   10.10.10.2     VirtualMachine2

[root@openvz ~]# cd /etc/sysconfig/vz-scripts
[root@gbapp2 vz-scripts]# pwd
/etc/sysconfig/vz-scripts

[root@openvz vz-scripts]# grep -i ONBOOT 300.conf 301.conf
300.conf:ONBOOT="yes"
301.conf:ONBOOT="yes"

If you happen to have configured ONBOOT="no" you will need to the change to respective VEID.conf:

vi /etc/sysconfig/vz-scripts/VEID.conf

search for

ONBOOT=”no”

and change to

ONBOOT=”yes”

OpenVZ_virtuozzo-standard-process-tree-landscape

OpenVZ server process tree. The colors of the virtual severs are indicated by colors.

OpenVZ Quick cheat sheet commands

This change will auto-start the VPS container next time the host Hypervisor node is rebooted.
If you happen to have daily work with OpenVZ legacy systems like I do you might find also useful the following OpenVZ Cheatsheet pdf document.

A miniature quick cheatsheet for OpenVZ Virtualion, in case if you are like me and you have to use various virtualization technologies and tend to forget is as below:

vzlist                               # List running instances
vzlist -a                            # List all instances

 

vzctl stop <instance>
vzctl start <instance>
vzctl status <instance>

vzctl exec <instance> <command>      # Run a command

vzctl enter <instance>               # Get console

vzyum <instance> install <package>   # Install a package


# Change properties
vzctl set <instance> –hostname <hostname> –save
vzctl set <instance> –ipadd <IP> –save
vzctl set <instance> –userpasswd root:<password> –save

If need to get more insight on how OpenVZ Virtualization does work on a low level and stretch out its possibilities, an old but useful document you might want to check is OpenVZ-Users-Guide PDF.


If you need it to hava e copy of it openvz_cheat_sheet.txt.

ASCII Art studio – A powerful ASCII art editor for Windows / Playscii a cool looking text editor for Linux

Monday, June 28th, 2021

This post is just informative for Text Geeks who are in love with ASCII Art, it is a bit of rant as I will say nothing new, but I thought it might be of interest to some console maniac out there 🙂

ascii art studio aas program windows xp professional drawing program screenshot

While checking stuff on Internet I've stumbled on interesting ASCII arts freak software – >ASCII Art Studio. ASCII Art Studio is unfortunately needs licensing is not Free Software. But anyways, for anyone willing to draw pro ASCII art pictures it is a must see. Check it out;

Isn't it like a Plain Text pro Photoshop ? 🙂 Its a pity we don't have a Linux / BSD Release of this wonderful piece of software. I've tried with WINE (Windows Emulator) on Linux to make the Ascii Art Studio work but that was a fail. It seems only way to make it work is have Windows as a worst case install a Virtual Machine with VirtualBox / Vmware and run it inside if you don't have a Windows PC at hand.

Of course there are stuff on Linux to ascii art edit you can use if you want to have a native software to edit ASCIIs such as Playscii. Unfortunately Playscii is not an easy one to install and the software doesn't have a prepared rpm or deb binary you can easily roll on the OS and you have to manually build all required python modules and have a working version of python3 to be able to make it work.

I did not have much time to test to install it and since I faced issues with plascii install I just abandoned it. If some geek has some more time anyways I guess it is worse to give it a try below is 2 screenshots from PLAYSCII official download page. 

playscii_shot1-official.

As you see authors of the open source playscii whose source is available via github choose to have an amazing looking ascii art text menus, though for daily ASCII art editing it is perhaps much more complicated to use than the simlistic ASCII Art Studio

playscii_shot2-official

There is other stuff for Linux to do ASCII Art files text edit like:
JaVE (this one I don't personally like because it is Java Based),  Ascii Art Maker or Pablow Draw Linux (unfortunately this 2 ones are proprietary).

Christ is Risen – Truly He is Risen – Happy Easter !

Tuesday, May 4th, 2021

admin

One more year the Holy Fire has descended and we have been blessed to great Each other with the All Joyful Paschal Greeting !

Христос възкресе! Воистина възкресе! (Hristos vozkrese! Voistina vozkrese!)


Христос Воскресе – Воистину Воскресе! Христос възкресе! Воистина възкресе! (Hristos vozkrese! Voistina vozkrese!)
Христос васкрсе! Ваистину васкрсе! (Hristos vaskrse! Vaistinu vaskrse!)
Χριστὸς ἀνέστη! Ἀληθῶς ἀνέστη! 

Christus ist auferstanden! Er ist wahrhaftig auferstanden!Christ is Risen ! Truly He is Risen !

For complete list of Paschal Greeting as a referrence to get idea how other weird languages sound like and how it is used in the major Eastern Orthodox Churches all around the world check out my previous article Christ is Risen Eastern Orthodox Resurrection Paschal Greeting in Different Languages

Kh