Posts Tagged ‘check’

Configure aide file integrity check server monitoring in Zabbix to track for file changes on servers

Tuesday, March 28th, 2023

zabbix-aide-log-monitoring-logo

Earlier I've written a small article on how to setup AIDE monitoring for Server File integrity check on Linux, which put the basics on how this handy software to improve your server overall Security software can be installed and setup without much hassle.

Once AIDE is setup and a preset custom configuration is prepared for AIDE it is pretty useful to configure AIDE to monitor its critical file changes for better server security by monitoring the AIDE log output for new record occurs with Zabbix. Usually if no files monitored by AIDE are modified on the machine, the log size will not grow, but if some file is modified once Advanced Linux Intrusion Detecting (aide) binary runs via the scheduled Cron job, the /var/log/app_aide.log file will grow zabbix-agentd will continuously check the file for size increases and will react.

Before setting up the Zabbix required Template, you will have to set few small scripts that will be reading a preconfigured list of binaries or application files etc. that aide will monitor lets say via /etc/aide-custom.conf
 

1. Configure aide to monitor files for changes


Before running aide, it is a good idea to prepare a file with custom defined directories and files that you plan to monitor for integrity checking e.g. future changes with aide, for example to capture bad intruders who breaks into server which runs aide and modifies critical files such as /etc/passwd /etc/shadow /etc/group or / /usr/local/etc/* or /var/* / /usr/* critical files that shouldn't be allowed to change without the admin to soon find out.

# cat /etc/aide-custom.conf

# Example configuration file for AIDE.
@@define DBDIR /var/lib/aide
@@define LOGDIR /var/log/aide
# The location of the database to be read.
database=file:@@{DBDIR}/app_custom.db.gz
database_out=file:@@{DBDIR}/app_aide.db.new.gz
gzip_dbout=yes
verbose=5

report_url=file:@@{LOGDIR}/app_custom.log
#report_url=syslog:LOG_LOCAL2
#report_url=stderr
#NOT IMPLEMENTED report_url=mailto:root@foo.com
#NOT IMPLEMENTED report_url=syslog:LOG_AUTH

# These are the default rules.
#
#p:      permissions
#i:      inode:
#n:      number of links
#u:      user
#g:      group
#s:      size
#b:      block count
#m:      mtime
#a:      atime
#c:      ctime
#S:      check for growing size
#acl:           Access Control Lists
#selinux        SELinux security context
#xattrs:        Extended file attributes
#md5:    md5 checksum
#sha1:   sha1 checksum
#sha256:        sha256 checksum
#sha512:        sha512 checksum
#rmd160: rmd160 checksum
#tiger:  tiger checksum

#haval:  haval checksum (MHASH only)
#gost:   gost checksum (MHASH only)
#crc32:  crc32 checksum (MHASH only)
#whirlpool:     whirlpool checksum (MHASH only)

FIPSR = p+i+n+u+g+s+m+c+acl+selinux+xattrs+sha256

#R:             p+i+n+u+g+s+m+c+acl+selinux+xattrs+md5
#L:             p+i+n+u+g+acl+selinux+xattrs
#E:             Empty group
#>:             Growing logfile p+u+g+i+n+S+acl+selinux+xattrs

# You can create custom rules like this.
# With MHASH…
# ALLXTRAHASHES = sha1+rmd160+sha256+sha512+whirlpool+tiger+haval+gost+crc32
ALLXTRAHASHES = sha1+rmd160+sha256+sha512+tiger
# Everything but access time (Ie. all changes)
EVERYTHING = R+ALLXTRAHASHES

# Sane, with multiple hashes
# NORMAL = R+rmd160+sha256+whirlpool
NORMAL = FIPSR+sha512

# For directories, don't bother doing hashes
DIR = p+i+n+u+g+acl+selinux+xattrs

# Access control only
PERMS = p+i+u+g+acl+selinux

# Logfile are special, in that they often change
LOG = >

# Just do sha256 and sha512 hashes
LSPP = FIPSR+sha512

# Some files get updated automatically, so the inode/ctime/mtime change
# but we want to know when the data inside them changes
DATAONLY =  p+n+u+g+s+acl+selinux+xattrs+sha256

##############TOUPDATE
#To delegate to app team create a file like /app/aide.conf
#and uncomment the following line
#@@include /app/aide.conf
#Then remove all the following lines
/etc/zabbix/scripts/check.sh FIPSR
/etc/zabbix/zabbix_agentd.conf FIPSR
/etc/sudoers FIPSR
/etc/hosts FIPSR
/etc/keepalived/keepalived.conf FIPSR
# monitor haproxy.cfg
/etc/haproxy/haproxy.cfg FIPSR
# monitor keepalived
/home/keepalived/.ssh/id_rsa FIPSR
/home/keepalived/.ssh/id_rsa.pub FIPSR
/home/keepalived/.ssh/authorized_keys FIPSR

/usr/local/bin/script_to_run.sh FIPSR
/usr/local/bin/another_script_to_monitor_for_changes FIPSR

#  cat /usr/local/bin/aide-config-check.sh
#!/bin/bash
/sbin/aide -c /etc/aide-custom.conf -D

# cat /usr/local/bin/aide-init.sh
#!/bin/bash
/sbin/aide -c /etc/custom-aide.conf -B database_out=file:/var/lib/aide/custom-aide.db.gz -i

 

# cat /usr/local/bin/aide-check.sh

#!/bin/bash
/sbin/aide -c /etc/custom-aide.conf -Breport_url=stdout -B database=file:/var/lib/aide/custom-aide.db.gz -C|/bin/tee -a /var/log/aide/custom-aide-check.log|/bin/logger -t custom-aide-check-report
/usr/local/bin/aide-init.sh

 

# cat /usr/local/bin/aide_app_cron_daily.txt

#!/bin/bash
#If first time, we need to init the DB
if [ ! -f /var/lib/aide/app_aide.db.gz ]
   then
    logger -p local2.info -t app-aide-check-report  "Generating NEW AIDE DATABASE for APPLICATION"
    nice -n 18 /sbin/aide –init -c /etc/aide_custom.conf
    mv /var/lib/aide/app_aide.db.new.gz /var/lib/aide/app_aide.db.gz
fi

nice -n 18 /sbin/aide –update -c /etc/aide_app.conf
#since the option for syslog seems not fully implemented we need to push logs via logger
/bin/logger -f /var/log/aide/app_aide.log -p local2.info -t app-aide-check-report
#Acknoledge the new database as the primary (every results are sended to syslog anyway)
mv /var/lib/aide/app_aide.db.new.gz /var/lib/aide/app_aide.db.gz

What above cron job does is pretty simple, as you can read it yourself. If the configuration predefined aide database store file /var/lib/aide/app_aide.db.gz, does not
exists aide will create its fresh empty database and generate a report for all predefined files with respective checksums to be stored as a comparison baseline for file changes. 

Next there is a line to write aide file changes via rsyslog through the logger and local2.info handler


2. Setup Zabbix Template with Items, Triggers and set Action

2.1 Create new Template and name it YourAppName APP-LB File integrity Check

aide-itengrity-check-zabbix_ Configuration of templates

Then setup the required Items, that will be using zabbix's Skip embedded function to scan file in a predefined period of file, this is done by the zabbix-agent that is
supposed to run on the server.

2.2 Configure Item like

aide-zabbix-triggers-screenshot
 

*Name: check aide log file

Type: zabbix (active)

log[/var/log/aide/app_aide.log,^File.*,,,skip]

Type of information: Log

Update Interval: 30s

Applications: File Integrity Check

Configure Trigger like

Enabled: Tick On

images/aide-zabbix-screenshots/check-aide-log-item


2.3 Create Triggers with the respective regular expressions, that would check the aide generated log file for file modifications


aide-zabbix-screenshot-minor-config

Configure Trigger like
 

Enabled: Tick On


*Name: Someone modified {{ITEM.VALUE}.regsub("(.*)", \1)}

*Expression: {PROD APP-LB File Integrity Check:log[/var/log/aide/app_aide.log,^File.*,,,skip].strlen()}>=1

Allow manual close: yes tick

*Description: Someone modified {{ITEM.VALUE}.regsub("(.*)", \1)} on {HOST.NAME}

 

2.4 Configure Action

 

aide-zabbix-file-monitoring-action-screensho

Now assuming the Zabbix server has  a properly set media for communication and you set Alerting rules zabbix-server can be easily set tosend mails to a Support email to get Notifications Alerts, everytime a monitored file by aide gets changed.

That's all folks ! Enjoy being notified on every file change on your servers  !
 

How to update expiring OpenSSL certificates without downtime on haproxy Pacemaker / Corosync PCS Cluster

Tuesday, July 19th, 2022

pcm-active-passive-scheme-corosync-pacemaker-openssl-renew-fix-certificate

Lets say you have a running PCS Haproxy cluster with 2 nodes and you have already a configuration in haproxy with a running VIP IP and this proxies
are tunneling traffic to a webserver such as Apache or directly to an Application and you end up in the situation where the configured certificates,
are about to expire soon. As you can guess having the cluster online makes replacing the old expiring SSL certificate with a new one relatively easy
task. But still there are a couple of steps to follow which seems easy but systemizing them and typing them down takes some time and effort.
In short you need to check the current certificates installed on the haproxy inside the Haproxy configuration files,
in my case the haproxy cluster was running 2 haproxy configs haproxyprod.cfg and haproxyqa.cfg and the certificates configured are places inside this
configuration.

Hence to do the certificate update, I had to follow few steps:

A. Find the old certificate key or generate a new one that will be used later together with the CSR (Certificate Request File) to generate the new Secure Socket Layer
certificate pair.
B. Either use the old .CSR (this is usually placed inside the old .CRT certificate file) or generate a new one
C. Copy those .CSR file to the Copy / Paste buffer and place it in the Website field on the step to fill in a CSR for the new certificate on the Domain registrer
such as NameCheap / GoDaddy / BlueHost / Entrust etc.
D. Registrar should then be able to generate files like the the new ServerCertificate.crt, Public Key Root Certificate Authority etc.
E. You should copy and store these files in some database for future perhaps inside some database such as .xdb
for example you can se the X – Certificate and Key management xca (google for xca download).
F. Copy this certificate and place it on the top of the old .crt file that is configured on the haproxies for each domain for which you have configured it on node2
G. standby node1 so the cluster sends the haproxy traffic to node2 (where you should already have the new configured certificate)
H. Prepare the .crt file used by haproxy by including the new ServerCertificate.crt content on top of the file on node1 as well
I. unstandby node1
J. Check in browser by accessing the URL the certificate is the new one based on the new expiry date that should be extended in future
K. Check the status of haproxy
L. If necessery check /var/log/haproxy.log on both clusters to check all works as expected

haserver_cluster_sample

Below are the overall commands to use to complete below jobs

Old extracted keys and crt files are located under /home/username/new-certs

1. Check certificate expiry start / end dates


[root@haproxy-serv01 certs]# openssl s_client -connect 10.40.18.88:443 2>/dev/null| openssl x509 -noout -enddate
notAfter=Aug 12 12:00:00 2022 GMT

2. Find Certificate location taken from /etc/haproxy/haproxyprod.cfg / /etc/haproxy/haproxyqa.cfg

# from Prod .cfg
   bind 10.40.18.88:443 ssl crt /etc/haproxy/certs/www.your-domain.com.crt ca-file /etc/haproxy/certs/ccnr-ca-prod.crt 
 

# from QA .cfg

    bind 10.50.18.87:443 ssl crt /etc/haproxy/certs/test.your-domain.com.crt ca-file /etc/haproxy/certs

3. Check  CRT cert expiry


# for haproxy-serv02 qa :443 listeners

[root@haproxy-serv01 certs]# openssl s_client -connect 10.50.18.87:443 2>/dev/null| openssl x509 -noout -enddate 
notAfter=Dec  9 13:24:00 2029 GMT

 

[root@haproxy-serv01 certs]# openssl x509 -enddate -noout -in /etc/haproxy/certs/www.your-domain.com.crt
notAfter=Aug 12 12:00:00 2022 GMT

[root@haproxy-serv01 certs]# openssl x509 -noout -dates -in /etc/haproxy/certs/www.your-domain.com.crt 
notBefore=May 13 00:00:00 2020 GMT
notAfter=Aug 12 12:00:00 2022 GMT


[root@haproxy-serv01 certs]# openssl x509 -noout -dates -in /etc/haproxy/certs/other-domain.your-domain.com.crt 
notBefore=Dec  6 13:52:00 2019 GMT
notAfter=Dec  9 13:52:00 2022 GMT

4. Check public website cert expiry in a Chrome / Firefox or Opera browser

In a Chrome browser go to updated URLs:

https://www.your-domain/login

https://test.your-domain/login

https://other-domain.your-domain/login

and check the certs

5. Login to one of haproxy nodes haproxy-serv02 or haproxy-serv01

Check what crm_mon (the cluster resource manager) reports of the consistancy of cluster and the belonging members
you should get some output similar to below:

[root@haproxy-serv01 certs]# crm_mon
Stack: corosync
Current DC: haproxy-serv01 (version 1.1.23-1.el7_9.1-9acf116022) – partition with quorum
Last updated: Fri Jul 15 16:39:17 2022
Last change: Thu Jul 14 17:36:17 2022 by root via cibadmin on haproxy-serv01

2 nodes configured
6 resource instances configured

Online: [ haproxy-serv01 haproxy-serv02 ]

Active resources:

 ccnrprodlbvip  (ocf::heartbeat:IPaddr2):       Started haproxy-serv01
 ccnrqalbvip    (ocf::heartbeat:IPaddr2):       Started haproxy-serv01
 Clone Set: haproxyqa-clone [haproxyqa]
     Started: [ haproxy-serv01 haproxy-serv02 ]
 Clone Set: haproxyprod-clone [haproxyprod]
     Started: [ haproxy-serv01 haproxy-serv02 ]


6. Create backup of existing certificates before proceeding to regenerate expiring
On both haproxy-serv01 / haproxy-serv02 run:

 

# cp -vrpf /etc/haproxy/certs/ /home/username/etc-haproxy-certs_bak_$(date +%d_%y_%m)/


7. Find the .key file etract it from latest version of file CCNR-Certificates-DB.xdb

Extract passes from XCA cert manager (if you're already using XCA if not take the certificate from keypass or wherever you have stored it.

+ For XCA cert manager ccnrlb pass
Find the location of the certificate inside the .xdb place etc.

+++++ www.your-domain.com.key file +++++

—–BEGIN PUBLIC KEY—–

—–END PUBLIC KEY—–


# Extracted from old file /etc/haproxy/certs/www.your-domain.com.crt
 

—–BEGIN RSA PRIVATE KEY—–

—–END RSA PRIVATE KEY—–


+++++

8. Renew Generate CSR out of RSA PRIV KEY and .CRT

[root@haproxy-serv01 certs]# openssl x509 -noout -fingerprint -sha256 -inform pem -in www.your-domain.com.crt
SHA256 Fingerprint=24:F2:04:F0:3D:00:17:84:BE:EC:BB:54:85:52:B7:AC:63:FD:E4:1E:17:6B:43:DF:19:EA:F4:99:L3:18:A6:CD

# for haproxy-serv01 prod :443 listeners

[root@haproxy-serv02 certs]# openssl x509 -x509toreq -in www.your-domain.com.crt -out www.your-domain.com.csr -signkey www.your-domain.com.key


9. Move (Standby) traffic from haproxy-serv01 to ccnrl0b2 to test cert works fine

[root@haproxy-serv01 certs]# pcs cluster standby haproxy-serv01


10. Proceed the same steps on haproxy-serv01 and if ok unstandby

[root@haproxy-serv01 certs]# pcs cluster unstandby haproxy-serv01


11. Check all is fine with openssl client with new certificate


Check Root-Chain certificates:

# openssl verify -verbose -x509_strict -CAfile /etc/haproxy/certs/ccnr-ca-prod.crt -CApath  /etc/haproxy/certs/other-domain.your-domain.com.crt{.pem?)
/etc/haproxy/certs/other-domain.your-domain.com.crt: OK

# openssl verify -verbose -x509_strict -CAfile /etc/haproxy/certs/thawte-ca.crt -CApath  /etc/haproxy/certs/www.your-domain.com.crt
/etc/haproxy/certs/www.your-domain.com.crt: OK

################# For other-domain.your-domain.com.crt ##############
Do the same

12. Check cert expiry on /etc/haproxy/certs/other-domain.your-domain.com.crt

# for haproxy-serv02 qa :15443 listeners
[root@haproxy-serv01 certs]# openssl s_client -connect 10.40.18.88:15443 2>/dev/null| openssl x509 -noout -enddate
notAfter=Dec  9 13:52:00 2022 GMT

[root@haproxy-serv01 certs]#  openssl x509 -enddate -noout -in /etc/haproxy/certs/other-domain.your-domain.com.crt 
notAfter=Dec  9 13:52:00 2022 GMT


Check also for 
+++++ other-domain.your-domain.com..key file +++++
 

—–BEGIN PUBLIC KEY—–

—–END PUBLIC KEY—–

 


# Extracted from /etc/haproxy/certs/other-domain.your-domain.com.crt
 

—–BEGIN RSA PRIVATE KEY—–

—–END RSA PRIVATE KEY—–


+++++

13. Standby haproxy-serv01 node 1

[root@haproxy-serv01 certs]# pcs cluster standby haproxy-serv01

14. Renew Generate CSR out of RSA PRIV KEY and .CRT for second domain other-domain.your-domain.com

# for haproxy-serv01 prod :443 renew listeners
[root@haproxy-serv02 certs]# openssl x509 -x509toreq -in other-domain.your-domain.com.crt  -out domain-certificate.com.csr -signkey domain-certificate.com.key


And repeat the same steps e.g. fill the CSR inside the domain registrer and get the certificate and move to the proxy, check the fingerprint if necessery
 

[root@haproxy-serv01 certs]# openssl x509 -noout -fingerprint -sha256 -inform pem -in other-domain.your-domain.com.crt
SHA256 Fingerprint=60:B5:F0:14:38:F0:1C:51:7D:FD:4D:C1:72:EA:ED:E7:74:CA:53:A9:00:C6:F1:EB:B9:5A:A6:86:73:0A:32:8D


15. Check private key's SHA256 checksum

# openssl pkey -in terminals-priv.KEY -pubout -outform pem | sha256sum
# openssl x509 -in other-domain.your-domain.com.crt -pubkey -noout -outform pem | sha256sum

# openssl pkey -in  www.your-domain.com.crt-priv-KEY -pubout -outform pem | sha256sum

# openssl x509 -in  www.your-domain.com.crt -pubkey -noout -outform pem | sha256sum


16. Check haproxy config is okay before reload cert


# haproxy -c -V -f /etc/haproxy/haproxyprod.cfg
Configuration file is valid


# haproxy -c -V -f /etc/haproxy/haproxyqa.cfg
Configuration file is valid

Good so next we can the output of status of certificate

17.Check old certificates are reachable via VIP IP address

Considering that the cluster VIP Address is lets say 10.40.18.88 and running one of the both nodes cluster to check it do something like:
 

# curl -vvI https://10.40.18.88:443|grep -Ei 'start date|expire date'


As output you should get the old certificate


18. Reload Haproxies for Prod and QA on node1 and node2

You can reload the haproxy clusters processes gracefully something similar to kill -HUP but without loosing most of the current established connections with below cmds:

Login on node1 (haproxy-serv01) do:

# /usr/sbin/haproxy -f /etc/haproxy/haproxyprod.cfg -D -p /var/run/haproxyprod.pid  -sf $(cat /var/run/haproxyprod.pid)
# /usr/sbin/haproxy -f /etc/haproxy/haproxyqa.cfg -D -p /var/run/haproxyqa.pid  -sf $(cat /var/run/haproxyqa.pid)

repeat the same commands on haproxy-serv02 host

19.Check new certificates online and the the haproxy logs

# curl -vvI https://10.50.18.88:443|grep -Ei 'start date|expire date'

*       start date: Jul 15 08:19:46 2022 GMT
*       expire date: Jul 15 08:19:46 2025 GMT


You should get the new certificates Issueing start date and expiry date.

On both nodes (if necessery) do:

# tail -f /var/log/haproxy.log

How to RPM update Hypervisors and Virtual Machines running Haproxy High Availability cluster on KVM, Virtuozzo without a downtime on RHEL / CentOS Linux

Friday, May 20th, 2022

virtuozzo-kvm-virtual-machines-and-hypervisor-update-manual-haproxy-logo


Here is the scenario, lets say you have on your daily task list two Hypervisor (HV) hosts running CentOS or RHEL Linux with KVM or Virutozzo technology and inside the HV hosts you have configured at least 2 pairs of virtual machines one residing on HV Host 1 and one residing on HV Host 2 and you need to constantly keep the hosts to the latest distribution major release security patchset.

The Virtual Machines has been running another set of Redhat Linux or CentOS configured to work in a High Availability Cluster running Haproxy / Apache / Postfix or any other kind of HA solution on top of corosync / keepalived or whatever application cluster scripts Free or Open Source technology that supports a switch between clustered Application nodes.

The logical question comes how to keep up the CentOS / RHEL Machines uptodate without interfering with the operations of the Applications running on the cluster?

Assuming that the 2 or more machines are configured to run in Active / Passive App member mode, e.g. one machine is Active at any time and the other is always Passive, a switch is possible between the Active and Passive node.

HAProxy--Load-Balancer-cluster-2-nodes-your-Servers

In this article I'll give a simple step by step tested example on how you I succeeded to update (for security reasons) up to the latest available Distribution major release patchset on one by one first the Clustered App on Virtual Machines 1 and VM2 on Linux Hypervisor Host 1. Then the App cluster VM 1 / VM 2 on Hypervisor Host 2.
And finally update the Hypervisor1 (after moving the Active resources from it to Hypervisor2) and updating the Hypervisor2 after moving the App running resources back on HV1.
I know the procedure is a bit monotonic but it tries to go through everything step by step to try to mitigate any possible problems. In case of failure of some rpm dependencies during yum / dnf tool updates you can always revert to backups so in anyways don't forget to have a fully functional backup of each of the HV hosts and the VMs somewhere on a separate machine before proceeding further, any possible failures due to following my aritcle literally is your responsibility 🙂

 

0. Check situation before the update on HVs / get VM IDs etc.

Check the virsion of each of the machines to be updated both Hypervisor and Hosted VMs, on each machine run:
 

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)


The machine setup I'll be dealing with is as follows:
 

hypervisor-host1 -> hypervisor-host1.fqdn.com 
•    virt-mach-centos1
•    virt-machine-zabbix-proxy-centos (zabbix proxy)

hypervisor-host2 -> hypervisor-host2.fqdn.com
•    virt-mach-centos2
•    virt-machine-zabbix2-proxy-centos (zabbix proxy)

To check what is yours check out with virsh cmd –if on KVM or with prlctl if using Virutozzo, you should get something like:

[root@hypervisor-host2 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host1 running
 2 virt-mach-centos2 running

 # virsh list –all

[root@hypervisor-host1 ~]# virsh list
 Id Name State
—————————————————-
 1 vm-host2 running
 3 virt-mach-centos1 running

[root@hypervisor-host1 ~]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{dc37c201-08c9-589d-aa20-9386d63ce3f3}  running      –               VM virt-mach-centos1
{76e8a5f8-caa8-5442-830e-aa4bfe8d42d9}  running      –               VM vm-host2
[root@hypervisor-host1 ~]#

If you have stopped VMs with Virtuozzo to list the stopped ones as well.
 

# prlctl list -a

[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# vzlist -a
                                CTID      NPROC STATUS    IP_ADDR         HOSTNAME
[root@hypervisor-host2 74a7bbe8-9245-5385-ac0d-d10299100789]# prlctl list
UUID                                    STATUS       IP_ADDR         T  NAME
{92075803-a4ce-5ec0-a3d8-9ee83d85fc76}  running      –               VM virt-mach-centos2
{74a7bbe8-9245-5385-ac0d-d10299100789}  running      –               VM vm-host1

# prlctl list -a


If due to Virtuozzo version above command does not return you can manually check in the VM located folder, VM ID etc.
 

[root@hypervisor-host2 vmprivate]# ls
74a7bbe8-9245-4385-ac0d-d10299100789  92075803-a4ce-4ec0-a3d8-9ee83d85fc76
[root@hypervisor-host2 vmprivate]# pwd
/vz/vmprivate
[root@hypervisor-host2 vmprivate]#


[root@hypervisor-host1 ~]# ls -al /vz/vmprivate/
total 20
drwxr-x—. 5 root root 4096 Feb 14  2019 .
drwxr-xr-x. 7 root root 4096 Feb 13  2019 ..
drwxr-x–x. 4 root root 4096 Feb 18  2019 1c863dfc-1deb-493c-820f-3005a0457627
drwxr-x–x. 4 root root 4096 Feb 14  2019 76e8a5f8-caa8-4442-830e-aa4bfe8d42d9
drwxr-x–x. 4 root root 4096 Feb 14  2019 dc37c201-08c9-489d-aa20-9386d63ce3f3
[root@hypervisor-host1 ~]#


Before doing anything with the VMs, also don't forget to check the Hypervisor hosts has enough space, otherwise you'll get in big troubles !
 

[root@hypervisor-host2 vmprivate]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/centos_hypervisor-host2-root   20G  1.8G   17G  10% /
devtmpfs                          20G     0   20G   0% /dev
tmpfs                             20G     0   20G   0% /dev/shm
tmpfs                             20G  2.0G   18G  11% /run
tmpfs                             20G     0   20G   0% /sys/fs/cgroup
/dev/sda1                        992M  159M  766M  18% /boot
/dev/mapper/centos_hypervisor-host2-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos_hypervisor-host2-var   9.8G  355M  8.9G   4% /var
/dev/mapper/centos_hypervisor-host2-vz    755G   25G  692G   4% /vz

 

[root@hypervisor-host1 ~]# df -h
Filesystem               Size  Used Avail Use% Mounted on
/dev/mapper/centos-root   50G  1.8G   45G   4% /
devtmpfs                  20G     0   20G   0% /dev
tmpfs                     20G     0   20G   0% /dev/shm
tmpfs                     20G  2.1G   18G  11% /run
tmpfs                     20G     0   20G   0% /sys/fs/cgroup
/dev/sda2                992M  153M  772M  17% /boot
/dev/mapper/centos-home  9.8G   37M  9.2G   1% /home
/dev/mapper/centos-var   9.8G  406M  8.9G   5% /var
/dev/mapper/centos-vz    689G   12G  643G   2% /vz

Another thing to do before proceeding with update is to check and tune if needed the amount of CentOS repositories used, before doing anything with yum.
 

[root@hypervisor-host2 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 110 root root 12288 Oct  7 11:13 ..
-rw-r–r–.   1 root root  4382 Mar 14  2019 CentOS7.repo
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo
[root@hypervisor-host2 yum.repos.d]#

 

[root@hypervisor-host1 yum.repos.d]# ls -al
total 68
drwxr-xr-x.   2 root root  4096 Oct  6 13:13 .
drwxr-xr-x. 112 root root 12288 Oct  7 11:09 ..
-rw-r–r–.   1 root root  1664 Sep  5  2019 CentOS-Base.repo
-rw-r–r–.   1 root root  1309 Sep  5  2019 CentOS-CR.repo
-rw-r–r–.   1 root root   649 Sep  5  2019 CentOS-Debuginfo.repo
-rw-r–r–.   1 root root   314 Sep  5  2019 CentOS-fasttrack.repo
-rw-r–r–.   1 root root   630 Sep  5  2019 CentOS-Media.repo
-rw-r–r–.   1 root root  1331 Sep  5  2019 CentOS-Sources.repo
-rw-r–r–.   1 root root  6639 Sep  5  2019 CentOS-Vault.repo
-rw-r–r–.   1 root root  1303 Mar 14  2019 factory.repo
-rw-r–r–.   1 root root   300 Mar 14  2019 obsoleted_tmpls.repo
-rw-r–r–.   1 root root   666 Sep  8 10:13 openvz.repo


1. Dump VM definition XMs (to have it in case if it gets wiped during update)

There is always a possibility that something will fail during the update and you might be unable to restore back to the old version of the Virtual Machine due to some config misconfigurations or whatever thus a very good idea, before proceeding to modify the working VMs is to use KVM's virsh and dump the exact set of XML configuration that makes the VM roll properly.

To do so:
Check a little bit up in the article how we have listed the IDs that are part of the directory containing the VM.
 

[root@hypervisor-host1 ]# virsh dumpxml (Id of VM virt-mach-centos1 ) > /root/virt-mach-centos1_config_bak.xml
[root@hypervisor-host2 ]# virsh dumpxml (Id of VM virt-mach-centos2) > /root/virt-mach-centos2_config_bak.xml

 


2. Set on standby virt-mach-centos1 (virt-mach-centos1)

As I'm upgrading two machines that are configured to run an haproxy corosync cluster, before proceeding to update the active host, we have to switch off
the proxied traffic from node1 to node2, – e.g. standby the active node, so the cluster can move up the traffic to other available node.
 

[root@virt-mach-centos1 ~]# pcs cluster standby virt-mach-centos1


3. Stop VM virt-mach-centos1 & backup on Hypervisor host (hypervisor-host1) for VM1

Another prevention step to make sure you don't get into damaged VM or broken haproxy cluster after the upgrade is to of course backup 

 

[root@hypervisor-host1 ]# prlctl backup virt-mach-centos1

or
 

[root@hypervisor-host1 ]# prlctl stop virt-mach-centos1
[root@hypervisor-host1 ]# cp -rpf /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3 /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3-bak
[root@hypervisor-host1 ]# tar -czvf virt-mach-centos1_vm_virt-mach-centos1.tar.gz /vz/vmprivate/dc37c201-08c9-489d-aa20-9386d63ce3f3

[root@hypervisor-host1 ]# prlctl start virt-mach-centos1


4. Remove package version locks on all hosts

If you're using package locking to prevent some other colleague to not accidently upgrade the machine (if multiple sysadmins are managing the host), you might use the RPM package locking meachanism, if that is used check RPM packs that are locked and release the locking.

+ List actual list of locked packages

[root@hypervisor-host1 ]# yum versionlock list  

…..
0:libtalloc-2.1.16-1.el7.*
0:libedit-3.0-12.20121213cvs.el7.*
0:p11-kit-trust-0.23.5-3.el7.*
1:quota-nls-4.01-19.el7.*
0:perl-Exporter-5.68-3.el7.*
0:sudo-1.8.23-9.el7.*
0:libxslt-1.1.28-5.el7.*
versionlock list done
                          

+ Clear the locking            

# yum versionlock clear                               


+ List actual list / == clear all entries
 

[root@virt-mach-centos2 ]# yum versionlock list; yum versionlock clear
[root@virt-mach-centos1 ]# yum versionlock list; yum versionlock clear
[root@hypervisor-host1 ~]# yum versionlock list; yum versionlock clear
[root@hypervisor-host2 ~]# yum versionlock list; yum versionlock clear


5. Do yum update virt-mach-centos1


For some clarity if something goes wrong, it is really a good idea to make a dump of the basic packages installed before the RPM package update is initiated,
The exact versoin of RHEL or CentOS as well as the list of locked packages, if locking is used.

Enter virt-mach-centos1 (ssh virt-mach-centos1) and run following cmds:
 

# cat /etc/redhat-release  > /root/logs/redhat-release-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


+ Only if needed!!
 

# yum versionlock clear
# yum versionlock list


Clear any previous RPM packages – careful with that as you might want to keep the old RPMs, if unsure comment out below line
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

Proceed with the update and monitor closely the output of commands and log out everything inside files using a small script that you should place under /root/status the script is given at the end of the aritcle.:
 

yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
yum check-update | wc -l
yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

 

6. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


7. Stop VM virt-mach-centos2 & backup  on Hypervisor host (hypervisor-host2)

Same backup step as prior 

# prlctl backup virt-mach-centos2


or
 

# prlctl stop virt-mach-centos2
# cp -rpf /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76 /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76-bak
## tar -czvf virt-mach-centos2_vm_virt-mach-centos2.tar.gz /vz/vmprivate/92075803-a4ce-4ec0-a3d8-9ee83d85fc76

# prctl start virt-mach-centos2


8. Do yum update on virt-mach-centos2

Log system state, before the update
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list

 

Clean old install update / packages if required
 

# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Initiate the update

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l 
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


9. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now

 

10. Stop VM vm-host2 & backup
 

# prlctl backup vm-host2


or

# prlctl stop vm-host2

Or copy the actual directory containig the Virtozzo VM (use the correct ID)
 

# cp -rpf /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9 /vz/vmprivate/76e8a5f8-caa8-5442-830e-aa4bfe8d42d9-bak
## tar -czvf vm-host2.tar.gz /vz/vmprivate/76e8a5f8-caa8-4442-830e-aa5bfe8d42d9

# prctl start vm-host2


11. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Clear only if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Do the rpm upgrade

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


12. Check if everything is running fine after upgrade
 

Reboot VM
 

# shutdown -r now


13. Do yum update hypervisor-host2

 

 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock   if needed

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


Update rpms
 

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out 2>&1
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


14. Stop VM vm-host1 & backup


Some as ealier
 

# prlctl backup vm-host1

or
 

# prlctl stop vm-host1

# cp -rpf /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789 /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789-bak
# tar -czvf vm-host1.tar.gz /vz/vmprivate/74a7bbe8-9245-4385-ac0d-d10299100789

# prctl start vm-host1


15. Do yum update vm-host2
 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum versionlock clear == if needed!!
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


16. Check if everything is running fine after upgrade

+ Reboot VM

# shutdown -r now


17. Do yum update hypervisor-host1

Same procedure for HV host 1 

# cat /etc/redhat-release  > /root/logs/redhat-release-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# cat /etc/grub.d/30_os-prober > /root/logs/grub2-efi-vorher-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

Clear lock
 

# yum versionlock clear
# yum versionlock list
# yum clean all |tee /root/logs/yumcleanall-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out

# yum check-update |tee /root/logs/yumcheckupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# yum check-update | wc -l
# yum update |tee /root/logs/yumupdate-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out
# sh /root/status |tee /root/logs/status-before-$(hostname)-$(date '+%Y-%m-%d_%H-%M-%S').out


18. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host1 all VMs run as expected 


19. Check if everything is running fine after upgrade

Reboot VM
 

# shutdown -r now


Check hypervisor-host2 all VMs run as expected afterwards


20. Check once more VMs and haproxy or any other contained services in VMs run as expected

Login to hosts and check processes and logs for errors etc.
 

21. Haproxy Unstandby virt-mach-centos1

Assuming that the virt-mach-centos1 and virt-mach-centos2 are running a Haproxy / corosync cluster you can try to standby node1 and check the result
hopefully all should be fine and traffic should come to host node2.

[root@virt-mach-centos1 ~]# pcs cluster unstandby virt-mach-centos1


Monitor logs and make sure HAproxy works fine on virt-mach-centos1


22. If necessery to redefine VMs (in case they disappear from virsh) or virtuosso is not working

[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos1_config_bak.xml
[root@virt-mach-centos1 ]# virsh define /root/virt-mach-centos2_config_bak.xml


23. Set versionlock to RPMs to prevent accident updates and check OS version release

[root@virt-mach-centos2 ]# yum versionlock \*
[root@virt-mach-centos1 ]# yum versionlock \*
[root@hypervisor-host1 ~]# yum versionlock \*
[root@hypervisor-host2 ~]# yum versionlock \*

[root@hypervisor-host2 ~]# cat /etc/redhat-release 
CentOS Linux release 7.8.2003 (Core)

Other useful hints

[root@hypervisor-host1 ~]# virsh console dc37c201-08c9-489d-aa20-9386d63ce3f3
Connected to domain virt-mach-centos1
..

! Compare packages count before the upgrade on each of the supposable identical VMs and HVs – if there is difference in package count review what kind of packages are different and try to make the machines to look as identical as possible  !

Packages to update on hypervisor-host1 Count: XXX
Packages to update on hypervisor-host2 Count: XXX
Packages to update virt-mach-centos1 Count: – 254
Packages to update virt-mach-centos2 Count: – 249

The /root/status script

+++

#!/bin/sh
echo  '=======================================================   '
echo  '= Systemctl list-unit-files –type=service | grep enabled '
echo  '=======================================================   '
systemctl list-unit-files –type=service | grep enabled

echo  '=======================================================   '
echo  '= systemctl | grep ".service" | grep "running"            '
echo  '=======================================================   '
systemctl | grep ".service" | grep "running"

echo  '=======================================================   '
echo  '= chkconfig –list                                        '
echo  '=======================================================   '
chkconfig –list

echo  '=======================================================   '
echo  '= netstat -tulpn                                          '
echo  '=======================================================   '
netstat -tulpn

echo  '=======================================================   '
echo  '= netstat -r                                              '
echo  '=======================================================   '
netstat -r


+++

That's all folks, once going through the article, after some 2 hours of efforts or so you should have an up2date machines.
Any problems faced or feedback is mostly welcome as this might help others who have the same setup.

Thanks for reading me 🙂

Fix weird double logging in haproxy.log file due to haproxy.cfg misconfiguration

Tuesday, March 8th, 2022

haproxy-logging-front-network-back-network-diagram

While we were building a new machine that will serve as a Haproxy server proxy frontend to tunnel some traffic to a number of backends, 
came across weird oddity. The call requests sent to the haproxy and redirected to backend servers, were being written in the log twice.

Since we have two backend Application servers hat are serving the request, my first guess was this is caused by the fact haproxy
tries to connect to both nodes on each sent request, or that the double logging is caused by the rsyslogd doing something strange on each
received query. The rsyslog configuration configured to send via local6 facility to rsyslog is like that:

$ModLoad imudp
$UDPServerAddress 127.0.0.1
$UDPServerRun 514
#2022/02/02: HAProxy logs to local6, save the messages
local6.*                                                /var/log/haproxy.log


The haproxy basic global and defaults and frontend section config is like that:
 

global
    log          127.0.0.1 local6 debug
    chroot       /var/lib/haproxy
    pidfile      /run/haproxy.pid
    stats socket /var/lib/haproxy/haproxy.sock mode 0600 level admin
    maxconn      4000
    user         haproxy
    group        haproxy
    daemon

 

defaults
    mode        tcp
    log         global
#    option      dontlognull
#    option      httpclose
#    option      httplog
#    option      forwardfor
    option      redispatch
    option      log-health-checks
    timeout connect 10000 # default 10 second time out if a backend is not found
    timeout client 300000
    timeout server 300000
    maxconn     60000
    retries     3
 

listen FRONTEND1
        bind 10.71.80.5:63750
        mode tcp
        option tcplog
        log global
        log-format [%t]\ %ci:%cp\ %bi:%bp\ %b/%s:%sp\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
        balance roundrobin
        timeout client 350000
        timeout server 350000
        timeout connect 35000
        server backend-host1 10.80.50.1:13750 weight 1 check port 15000
        server backend-host2 10.80.50.2:13750 weight 2 check port 15000

After quick research online on why this odd double logging of request ends in /var/log/haproxy.log it turns out this is caused by the 

log global  defined double under the defaults section as well as in the frontend itself, hence to resolve simply had to comment out the log global in Frontend, so it looks like so:

listen FRONTEND1
        bind 10.71.80.5:63750
        mode tcp
        option tcplog
  #      log global
        log-format [%t]\ %ci:%cp\ %bi:%bp\ %b/%s:%sp\ %Tw/%Tc/%Tt\ %B\ %ts\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq
        balance roundrobin
        timeout client 350000
        timeout server 350000
        timeout connect 35000
        server backend-host1 10.80.50.1:13750 weight 1 check port 15000
        server backend-host2 10.80.50.2:13750 weight 2 check port 15000

 


Next just reloaded haproxy and one request started leaving only one trail inside haproxy.log as expected 🙂
 

List and fix failed systemd failed services after Linux OS upgrade and how to get full info about systemd service from jorunal log

Friday, February 25th, 2022

systemd-logo-unix-linux-list-failed-systemd-services

I have recently upgraded a number of machines from Debian 10 Buster to Debian 11 Bullseye. The update as always has some issues on some machines, such as problem with package dependencies, changing a number of external package repositories etc. to match che Bullseye deb packages. On some machines the update was less painful on others but the overall line was that most of the machines after the update ended up with one or more failed systemd services. It could be that some of the machines has already had this failed services present and I never checked them from the previous time update from Debian 9 -> Debian 10 or just some mess I've left behind in the hurry when doing software installation in the past. This doesn't matter anyways the fact was that I had to deal to a number of systemctl services which I managed to track by the Failed service mesage on system boot on one of the physical machines and on the OpenXen VTY Console the rest of Virtual Machines after update had some Failed messages. Thus I've spend some good amount of time like an overall of a day or two fixing strange failed services. This is how this small article was born in attempt to help sysadmins or any home Linux desktop users, who has updated his Debian Linux / Ubuntu or any other deb based distribution but due to the chaotic nature of Linux has ended with same strange Failed services and look for a way to find the source of the failures and get rid of the problems. 
Systemd is a very complicated system and in my many sysadmin opinion it makes more problems than it solves, but okay for today's people's megalomania mindset it matches well.

Systemd_components-systemd-journalctl-cgroups-loginctl-nspawn-analyze.svg

 

1. Check the journal for errors, running service irregularities and so on
 

First thing to do to track for errors, right after the update is to take some minutes and closely check,, the journalctl for any strange errors, even on well maintained Unix machines, this journal log would bring you to a problem that is not fatal but still some process or stuff is malfunctioning in the background that you would like to solve:
 

root@pcfreak:~# journalctl -x
Jan 10 10:10:01 pcfreak CRON[17887]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17887]: USER_END pid=17887 uid=0 auid=0 ses=340858 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17888]: CRED_DISP pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17888]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17888]: USER_END pid=17888 uid=0 auid=0 ses=340860 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17884]: CRED_DISP pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="root" exe="/usr/sbin/cron" >
Jan 10 10:10:01 pcfreak CRON[17884]: pam_unix(cron:session): session closed for user root
Jan 10 10:10:01 pcfreak audit[17884]: USER_END pid=17884 uid=0 auid=0 ses=340855 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permit>
Jan 10 10:10:01 pcfreak audit[17886]: CRED_DISP pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/c>
Jan 10 10:10:01 pcfreak CRON[17886]: pam_unix(cron:session): session closed for user www-data
Jan 10 10:10:01 pcfreak audit[17886]: USER_END pid=17886 uid=0 auid=33 ses=340859 subj==unconfined msg='op=PAM:session_close grantors=pam_loginuid,pam_env,pam_env,pam_permi>
Jan 10 10:10:08 pcfreak NetworkManager[696]:  [1641802208.0899] device (eth1): carrier: link connected
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:08 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:19 pcfreak NetworkManager[696]:
 [1641802219.7920] device (eth1): carrier: link connected
Jan 10 10:10:19 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:20 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:22 pcfreak NetworkManager[696]:
 [1641802222.2772] device (eth1): carrier: link connected
Jan 10 10:10:22 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx
Jan 10 10:10:23 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Down
Jan 10 10:10:33 pcfreak sshd[18142]: Unable to negotiate with 66.212.17.162 port 19255: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diff>
Jan 10 10:10:41 pcfreak NetworkManager[696]:
 [1641802241.0186] device (eth1): carrier: link connected
Jan 10 10:10:41 pcfreak kernel: r8169 0000:03:00.0 eth1: Link is Up – 100Mbps/Full – flow control rx/tx

If you want to only check latest journal log messages use the -x -e (pager catalog) opts

root@pcfreak;~# journalctl -xe

Feb 25 13:08:29 pcfreak audit[2284920]: USER_LOGIN pid=2284920 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='op=login acct=28696E76616C>
Feb 25 13:08:29 pcfreak sshd[2284920]: Received disconnect from 177.87.57.145 port 40927:11: Bye Bye [preauth]
Feb 25 13:08:29 pcfreak sshd[2284920]: Disconnected from invalid user ubuntuuser 177.87.57.145 port 40927 [preauth]

Next thing to after the update was to get a list of failed service only.


2. List all systemd failed check services which was supposed to be running

root@pcfreak:/root # systemctl list-units | grep -i failed
● certbot.service                                                                                                       loaded failed failed    Certbot
● logrotate.service                                                                                                     loaded failed failed    Rotate log files
● maldet.service                                                                                                        loaded failed failed    LSB: Start/stop maldet in monitor mode
● named.service                                                                                                         loaded failed failed    BIND Domain Name Server


Alternative way is with the –failed option

hipo@jeremiah:~$ systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.
7 loaded units listed.

 

root@jeremiah:/etc/apt/sources.list.d#  systemctl list-units –failed
  UNIT                        LOAD   ACTIVE SUB    DESCRIPTION
● haproxy.service             loaded failed failed HAProxy Load Balancer
● libvirt-guests.service      loaded failed failed Suspend/Resume Running libvirt Guests
● libvirtd.service            loaded failed failed Virtualization daemon
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● sqwebmail.service           masked failed failed sqwebmail.service
● tpm2-abrmd.service          loaded failed failed TPM2 Access Broker and Resource Management Daemon
● wd_keepalive.service        loaded failed failed LSB: Start watchdog keepalive daemon


To get a full list of objects of systemctl you can pass as state:
 

# systemctl –state=help
Full list of possible load states to pass is here
Show service properties


Check whether a service is failed or has other status and check default set systemd variables for it.

root@jeremiah~:# systemctl is-failed vboxweb.service
inactive

# systemctl show haproxy
Type=notify
Restart=always
NotifyAccess=main
RestartUSec=100ms
TimeoutStartUSec=1min 30s
TimeoutStopUSec=1min 30s
TimeoutAbortUSec=1min 30s
TimeoutStartFailureMode=terminate
TimeoutStopFailureMode=terminate
RuntimeMaxUSec=infinity
WatchdogUSec=0
WatchdogTimestampMonotonic=0
RootDirectoryStartOnly=no
RemainAfterExit=no
GuessMainPID=yes
SuccessExitStatus=143
MainPID=304858
ControlPID=0
FileDescriptorStoreMax=0
NFileDescriptorStore=0
StatusErrno=0
Result=success
ReloadResult=success
CleanResult=success

Full output of the above command is dumped in show_systemctl_properties.txt


3. List all running systemd services for a better overview on what's going on on machine
 

To get a list of all properly systemd loaded services you can use –state running.

hipo@jeremiah:~$ systemctl list-units –state running|head -n 10
  UNIT                              LOAD   ACTIVE SUB     DESCRIPTION
  proc-sys-fs-binfmt_misc.automount loaded active running Arbitrary Executable File Formats File System Automount Point
  cups.path                         loaded active running CUPS Scheduler
  init.scope                        loaded active running System and Service Manager
  session-2.scope                   loaded active running Session 2 of user hipo
  accounts-daemon.service           loaded active running Accounts Service
  anydesk.service                   loaded active running AnyDesk
  apache-htcacheclean.service       loaded active running Disk Cache Cleaning Daemon for Apache HTTP Server
  apache2.service                   loaded active running The Apache HTTP Server
  avahi-daemon.service              loaded active running Avahi mDNS/DNS-SD Stack

 

It is useful thing is to list all unit-files configured in systemd and their state, you can do it with:

 


root@pcfreak:~# systemctl list-unit-files
UNIT FILE                                                                 STATE           VENDOR PRESET
proc-sys-fs-binfmt_misc.automount                                         static          –            
-.mount                                                                   generated       –            
backups.mount                                                             generated       –            
dev-hugepages.mount                                                       static          –            
dev-mqueue.mount                                                          static          –            
media-cdrom0.mount                                                        generated       –            
mnt-sda1.mount                                                            generated       –            
proc-fs-nfsd.mount                                                        static          –            
proc-sys-fs-binfmt_misc.mount                                             disabled        disabled     
run-rpc_pipefs.mount                                                      static          –            
sys-fs-fuse-connections.mount                                             static          –            
sys-kernel-config.mount                                                   static          –            
sys-kernel-debug.mount                                                    static          –            
sys-kernel-tracing.mount                                                  static          –            
var-www.mount                                                             generated       –            
acpid.path                                                                masked          enabled      
cups.path                                                                 enabled         enabled      

 

 


root@pcfreak:~# systemctl list-units –type service –all
  UNIT                                   LOAD      ACTIVE   SUB     DESCRIPTION
  accounts-daemon.service                loaded    inactive dead    Accounts Service
  acct.service                           loaded    active   exited  Kernel process accounting
● alsa-restore.service                   not-found inactive dead    alsa-restore.service
● alsa-state.service                     not-found inactive dead    alsa-state.service
  apache2.service                        loaded    active   running The Apache HTTP Server
● apparmor.service                       not-found inactive dead    apparmor.service
  apt-daily-upgrade.service              loaded    inactive dead    Daily apt upgrade and clean activities
 apt-daily.service                      loaded    inactive dead    Daily apt download activities
  atd.service                            loaded    active   running Deferred execution scheduler
  auditd.service                         loaded    active   running Security Auditing Service
  auth-rpcgss-module.service             loaded    inactive dead    Kernel Module supporting RPCSEC_GSS
  avahi-daemon.service                   loaded    active   running Avahi mDNS/DNS-SD Stack
  certbot.service                        loaded    inactive dead    Certbot
  clamav-daemon.service                  loaded    active   running Clam AntiVirus userspace daemon
  clamav-freshclam.service               loaded    active   running ClamAV virus database updater
..

 


linux-systemd-components-diagram-linux-kernel-system-targets-systemd-libraries-daemons

 

4. Finding out more on why a systemd configured service has failed


Usually getting info about failed systemd service is done with systemctl status servicename.service
However, in case of troubles with service unable to start to get more info about why a service has failed with (-l) or (–full) options


root@pcfreak:~# systemctl -l status logrotate.service
● logrotate.service – Rotate log files
     Loaded: loaded (/lib/systemd/system/logrotate.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 00:00:06 EET; 13h ago
TriggeredBy: ● logrotate.timer
       Docs: man:logrotate(8)
             man:logrotate.conf(5)
    Process: 2045320 ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=1/FAILURE)
   Main PID: 2045320 (code=exited, status=1/FAILURE)
        CPU: 2.479s

Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: For now we will assume you meant to write /32
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| ERROR: '0.0.0.0/0.0.0.0' needs to be replaced by the term 'all'.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| SECURITY NOTICE: Overriding config setting. Using 'all' instead.
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: (B) '::/0' is a subnetwork of (A) '::/0'
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: because of this '::/0' is ignored to keep splay tree searching predictable
Feb 25 00:00:06 pcfreak logrotate[2045577]: 2022/02/25 00:00:06| WARNING: You should probably remove '::/0' from the ACL named 'all'
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Failed with result 'exit-code'.
Feb 25 00:00:06 pcfreak systemd[1]: Failed to start Rotate log files.
Feb 25 00:00:06 pcfreak systemd[1]: logrotate.service: Consumed 2.479s CPU time.


systemctl -l however is providing only the last log from message a started / stopped or whatever status service has generated. Sometimes systemctl -l servicename.service is showing incomplete the splitted error message as there is a limitation of line numbers on the console, see below

 

root@pcfreak:~# systemctl status -l certbot.service
● certbot.service – Certbot
     Loaded: loaded (/lib/systemd/system/certbot.service; static)
     Active: failed (Result: exit-code) since Fri 2022-02-25 09:28:33 EET; 4h 0min ago
TriggeredBy: ● certbot.timer
       Docs: file:///usr/share/doc/python-certbot-doc/html/index.html
             https://certbot.eff.org/docs
    Process: 290017 ExecStart=/usr/bin/certbot -q renew (code=exited, status=1/FAILURE)
   Main PID: 290017 (code=exited, status=1/FAILURE)
        CPU: 9.771s

Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using th>
Feb 25 09:28:33 pcfrxen certbot[290017]: All renewals failed. The following certificates could not be renewed:
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/mail.pcfreak.org-0003/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/www.eforia.bg-0005/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]:   /etc/letsencrypt/live/zabbix.pc-freak.net/fullchain.pem (failure)
Feb 25 09:28:33 pcfrxen certbot[290017]: 3 renew failure(s), 5 parse failure(s)
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Main process exited, code=exited, status=1/FAILURE
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Failed with result 'exit-code'.
Feb 25 09:28:33 pcfrxen systemd[1]: Failed to start Certbot.
Feb 25 09:28:33 pcfrxen systemd[1]: certbot.service: Consumed 9.771s CPU time.

 

5. Get a complete log of journal to make sure everything configured on server host runs as it should

Thus to get more complete list of the message and be able to later google and look if has come with a solution on the internet  use:

root@pcfrxen:~#  journalctl –catalog –unit=certbot

— Journal begins at Sat 2022-01-22 21:14:05 EET, ends at Fri 2022-02-25 13:32:01 EET. —
Jan 23 09:58:18 pcfrxen systemd[1]: Starting Certbot…
░░ Subject: A start job for unit certbot.service has begun execution
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░ 
░░ A start job for unit certbot.service has begun execution.
░░ 
░░ The job identifier is 5754.
Jan 23 09:58:20 pcfrxen certbot[124996]: Traceback (most recent call last):
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/renewal.py", line 71, in _reconstitute
Jan 23 09:58:20 pcfrxen certbot[124996]:     renewal_candidate = storage.RenewableCert(full_path, config)
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 471, in __init__
Jan 23 09:58:20 pcfrxen certbot[124996]:     self._check_symlinks()
Jan 23 09:58:20 pcfrxen certbot[124996]:   File "/usr/lib/python3/dist-packages/certbot/_internal/storage.py", line 537, in _check_symlinks

root@server:~# journalctl –catalog –unit=certbot|grep -i pluginerror|tail -1
Feb 25 09:28:33 pcfrxen certbot[290017]: The error was: PluginError('An authentication script must be provided with –manual-auth-hook when using the manual plugin non-interactively.')


Or if you want to list and read only the last messages in the journal log regarding a service

root@server:~# journalctl –catalog –pager-end –unit=certbot


If you have disabled a failed service because you don't need it to run at all on the machine with:

root@rhel:~# systemctl stop rngd.service
root@rhel:~# systemctl disable rngd.service

And you want to clear up any failed service information that is kept in the systemctl service log you can do it with:
 

root@rhel:~# systemctl reset-failed

Another useful systemctl option is cat, you can use it to easily list a service it is useful to quickly check what is a service, an actual shortcut to save you from giving a full path to the service e.g. cat /lib/systemd/system/certbot.service

root@server:~# systemctl cat certbot
# /lib/systemd/system/certbot.service
[Unit]
Description=Certbot
Documentation=file:///usr/share/doc/python-certbot-doc/html/index.html
Documentation=https://certbot.eff.org/docs
[Service]
Type=oneshot
ExecStart=/usr/bin/certbot -q renew
PrivateTmp=true


After failed SystemD services are fixed, it is best to reboot the machine and check put some more time to inspect rawly the complete journal log to make sure, no error  was left behind.


Closure
 

As you can see updating a machine from a major to a major version even if you follow the official documentation and you have plenty of experience is always more or a less a pain in the ass, which can eat up much of your time banging your head solving problems with failed daemons issues with /etc/rc.local (which I have faced becase of #/bin/sh -e (which would make /etc/rc.local) to immediately quit if any error from command $? returns different from 0 etc.. The  logical questions comes then;
1. Is it really worthy to update at all regularly, especially if you don't know of a famous major Vulnerability 🙂 ?
2. Or is it worthy to update from OS major release to OS major release at all?  
3. Or should you only try to patch the service that is exposed to an external reachable computer network or the internet only and still the the same OS release until End of Life (LTS = Long Term Support) as called in Debian or  End Of Life  (EOL) Cycle as called in RPM based distros the period until the OS major release your software distro has official security patches is reached.

Anyone could take any approach but for my own managed systems small network at home my practice was always to try to keep up2date everything every 3 or 6 months maximum. This has caused me multiple days of irritation and stress and perhaps many white hairs and spend nerves on shit.


4. Based on the company where I'm employed the better strategy is to patch to the EOL is still offered and keep the rule First Things First (FTF), once the EOL is reached, just make a copy of all servers data and configuration to external Data storage, bring up a new Physical or VM and migrate the services.
Test after the migration all works as expected if all is as it should be change the DNS records or Leading Infrastructure Proxies whatever to point to the new service and that's it! Yes it is true that migration based on a full OS reinstall is more time consuming and requires much more planning, but usually the result is much more expected, plus it is much less stressful for the guy doing the job.

Zabbix rkhunter monitoring check if rootkits trojans and viruses or suspicious OS activities are detected

Wednesday, December 8th, 2021

monitor-rkhunter-with-zabbix-zabbix-rkhunter-check-logo

If you're using rkhunter to monitor for malicious activities, a binary changes, rootkits, viruses, malware, suspicious stuff and other famous security breach possible or actual issues, perhaps you have configured your machines to report to some Email.
But what if you want to have a scheduled rkhunter running on the machine and you don't want to count too much on email alerting (especially because email alerting) makes possible for emails to be tracked by sysadmin pretty late?

We have been in those situation and in this case me and my dear colleague Georgi Stoyanov developed a small rkhunter Zabbix userparameter check to track and Alert if any traces of "Warning"''s are mateched in the traditional rkhunter log file /var/log/rkhunter/rkhunter.log

To set it up and use it is pretty use you will need to have a recent version of zabbix-agent installed on the machine and connected to a Zabbix server, in my case this is:

[root@centos ~]# rpm -qa |grep -i zabbix-agent
zabbix-agent-4.0.7-1.el7.x86_64

 placed inside /etc/zabbix/zabbix_agentd.d/userparameter_rkhunter_warning_check.conf

[root@centos /etc/zabbix/zabbix_agentd.d ]# cat userparameter_rkhunter_warning_check.conf
# userparameter script to check if any Warning is inside /var/log/rkhunter/rkhunter.log and if found to trigger Zabbix alert
UserParameter=rkhunter.warning, (TODAY=$(date |awk '{ print $1" "$2" "$3 }'); if [ $(cat /var/log/rkhunter/rkhunter.log | awk “/$TODAY/,EOF” | /bin/grep -i ‘\[ Warning \]’ | /usr/bin/wc -l) != ‘0’ ]; then echo 1; else echo 0; fi)
UserParameter=rkhunter.suspected,(/bin/grep -i 'Suspect files: ' /var/log/rkhunter/rkhunter.log|tail -n 1| awk '{ print $4 }')
UserParameter=rkhunter.rootkits,(/bin/grep -i 'Possible rootkits: ' /var/log/rkhunter/rkhunter.log|tail -n 1| awk '{ print $4 }')


2. Prepare Rkhunter Template, Triggers and Items


In Zabbix Server that you access from web control interface, you will have to prepare a new template called lets say Rkhunter with the necessery Triggers and Items


2.1 Create Rkhunter Items
 

On Zabbix Server side, uou will have to configure 3 Items for the 3 configured userparameter above script keys, like so:

rkhunter-items-zabbix-screenshot.png

  • rkhunter.suspected Item configuration


rkhunter-suspected-files

  • rkhunter.warning Zabbix Item config

rkhunter-warning-found-check-zabbix

  • rkhunter.rootkits Zabbix Item config

     

     

    rkhunter-zabbix-possible-rootkits-item1

    2.2 Create Triggers
     

You need to have an overall of 3 triggers like in below shot:

zabbix-rkhunter-all-triggers-screenshot
 

  • rkhunter.rootkits Trigger config

rkhunter-rootkits-trigger-zabbix1

  • rkhunter.suspected Trigger cfg

rkhunter-suspected-trigger-zabbix1

  • rkhunter warning Trigger cfg

rkhunter-warning-trigger1

3. Reload zabbix-agent and test the keys


It is necessery to reload zabbix agent for the new userparameter to start to be sent to remote zabbix server (through a proxy if you have one configured).

[root@centos ~]# systemctl restart zabbix-agent


To make the zabbix-agent send the keys to the server you can use zabbix_sender to have the test tool you will have to have installed (zabbix-sender) on the server.

To trigger a manualTest if you happen to have some problems with the key which shouldn''t be the case you can sent a value to the respectve key with below command:

[root@centos ~ ]# zabbix_sender -vv -c "/etc/zabbix/zabbix_agentd.conf" -k "khunter.warning" -o "1"


Check on Zabbix Server the sent value is received, for any oddities as usual check what is inside  /var/log/zabbix/zabbix_agentd.log for any errors or warnings.

How to test RAM Memory for errors in Linux / UNIX OS servers. Find broken memory RAM banks

Friday, December 3rd, 2021

test-ram-memory-for-errors-linux-unix-find-broken-memory-logo

 

1. Testing the memory with motherboard integrated tools
 

Memory testing has been integral part of Computers for the last 50 years. In the dawn of computers those older perhaps remember memory testing was part of the computer initialization boot. And this memory testing was delaying the boot with some seconds and the user could see the memory numbers being counted up to the amount of memory. With the increased memory modern computers started to have and the annoyance to wait for a memory check program to check the computer hardware memory on modern computers this check has been mitigated or completely removed on some hardware.
Thus under some circumstances sysadmins or advanced computer users might need to check the memory, especially if there is some suspicion for memory damages or if for example a home PC starts crashing with Blue screens of Death on Windows without reason or simply the PC or some old arcane Linux / UNIX servers gets restarted every now and then for now apparent reason. When such circumstances occur it is an idea to start debugging the hardware issue with a simple memory check.

There are multiple ways to test installed memory banks on a server laptop or local home PC both integrated and using external programs.
On servers that is usually easily done from ILO or IPMI or IDRAC access (usually web) interface of the vendor, on laptops and home usage from BIOS or UEFI (Unified Extensible Firmware Interface) acces interface on system boot that is possible as well.

memtest-hp
HP BIOS Setup

An old but gold TIP, more younger people might not know is the

 

Prolonged SHIFT key press which once held with the user instructs the machine to initiate a memory test before the computer starts reading what is written in the boot loader.

So before anything else from below article it might be a good idea to just try HOLD SHIFT for 15-20 seconds after a complete Shut and ON from the POWER button.

If this test does not triggered or it is triggered and you end up with some corrupted memory but you're not sure which exact Memory bank is really crashing and want to know more on what memory Bank and segments are breaking up you might want to do a more thorough testing. In below article I'll try to explain shortly how this can be done.


2. Test the memory using a boot USB Flash Drive / DVD / CD 
 

Say hello to memtest86+. It is a Linux GRUB boot loader bootable utility that tests physical memory by writing various patterns to it and reading them back. Since memtest86+ runs directly off the hardware it does not require any operating system support for execution. Perhaps it is important to mention that memtest86 (is PassMark memtest86)and memtest86+ (An Advanced Memory diagnostic tool) are different tools, the first is freeware and second one is FOSS software.

To use it all you'll need is some version of Linux. If you don't already have some burned in somewhere at your closet, you might want to burn one.
For Linux / Mac users this is as downloading a Linux distribution ISO file and burning it with

# dd if=/path/to/iso of=/dev/sdbX bs=80M status=progress


Windows users can burn a Live USB with whatever Linux distro or download and burn the latest versionof memtest86+ from https://www.memtest.org/  on Windows Desktop with some proggie like lets say UnetBootIn.
 

2.1. Run memtest86+ on Ubuntu

Many Linux distributions such as Ubuntu 20.0 comes together with memtest86+, which can be easily invoked from GRUB / GRUB2 Kernel boot loader.
Ubuntu has a separate menu pointer for a Memtest.

ubuntu-grub-2-04-boot-loader-memtest86-menu-screenshot

Other distributions RPM based distributions such as CentOS, Fedora Linux, Redhat things differ.

2.2. memtest86+ on Fedora


Fedora used to have the memtest86+ menu at the GRUB boot selection prompt, but for some reason removed it and in newest Fedora releases as of time such as Fedora 35 memtest86+ is preinstalled and available but not visible, to start on  already and to start a memtest memory test tool:

  •   Boot a Fedora installation or Rescue CD / USB. At the prompt, type "memtest86".

boot: memtest86

2.3 memtest86+ on RHEL Linux

The memtest86+tool is available as an RPM package from Red Hat Network (RHN) as well as a boot option from the Red Hat Enterprise Linux rescue disk.
And nowadays Red Hat Enterprise Linux ships by default with the tool.

Prior redhat (now legacy) releases such as on RHEL 5.0 it has to be installed and configure it with below 3 commands.

[root@rhel ~]# yum install memtest86+
[root@rhel ~]# memtest-setup
[root@rhel ~]# grub2-mkconfig -o /boot/grub2/grub.cfg


    Again as with CentOS to boot memtest86+ from the rescue disk, you will need to boot your system from CD 1 of the Red Hat Enterprise Linux installation media, and type the following at the boot prompt (before the Linux kernel is started):

boot: memtest86

memtestx86-8gigabytes-of-memory-boot-screenshot
memtest86+ testing 5 memory slots

As you see all on above screenshot the Memory banks are listed as Slots. There are a number of Tests to be completed until
it can be said for sure memory does not have any faulty cells. 
The

Pass: 0
Errors: 0 

Indicates no errors, so in the end if memtest86 does not find anything this values should stay at zero.
memtest86+ is also usable to detecting issues with temperature of CPU. Just recently I've tested a PC thinking that some memory has defects but it turned out the issue on the Computer was at the CPU's temperature which was topping up at 80 – 82 Celsius.

If you're unfortunate and happen to get some corrupted memory segments you will get some red fields with the memory addresses found to have corrupted on Read / Write test operations:

memtest86-returning-memory-address-errors-screenshot


2.4. Install and use memtest and memtest86+ on Debian / Mint Linux

You can install either memtest86+ or just for the fun put both of them and play around with both of them as they have a .deb package provided out of debian non-free /etc/apt/sources.list repositories.


root@jeremiah:/home/hipo# apt-cache show memtest86 memtest86+
Package: memtest86
Version: 4.3.7-3
Installed-Size: 302
Maintainer: Yann Dirson <dirson@debian.org>
Architecture: amd64
Depends: debconf (>= 0.5) | debconf-2.0
Recommends: memtest86+
Suggests: hwtools, memtester, kernel-patch-badram, grub2 (>= 1.96+20090523-1) | grub (>= 0.95+cvs20040624), mtools
Description-en: thorough real-mode memory tester
 Memtest86 scans your RAM for errors.
 .
 This tester runs independently of any OS – it is run at computer
 boot-up, so that it can test *all* of your memory.  You may want to
 look at `memtester', which allows testing your memory within Linux,
 but this one won't be able to test your whole RAM.
 .
 It can output a list of bad RAM regions usable by the BadRAM kernel
 patch, so that you can still use you old RAM with one or two bad bits.
 .
 This is the last DFSG-compliant version of this software, upstream
 has opted for a proprietary development model starting with 5.0.  You
 may want to consider using memtest86+, which has been forked from an
 earlier version of memtest86, and provides a different set of
 features.  It is available in the memtest86+ package.
 .
 A convenience script is also provided to make a grub-legacy-based
 floppy or image.

Description-md5: 0ad381a54d59a7d7f012972f613d7759
Homepage: http://www.memtest86.com/
Section: misc
Priority: optional
Filename: pool/main/m/memtest86/memtest86_4.3.7-3_amd64.deb
Size: 45470
MD5sum: 8dd2a4c52910498d711fbf6b5753bca9
SHA256: 09178eca21f8fd562806ccaa759d0261a2d3bb23190aaebc8cd99071d431aeb6

Package: memtest86+
Version: 5.01-3
Installed-Size: 2391
Maintainer: Yann Dirson <dirson@debian.org>
Architecture: amd64
Depends: debconf (>= 0.5) | debconf-2.0
Suggests: hwtools, memtester, kernel-patch-badram, memtest86, grub-pc | grub-legacy, mtools
Description-en: thorough real-mode memory tester
 Memtest86+ scans your RAM for errors.
 .
 This tester runs independently of any OS – it is run at computer
 boot-up, so that it can test *all* of your memory.  You may want to
 look at `memtester', which allows to test your memory within Linux,
 but this one won't be able to test your whole RAM.
 .
 It can output a list of bad RAM regions usable by the BadRAM kernel
 patch, so that you can still use your old RAM with one or two bad bits.
 .
 Memtest86+ is based on memtest86 3.0, and adds support for recent
 hardware, as well as a number of general-purpose improvements,
 including many patches to memtest86 available from various sources.
 .
 Both memtest86 and memtest86+ are being worked on in parallel.
Description-md5: aa685f84801773ef97fdaba8eb26436a
Homepage: http://www.memtest.org/

Tag: admin::benchmarking, admin::boot, hardware::storage:floppy,
 interface::text-mode, role::program, scope::utility, use::checking
Section: misc
Priority: optional
Filename: pool/main/m/memtest86+/memtest86+_5.01-3_amd64.deb
Size: 75142
MD5sum: 4f06523532ddfca0222ba6c55a80c433
SHA256: ad42816e0b17e882713cc6f699b988e73e580e38876cebe975891f5904828005
 

 

root@jeremiah:/home/hipo# apt-get install –yes memtest86+

root@jeremiah:/home/hipo# apt-get install –yes memtest86

Reading package lists… Done
Building dependency tree       
Reading state information… Done
Suggested packages:
  hwtools kernel-patch-badram grub2 | grub
The following NEW packages will be installed:
  memtest86
0 upgraded, 1 newly installed, 0 to remove and 21 not upgraded.
Need to get 45.5 kB of archives.
After this operation, 309 kB of additional disk space will be used.
Get:1 http://ftp.de.debian.org/debian buster/main amd64 memtest86 amd64 4.3.7-3 [45.5 kB]
Fetched 45.5 kB in 0s (181 kB/s)     
Preconfiguring packages …
Selecting previously unselected package memtest86.
(Reading database … 519985 files and directories currently installed.)
Preparing to unpack …/memtest86_4.3.7-3_amd64.deb …
Unpacking memtest86 (4.3.7-3) …
Setting up memtest86 (4.3.7-3) …
Generating grub configuration file …
Found background image: saint-John-of-Rila-grub.jpg
Found linux image: /boot/vmlinuz-4.19.0-18-amd64
Found initrd image: /boot/initrd.img-4.19.0-18-amd64
Found linux image: /boot/vmlinuz-4.19.0-17-amd64
Found initrd image: /boot/initrd.img-4.19.0-17-amd64
Found linux image: /boot/vmlinuz-4.19.0-8-amd64
Found initrd image: /boot/initrd.img-4.19.0-8-amd64
Found linux image: /boot/vmlinuz-4.19.0-6-amd64
Found initrd image: /boot/initrd.img-4.19.0-6-amd64
Found linux image: /boot/vmlinuz-4.19.0-5-amd64
Found initrd image: /boot/initrd.img-4.19.0-5-amd64
Found linux image: /boot/vmlinuz-4.9.0-8-amd64
Found initrd image: /boot/initrd.img-4.9.0-8-amd64
Found memtest86 image: /boot/memtest86.bin
Found memtest86+ image: /boot/memtest86+.bin
Found memtest86+ multiboot image: /boot/memtest86+_multiboot.bin
File descriptor 3 (pipe:[66049]) leaked on lvs invocation. Parent PID 22581: /bin/sh
done
Processing triggers for man-db (2.8.5-2) …

 

After this both memory testers memtest86+ and memtest86 will appear next to the option of booting a different version kernels and the Advanced recovery kernels, that you usually get in the GRUB boot prompt.

2.5. Use memtest embedded tool on any Linux by adding a kernel variable

Edit-Grub-Parameters-add-memtest-4-to-kernel-boot

2.4.1. Reboot your computer

# reboot

2.4.2. At the GRUB boot screen (with UEFI, press Esc).

2.4.3 For 4 passes add temporarily the memtest=4 kernel parameter.
 

memtest=        [KNL,X86,ARM,PPC,RISCV] Enable memtest
                Format: <integer>
                default : 0 <disable>
                Specifies the number of memtest passes to be
                performed. Each pass selects another test
                pattern from a given set of patterns. Memtest
                fills the memory with this pattern, validates
                memory contents and reserves bad memory
                regions that are detected.


3. Install and use memtester Linux tool
 

At some condition, memory is the one of the suspcious part, or you just want have a quick test. memtester  is an effective userspace tester for stress-testing the memory subsystem.  It is very effective at finding intermittent and non-deterministic faults.

The advantage of memtester "live system check tool is", you can check your system for errors while it's still running. No need for a restart, just run that application, the downside is that some segments of memory cannot be thoroughfully tested as you already have much preloaded data in it to have the Operating Sytstem running, thus always when possible try to stick to rule to test the memory using memtest86+  from OS Boot Loader, after a clean Machine restart in order to clean up whole memory heap.

Anyhow for a general memory test on a Critical Legacy Server  (if you lets say don't have access to Remote Console Board, or don't trust the ILO / IPMI Hardware reported integrity statistics), running memtester from already booted is still a good idea.


3.1. Install memtester on any Linux distribution from source

wget http://pyropus.ca/software/memtester/old-versions/memtester-4.2.2.tar.gz
# tar zxvf memtester-4.2.2.tar.gz
# cd memtester-4.2.2
# make && make install

3.2 Install on RPM based distros

 

On Fedora memtester is available from repositories however on many other RPM based distros it is not so you have to install it from source.

[root@fedora ]# yum install -y memtester

 

3.3. Install memtester on Deb based Linux distributions from source
 

To install it on Debian / Ubuntu / Mint etc. , open a terminal and type:
 

root@linux:/ #  apt install –yes memtester

The general run syntax is:

memtester [-p PHYSADDR] [ITERATIONS]


You can hence use it like so:

hipo@linux:/ $ sudo memtester 1024 5

This should allocate 1024MB of memory, and repeat the test 5 times. The more repeats you run the better, but as a memtester run places a great overall load on the system you either don't increment the runs too much or at least run it with  lowered process importance e.g. by nicing the PID:

hipo@linux:/ $ nice -n 15 sudo memtester 1024 5

 

  • If you have more RAM like 4GB or 8GB, it is upto you how much memory you want to allocate for testing.
  • As your operating system, current running process might take some amount of RAM, Please check available Free RAM and assign that too memtester.
  • If you are using a 32 Bit System, you cant test more than 4 GB even though you have more RAM( 32 bit systems doesnt support more than 3.5 GB RAM as you all know).
  • If your system is very busy and you still assigned higher than available amount of RAM, then the test might get your system into a deadlock, leads to system to halt, be aware of this.
  • Run the memtester as root user, so that memtester process can malloc the memory, once its gets hold on that memory it will try to apply lock. if specified memory is not available, it will try to reduce required RAM automatically and try to lock it with mlock.
  • if you run it as a regular user, it cant auto reduce the required amount of RAM, so it cant lock it, so it tries to get hold on that specified memory and starts exhausting all system resources.


If you have 8 Gigas of RAM plugged into the PC motherboard you have to multiple 1024*8 this is easily done with bc (An arbitrary precision calculator language) tool:

root@linux:/ # bc -l
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'. 
8*1024
8192


 for example you should run:

root@linux:/ # memtester 8192 5

memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 8192MB (2083520512 bytes)
got  8192MB (2083520512 bytes), trying mlock …Loop 1/1:
  Stuck Address       : ok        
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok        
  Block Sequential    : ok        
  Checkerboard        : ok        
  Bit Spread          : ok        
  Bit Flip            : ok        
  Walking Ones        : ok        
  Walking Zeroes      : ok        
  8-bit Writes        : ok
  16-bit Writes       : ok

Done.

 

4. Shell Script to test server memory for corruptions
 

If for some reason the machine you want to run a memory test doesn't have connection to the external network such as the internet and therefore you cannot configure a package repository server and install memtester, the other approach is to use a simple memory test script such as memtestlinux.sh
 

#!/bin/bash
# Downloaded from https://www.srv24x7.com/memtest-linux/
echo "ByteOnSite Memory Test"
cpus=`cat /proc/cpuinfo | grep processor | wc -l`
if [ $cpus -lt 6 ]; then
threads=2
else
threads=$(($cpus / 2))
fi
echo "Detected $cpus CPUs, using $threads threads.."
memory=`free | grep 'Mem:' | awk {'print $2'}`
memoryper=$(($memory / $threads))
echo "Detected ${memory}K of RAM ($memoryper per thread).."
freespace=`df -B1024 . | tail -n1 | awk {'print $4'}`
if [ $freespace -le $memory ]; then
echo You do not have enough free space on the current partition. Minimum: $memory bytes
exit 1
fi
echo "Clearing RAM Cache.."
sync; echo 3 > /proc/sys/vm/drop_cachesfile
echo > dump.memtest.img
echo "Writing to dump file (dump.memtest.img).."
for i in `seq 1 $threads`;
do
# 1044 is used in place of 1024 to ensure full RAM usage (2% over allocation)
dd if=/dev/urandom bs=$memoryper count=1044 >> dump.memtest.img 2>/dev/null &
pids[$i]=$!
echo $i
done
for pid in "${pids[@]}"
do
wait $pid
done

echo "Reading and analyzing dump file…"
echo "Pass 1.."
md51=`md5sum dump.memtest.img | awk {'print $1'}`
echo "Pass 2.."
md52=`md5sum dump.memtest.img | awk {'print $1'}`
echo "Pass 3.."
md53=`md5sum dump.memtest.img | awk {'print $1'}`
if [ “$md51” != “$md52” ]; then
fail=1
elif [ “$md51” != “$md53” ]; then
fail=1
elif [ “$md52” != “$md53” ]; then
fail=1
else
fail=0
fi
if [ $fail -eq 0 ]; then
echo "Memory test PASSED."
else
echo "Memory test FAILED. Bad memory detected."
fi
rm -f dump.memtest.img
exit $fail

Nota Bene !: Again consider the restults might not always be 100% trustable if possible restart the server and test with memtest86+

Consider also its important to make sure prior to script run,  you''ll have enough disk space to produce the dump.memtest.img file – file is created as a test bed for the memory tests and if not scaled properly you might end up with a full ( / ) root directory!

 

4.1 Other memory test script with dd and md5sum checksum

I found this solution on the well known sysadmin site nixCraft cyberciti.biz, I think it makes sense and quicker.

First find out memory site using free command.
 

# free
             total       used       free     shared    buffers     cached
Mem:      32867436   32574160     293276          0      16652   31194340
-/+ buffers/cache:    1363168   31504268
Swap:            0          0          0


It shows that this server has 32GB memory,
 

# dd if=/dev/urandom bs=32867436 count=1050 of=/home/memtest


free reports by k and use 1050 is to make sure file memtest is bigger than physical memory.  To get better performance, use proper bs size, for example 2048 or 4096, depends on your local disk i/o,  the rule is to make bs * count > 32 GB.
run

# md5sum /home/memtest; md5sum /home/memtest; md5sum /home/memtest


If you see md5sum mismatch in different run, you have faulty memory guaranteed.
The theory is simple, the file /home/memtest will cache data in memory by filling up all available memory during read operation. Using md5sum command you are reading same data from memory.


5. Other ways to test memory / do a machine stress test

Other good tools you might want to check for memory testing is mprime – ftp://mersenne.org/gimps/ 
(https://www.mersenne.org/ftp_root/gimps/)

  •  (mprime can also be used to stress test your CPU)

Alternatively, use the package stress-ng to run all kind of stress tests (including memory test) on your machine.
Perhaps there are other interesting tools for a diagnosis of memory if you know other ones I miss, let me know in the comment section.

How to fresh Upgrade mistakenly installed 32-bit Windows 10 Professional to 64-bit Windows / A failure to Disk Clone old SSD 120GB to 512GB HDD due to failed Solid State Drive

Wednesday, November 17th, 2021

upgrade-windows-10-32-bit-to-64-bit-howto-picture

I've been Setting up a new PC with Windows OS that is a bit old a 11 years old Lenovo ThinkCentre model M90P with 8 GB of Memory, Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz   3.19 GHz, Intel Q57 Express Chipset. The machine came to me with Windows 7 preinstalled and the intial goal was to migrate Windows as it is with its data from the old 120GB SSD to new 512 SSD and then to keep the machine at least a bit more up to date to upgrade the old Windows 7 to Windows 10.

This as usual seemed like a very trivial task for a System Administrator, and even if you haven't touched much of Windows as me it makes it look a piece of cake, however as always with computers, once you think you'll be done in 2 hours usually it takes 20+ . Some call it Murphy's law "If something could go wrong then it will go wrong". But putting this situation that I thought all well that's easy lets do it is a kind of a proud Thought for man and the to save us from this Passion of Proudness which according to Church fathers is the worst passion one can have and humiliate us a bit.

God allows some unforseen stuff to happen   🙂 The case with this machine whose original idea I had is to OK I Simply Duplicate the Old Hard Drive to the New one and Place the new one on the ThinkCentre is not a big deal turned to a small adventure 🙂

For this machine hardware I have to say, the old English saying "Old but Gold" is pretty true, especially after I've attached the Samsung 512GB NVME SSD Drive, which my dear friend and brother in Christ "Uncle Emilian" had received as a gift from another friend called Angel. To put even more rant, here name Emilian stems from the Greek Emilianos which translated to English means Adversary.. But anyways The old Intel SSD 120 GB drive which besides being already completely Full of Data,  turned to have Memory DATA Chips (that perhaps burn out / wasted),  so parts of the Drive were Unreadable.
I've realized the fauly SSD fact after, 
trying to first clone the drives with my Hardware Disk Clone device Orico Dual Bay 2.5 6629US3-C device and then using a simple bit to bit copy with dd command.

orico-6629us3-c2-bay-usb3-type-b2.5-type3-5.inch-sata


At first for some weird reason the Cloning of 120GB SSD HDD towards -> 512 GB newer one was unsuccessful – one of the 2 lamp indicators on Source and Destination Drives was continuiously blinking orange as it seemed data could not be read, even though I tried few times and wait for about 1 hour of time for the cloning to complete, so I first suspected that might be an issue with my  last year bought Disk Clone hardware device. So I've attached the 2 Hard Drives towards my Debian GNU / Linux 10 as USB attached drives using the "Toaster" device  and tried a classical copy   from terminal with Disk Druid e.g.


# dd if=/dev/sdb2 of=/dev/sdbc2 bs=180M status=progress conv=noerror, sync

 
dd: error reading '/dev/sdb2': Input/output error
1074889+17746 records in
1092635+0 records out
559429120 bytes (559 MB, 534 MiB) copied, 502933 s, 1.1 kB/s
dd: writing to '/dev/dc2': Input/output error
1074889+17747 records in
1092635+0 records out
559429120 bytes (559 MB, 534 MiB) copied, 502933 s, 1.1 kB/s

Finally I did a manual copy of files from /dev/sdb2 /dev/sdc2 with rsync and part of the files managed to be succesfully copied, about 55Gigabytes out of 110 managed to copy.  Luckily the data on the broken Intel 320 Series 120GB was not top secret stuff so wasting some bits wasn't the end of the world 🙂

Next, I've removed the broken 120Gb SSD which perhaps was about at least 9+ years old and attached to the Lenovo ThinkCentre, the new drive and as my dear friend wanted to have Windows again (his computer has Microsoft "Certificate of Authenticity"), e.g. that OEM Registration Serial Key for Windows 7.

Lenovo-ThinkCentre-M90p-certificate-of-authenticity

I've jumped in and used some old Flash USB Stick Drive to place again Windows 7 (in order to use the same active license) and from there on, I've used another old Windows 10 Installation Bootable stick of mine to upgrade the Windows 7 to Windows 10 (by using this Win 7 to Win 10 upgrade trick it is possible to still continue use your old Windows 7 License Key on Windows 10). So far so good, now I've had Windows 10 Professional Edition installed on the machine, but faced another issue the Memory of the Machine which is 8GB did not get fully detected the machine had detected only 3.22 GB of Memory, for some weird reason.

only-2-80-gb-usable-windows-10-problem-32-bit-cpu-cause-screenshot

After few minutes of investigation online, I've realized, I've installed by mistake a 32 Bit version of Windows 10 Pro…So the next step was of course to upgrade to 64 bit to work around the unrecognized 5.2GB memory… To make sure my Windows 10 Installation is up-to-date I've downloaded the latest one from the Media Creation Installation Tool from Microsoft's website used the tool to burn the Downloaded Image to an Empty USB Stick (mine is 16GB but minimum required would be 4Gb) and proceeded to reboot the Lenovo Desktop machine and boot from the Windows 10 Install Flash Drive. From there on I've had to select I need to install a 64 Bit version of Windows and Skip the Licensing Key fill in Prompt Twice (act as I have no license) as Windows already could recognize the older OEM installed 32 bit install Windows key and automatically fetches the key from there.

Before proceeding to install the 64 Bit Windows, of course double check  that the Machine you have at hand has already the License Key recognized by Microsoft  is 64 Bit capable:

To check 32 bit version of Windows before attempted upgrade is Properly Licensed :

Settings > Update & security > Activation

check-if-windows-is-already-activated-settings-update-and-security-Activation-menus

 

To check whether Hardware is 64 Capable:

Settings -> System -> About

 

is-hardware-processor-64-bit-capable-windows-screenshot

32 bit Windows on x64based processor (Machine supports 64 bit OS)

 

windows10-OS-Installation-media-install-tool

Media Creation Tool Windows 10 MS Installer tool (make sure you select 64-bit (x86) instead of the default

From the Installer, I've installed Windows just like I install a brand new fersh Win OS and after asking the few trivial Installation Program questions landed to the new working OS and proceeded to install the usual software which are a must have on a freshly installed Windows for some of them check my previous article Essential Must have software to install on Fresh  new Windows installation host.

Fix Out of inodes on Postfix Linux Mail Cluster. How to clean up filesystem running out of Inodes, Filesystem inodes on partition is 100% full

Wednesday, August 25th, 2021

Inode_Entry_inode-table-content

Recently we have faced a strange issue with with one of our Clustered Postfix Mail servers (the cluster is with 2 nodes that each has configured Postfix daemon mail servers (running on an OpenVZ virtualized environment).
A heartbeat that checks liveability of clusters and switches nodes in case of one of the two gets broken due to some reason), pretty much a standard SMTP cluster.

So far so good but since the cluster is a kind of abondoned and is pretty much legacy nowadays and used just for some Monitoring emails from different scripts and systems on servers, it was not really checked thoroughfully for years and logically out of sudden the alarming email content sent via the cluster stopped working.

The normal sysadmin job here  was to analyze what is going on with the cluster and fix it ASAP. After some very basic analyzing we catched the problem is caused by a  "inodes full" (100% of available inodes were occupied) problem, e.g. file system run out of inodes on both machines perhaps due to a pengine heartbeat process  bug  leading to producing a high number of .bz2 pengine recovery archive files stored in /var/lib/pengine>

Below are the few steps taken to analyze and fix the problem.
 

1. Finding out about the the system run out of inodes problem


After logging on to system and not finding something immediately is wrong with inodes, all I can see from crm_mon is cluster was broken.
A plenty of emails were left inside the postfix mail queue visible with a standard command

[root@smtp1: ~ ]# postqueue -p

It took me a while to find ot the problem is with inodes because a simple df -h  was showing systems have enough space but still cluster quorum was not complete.
A bit of further investigation led me to a  simple df -i reporting the number of inodes on the local filesystems on both our SMTP1 and SMTP2 got all occupied.

[root@smtp1: ~ ]# df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/simfs            500000   500000  0   100% /
none                   65536      61   65475    1% /dev

As you can see the number of inodes on the Virual Machine are unfortunately depleted

Next step was to check directories occupying most inodes, as this is the place from where files could be temporary moved to a remote server filesystem or moved to another partition with space on a server locally attached drives.
Below command gives an ordered list with directories locally under the mail root filesystem / and its respective occupied number files / inodes,
the more files under a directory the more inodes are being occupied by the files on the filesystem.

 

run-out-if-inodes-what-is-inode-find-out-which-filesystem-or-directory-eating-up-all-your-system-inodes-linux_inode_diagram.gif
1.1 Getting which directory consumes most of the inodes on the systems

 

[root@smtp1: ~ ]# { find / -xdev -printf '%h\n' | sort | uniq -c | sort -k 1 -n; } 2>/dev/null
….
…..

…….
    586 /usr/lib64/python2.4
    664 /usr/lib64
    671 /usr/share/man/man8
    860 /usr/bin
   1006 /usr/share/man/man1
   1124 /usr/share/man/man3p
   1246 /var/lib/Pegasus/prev_repository_2009-03-10-1236698426.308128000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2009-05-18-1242636104.524113000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2009-11-06-1257494054.380244000.rpmsave/root#cimv2/classes
   1246 /var/lib/Pegasus/prev_repository_2010-08-04-1280907760.750543000.rpmsave/root#cimv2/classes
   1381 /var/lib/Pegasus/prev_repository_2010-11-15-1289811714.398469000.rpmsave/root#cimv2/classes
   1381 /var/lib/Pegasus/prev_repository_2012-03-19-1332151633.572875000.rpmsave/root#cimv2/classes
   1398 /var/lib/Pegasus/repository/root#cimv2/classes
   1696 /usr/share/man/man3
   400816 /var/lib/pengine

Note, the above command orders the files from bottom to top order and obviosuly the bottleneck directory that is over-eating Filesystem inodes with an exceeding amount of files is
/var/lib/pengine
 

2. Backup old multitude of files just in case of something goes wrong with the cluster after some files are wiped out


The next logical step of course is to check what is going on inside /var/lib/pengine just to find a very ,very large amount of pe-input-*NUMBER*.bz2 files were suddenly produced.

 

[root@smtp1: ~ ]# ls -1 pe-input*.bz2 | wc -l
 400816


The files are produced by the pengine process which is one of the processes that is controlling the heartbeat cluster state, presumably it is done by running process:

[root@smtp1: ~ ]# ps -ef|grep -i pengine
24        5649  5521  0 Aug10 ?        00:00:26 /usr/lib64/heartbeat/pengine


Hence in order to fix the issue, to prevent some inconsistencies in the cluster due to the file deletion,  copied the whole directory to another mounted parition (you can mount it remotely with sshfs for example) or use a local one if you have one:

[root@smtp1: ~ ]# cp -rpf /var/lib/pengine /mnt/attached_storage


and proceeded to clean up some old multitde of files that are older than 2 years of times (720 days):


3. Clean  up /var/lib/pengine files that are older than two years with short loop and find command

 


First I made a list with all the files to be removed in external text file and quickly reviewed it by lessing it like so

[root@smtp1: ~ ]#  cd /var/lib/pengine
[root@smtp1: ~ ]# find . -type f -mtime +720|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last -fprint /home/myuser/pengine_older_than_720days.txt
[root@smtp1: ~ ]# less /home/myuser/pengine_older_than_720days.txt


Once reviewing commands I've used below command to delete the files you can run below command do delete all older than 2 years that are different from pe-error.last / pe-input.last / pre-warn.last which might be needed for proper cluster operation.

[root@smtp1: ~ ]#  for i in $(find . -type f -mtime +720 -exec echo '{}' \;|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last); do echo $i; done


Another approach to the situation is to simply review all the files inside /var/lib/pengine and delete files based on year of creation, for example to delete all files in /var/lib/pengine from 2010, you can run something like:
 

[root@smtp1: ~ ]# for i in $(ls -al|grep -i ' 2010 ' | awk '{ print $9 }' |grep -v 'pe-warn.last'); do rm -f $i; done


4. Monitor real time inodes freeing

While doing the clerance of old unnecessery pengine heartbeat archives you can open another ssh console to the server and view how the inodes gets freed up with a command like:

 

# check if inodes is not being rapidly decreased

[root@csmtp1: ~ ]# watch 'df -i'


5. Restart basic Linux services producing pid files and logs etc. to make then workable (some services might not be notified the inodes on the Hard drive are freed up)

Because the hard drive on the system was full some services started to misbehaving and /var/log logging was impacted so I had to also restart them in our case this is the heartbeat itself
that  checks clusters nodes availability as well as the logging daemon service rsyslog

 

# restart rsyslog and heartbeat services
[root@csmtp1: ~ ]# /etc/init.d/heartbeat restart
[root@csmtp1: ~ ]# /etc/init.d/rsyslog restart

The systems had been a data integrity legacy service samhain so I had to restart this service as well to reforce the /var/log/samhain log file to again continusly start writting data to HDD.

# Restart samhain service init script 
[root@csmtp1: ~ ]# /etc/init.d/samhain restart


6. Check up enough inodes are freed up with df

[root@smtp1 log]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/simfs 500000 410531 19469 91% /
none 65536 61 65475 1% /dev


I had to repeat the same process on the second Postfix cluster node smtp2, and after all the steps like below check the status of smtp2 node and the postfix queue, following same procedure made the second smtp2 cluster member as expected 🙂

 

7. Check the cluster node quorum is complete, e.g. postfix cluster is operating normally

 

# Test if email cluster is ok with pacemaker resource cluster manager – lt-crm_mon
 

[root@csmtp1: ~ ]# crm_mon -1
============
Last updated: Tue Aug 10 18:10:48 2021
Stack: Heartbeat
Current DC: smtp2.fqdn.com (bfb3d029-89a8-41f6-a9f0-52d377cacd83) – partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============

Online: [ smtp2.fqdn.com smtp1.fqdn.com ]

failover-ip (ocf::heartbeat:IPaddr2): Started csmtp1.ikossvan.de
Clone Set: postfix_clone
Started: [ smtp2.fqdn.com smtp1fqdn.com ]
Clone Set: pingd_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]
Clone Set: mailto_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]

 

8.  Force resend a few hundred thousands of emails left in the email queue


After some inodes gets freed up due to the file deletion, i've reforced a couple of times the queued mail servers to be immediately resent to remote mail destinations with cmd:

 

# force emails in queue to be resend with postfix

[root@smtp1: ~ ]# sendmail -q


– It was useful to watch in real time how the queued emails are quickly decreased (queued mails are successfully sent to destination addresses) with:

 

# Monitor  the decereasing size of the email queue
[root@smtp1: ~ ]# watch 'postqueue -p|grep -i '@'|wc -l'