Recently we have faced a strange issue with with one of our Clustered Postfix Mail servers (the cluster is with 2 nodes that each has configured Postfix daemon mail servers (running on an OpenVZ virtualized environment).
A heartbeat that checks liveability of clusters and switches nodes in case of one of the two gets broken due to some reason), pretty much a standard SMTP cluster.
So far so good but since the cluster is a kind of abondoned and is pretty much legacy nowadays and used just for some Monitoring emails from different scripts and systems on servers, it was not really checked thoroughfully for years and logically out of sudden the alarming email content sent via the cluster stopped working.
The normal sysadmin job here was to analyze what is going on with the cluster and fix it ASAP. After some very basic analyzing we catched the problem is caused by a "inodes full" (100% of available inodes were occupied) problem, e.g. file system run out of inodes on both machines perhaps due to a pengine heartbeat process bug leading to producing a high number of .bz2 pengine recovery archive files stored in /var/lib/pengine>
Below are the few steps taken to analyze and fix the problem.
1. Finding out about the the system run out of inodes problem
After logging on to system and not finding something immediately is wrong with inodes, all I can see from crm_mon is cluster was broken.
A plenty of emails were left inside the postfix mail queue visible with a standard command
[root@smtp1: ~ ]# postqueue -p
It took me a while to find ot the problem is with inodes because a simple df -h was showing systems have enough space but still cluster quorum was not complete.
A bit of further investigation led me to a simple df -i reporting the number of inodes on the local filesystems on both our SMTP1 and SMTP2 got all occupied.
[root@smtp1: ~ ]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/simfs 500000 500000 0 100% /
none 65536 61 65475 1% /dev
As you can see the number of inodes on the Virual Machine are unfortunately depleted
Next step was to check directories occupying most inodes, as this is the place from where files could be temporary moved to a remote server filesystem or moved to another partition with space on a server locally attached drives.
Below command gives an ordered list with directories locally under the mail root filesystem / and its respective occupied number files / inodes,
the more files under a directory the more inodes are being occupied by the files on the filesystem.
1.1 Getting which directory consumes most of the inodes on the systems
[root@smtp1: ~ ]# { find / -xdev -printf '%hn' | sort | uniq -c | sort -k 1 -n; } 2>/dev/null
….
…..…….
586 /usr/lib64/python2.4
664 /usr/lib64
671 /usr/share/man/man8
860 /usr/bin
1006 /usr/share/man/man1
1124 /usr/share/man/man3p
1246 /var/lib/Pegasus/prev_repository_2009-03-10-1236698426.308128000.rpmsave/root#cimv2/classes
1246 /var/lib/Pegasus/prev_repository_2009-05-18-1242636104.524113000.rpmsave/root#cimv2/classes
1246 /var/lib/Pegasus/prev_repository_2009-11-06-1257494054.380244000.rpmsave/root#cimv2/classes
1246 /var/lib/Pegasus/prev_repository_2010-08-04-1280907760.750543000.rpmsave/root#cimv2/classes
1381 /var/lib/Pegasus/prev_repository_2010-11-15-1289811714.398469000.rpmsave/root#cimv2/classes
1381 /var/lib/Pegasus/prev_repository_2012-03-19-1332151633.572875000.rpmsave/root#cimv2/classes
1398 /var/lib/Pegasus/repository/root#cimv2/classes
1696 /usr/share/man/man3
400816 /var/lib/pengine
Note, the above command orders the files from bottom to top order and obviosuly the bottleneck directory that is over-eating Filesystem inodes with an exceeding amount of files is
/var/lib/pengine
2. Backup old multitude of files just in case of something goes wrong with the cluster after some files are wiped out
The next logical step of course is to check what is going on inside /var/lib/pengine just to find a very ,very large amount of pe-input-*NUMBER*.bz2 files were suddenly produced.
[root@smtp1: ~ ]# ls -1 pe-input*.bz2 | wc -l
400816
The files are produced by the pengine process which is one of the processes that is controlling the heartbeat cluster state, presumably it is done by running process:
[root@smtp1: ~ ]# ps -ef|grep -i pengine
24 5649 5521 0 Aug10 ? 00:00:26 /usr/lib64/heartbeat/pengine
Hence in order to fix the issue, to prevent some inconsistencies in the cluster due to the file deletion, copied the whole directory to another mounted parition (you can mount it remotely with sshfs for example) or use a local one if you have one:
[root@smtp1: ~ ]# cp -rpf /var/lib/pengine /mnt/attached_storage
and proceeded to clean up some old multitde of files that are older than 2 years of times (720 days):
3. Clean up /var/lib/pengine files that are older than two years with short loop and find command
First I made a list with all the files to be removed in external text file and quickly reviewed it by lessing it like so
[root@smtp1: ~ ]# cd /var/lib/pengine
[root@smtp1: ~ ]# find . -type f -mtime +720|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last -fprint /home/myuser/pengine_older_than_720days.txt
[root@smtp1: ~ ]# less /home/myuser/pengine_older_than_720days.txt
Once reviewing commands I've used below command to delete the files you can run below command do delete all older than 2 years that are different from pe-error.last / pe-input.last / pre-warn.last which might be needed for proper cluster operation.
[root@smtp1: ~ ]# for i in $(find . -type f -mtime +720 -exec echo '{}' ;|grep -v pe-error.last | grep -v pe-input.last |grep -v pe-warn.last); do echo $i; done
Another approach to the situation is to simply review all the files inside /var/lib/pengine and delete files based on year of creation, for example to delete all files in /var/lib/pengine from 2010, you can run something like:
[root@smtp1: ~ ]# for i in $(ls -al|grep -i ' 2010 ' | awk '{ print $9 }' |grep -v 'pe-warn.last'); do rm -f $i; done
4. Monitor real time inodes freeing
While doing the clerance of old unnecessery pengine heartbeat archives you can open another ssh console to the server and view how the inodes gets freed up with a command like:
# check if inodes is not being rapidly decreased
[root@csmtp1: ~ ]# watch 'df -i'
5. Restart basic Linux services producing pid files and logs etc. to make then workable (some services might not be notified the inodes on the Hard drive are freed up)
Because the hard drive on the system was full some services started to misbehaving and /var/log logging was impacted so I had to also restart them in our case this is the heartbeat itself
that checks clusters nodes availability as well as the logging daemon service rsyslog
# restart rsyslog and heartbeat services
[root@csmtp1: ~ ]# /etc/init.d/heartbeat restart
[root@csmtp1: ~ ]# /etc/init.d/rsyslog restart
The systems had been a data integrity legacy service samhain so I had to restart this service as well to reforce the /var/log/samhain log file to again continusly start writting data to HDD.
# Restart samhain service init script
[root@csmtp1: ~ ]# /etc/init.d/samhain restart
6. Check up enough inodes are freed up with df
[root@smtp1 log]# df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/simfs 500000 410531 19469 91% /
none 65536 61 65475 1% /dev
I had to repeat the same process on the second Postfix cluster node smtp2, and after all the steps like below check the status of smtp2 node and the postfix queue, following same procedure made the second smtp2 cluster member as expected 🙂
7. Check the cluster node quorum is complete, e.g. postfix cluster is operating normally
# Test if email cluster is ok with pacemaker resource cluster manager – lt-crm_mon
[root@csmtp1: ~ ]# crm_mon -1
============
Last updated: Tue Aug 10 18:10:48 2021
Stack: Heartbeat
Current DC: smtp2.fqdn.com (bfb3d029-89a8-41f6-a9f0-52d377cacd83) – partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
4 Resources configured.
============Online: [ smtp2.fqdn.com smtp1.fqdn.com ]
failover-ip (ocf::heartbeat:IPaddr2): Started csmtp1.ikossvan.de
Clone Set: postfix_clone
Started: [ smtp2.fqdn.com smtp1fqdn.com ]
Clone Set: pingd_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]
Clone Set: mailto_clone
Started: [ smtp2.fqdn.com smtp1.fqdn.com ]
8. Force resend a few hundred thousands of emails left in the email queue
After some inodes gets freed up due to the file deletion, i've reforced a couple of times the queued mail servers to be immediately resent to remote mail destinations with cmd:
# force emails in queue to be resend with postfix
[root@smtp1: ~ ]# sendmail -q
– It was useful to watch in real time how the queued emails are quickly decreased (queued mails are successfully sent to destination addresses) with:
# Monitor the decereasing size of the email queue
[root@smtp1: ~ ]# watch 'postqueue -p|grep -i '@'|wc -l'
More helpful Articles
Tags: Below, check, checks, Clone Set, cmd, command, dev, email, Free, How to, init, inodes, last, lib, Linux Fix, linux server, manager, none, Postfix Linux Mail Cluster, server, var